Survey Analysis Medallion ETL Pipeline

Tools: Jupiter Notebook | PostgreSQL | Python: Pandas | NumPy | Matplotlib | Seaborn | SciPy

Project Overview

This project implements a multi-layer ETL pipeline for analyzing survey data using the Medallion Architecture pattern. The pipeline extracts raw survey data, performs data quality checks, applies transformations, and prepares, clean and normalized data using Python and PostgreSQL for analysis.

Dataset:

Our data set is a survey works among IT professional , collected and published in the link below.

“https://api.example.com/data”

It has 11551 records and 84 columns. (11552, 85)

Github Repo

Architecture

The project follows the **Medallion Architecture** pattern:

Bronze Layer: Raw data ingestion from APIs and databases

Silver Layer: Data cleaning, normalization, and transformation

Gold Layer: (Future) Aggregated, business-ready analytics

Pipeline Components

Bronze Layer

The Bronze layer handles:

– Data extraction from external APIs

– Connection to PostgreSQL database

– Loading raw survey data

– Initial column filtering and selection

– Raw data storage for future reference

Key Features:

– API data ingestion

– Database connectivity with PostgreSQL

– Column management and initial filtering

– Raw data preservation

Silver Layer

The Silver layer performs:

– Data quality checks (duplicates, missing values)

– Data cleaning and imputation

– Currency and payment frequency normalization

– Data type standardization

– Statistical analysis preparation

– Chi-square tests for categorical variables

Key Operations:

– Duplicate detection and removal

– Missing value identification and imputation

– Earnings normalization across currencies

– Payment frequency standardization to annual income

– Data validation and quality reporting

Running the Pipeline

1. Bronze Layer Run [Bronze_layer 1.38.59 PM.ipynb](Bronze_layer%201.38.59%20PM.ipynb)

– Extracts raw data from sources

– Loads into PostgreSQL

2. Silver Layer: Run [Silver_layer 1.38.59 PM.ipynb](Silver_layer%201.38.59%20PM.ipynb)

– Cleans and transforms data

– Normalizes earnings and currencies

– Performs quality checks

Data Transformation Details

Earnings Normalization

– Converts various payment frequencies to annual income

– Standardizes multiple currencies to USD

– Uses current exchange rates for accurate conversion

Data Quality Checks

– Duplicate record detection

– Missing value analysis

– Data type validation

– Statistical distribution analysis

Security Notes

– Database credentials are stored separately (not in version control)

– API keys and sensitive data are managed in external configuration files

– Follow the principle of least privilege for database access

Future Enhancements

– [ ] Implement Gold layer for business analytics

– [ ] Add automated data validation rules

– [ ] Create data quality dashboard

– [ ] Implement incremental data loading

– [ ] Add comprehensive error handling

– [ ] Create automated testing suite

– [ ] Add CI/CD pipeline integration

– [ ] Implement data versioning

License

This project is for educational and analysis purposes.

Author

Rana

**Note**: This project is under active development. Refer to individual notebooks for detailed implementation and analysis steps.