IT professional survey data – ELT
Tools: Jupiter Notebook | PostgreSQL | Python: Pandas | NumPy | Matplotlib | Seaborn | SciPy
Project Overview
This project implements a multi-layer ETL pipeline for analyzing survey data using the Medallion Architecture pattern. The pipeline extracts raw survey data, performs data quality checks, applies transformations, and prepares, clean and normalized data using Python and PostgreSQL for analysis.
Architecture
The project follows the **Medallion Architecture** pattern:
- Bronze Layer: Raw data ingestion from APIs and databases
- Silver Layer: Data cleaning, normalization, and transformation
- Gold Layer: (Future) Aggregated, business-ready analytics
Pipeline Components
Bronze Layer
The Bronze layer handles:
– Data extraction from external APIs
– Connection to PostgreSQL database
– Loading raw survey data
– Initial column filtering and selection
– Raw data storage for future reference
Silver Layer
The Silver layer performs:
– Data quality checks (duplicates, missing values)
– Data cleaning and imputation
– Currency and payment frequency normalization
– Data type standardization
– Statistical analysis preparation
– Chi-square tests for categorical variables
Earnings Normalization
– Converts various payment frequencies to annual income
– Standardizes multiple currencies to USD
– Uses current exchange rates for accurate conversion
Data Quality Checks
– Duplicate record detection
– Missing value analysis
– Data type validation
– Statistical distribution analysis
**Note**: This project is under active development. Refer to individual notebooks for detailed implementation and analysis steps.
