Tools: Jupiter Notebook | PostgreSQL | Python: Pandas | NumPy | Matplotlib | Seaborn | SciPy
Project Overview
This project implements a multi-layer ETL pipeline for analyzing survey data using the Medallion Architecture pattern. The pipeline extracts raw survey data, performs data quality checks, applies transformations, and prepares, clean and normalized data using Python and PostgreSQL for analysis.
Dataset:
Our data set is a survey works among IT professional , collected and published in the link below.
“https://api.example.com/data”
It has 11551 records and 84 columns. (11552, 85)
Architecture
The project follows the **Medallion Architecture** pattern:
- Bronze Layer: Raw data ingestion from APIs and databases
- Silver Layer: Data cleaning, normalization, and transformation
- Gold Layer: (Future) Aggregated, business-ready analytics
Pipeline Components
Bronze Layer
The Bronze layer handles:
– Data extraction from external APIs
– Connection to PostgreSQL database
– Loading raw survey data
– Initial column filtering and selection
– Raw data storage for future reference
Key Features:
– API data ingestion
– Database connectivity with PostgreSQL
– Column management and initial filtering
– Raw data preservation
Silver Layer
The Silver layer performs:
– Data quality checks (duplicates, missing values)
– Data cleaning and imputation
– Currency and payment frequency normalization
– Data type standardization
– Statistical analysis preparation
– Chi-square tests for categorical variables
Key Operations:
– Duplicate detection and removal
– Missing value identification and imputation
– Earnings normalization across currencies
– Payment frequency standardization to annual income
– Data validation and quality reporting
Running the Pipeline
1. Bronze Layer Run [Bronze_layer 1.38.59 PM.ipynb](Bronze_layer%201.38.59%20PM.ipynb)
– Extracts raw data from sources
– Loads into PostgreSQL
2. Silver Layer: Run [Silver_layer 1.38.59 PM.ipynb](Silver_layer%201.38.59%20PM.ipynb)
– Cleans and transforms data
– Normalizes earnings and currencies
– Performs quality checks
Data Transformation Details
Earnings Normalization
– Converts various payment frequencies to annual income
– Standardizes multiple currencies to USD
– Uses current exchange rates for accurate conversion
Data Quality Checks
– Duplicate record detection
– Missing value analysis
– Data type validation
– Statistical distribution analysis
Security Notes
– Database credentials are stored separately (not in version control)
– API keys and sensitive data are managed in external configuration files
– Follow the principle of least privilege for database access
Future Enhancements
– [ ] Implement Gold layer for business analytics
– [ ] Add automated data validation rules
– [ ] Create data quality dashboard
– [ ] Implement incremental data loading
– [ ] Add comprehensive error handling
– [ ] Create automated testing suite
– [ ] Add CI/CD pipeline integration
– [ ] Implement data versioning
License
This project is for educational and analysis purposes.
Author
Rana
**Note**: This project is under active development. Refer to individual notebooks for detailed implementation and analysis steps.
