Rana Nasri Ghazzi

Data Analyst & Visualization Expert

Turning Data into Decisions

Category: Uncategorized

AI-Powered Real-Time Crypto ETL
Project: Crypto Medallion ETL (Bronze → Silver → Gold)

Github Repo

Overview

A multi-layer (Bronze → Silver → Gold) ETL pipeline that ingests live cryptocurrency market data from the CoinGecko API, stores raw and cleaned records in PostgreSQL, and produces daily aggregated insights. The pipeline is orchestrated by pipeline.py, version-controlled on GitHub, and implemented in Python (Anaconda). The Claude AI agent assisted during development. Logs are written to etl.log.

Goals

Ingest top N coins from CoinGecko on an hourly cadence.

Preserve raw extracted records (Bronze) for auditing and reprocessing.

Enrich and normalize data (Silver) for analysis-ready consumption.

Produce daily aggregated summaries and top-movers (Gold) for reporting and downstream analytics.

Architecture — Medallion Layers

CoinGecko API

▼

Bronze Layer → coins table

▼

Silver Layer → coins_silver table

▼

Gold Layer → gold_daily_summary + gold_top_movers

Bronze: extract.py → transform.py → load.py → coins table

Silver: silver.py → coins_silver table

Gold: gold.py → gold_daily_summary + gold_top_movers tables

Orchestration: pipeline.py. runs all three layers end-to-end

Automation cron job runs pipeline.py every hour (add manually via crontab -e)

Data Flow (High-level)
- pipeline.py triggers extraction, transformation, loading for Bronze.
- Bronze layer persists cleaned raw snapshots to coins.
- Silver layer reads recent Bronze snapshots, enriches data (derived fields, currency normalization), and writes to coins_silver.
- Gold layer computes daily summaries and top movers and writes to gold_daily_summary and gold_top_movers.
- Cron schedules pipeline hourly.
Project Structure

├── pipeline.py # Orchestrates full Bronze → Silver run → Gold

├── extract.py # Fetch top N coins from CoinGecko API

├── transform.py # Select and clean fields (Bronze)

├── load.py # Insert into Bronze coins table

├── silver.py # Extract from Bronze, enrich, load into Silver

├── config.py # DB connection via .env credentials

├── db_setup.sql # SQL to create coins and coins_silver tables

├── etl.log # Auto-generated pipeline log

├── #test_extract.py # Unit test for extract

├── # test_transform.py # Unit test for transform

├── #test_load.py # End-to-end Bronze pipeline test

└── test_silver.py # End-to-end Silver pipeline test

Environment

– Python 3.8 (Anaconda) — ~/opt/anaconda3/bin/python

– PostgreSQL with psycopg2

– Credentials stored in .env (not committed to git)

– Logs written to etl.log
March 5, 2026
Survey Analysis Medallion ETL Pipeline
Tools: Jupiter Notebook | PostgreSQL | Python: Pandas | NumPy | Matplotlib | Seaborn | SciPy

Project Overview

This project implements a multi-layer ETL pipeline for analyzing survey data using the Medallion Architecture pattern. The pipeline extracts raw survey data, performs data quality checks, applies transformations, and prepares, clean and normalized data using Python and PostgreSQL for analysis.

Dataset:

Our data set is a survey works among IT professional , collected and published in the link below.

“https://api.example.com/data”

It has 11551 records and 84 columns. (11552, 85)

Github Repo

Architecture

The project follows the **Medallion Architecture** pattern:
- Bronze Layer: Raw data ingestion from APIs and databases
- Silver Layer: Data cleaning, normalization, and transformation
- Gold Layer: (Future) Aggregated, business-ready analytics
Pipeline Components

Bronze Layer

The Bronze layer handles:

– Data extraction from external APIs

– Connection to PostgreSQL database

– Loading raw survey data

– Initial column filtering and selection

– Raw data storage for future reference

Key Features:

– API data ingestion

– Database connectivity with PostgreSQL

– Column management and initial filtering

– Raw data preservation

Silver Layer

The Silver layer performs:

– Data quality checks (duplicates, missing values)

– Data cleaning and imputation

– Currency and payment frequency normalization

– Data type standardization

– Statistical analysis preparation

– Chi-square tests for categorical variables

Key Operations:

– Duplicate detection and removal

– Missing value identification and imputation

– Earnings normalization across currencies

– Payment frequency standardization to annual income

– Data validation and quality reporting

Running the Pipeline

1. Bronze Layer Run [Bronze_layer 1.38.59 PM.ipynb](Bronze_layer%201.38.59%20PM.ipynb)

– Extracts raw data from sources

– Loads into PostgreSQL

2. Silver Layer: Run [Silver_layer 1.38.59 PM.ipynb](Silver_layer%201.38.59%20PM.ipynb)

– Cleans and transforms data

– Normalizes earnings and currencies

– Performs quality checks

Data Transformation Details

Earnings Normalization

– Converts various payment frequencies to annual income

– Standardizes multiple currencies to USD

– Uses current exchange rates for accurate conversion

Data Quality Checks

– Duplicate record detection

– Missing value analysis

– Data type validation

– Statistical distribution analysis

Security Notes

– Database credentials are stored separately (not in version control)

– API keys and sensitive data are managed in external configuration files

– Follow the principle of least privilege for database access

Future Enhancements

– [ ] Implement Gold layer for business analytics

– [ ] Add automated data validation rules

– [ ] Create data quality dashboard

– [ ] Implement incremental data loading

– [ ] Add comprehensive error handling

– [ ] Create automated testing suite

– [ ] Add CI/CD pipeline integration

– [ ] Implement data versioning

License

This project is for educational and analysis purposes.

Author

Rana

**Note**: This project is under active development. Refer to individual notebooks for detailed implementation and analysis steps.
January 26, 2026
Exploratory Data Analysis (EDA) of IT Professionals
Tools: Jupiter Notebook | Python: Pandas | NumPy | Matplotlib | Seaborn | SciPy
Introduction

This Exploratory Data Analysis (EDA) project examines compensation, skills, demographics, and work patterns among IT professionals. Using a cleaned, annualized-salary dataset filtered for full-time, USD-denominated responses and with multi-valued fields (e.g., programming languages) normalized, the analysis aims to reveal actionable insights for hiring managers, data practitioners, and professionals planning career moves.

Explore Dashboard

Objectives

Identify the most popular programming languages among IT professionals.

Analyze average salaries and key income statistics (mean, median, percentiles, dispersion).

Explore age distribution and demographic patterns across the workforce.

Provide statistical summaries of working hours for part-time and full-time professionals.

Examine relationships between income and factors such as working hours, age, education, experience, role, and skills.

Determine the most popular databases used by IT professionals.

Methodology

Descriptive statistics: compute counts, mean, median, standard deviation, IQR, and percentiles for numeric variables.

Frequency analysis: rank programming languages and databases by prevalence; report counts and proportions.

Distributional analysis: histograms, density plots, and boxplots for salaries, ages, and working hours; stratify by role, employment type, and region.

Comparative analysis: group-by and pivot tables to compare median/mean salaries across categories (language, database, education level, role).

Correlation & modeling: correlation matrices and scatterplots to inspect linear relationships; regression or tree-based models to quantify the effect of predictors (hours, age, education, skills) on income, controlling for confounders.

Statistical testing: ANOVA or non-parametric equivalents to test salary differences across groups; significance thresholds and effect size reporting.

Visualization & dashboarding: interactive plots and filters to enable drill-downs (by location, experience, role, language, database).

Expected Insights

Top programming languages and databases used across the industry.

Typical compensation ranges and which skills, roles, or demographics associate with higher pay.

How age and working hours correlate with income and career stage.

Evidence-based recommendations for skill development, hiring priorities, and compensation benchmarking.

Dataset:

Our data set is a survey works among IT professional , collected and published in the link below.

“https://api.example.com/data”

It has 11551 records and 84 columns. (11552, 85)
Github Repo

1. Identify the most popular programming languages among IT professionals.

There are unique 28 programming languages , the top most popular ones are:
JavaScript

HTML/CSS

SQL

Bash/Shell/PowerShell

Python
2. Analyze earning among IT professionals.

count 3,125.00
mean 152,928.26
std 207,792.90
min 38,272.00
25% 84,000.00
50% 106,000.00
75% 140,000.00
max 1,664,000.00

1. Central Tendency (The “Middle”)
- Mean (152,928.26): The arithmetic average of all values.
- 50% / Median (106,000.00): The exact middle value when the data is sorted.
- Interpretation: Because the mean is much higher than the median, your data is right-skewed (positively skewed). This typically occurs when a few very large values pull the average up, while most data points remain lower.
2. Spread and Variability
- Std (Standard Deviation: 207,792.90): This measures how far, on average, data points are from the mean. A standard deviation larger than the mean itself indicates extremely high variability and inconsistency.
- Min (38,272.00) & Max (1,664,000.00): These define the total range. The massive gap between the max and the 75th percentile (140k vs. 1.66M) confirms the presence of significant outliers at the high end.
3. Distribution Percentiles (Quartiles)
- 25% (84,000.00): One-quarter of your data is below this value.
- 75% (140,000.00): Three-quarters of your data is below this value, meaning the top 25% of your data starts at 140,000.
- Interquartile Range (IQR): The middle 50% of data falls between 84,000 and 140,000.
3. Explore the age distribution among IT professionals.
Summary statistics.

Mean: 30.77

Std (standard deviation): 7.37

Min: 16.00

25% (first quartile, Q1): 25.00

50% (median, Q2): 29.00

Max: 72.00

75% (third quartile, Q3): 35.00
4. Provide statistical information on working hours for part-time and full-time IT professionals.

Reading into dataset:

The Max of 1,012.00 hours is mathematically impossible (there are only 168 hours in a week). This indicates “noisy” data or data entry errors in the dataset.

Similar to the full-time data, the Max of 375.00 hours is impossible for a single week, suggesting errors in the source data.

Full-time workers are a much larger group in this data and center strictly around a 40-hour week. Part-time workers have a much broader distribution relative to their average, typically working between 20 and 35 hours.

After applying outlier treatment:

5. Determine the most popular databases among IT professionals.

There are 13 unique databases , the top most popular ones are:

MySQL

Microsoft SQL Server

PostgreSQL

SQLite

MongoDB

6. Examine the relationship between income and various factors such as working hours, age, education, and other variables.

Top 3 Insights From the Pair Plot Below:
1. Experience Variables Are Strongly Correlated
- YearsCode, YearsCodePro, and Age move together in clear linear patterns.
- This confirms that experience‑related fields are consistent and reinforce each other.
1. Compensation Shows Only a Weak Positive Trend With Experience
- CompTotal increases slightly with YearsCodePro, but the scatter is wide.
- This suggests compensation is influenced by many external factors (role, region, company size), not just experience.
- Insight: experience alone is not a strong predictor of pay.
1. Tool Knowledge Grows Slowly With Experience
- Languages and Databases show mild upward trends with experience.
- Most respondents know only a few tools, with a small number of outliers.
- This indicates that tool count is not a strong differentiator across the population.
For Categorical Variables like Ethnicity , Country, Education , Gender and more we can use Chi-Square test to determine if there is a statistically significant association between two categorical variables.
January 23, 2026
Behind Data…

What lies behind data are facts— the story hides between the lines. Beneath every line there is opportunity to surface—a need for an address, a problem demanding attention. What data tells are stories, but the narrative is in your hands. It can go as far as you dare to go, but remember to carry the facts with you and keep them grounded in where they belong—in the story. You feed imagination with facts, and imagination is founded on them. And who knows… it might even start a new history.

Rana

December 28, 2025

Social Media Auto Publish Powered By : XYZScripts.com