
Featured Projects:
A selection of data engineering and analytics work
A portfolio of ETL and Analytics projects built on the Databricks ecosystem using Medallion Architecture. Each project ingests data from a different source platform using a purpose-fit ingestion method, applies transformations through the Silver layer, and delivers a Gold layer ready for AI and analytics workloads. Every pipeline showcases a distinct Databricks capability — reflecting real-world design choices tailored to different data sources, volumes, and business needs.
Several projects include embedded analytics and visualizations to surface data quality issues and deliver actionable insights for business users. All pipelines are production-ready, implemented with DAB (Databricks Asset Bundles) for deployment automation and managed under version control via GitHub.
Note: These project has been intentionally designed to demonstrate a variety of ingestion modes and methods. Certain pipelines have been restructured — for example, some sources use streaming rather than batch ingestion — in order to showcase different data engineering concepts across the pipeline inventory.
Data Ingestion Pipeline Inventory
| Pipeline Name | Source | Source Type | Ingestion Type | Load Mode | Target Architecture | Table Type | Schedule / Trigger | Monitoring | Data Quality (QA) |
| AirFlights | Aviation API | JSON | Full refresh | Overwrite | Medallion (Bronze-Silver-Gold) | Delta | Daily batch | Custom logging | Filter rules + row count checks |
| GA4 | Google Analytics 4 | Lakehouse Connect -Event Stream | Incremental (DLT live tables) | Append | Medallion (Bronze-Silver-Gold) | Streaming | Daily batchs | DLT built-in event log | DLT built-in expectations |
| Wikimedia | Wikimedia API | JSON | Incremental (DLT live tables) | Append + Merge (SCD) | Medallion (Bronze-Silver-Gold) | Streaming | Daily batch | DLT built-in event log | DLT built-in expectations |
| Stocks | Alpha Vantage API | JSON | Incremental(DLT live tables) | Append | Medallion (Bronze-Silver-Gold) | Streaming | Daily batch | DLT built-in event log | DLT built-in expectations |
| IBM | Alpha Vantage API | JSON | Incremental (Structured Streaming) | Append | Medallion (Bronze-Silver-Gold) | Streaming | Daily batch | DLT built-in event log | DLT built-in expectations |
| Survey | Survey Files | CSV / Excel | Full refresh | Overwrite | Medallion (Bronze-Silver-Gold) | Delta | Weekly batch | Custom logging | Filter rules |
| Uber | Uber Trip Files | CSV | Full refresh | Overwrite | Medallion (Bronze-Silver-Gold) | Delta | Weekly batch | Custom logging | Filter rules |
| Cat Breeds | Cat API | JSON | Full refresh | Overwrite | Medallion (Bronze-Silver-Gold) | Delta | Weekly batch | Custom logging | Filter rules |
Airflights
An end-to-end ETL pipeline built with Python, Serpapi, and Delta Lake that ingests raw Google Flights data, transforms it through a medallion architecture, and produces aggregated flight intelligence for trip planning.
| Bronze Layer | Silver Layer | Gold Layer |
|---|---|---|
| Extracting and storing raw flight data to a Delta table in Databricks. | Processing raw flight data from Bronze tables into cleaned, curated silver tables. | Processing cleaned flight data to create business-ready analytics for round-trip flight combinations |
Stocks ETL Pipeline
A production-ready ETL pipeline for processing stock market data using Databricks LakeHouse architecture. This project implements a medallion architecture (Bronze → Silver → Gold) with automated data quality checks and orchestrated execution.
| Bronze Layer | Silver Layer | Gold – Layer |
|---|---|---|
| Raw, unprocessed stock data ingestion with full historical refresh capability | Cleaned and validated data with quality checks applied | Business-ready aggregations and analytics-optimized datasets |
IBM Stocks
This project is an automated, scheduled ETL pipeline built in Databricks that ingests IBM daily stock data from an external API and processes it through a two-layer Delta Lake architecture (Bronze → Silver).
| Bronze Layer | Silver Layer | Dashboard – Tableau |
|---|---|---|
| Incremental raw data ingestion that implements a Change Data Capture (CDC) using watermark approach. | From Raw Ingestion to Curated Truth — Silver Layer Delta Upsert Pipeline | Interactive Tableau dashboards surfacing IBM stock performance metrics and market trends for business consumption. |
GA4 Pipeline
This project builds a production-grade data pipeline that ingests raw Google Analytics 4 (GA4) event data, cleans and flattens it, and makes it ready for analysis. It is built on Databricks using Delta Live Tables (DLT) and PySpark, following the Medallion Architecture pattern.
| INGESTION | SILVER LAYER | GOLD LAYER |
|---|---|---|
| COMING SOON |
IT professional Survey
A multi-layer ETL pipeline for analyzing IT professional survey data
using Medallion Architecture with Python and PostgreSQL
| ELT | Analysis – EDA |
|---|---|
| Processing large IT professional survey dataset through a structured three-layer Medallion Architecture. | In-depth Exploratory Data Analysis (EDA) |
Medallion ETL – Dashboard – EDA
| Crypto – Medallion ETL | Uber Drive | Cat Breed |
|---|---|---|
| Processed cryptocurrency market data through a Medallion ETL pipeline. | Analyzed Uber ride data to improve operational efficiency metrics. DASHBOARD |
Comming Soon
Wikimedia Live Edit Stream — Custom PySpark Structured Streaming Pipeline
A real-time streaming data pipeline built on Apache Spark (Databricks) that continuously polls the Wikipedia/Wikimedia API for recent edits and processes them as a structured stream.
Databricks Pipeline Monitoring Framework
Built a lightweight, native monitoring framework for Databricks pipelines that tracks both operational health and data quality in a single queryable system. The framework captures per-run metrics — duration, row counts, SLO compliance — alongside granular quality check results, all linked by a shared run_id key stored in two Delta tables in Unity Catalog.
