A portfolio of ETL and Analytics projects built on the Databricks ecosystem using Medallion Architecture. Each project ingests data from a different source platform using a purpose-fit ingestion method, applies transformations through the Silver layer, and delivers a Gold layer ready for AI and analytics workloads. Every pipeline showcases a distinct Databricks capability — reflecting real-world design choices tailored to different data sources, volumes, and business needs.

Several projects include embedded analytics and visualizations to surface data quality issues and deliver actionable insights for business users. All pipelines are production-ready, implemented with DAB (Databricks Asset Bundles) for deployment automation and managed under version control via GitHub.

Note: These project has been intentionally designed to demonstrate a variety of ingestion modes and methods. Certain pipelines have been restructured — for example, some sources use streaming rather than batch ingestion — in order to showcase different data engineering concepts across the pipeline inventory.


Data Ingestion Pipeline Inventory

Pipeline NameSourceSource TypeIngestion TypeLoad ModeTarget ArchitectureTable TypeSchedule / TriggerMonitoringData Quality (QA)
AirFlightsAviation APIJSONFull refreshOverwriteMedallion (Bronze-Silver-Gold)DeltaDaily batchCustom loggingFilter rules + row count checks
GA4Google Analytics 4Lakehouse Connect -Event StreamIncremental (DLT live tables)AppendMedallion (Bronze-Silver-Gold)StreamingDaily batchsDLT built-in event logDLT built-in expectations
WikimediaWikimedia APIJSONIncremental (DLT live tables)Append + Merge (SCD)Medallion (Bronze-Silver-Gold)StreamingDaily batchDLT built-in event logDLT built-in expectations
StocksAlpha Vantage API JSONIncremental(DLT live tables)AppendMedallion (Bronze-Silver-Gold)StreamingDaily batchDLT built-in event logDLT built-in expectations
IBMAlpha Vantage API JSONIncremental (Structured Streaming)AppendMedallion (Bronze-Silver-Gold)StreamingDaily batchDLT built-in event logDLT built-in expectations
SurveySurvey FilesCSV / ExcelFull refreshOverwriteMedallion (Bronze-Silver-Gold)DeltaWeekly batchCustom loggingFilter rules
UberUber Trip FilesCSVFull refreshOverwriteMedallion (Bronze-Silver-Gold)DeltaWeekly batchCustom loggingFilter rules
Cat BreedsCat APIJSONFull refreshOverwriteMedallion (Bronze-Silver-Gold)DeltaWeekly batchCustom loggingFilter rules
Moving Airplane

A production-ready ETL pipeline for processing stock market data using Databricks LakeHouse architecture. This project implements a medallion architecture (Bronze → Silver → Gold) with automated data quality checks and orchestrated execution.

This project is an automated, scheduled ETL pipeline built in Databricks that ingests IBM daily stock data from an external API and processes it through a two-layer Delta Lake architecture (Bronze → Silver).

A multi-layer ETL pipeline for analyzing IT professional survey data
using Medallion Architecture with Python and PostgreSQL

real-time streaming data pipeline built on Apache Spark (Databricks) that continuously polls the Wikipedia/Wikimedia API for recent edits and processes them as a structured stream.

Databricks Pipeline Monitoring Framework

Built a lightweight, native monitoring framework for Databricks pipelines that tracks both operational health and data quality in a single queryable system. The framework captures per-run metrics — duration, row counts, SLO compliance — alongside granular quality check results, all linked by a shared run_id key stored in two Delta tables in Unity Catalog.