Tools: Databricks | Pyspark | Pandas| Numpy | delta.tables
Description:
This project is an automated, scheduled ETL pipeline built in Databricks that ingests IBM daily stock data from an external API and processes it through a two-layer Delta Lake architecture (Bronze → Silver).
The pipeline connects to a financial API that returns the latest 100 trading days of IBM stock data in JSON format — including open, high, low, close prices, and volume — updated every two to three days. Rather than reloading the full dataset on each run, it implements a Change Data Capture (CDC) approach that processes only new or changed records, keeping dashboards current without redundant data movement.
The Bronze layer serves as the raw ingestion and long-term historical archive. On each scheduled run, it compares incoming data against the existing table using a date watermark and appends only new records — ensuring no data is ever overwritten or lost as the 100-day API window shifts forward over time.
IBM_ETL Job – Complete Workflow Description
This is a 3-stage sequential ETL pipeline for processing IBM stock market data with email notifications and performance optimization.
Job Configuration
- Name: IBM_ETL
- Execution Mode: Queue-enabled (allows multiple runs to queue if previous run is still active)
- Performance Target: PERFORMANCE_OPTIMIZED (uses faster compute resources)
- Notifications: Sends emails on both success and failure
Task 1: API-Ingestion
Notebook: IBM_landing
Dependencies: None (first task)
What it does:
- Loads configuration from config_Parms (catalog, schemas, API key)
- Calls Alpha Vantage API to fetch IBM daily stock data (Open, High, Low, Close, Volume)
- Transforms API JSON response into pandas DataFrame
- Incremental Logic: Queries existing workspace.bronze.ibm table to find the latest date
- Filters only NEW records (dates newer than what exists in bronze)
- Writes new records as Parquet files to landing zone: /Volumes/workspace/bronze/landing_zone/ibm_landing/
Output: Parquet files ready for streaming ingestion
Task 2: Auto_Loader_bronze
Notebook: IBM_autoLoader_bronze
Dependencies: Waits for API-Ingestion to complete
What it does:
- Loads configuration from config_Parms
- Uses Auto Loader (cloudFiles) to automatically detect new Parquet files in landing zone
- Streams data with:
- Schema location: /Volumes/workspace/bronze/schemas/ibm_stream
- Checkpoint location: /Volumes/workspace/bronze/checkpoints/ibm_stream
- Merge schema enabled (handles schema evolution)
- Writes streaming data to Delta table: workspace.bronze.ibm
- Uses trigger(availableNow=True) for micro-batch processing (processes all available data then stops)
Output: Raw data in bronze Delta table with exactly-once processing semantics
