Rana Nasri Ghazzi

Data Analyst & Visualization Expert

Turning Data into Decisions

Stocks

IBM Stocks ETL – Bronze Layer

Tools: Databricks | Pyspark | Pandas| `Numpy` | `delta.tables`

GitHub

Digram

Description:

This project is an automated, scheduled ETL pipeline built in Databricks that ingests IBM daily stock data from an external API and processes it through a two-layer Delta Lake architecture (Bronze → Silver).

The pipeline connects to a financial API that returns the latest 100 trading days of IBM stock data in JSON format — including open, high, low, close prices, and volume — updated every two to three days. Rather than reloading the full dataset on each run, it implements a Change Data Capture (CDC) approach that processes only new or changed records, keeping dashboards current without redundant data movement.

The Bronze layer serves as the raw ingestion and long-term historical archive. On each scheduled run, it compares incoming data against the existing table using a date watermark and appends only new records — ensuring no data is ever overwritten or lost as the 100-day API window shifts forward over time.

IBM_ETL Job – Complete Workflow Description

This is a 3-stage sequential ETL pipeline for processing IBM stock market data with email notifications and performance optimization.

Job Configuration

Name: IBM_ETL
Execution Mode: Queue-enabled (allows multiple runs to queue if previous run is still active)
Performance Target: PERFORMANCE_OPTIMIZED (uses faster compute resources)
Notifications: Sends emails on both success and failure

Task 1: API-Ingestion

Notebook: IBM_landing
Dependencies: None (first task)

What it does:

Loads configuration from config_Parms (catalog, schemas, API key)
Calls Alpha Vantage API to fetch IBM daily stock data (Open, High, Low, Close, Volume)
Transforms API JSON response into pandas DataFrame
Incremental Logic: Queries existing workspace.bronze.ibm table to find the latest date
Filters only NEW records (dates newer than what exists in bronze)
Writes new records as Parquet files to landing zone: /Volumes/workspace/bronze/landing_zone/ibm_landing/

Output: Parquet files ready for streaming ingestion

Task 2: Auto_Loader_bronze

Notebook: IBM_autoLoader_bronze
Dependencies: Waits for API-Ingestion to complete