Tools: Databricks | Pyspark | Pandas| Numpy | delta.tables
Description:
This repo contains a Databricks notebook that is intended to run on a schedule to pull IBM daily stock data from an API and process changes through a layered (Bronze → Silver) Delta Lake design.
The API returns data in JSON format, including daily open, high, low, close prices, and volume for IBM.
The dataset includes the latest 100 trading days. Data is updated every two to three days.
What it does:
To ensures that business intelligence tools and dashboards display the most current data, we will build CDC pipeline enables continuous, incremental updates to data warehouses by propagating only the changes instead of reloading complete datasets. Also we will make sure to keep all historical data saved in our staging table as the 100 days API window moving forward,
Bronze Layer
Bronze layer (workspace.bronze.ibm) — ingestion + history
Connects to the API and ingests daily IBM stock data (JSON → tabular).
Appends only new records each scheduled run (based on a watermark / max Date in the existing Bronze table).
Keeps all historical data (Bronze is the long-term, append-friendly history layer).
Key idea: Bronze grows over time and preserves what was ingested each run (historic retention).
Silver Layer
Silver layer (workspace.silver.ibm) — curated “latest version” per record
- Reads from the Bronze table.
- Applies cleaning/validation steps (casting types, dropping nulls on critical columns, de-duplication on the key).
- Loads into Silver using a Delta MERGE (upsert) keyed by
Date.
Key idea: Silver represents the latest version of each record (per Date):
- If a
Datealready exists → it can be updated (latest values kept) - If a
Dateis new → it is inserted
