Tools: Databricks / Pyspark /Pandas
API Description:
This is an open API to fetch the daily time series data for IBM stock.
The API returns data in JSON format, including daily open, high, low, close prices, and volume for IBM.
The dataset includes the latest 100 trading days.
Data is updated every two to three days.
Incremental Ingestion – Append Only
To ensures that business intelligence tools and dashboards display the most current data, we will build CDC pipeline enables continuous, incremental updates to data warehouses by propagating only the changes instead of reloading complete datasets. Also we will make sure to keep all historical data saved in our staging table as the 100 days API window moving forward,
Timestamp-based CDC
We will rely on the last date in the target table to determine the new records to be updated.
Bronze Layer
History retention: Preserves complete history of data in the warehouse, So:
Deleted records: stay (no change)
New records: Added to the table.
Modified records: Updated
Use a unique identifier (e.g., primary key) to match incoming records with existing ones.

E: Extract Data

T: Transformation

L : Data Load:

Silver Layer


