IBM ETL Project Overview:
This project builds an ETL pipeline that pulls IBM stock from a public API, stages the raw history in PostgreSQL, then prepares cleaner and analytics-ready layers for reporting.
The API returns data in JSON format, including daily open, high, low, close prices, and volume for IBM.
The dataset includes the latest 100 trading days.
Data is updated every two to three days.
Project Flow
1. Extract daily IBM stock data from the Alpha Vantage API.
2. Transform the API response into a pandas DataFrame.
3. Load new records into the staging table IBM_history.
4. Refresh the silver layer IBM to keep one clean latest record per Date.
5. Refresh the gold layer IBM_gold to calculate business metrics for analysis.
How Python and PostgreSQL Stored Procedures Work Together
This pipeline uses a split-responsibility approach:
1. Python orchestrates ETL stages.
– Loads local settings from .env.
– Calls the Alpha Vantage API and transforms the payload to a pandas DataFrame.
– Appends incremental records to the stage table IBM_history.
2. Python deploys SQL objects from IBM_StoredProcedure.sql.
– Creates indexes and target tables if they do not exist.
– Creates or replaces stored procedures for silver and gold refresh steps.
3. Python calls PostgreSQL procedures.
– Executes CALL refresh_ibm_silver();
– Executes CALL refresh_ibm_gold();
4. PostgreSQL performs set-based processing.
– refresh_ibm_silver() deduplicates stage records by Date using the newest Last_updated, then upserts into IBM.
– refresh_ibm_gold() calculates analytics metrics (daily return, 7-day MA, 30-day MA, 30-day volatility, 30-day volume MA, trend flag), then upserts into IBM_gold.
Why this design:
– Python is used for connectivity, API extraction, and orchestration.
– PostgreSQL is used for batch SQL logic and metric calculations close to the data.
– This keeps the pipeline easier to maintain and improves refresh performance.
Bronze Layer
– Table: IBM_history
– Purpose: store newly extracted API records before downstream processing
– Notes: this layer can contain repeated dates from multiple loads
Silver Layer
– Table: IBM
– Purpose: keep the latest clean record for each business date
– Logic: use an upsert process on Date and keep the newest Last_updated value
Gold Layer
– Table: IBM_gold
– Purpose: provide analytics-ready metrics from the silver table
– Metrics: daily return, 7-day moving average, 30-day moving average, 30-day volatility, and volume moving average
