AirFlights Bronze Layer
Tools: Databricks | SerpAPI | Pandas | PyArrow | PySpark | Delta Lake | datetime
Description:
This layer is responsible for extracting and storing raw flight data, no transformations or business logic applied. It consists of two identical notebooks:
– The first notebook extracts outbound flights between the defined origin airports , in this Example :(DC metro — IAD, DCA, BWI) to destination airports (Florida — 5 airports)
– The second notebook runs the same logic to capture return flights.
Each notebook calls SerpAPI’s Google Flights engine across all airport combinations (15 API calls per run), parses the raw response into a flat pandas DataFrame enforced by a PyArrow schema, and writes the result to a Delta table in Databricks.
The raw fetched_at timestamp on every record ensures full traceability back to the moment of ingestion.
Saved as:
- Airflights_Outbound — DC → FL flights
- Airflights_Return — FL → DC flights
The configuration and setup
— SerpAPI is a service that scrapes Google search results (including Google Flights) and returns structured data SERPAPI_KEY is retrieved securely from Databricks Secrets rather than hardcoded
Using PyArrow to define the structure of Delta/Parquet table before any data is written. Defining the schema upfront enforces type safety .
Code Explanation
Three functions that together form the extract and load step of your Bronze ingestion pipeline.
extract_flights(origin, destination)
Fetches flight data for a single airport pair from SerpAPI.
Sends a request to SerpAPI’s Google Flights engine and gets back a dictionary of results. For each flight offer, it grabs segs[0] (first segment) and segs[-1] (last segment) .
Parsing the response:
SerpAPI returns two buckets of flights:
other_flights — remaining available options
best_flights — Google’s top recommended options
write_to_delta_table(df, table_name)
Converts the pandas DataFrame to Spark and writes it as a Delta table.
Using mode(overwrit) replaces the table on each run — appropriate for Bronze since you’re re-fetching fresh raw data.
extract_all_flights(origins, destinations)
Orchestrates the full extraction across all airport combinations.
The function accepts origins and destinations as parameters — the loop hardcodes FL_AIRPOSRTS and DC_AIRPORTS directly.
Collects all non-empty DataFrames, concatenates them, then writes once to Airflights_Outbound and Airflights_Return.
Execution Flow
extract_all_flights(FL_AIRPORTS, DC_AIRPORTS)
└── for each of 15 pairs (5 FL × 3 DC airports):
extract_flights(origin, destination)
└── SerpAPI call → parse best + other flights → pandas df
└── pd.concat all results
└── write_to_delta_table → "Airflights_Bronze_Return"
