Why Observability Matters in Data Engineering

Observability in data engineering is the ability to understand the internal state of your entire data platform — not just knowing something broke, but knowing why, where, when, and what was affected — before your users or business stakeholders notice.

The Core Problem It Solves

Data pipelines are inherently complex. Data flows through dozens of systems, transformations, and dependencies. Without observability you are essentially flying blind:

A job silently produces wrong data instead of failing
A cost spike goes unnoticed until the monthly bill arrives
A slow query kills performance but no one knows which team caused it
A table gets dropped and lineage is lost forever
Compliance requires an audit trail but no logs exis

Observability flips this — from reactive firefighting to proactive intelligence.

The 3 Pillars of Data Observability

1. 📊 Metrics

Quantitative measurements over time — query durations, job success rates, DBU consumption, row counts, cost per pipeline.

2. 📋 Logs

Detailed event records — who ran what query, when a job started/failed, what permissions changed.

3. 🔍 Traces

End-to-end tracking of a data record as it moves through the system — from raw ingestion through bronze → silver → gold.

Observability Strategies in Databricks:

Strategy 1 — Unity Catalog System Tables (Native)

System tables are essentially a built-in observability layer — useful for monitoring costs, auditing access, tracking data lineage, and managing compute resources without any custom instrumentation.

Table Name	Purpose
`system.access.assistant_events`	Tracks user messages sent to the Databricks Assistant
`system.access.audit`	Audit events from all workspaces in your region
`system.access.table_lineage`	Records every read/write event on a Unity Catalog table or path
`system.billing.list_prices`	Historical log of SKU pricing changes
`system.billing.usage`	All billable usage across your account
`system.compute.warehouse_events`	SQL warehouse lifecycle events (start, stop, scale)
`system.compute.warehouses`	Full configuration history for SQL warehouses
`system.data_classification.results`	Column-level sensitive data detections across catalogs
`system.lakeflow.job_run_timeline`	Start and end times of job runs
`system.lakeflow.job_tasks`	All job tasks running in the account
`system.lakeflow.jobs`	All jobs created in the account
`system.query.history`	All queries run on SQL warehouses and serverless compute

Best for: Cost monitoring, governance, compliance, job health, query analysis.

Strategy 2 — Delta Table Audit Logging (Medallion Layer)

Build a monitoring data lake on top of system tables using the medallion architecture.

Bronze → Raw system table snapshots (incremental, append-only)
Silver → Cleaned, joined, enriched metrics
Gold → Aggregated KPIs ready for dashboards and alerts

This decouples monitoring layer from live system tables, improves query performance, and creates a historical record you own and control.

Strategy 3 — Databricks Lakehouse Monitoring

To draw useful insights from your data, you must have confidence in the quality of your data. Monitoring your data provides quantitative measures that help you track and confirm the quality and consistency of your data over time.

Attach a monitor to any Delta table

To monitor a table in Databricks, you create a monitor attached to the table. To monitor the performance of a machine learning model, you attach the monitor to an inference table that holds the model’s inputs and corresponding predictions.

This automatically tracks:

Schema drift — did a column change type or disappear?
Data distribution shifts — are values suddenly out of range?
Null rates — are critical columns going empty?
Row count anomalies — did a pipeline produce zero rows?

Strategy 4 — Structured Streaming Metrics

Structured Streaming Metrics are used for observability—the telemetry data generated by Apache Spark engines to monitor the health and performance of real-time data pipelines

Where you access this — two options:

Option 1 — Spark UI (visual, built-in) : You see graphs of throughput, batch duration, and input rate updating live — no code needed.

Option 2 — Code (programmatic): Query lastProgress

Spark’s StreamingQuery.lastProgress gives a live snapshot of your streaming job’s throughput and latency after each micro-batch.

python

query = df.writeStream.format(“console”).start()

# After each micro-batch, inspect the metrics:

import time
time.sleep(10)
print(query.lastProgress)

Limitation:

StreamingQuery.lastProgress exists purely in memory inside the Spark driver process, storing only the most recent micro-batch snapshot. It does not survive query restarts or cluster shutdowns. On Databricks, however, streaming metrics are automatically persisted to system tables, so the data is not lost when a query stops.

What gets saved automatically on Databricks:

Databricks automatically logs structured streaming metrics into Unity Catalog system tables. Two tables are relevant: system.compute.query_history stores query-level metadata, while system.lakeflow.job_run_timeline captures job and pipeline run information. For Delta Live Tables pipelines specifically, Spark writes richer streaming metrics into a dedicated event_log table that you can query directly like any Delta table.

StreamingQueryListener:

For raw Structured Streaming jobs, Spark does not automatically save metrics anywhere permanent. StreamingQueryListener solves this — it is a built-in Spark hook that fires automatically after every micro-batch completes, handing you the full progress payload so you can forward it wherever you need: a Delta table, a logging system, or an alerting tool — without any manual polling.

# Option 1 — write to Delta manually in foreachBatch

def log_metrics(batch_df, batch_id):
progress = query.lastProgress
spark.createDataFrame([progress]).write \
.mode(“append”).saveAsTable(“monitoring.stream_metrics”)

#Option 2 — use a StreamingQueryListener

class MetricsListener(StreamingQueryListener):
def onQueryProgress(self, event):
progress = event.progress
# save to Delta, send to Slack, push to Prometheus, etc.

spark.streams.addListener(MetricsListener())

Strategy 5 — Alerting & Notification Pipelines

Observability is only complete when it triggers action automatically. Using the Databricks SQL Alert API, you can run a query against system tables on a schedule and fire an alert the moment a threshold is crossed

For example,

Counting failed jobs in the last hour from system.lakeflow.job_run_timeline and notifying your team if that count exceeds three. These alerts can chain directly into webhooks that post to Slack, PagerDuty, or Microsoft Teams, or they can auto-trigger a remediation notebook to respond without any human intervention.

Observability in Databricks