Data Quality Checks - Rana Nasri Ghazzi

Quality Checks:

1. Lakeflow Declarative Pipelines expectations:

Expectations are optional clauses in pipeline materialized view, streaming table, or view creation statements that apply data quality checks on each record passing through a query. Expectations use standard SQL Boolean statements to specify constraints. You can combine multiple expectations for a single dataset and set expectations across all dataset declarations in a pipeline.

Behavior on Violation:

Specify an action to determine what happens when a record fails the validation check. The following table describes the available actions:

dp.expect_or_fail – This expectation causes a failure of a single flow

dp.expect_or_drop – Invalid records are dropped before data is written to the target

dp.expect – Invalid records are written to the target.

The key advantage over external quality tools is that the check happens at write time inside the pipeline, not as a separate job after the fact. Bad data never reaches the next layer.
You can also implement advanced logic to quarantine invalid records without failing or dropping data.
You can group multiple expectations together and specify collective actions using the functions expect_all, expect_all_or_drop, and expect_all_or_fail.
Note: @dp pipelines can’t run in a regular notebook — only streaming tables, materialized views, and temporary views support expectations, data quality metrics are supported only for these object type.

2. Manage data quality – Regular Delta Tables

for regular delta table No built-in expectation framework. You have to enforce quality manually three ways:

Option 1 — Filter before writing

df_clean = df.filter(“amount > 0 AND customer_id IS NOT NULL”) df_clean.write.format(“delta”).mode(“append”).saveAsTable(“silver_transactions”)\ Simple but silent — violations are just dropped with no logging.

Option 2 — Write bad records to a quarantine table

Quarantining separates good data from bad data during pipeline execution — good rows flow into the target table while bad rows are stored separately for review and reprocessing. Once a quarantined record is reviewed and corrected, it can be rerun through the pipeline and inserted into the target table.

Databricks provides two built-in quarantine mechanisms:

1. badRecordsPath — captures bad records or files encountered during data loading and writes them to a specified path:

spark.read \    
.option("badRecordsPath", "/Volumes/workspace/quarantine/ibm") \    .parquet("/Volumes/workspace/bronze/landing_zone/ibm_landing")

2. Auto Loader rescued data column — any data that does not match the target schema is automatically captured in a _rescued_data column instead of failing or being silently dropped:

stream_df = spark.readStream \   
 .format("cloudFiles") \    
.option("cloudFiles.schemaEvolutionMode", "rescue") \    .load("/Volumes/workspace/bronze/landing_zone/ibm_landing")

3. Custom quarantine pattern — split good and bad rows explicitly:

good_rows = source_casted.filter(col("Close") > 0)
bad_rows  = source_casted.filter(col("Close") <= 0)
good_rows.write.mode("append").table(f"{catalog}.silver.ibm")
bad_rows.write.mode("append").table(f"{catalog}.silver.ibm_quarantine")

Option 3 — Use Great Expectations or Soda

External libraries that add @dp -like rule definitions to regular Spark pipelines.- More setup but works outside Databricks too.
Bottom line: @dp is the cleaner approach if you’re fully on Databricks. Regular Delta gives you more portability but you own all the quality instrumentation yourself

3. Delta table constraints

Delta table constraints are an automated mechanism for enforcing data quality and integrity by verifying that any data added to a table meets specific rules. Unlike passive monitoring, enforced constraints strictly prevent “bad” data from being committed to the table; if a record violates a constraint, the operation fails

Core Enforced Constraints:

Delta Lake currently supports two types of strictly enforced constraints for data quality:

NOT NULL Constraints: These ensure that specific columns never contain null values.
CHECK Constraints: These ensure that every row satisfies a specific Boolean SQL expression.

Informational Constraints:

Databricks also support Primary Key and Foreign Key constraints.

Non-Enforced: These are typically informational only and are not strictly enforced during writes.
Query Optimization: While they don’t block bad data, they help query engines optimize joins and aggregations.

Comparison: Constraints vs. DLT Expectation

Expectations provide a more flexible alternative to standard constraints:

Expectations: Allow for multiple “actions” on failure—warning (keep record), dropping the row, or failing the entire pipeline.
Standard Constraints: Always result in a “Fail Update” behavior, making them better for mission-critical data where no errors can be tolerated

Comparison to other strategies:

	Constraints	Quarantine	DLT expect
Bad row behavior	Fails batch	Saves separately	Configurable
Audit trail	✅ Table metadata	✅ Quarantine table	✅ DLT metrics
Granularity	Row level	Row level	Row level
Pipeline stops?	✅ Yes	❌ No	Configurable

Schema handling:

Schema handling in Databricks primarily revolves around ensuring data quality while allowing for structural changes over time. It is managed through two main mechanisms: Schema Enforcement (preventing bad data) and Schema Evolution (adapting to new structures).

Core Schema Mechanisms

Schema Enforcement (Validation): By default, Delta Lake enforces that all data written to a table matches the existing schema. It rejects any writes containing extra columns or incompatible data types to prevent “data pollution”.
Schema Evolution: This allows a table’s schema to change automatically to accommodate new data. It is often triggered during append or overwrite operations by setting the .option(“mergeSchema”, “true”).ction.

Common Handling Strategies

Inference & Auto Loader: When ingesting data from cloud storage, Auto Loader can automatically infer schemas and detect changes. It uses a schemaLocation to track structural versions over time.
Rescued Data Column: To prevent data loss when a schema mismatch occurs, you can enable a “rescued data column” which captures unexpected fields or type mismatches in a JSON blob for later inspection.
Explicit Schema Definition: For critical production tables, engineers often explicitly define the schema using DDL (Data Definition Language) or pyspark.sql.types.StructType to maintain strict control.