Effortless Change Tracking in Delta Lake: A Guide to Delta Change Data Feed

Track changes in your Delta Lake tables using Delta Change Data Feed. Learn how to enable CDF for efficient incremental processing, ETL, and CDC

Effortless Change Tracking in Delta Lake: A Guide to Delta Change Data Feed
Photo by Ross Findon / Unsplash

In the world of data, keeping track of changes is crucial—especially when working with your medallion architecture. Whether you're dealing with refined (silver), or gold-tier data, understanding what changed, when, and why can significantly enhance data reliability, trust, and downstream processes.

Before I get into the solution at hand, I just wanted to call out that depending on your platform of choice, be it Databricks, Azure Synapse, Fabric, Snowflake and so on... There's going to be a plethora of ways to achieve something similar to what I'm setting out to do here. There's no right or wrong. This is just an approach to a problem that I had and it worked for me. As such, I wanted to share it. Also, this solution only works if you're using Delta Lake tables.

In this case, that solution at hand is something called Delta Change Data Feed (CDF), which is a feature that is specifically for helping track changes in your Delta Lake tables efficiently.

In this post, we’ll explore how Delta Change Data Feed can help you monitor modifications in medallion tables, making your data pipelines more robust and auditable.

The Problem

Before getting into the ins and outs of CDF, let me talk about my problem and what I was looking to achieve.

I'm old school. In the days of SQL Server data warehouse solutions, we used to build in various log and audit tables to track what was changed. Usually driven by TRY CATCH and OUTPUT functions built into the stored procedures that populated our dimensions and facts. These could sometimes be tricky and time consuming to setup, but they were a good tool.

The world has moved on, but some organisations still like the concept of the audit tables. As I've already said, it builds trust and reliability. It also helps demonstrate what's changed and when in a simple way. So when moving into the new world of data platforms, as part of a proof of concept solution I was putting together, I still wanted to provide that trust using the newer technology. A nice audit table that can be reviewed by stakeholders, or data engineers to see really easily what has changed and when. Overall, it's useful for compliance, debugging, and governance. Specifically I wanted to keep an audit of key tables in my gold layer, given for the implementation in question, slowly changing dimensions weren't being implemented. OK, back to CDF.

What is Delta Change Data Feed (CDF)?

Delta Change Data Feed (CDF) is a Delta Lake feature that enables you to capture row-level changes (inserts, updates, and deletes) in a structured and efficient way. Instead of running expensive and time-consuming comparisons across entire datasets, CDF allows you to extract only the changed data since the last checkpoint.

With CDF, you can:

  • Identify inserts, updates, and deletes in a Delta table.
  • Process only changed records in downstream applications.
  • Improve efficiency and performance in ETL and data warehousing pipelines.
  • Enable incremental processing for machine learning, analytics, and reporting workflows.

Medallion Architecture Recap

Now, before jumping into the implementation, let’s refresh our memory on the Medallion Architecture:

  • Bronze Layer: Raw data ingestion, typically from source systems like logs, CDC streams, and APIs.
  • Silver Layer: Data cleaning, deduplication, transformations, and feature engineering.
  • Gold Layer: High-quality, aggregated, and business-consumable data for BI and analytics.

Each layer has its own data evolution over time, and tracking changes at each stage can help maintain data integrity, lineage, and operational efficiency. Personally, I wouldn't use CDF in a Bronze setting, but that's just my personal preference. I'm just laying out the facts for you here. Do with this what you wish!

Why Use Delta Change Data Feed in Medallion Tables?

I've already touched on this, but Medallion tables undergo frequent updates due to incremental ingestions, transformations, and corrections. Manually tracking these changes can be painful and resource-intensive. Although in my problem, I was really trying to solve one problem, using CDF provides several benefits:

  • Better Auditability: Track data lineage and compliance requirements.
  • Faster Incremental Loads: Process only what changed instead of full table scans.
  • Enhanced Streaming Workflows: Push changes to downstream consumers in near real-time.
  • Optimised Cost & Performance: Reduce compute and storage costs by avoiding redundant operations.

How to enable Change Data Feed

To start using CDF, you need to enable it at the table level. Here’s how:

ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableChangeDataFeed' = true);

That's it! Once enabled, Delta Lake will track row-level changes automatically for any table you've enabled CDF on.

Woah there! Now before you get all excited and getting all ahead of yourself there's an important element to understand about delta tables. Every change you make, be it an insert, delete, update, column addition, and so on - each of those changes creates a new version for the table. This is an important, because you'll need to understand the version numbers to query your changes. Also it may be obvious, but it's worth saying that the point at which you enable CDF on your table, almost certainly won't be version 1.

To identify the history of your table, and also see those useful version numbers, you can run the following SQL:

DESCRIBE HISTORY my_table

This will output everything that's happened to your Delta table since its inception.

Querying Changes in Delta Tables

Once you have CDF enabled, extracting changed data is straightforward. Here’s an example of how you can query changes in a Delta table:

SELECT * FROM table_changes('my_table', 10);

This query fetches all changes since version 10 of the table. The results include a _change_type column indicating whether a row was inserted, updated, or deleted.

Example Output:

idname_change_type
1Aliceinsert
2Bobupdate
3Charliedelete

Solving the Problem

So, if you recall the problem I was trying to solve when I first started using Change Data Feed, it was about recording audit information on certain key tables in the gold layer of my medallion lakehouse.

For this to work, I had to store where I was up to in logging the changes. I didn't want to reload all of the changes into my audit table each time I ran my ETL. So I created a watermark table. In here I would store the last version number for each required table.

For example:

CREATE OR REPLACE TABLE audit_watermark (
    table_name string NOT NULL,
    last_cdfversion int
)

Initially I had to populate the last_cdfversion manually based on the version of the table at the point CDF was enabled. Then I would run my query to run a table_changes query (see the Querying Changes in Delta Tables section above), passing in the version number as a variable to get all the changes since the last run of the ETL. I would then pass those records into an audit table. Then another part of the query would collect the max table version and update the last_cdfversion field in my watermark table again. And the cycle continues.

💡
Given this was a POC, with more time, I would have automated the initial entry of the last_cdfversion by checking through the history of the table and grabbing the appropriate version number that represents the latest enabling of CDF

As this point, I'm hoping you get the idea of what I was trying to achieve and how I went about it using CDF.

Other Practical Use Cases

Incremental ETL pipelines

Instead of processing the entire table, ETL jobs can use CDF to extract only modified records. This reduces compute time and improves overall efficiency.

SELECT * FROM table_changes('silver_table', 100) WHERE _change_type != 'delete';

Change Data Capture (CDC) for Streaming

Delta CDF can power CDC workflows where changes are continuously streamed to downstream consumers such as Apache Kafka, Event Hubs, or cloud storage for further processing.

from delta.tables import DeltaTable
from pyspark.sql.functions import col

deltaTable = DeltaTable.forPath(spark, "/mnt/delta/silver_table")
df_changes = deltaTable.changeDataFeed(10)

# Process changes downstream

Historical Auditing & Debugging (Alternate method)

Rather than storing the data in a seperate audit table if you know the version number of the specific changes you want to review, you could use something like the following.

SELECT * FROM my_table VERSION AS OF 50;

Wrapping up

Delta Change Data Feed is a game-changer for tracking data evolution in Medallion architectures if you're using Delta. It simplifies incremental processing, CDC workflows, and data auditing, making your data pipelines more efficient and scalable. Whether you're building ETL pipelines, real-time analytics, or compliance solutions, leveraging CDF in your Delta Lake setup can significantly improve performance and reliability.

Key Takeaways:

  • Delta CDF helps track inserts, updates, and deletes in Medallion tables.
  • Reduces compute cost by processing only changed data.
  • Enables efficient incremental processing for ETL and streaming.
  • Enhances data governance with better auditing and lineage tracking.

If you’re already using Delta Lake, enabling Change Data Feed is a no-brainer. Try it out, and let me know how it enhances your workflows.