Quick byte: Automating your Microsoft Fabric lakehouse maintenance programatically

Learn how to optimize and vacuum your Microsoft Fabric lakehouse tables for better performance and efficiency. Keep your data running smoothly!

Quick byte: Automating your Microsoft Fabric lakehouse maintenance programatically
Photo by Coby Shimabukuro-Sanchez / Unsplash

This is the first of my "quick byte" - short, sharp insights fired off when needed. Today, we’re tackling something that might not be on your radar: maintaining your Microsoft Fabric lakehouse tables. Enough puns for the moment - let's go.

We were meant to do maintenance?

Fabric lakehouses are relatively low-maintenance compared to traditional databases, but they’re not entirely hands-off. If you want to keep performance smooth and storage efficient, there are still a few housekeeping tasks to handle.

The main two tasks we will focus on in this post are Optimize and Vacuum.

Microsoft have made it relatively straight forward to trigger these tasks via the Fabric UI, but as at time of writing this post, you need to do these tasks per table. It's also not a particularly quick task. Now imagine having a separate bronze, silver, and gold lakehouse. That's a lot of clicking and waiting.

Before we get onto how to make the process easier, let's quickly cover what the Optimize and Vacuum operations do.

What Does Optimize Do?

When working with delta tables in Fabric, data gets written in small file chunks over time. This fragmentation can slow down queries.

Optimize compacts these small files into larger, more efficient ones. The benefits?

• Faster query performance by reducing the number of files that need to be scanned
• Better storage efficiency by reducing metadata overhead
• Improved partitioning for more effective data skipping

Running Optimize regularly is crucial for keeping lakehouse queries snappy and cost-effective.

What Does Vacuum Do?

Deleting data from delta tables doesn’t immediately free up storage. Instead, the deleted records stick around as “tombstones” in the transaction log.

Vacuum clears out these obsolete files, preventing unnecessary storage bloat and improving performance. Key benefits:

• Reclaims storage space by permanently removing deleted data
• Prevents performance degradation from excessive log files
• Keeps transaction logs manageable for long-term stability

One important note: Vacuum can permanently delete data, so make sure your retention settings are configured correctly before running it.

Now that we understand why Optimize and Vacuum are important, let’s look at how to streamline these operations across multiple tables and lakehouses.

Performing maintenance using a Notebook

The process is split into three parts:

  1. Get a list of the lakehouses we want to run our operations on.
  2. Then get a list of tables from those lakehouses.
  3. Finally, run our commands

The commands we are going to run are simply:

OPTIMIZE lakename.tablename VORDER
VACUUM lakename.tablename RETAIN 168 HOURS -- You need to do this in hours. I don't know why. It's just the way it is.

But we are going to dynamically insert our lakehouses and tables into those queries and loop through them using functions.

Getting a list of lakehouses

# Populate lakehouses with list of all the lakehouses in the current workspace
lakehouses = spark.catalog.listDatabases()

# Create an array to populate the lakehouse names to
lakehouse_list = []

# Add the lakehouse names to the array, filtering out the ones we don't want
for lakehouse in lakehouses:
    if lakehouse.name not in ["Lakehouse_I_dont_want_to_maintain", "another_lakehouse_not_to_maintain"]:
        lakehouse_list.append(lakehouse.name)

print(lakehouse_list)

Note above, we can filter out lakehouses we aren't interested in including, by adding them to the "not in" section.

Get the tables in our lakehouse list

# Get all the tables in each of our lakehouses
for lake in lakehouse_list:
    tables=spark.sql(f'SHOW TABLES IN {lake}')
    tablenames=list(tables.toPandas()['tableName'])
    for tb in tablenames:
        print(tb)

Running this after the previous code will return a complete list of tables that live in the included lakehouses.

Putting it all together

TLDR? Create a notebook with the below code in. Don't forget to update the list of lakehouses you don't want to include and your Vacuum retention period.

def listlakes():
    lakehouses = spark.catalog.listDatabases()
    lakehouse_list = []
    for lakehouse in lakehouses:
        if lakehouse.name not in ["Lakehouse_I_dont_want_to_maintain", "another_lakehouse_not_to_maintain"]:
            lakehouse_list.append(lakehouse.name)
    return lakehouse_list

def cleanLakehouse(lakename):
    tables=spark.sql(f'SHOW TABLES IN {lake}')
    tablenames=list(tables.toPandas()['tableName'])
    for tb in tablenames:
        cleanTables(lakename,tb)

def cleanTables(lakename,tablename):

    spark.sql(f'OPTIMIZE {lakename}.{tablename} VORDER')
    spark.sql(f'VACUUM {lakename}.{tablename} RETAIN 168 HOURS');

    print(f'\nTable {lakename}.{tablename} OPTIMIZE and VACUUM successful')

lakelist=listlakes()

for lake in lakelist:
    cleanLakehouse(lake)

Now add it to a pipeline that is scheduled to run every now and then and you are done.

Final Thoughts

While Microsoft Fabric handles a lot for you, keeping your lakehouse tables optimized ensures better performance and lower costs. Hopefully, this quick byte makes maintenance a little less painful.

If you’ve found a better way to automate this, I’d love to hear about it! Drop a comment or connect with me.