Microsoft Fabric Lakehosue- Time Travel- Looking into the past

Amit Chandak
11 min readJul 23, 2024

--

Microsoft Fabric Lakehouse is a very versatile data platform, bringing all of your structured and unstructured data into a single location. Flexibility and scalability are designed-in so that it can make management and analysis of large chunks of data easier for organizations. Endowed with a raft of tools and frameworks, you will be able to process and analyze your data efficiently. Moreover, it integrates really well with other data management and analytics tools. This provides you with a single-stop solution for all of your data engineering and analytics needs.

Structured Data format — Delta Parquet

Microsoft Fabric Lakehouse is the ultimate data architecture platform for structured and unstructured data, consolidating them into one manageable location. To provide seamless access to data across all compute engines within Microsoft Fabric, Delta Lake is used as a unified table format. When saving structured data in the Lakehouse — whether through features like Load to Tables or the various methods outlined in the Options( Data Glow Gen2, Data Pipelines etc.) to Get Data into the Fabric Lakehouse— all data will be stored in Delta format. This approach ensures efficiency, reliability, and scalability in handling your data.

Microsoft Fabric Lakhouse does keep the history of the data. In the following article, we will discuss how we can view historical data using PySpark and Spark SQL.

What is a Lakehouse?

Developed to store, manage, and analyze structured and unstructured data in one place, Microsoft Fabric Lakehouse is an end-to-end data architecture platform. This would be a flexible and scalable solution that lets organizations process and analyze large volumes of data using different tools and frameworks with easy integration into other data management and analytics tools.

Key Features:

Lakehouse SQL Analytics Endpoint:

  • Automatically creates an SQL analytics endpoint and a default semantic model, which will automatically expose a serving layer.
  • It enables the direct interaction with Delta tables, making the whole experience from data ingestion to reporting as smooth as possible.
  • Note: This endpoint is read-only and only works with Delta tables.

Automatic Table Discovery and Registration:

  • Facilitates a managed file-to-table experience.
  • Automatically validates and registers files in supported structured formats (currently Delta) into the metastore with metadata.

Interacting with the Lakehouse:

  • Lakehouse Explorer: Main interaction page for loading, exploring data, setting labels, etc.
  • Notebooks: For reading, transforming, and writing data using code.
  • Pipelines: For data integration and pulling data from other sources.
  • Apache Spark Job Definitions: For developing and orchestrating Spark jobs.
  • Dataflows Gen 2: For data ingestion and preparation.

Multitasking Capabilities:

  • Enhanced multitasking experience with browser tab design for seamless switching between tasks.
  • Features include preserving running operations, retaining context, non-blocking list reload, and clear notifications.

The Lakehouse in Microsoft Fabric ensures an efficient, user-friendly, and inclusive data management experience, supporting various tasks and workflows for data engineers and scientists.

How to create a notebook?

The best way to open a notebook is from the Lakehouse. However, you can also open it from your workspace and choose the Lakehouse:

  1. Go to the left pane and select “Workspace.”
  2. Choose your workspace.
  3. In the top left, click “New,” then “More.”
  4. Under “Data Engineering,” select “Notebooks.”

From Lakehouse:

  1. Go to the left pane and select “Workspace.”
  2. Choose your workspace.
  3. Select “Lakehouse Explorer.”
  4. In the top menu, click “Open Notebook.”
  5. Open a new or existing notebook.

In this blog, we will focus on functions and commands needed for time travel in Lakehouse.

You can also watch the same in the video below

We will use a file from my GitHub account: Sales Data for Fabric.xlsx. This file is specifically designed with no spaces in the column headers, allowing you to save it directly to Lakehouses without renaming. I have already loaded this into my Lake02 of “01-GA-Fabric” Workspace.

I have created my notebook from the new notebook option in Lakehouse and renamed it “Time Travel in Lakehouse", and selected the Lakehouse

Let's start by displaying the record of the sales_delta table.

df = spark.sql("SELECT * FROM Lake02.sales_delta LIMIT 100")
display(df)

spark.sql("SELECT * FROM Lake02.sales_delta LIMIT 100"):

  • This code is using the spark.sql method to execute a SQL query.
  • spark.sql allows you to run SQL queries directly on your Spark DataFrame.
  • The query "SELECT * FROM Lake02.sales_delta LIMIT 100" is a standard SQL query that selects all columns (*) from the table Lake02.sales_delta and limits the result to the first 100 rows.
  • Lake02.sales_delta is the table in the Lakehouse from which data is being queried.

df:

  • The result of the SQL query is assigned to the variable df. This df is a DataFrame containing the first 100 rows of data from the Lake02.sales_delta table.

display(df):

  • This function is used to display the content of the DataFrame df in a tabular format.
  • It is often used in notebook environments (like Databricks notebooks) to visualize the DataFrame.

We can also use Spark SQL for the same.

%%sql 
SELECT * FROM Lake02.sales_delta LIMIT 100

%%sql:

  • This is a cell magic command that tells the notebook to interpret the contents of the cell as SQL code.
  • Everything written in the cell after %%sql is considered SQL and is executed as such.

SELECT * FROM Lake02.sales_delta LIMIT 100:

  • This SQL query selects all columns (*) from the table Lake02.sales_delta and limits the result to the first 100 rows.

To check the difference between data at different timestamps, we will use qty and qty * price. Let us run the SQL query and note down the values.

%%sql  
SELECT sum(Qty) qty, sum(Qty*Price) Gross FROM Lake02.sales_delta

SELECT sum(Qty) qty, sum(Qty*Price) Gross:

  • This SQL query selects two calculated columns from the Lake02.sales_delta table:
  • sum(Qty) qty: This calculates the total quantity (Qty) of all items and gives it the alias qty.
  • sum(Qty*Price) Gross: This calculates the total gross sales by multiplying the quantity (Qty) by the price (Price) for each item and then summing the results. This is given the alias Gross.

Let us look at the history of changes (commits) made to the Delta table. I have already updated and restored it once, so we should see some history.

%%sql 
DESCRIBE HISTORY sales_delta

DESCRIBE HISTORY sales_delta:

  • This SQL command is used to show the history of changes (commits) made to the Delta table sales_delta.
  • DESCRIBE HISTORY provides a log of operations performed on the table, such as inserts, updates, and deletes.

To demonstrate the Delta Table’s time travel feature and how it maintains historical data, we will update specific records in our sales_delta table. This update will allow us to observe changes and revert to previous states if needed. Let’s proceed with modifying the quantity for entries where it is currently set to 1.

%%sql  
update sales_delta
set Qty = 5
where Qty =1
  • update sales_delta:
  • This SQL command specifies that we want to update records in the sales_delta table.
  • set Qty = 5:
  • This part of the SQL command sets the value of the Qty column to 5 for the rows that meet the condition specified in the where clause.
  • where Qty = 1:
  • This condition specifies that only the rows where the Qty column has a value of 1 and should be updated.

Now, let’s explore the history of our Delta Table to observe the changes and see how time travel is supported. We will retrieve the historical metadata for the sales_delta table.

%%sql 
DESCRIBE HISTORY sales_delta

Let us have a look at the current data of the sales_delta table.

%%sql  
SELECT sum(Qty) qty, sum(Qty*Price) Gross FROM Lake02.sales_delta

Data Just before the current update.

%%sql  
SELECT sum(Qty) qty, sum(Qty*Price) Gross FROM Lake02.sales_delta TIMESTAMP AS OF '2024-07-07T14:47:00Z'

We can also use the version to retrieve data from the past. To analyze the initial state of our Delta Table, we’ll perform a query on the very first version of the sales_delta table.

df = spark.sql("SELECT sum(Qty) AS qty, sum(Qty * Price) AS Gross FROM Lake02.sales_delta VERSION AS OF 0")
display(df)

FROM Lake02.sales_delta VERSION AS OF 0:

  • This specifies that the query should be run against version 0 of the Lake02.sales_delta table. Delta tables keep a version history of all changes, and this clause allows you to query a specific version of the table

Version 3, current version

We can also use PySpark, to get the history of our Delta Table and visualize the changes over time, we will use the Delta Lake API to retrieve the full history of operations performed on the sales_delta table. This will provide a detailed log of all modifications.

from delta.tables import DeltaTable
delta_table = DeltaTable.forName(spark, "Lake02.sales_delta")
history_df = delta_table.history()
display(history_df)

from delta.tables import DeltaTable:

  • This code imports the DeltaTable class from the delta.tables module. This class provides functionality to interact with Delta tables.

delta_table = DeltaTable.forName(spark, "Lake02.sales_delta"):

  • This creates an instance of DeltaTable for the table named Lake02.sales_delta.
  • The forName method is used to reference an existing Delta table by its name. The spark session is passed as an argument to this method.

history_df = delta_table.history():

  • This retrieves the history of changes (commits) made to the Lake02.sales_delta table and returns it as a DataFrame.
  • The history method provides a log of operations performed on the Delta table, such as inserts, updates, and deletes.

We will now use PySpark to explore the time travel feature, we will read the sales_delta table as it existed at a specific point in time. This allows us to analyze the state of the data as of July 7, 2024, at 13:35 UTC.

df = spark.read.option("timestampAsOf", "2024-07-07T13:35:00Z").table("sales_delta")

spark.read:

  • This initiates a read operation using Spark’s DataFrameReader

.option("timestampAsOf", "2024-07-07T13:35:00Z"):

  • This sets an option for the read operation, specifying that you want to read the table as it was at a specific point in time.
  • The timestampAsOf option is used to query the Delta table at the state it was at a particular timestamp.
  • "2024-07-07T13:35:00Z" is the specified timestamp (in ISO 8601 format, with "Z" indicating UTC time) as of which you want to read the table.

.table("sales_delta"):

  • This specifies the name of the Delta table to be read, in this case, sales_delta.

To verify the total quantity at the specified point in time, we will aggregate the Qty column and print the result. This will help us confirm the state of our data as of the given timestamp.

print(df.agg({"Qty": "sum"}).collect()[0][0])

df.agg({"Qty": "sum"}):

  • df: This is the DataFrame created from the sales_delta table as of the specified timestamp.
  • .agg({"Qty": "sum"}): The agg function is used to perform aggregate calculations on the DataFrame. In this case, it is creating a dictionary {"Qty": "sum"} that specifies you want to calculate the sum of the Qty column.

.collect():

  • The collect method retrieves the aggregated result from the executors to the driver as a list of Rows. This is useful for getting the actual data out of the DataFrame after aggregation.

[0][0]:

  • [0]: Accesses the first Row in the list returned by collect.
  • [0]: Accesses the first element of the Row, which is the sum of the Qty column.

print(...):

  • This prints the value retrieved in the previous steps, which is the sum of the Qty column.

Next, we will read the sales_delta table as it existed in version 2. By aggregating the Qty column and printing the result, we can verify the total quantity at this specific version, allowing us to track changes over time.

df = spark.read.option("versionAsOf", "2").table("sales_delta")
print(df.agg({"Qty": "sum"}).collect()[0][0])

To demonstrate the ability to revert our Delta Table to a previous state, we will restore the sales_delta table to its original version. This action allows us to roll back any changes and return the table to its initial condition.

%%sql
RESTORE TABLE Lake02.sales_delta TO VERSION AS OF 0

from delta.tables import DeltaTable:

  • Imports the DeltaTable class, which provides functionality to interact with Delta tables.

delta_table = DeltaTable.forName(spark, "Lake02.sales_delta"):

  • Creates an instance of DeltaTable for the table named Lake02.sales_delta.

delta_table.restoreToVersion(0):

  • Restores the Lake02.sales_delta table to version 0 using the restoreToVersion method.

After restoring the sales_delta table to its original version, we will read the table again and aggregate the Qty column to confirm that the table has been successfully reverted to its initial state

To verify and understand the changes made, including our recent restoration, we will review the historical metadata of the sales_delta table again. This will provide a comprehensive log of all operations and help us confirm the restoration process.

%%sql 
DESCRIBE HISTORY sales_delta

Microsoft Fabric Lakehouse offers a robust and scalable solution for managing both structured and unstructured data, integrating seamlessly with various data tools. The use of Delta Lake as a unified table format ensures efficient data handling. The time travel feature in Microsoft Fabric Lakehouse allows users to view and revert to historical data states, providing flexibility in data management. By leveraging PySpark and Spark SQL, users can efficiently interact with and analyze historical data, demonstrating the platform’s capabilities in maintaining and exploring data history. This feature is crucial for data reliability and recovery, making Microsoft Fabric Lakehouse a powerful tool for comprehensive data management and analysis.

Also, Refer

Complete Power BI in one Video in 11 hours

Mastering Power BI: 230+ Videos

Expertise Power BI: 150+ Videos

Power BI 50 Interview Questions and 10 Advanced Use Cases

My Medium blogs can be found here if you are interested

Click here to access all my blogs and videos in a jiffy via an exclusive glossary using Power BI. Please like, share, and comment on these blogs. I would appreciate your suggestions for improvement, challenges, and suggestions for topics so that I can explore these in more depth.

In addition, I have over 750 videos on my YouTube channel that are related to Power BI, Tableau, and Incorta. With the help of these videos, you will learn hundreds of DAX and Power Query functions, in addition to hundreds of Power BI, use cases that will assist you in becoming an expert in Power BI. Make sure you subscribe, like, and share it with your friends.

Master Power BI
Expertise in Power BI
Power BI For Tableau User
DAX for SQL Users
Learn SQL

--

--

Amit Chandak

Amit has 20+ years of experience in Business Intelligence, Analytics, Data Science. He is Chief Analytics Officer at Kanerika & Power BI community Super User