Microsoft Fabric Lakehosue- Time Travel- Looking into the past
Microsoft Fabric Lakehouse is a very versatile data platform, bringing all of your structured and unstructured data into a single location. Flexibility and scalability are designed-in so that it can make management and analysis of large chunks of data easier for organizations. Endowed with a raft of tools and frameworks, you will be able to process and analyze your data efficiently. Moreover, it integrates really well with other data management and analytics tools. This provides you with a single-stop solution for all of your data engineering and analytics needs.
Structured Data format — Delta Parquet
Microsoft Fabric Lakehouse is the ultimate data architecture platform for structured and unstructured data, consolidating them into one manageable location. To provide seamless access to data across all compute engines within Microsoft Fabric, Delta Lake is used as a unified table format. When saving structured data in the Lakehouse — whether through features like Load to Tables or the various methods outlined in the Options( Data Glow Gen2, Data Pipelines etc.) to Get Data into the Fabric Lakehouse— all data will be stored in Delta format. This approach ensures efficiency, reliability, and scalability in handling your data.
Microsoft Fabric Lakhouse does keep the history of the data. In the following article, we will discuss how we can view historical data using PySpark and Spark SQL.
What is a Lakehouse?
Developed to store, manage, and analyze structured and unstructured data in one place, Microsoft Fabric Lakehouse is an end-to-end data architecture platform. This would be a flexible and scalable solution that lets organizations process and analyze large volumes of data using different tools and frameworks with easy integration into other data management and analytics tools.
Key Features:
Lakehouse SQL Analytics Endpoint:
- Automatically creates an SQL analytics endpoint and a default semantic model, which will automatically expose a serving layer.
- It enables the direct interaction with Delta tables, making the whole experience from data ingestion to reporting as smooth as possible.
- Note: This endpoint is read-only and only works with Delta tables.
Automatic Table Discovery and Registration:
- Facilitates a managed file-to-table experience.
- Automatically validates and registers files in supported structured formats (currently Delta) into the metastore with metadata.
Interacting with the Lakehouse:
- Lakehouse Explorer: Main interaction page for loading, exploring data, setting labels, etc.
- Notebooks: For reading, transforming, and writing data using code.
- Pipelines: For data integration and pulling data from other sources.
- Apache Spark Job Definitions: For developing and orchestrating Spark jobs.
- Dataflows Gen 2: For data ingestion and preparation.
Multitasking Capabilities:
- Enhanced multitasking experience with browser tab design for seamless switching between tasks.
- Features include preserving running operations, retaining context, non-blocking list reload, and clear notifications.
The Lakehouse in Microsoft Fabric ensures an efficient, user-friendly, and inclusive data management experience, supporting various tasks and workflows for data engineers and scientists.
How to create a notebook?
The best way to open a notebook is from the Lakehouse. However, you can also open it from your workspace and choose the Lakehouse:
- Go to the left pane and select “Workspace.”
- Choose your workspace.
- In the top left, click “New,” then “More.”
- Under “Data Engineering,” select “Notebooks.”
From Lakehouse:
- Go to the left pane and select “Workspace.”
- Choose your workspace.
- Select “Lakehouse Explorer.”
- In the top menu, click “Open Notebook.”
- Open a new or existing notebook.
In this blog, we will focus on functions and commands needed for time travel in Lakehouse.
You can also watch the same in the video below
We will use a file from my GitHub account: Sales Data for Fabric.xlsx. This file is specifically designed with no spaces in the column headers, allowing you to save it directly to Lakehouses without renaming. I have already loaded this into my Lake02 of “01-GA-Fabric” Workspace.
I have created my notebook from the new notebook option in Lakehouse and renamed it “Time Travel in Lakehouse", and selected the Lakehouse
Let's start by displaying the record of the sales_delta table.
df = spark.sql("SELECT * FROM Lake02.sales_delta LIMIT 100")
display(df)
spark.sql("SELECT * FROM Lake02.sales_delta LIMIT 100")
:
- This code is using the
spark.sql
method to execute a SQL query. spark.sql
allows you to run SQL queries directly on your Spark DataFrame.- The query
"SELECT * FROM Lake02.sales_delta LIMIT 100"
is a standard SQL query that selects all columns (*
) from the tableLake02.sales_delta
and limits the result to the first 100 rows. Lake02.sales_delta
is the table in the Lakehouse from which data is being queried.
df
:
- The result of the SQL query is assigned to the variable
df
. Thisdf
is a DataFrame containing the first 100 rows of data from theLake02.sales_delta
table.
display(df)
:
- This function is used to display the content of the DataFrame
df
in a tabular format. - It is often used in notebook environments (like Databricks notebooks) to visualize the DataFrame.
We can also use Spark SQL for the same.
%%sql
SELECT * FROM Lake02.sales_delta LIMIT 100
%%sql
:
- This is a cell magic command that tells the notebook to interpret the contents of the cell as SQL code.
- Everything written in the cell after
%%sql
is considered SQL and is executed as such.
SELECT * FROM Lake02.sales_delta LIMIT 100
:
- This SQL query selects all columns (
*
) from the tableLake02.sales_delta
and limits the result to the first 100 rows.
To check the difference between data at different timestamps, we will use qty
and qty * price
. Let us run the SQL query and note down the values.
%%sql
SELECT sum(Qty) qty, sum(Qty*Price) Gross FROM Lake02.sales_delta
SELECT sum(Qty) qty, sum(Qty*Price) Gross
:
- This SQL query selects two calculated columns from the
Lake02.sales_delta
table: sum(Qty) qty
: This calculates the total quantity (Qty
) of all items and gives it the aliasqty
.sum(Qty*Price) Gross
: This calculates the total gross sales by multiplying the quantity (Qty
) by the price (Price
) for each item and then summing the results. This is given the aliasGross
.
Let us look at the history of changes (commits) made to the Delta table. I have already updated and restored it once, so we should see some history.
%%sql
DESCRIBE HISTORY sales_delta
DESCRIBE HISTORY sales_delta
:
- This SQL command is used to show the history of changes (commits) made to the Delta table
sales_delta
. DESCRIBE HISTORY
provides a log of operations performed on the table, such as inserts, updates, and deletes.
To demonstrate the Delta Table’s time travel feature and how it maintains historical data, we will update specific records in our sales_delta table. This update will allow us to observe changes and revert to previous states if needed. Let’s proceed with modifying the quantity for entries where it is currently set to 1.
%%sql
update sales_delta
set Qty = 5
where Qty =1
update sales_delta
:- This SQL command specifies that we want to update records in the
sales_delta
table. set Qty = 5
:- This part of the SQL command sets the value of the
Qty
column to 5 for the rows that meet the condition specified in thewhere
clause. where Qty = 1
:- This condition specifies that only the rows where the
Qty
column has a value of 1 and should be updated.
Now, let’s explore the history of our Delta Table to observe the changes and see how time travel is supported. We will retrieve the historical metadata for the sales_delta
table.
%%sql
DESCRIBE HISTORY sales_delta
Let us have a look at the current data of the sales_delta
table.
%%sql
SELECT sum(Qty) qty, sum(Qty*Price) Gross FROM Lake02.sales_delta
Data Just before the current update.
%%sql
SELECT sum(Qty) qty, sum(Qty*Price) Gross FROM Lake02.sales_delta TIMESTAMP AS OF '2024-07-07T14:47:00Z'
We can also use the version to retrieve data from the past. To analyze the initial state of our Delta Table, we’ll perform a query on the very first version of the sales_delta
table.
df = spark.sql("SELECT sum(Qty) AS qty, sum(Qty * Price) AS Gross FROM Lake02.sales_delta VERSION AS OF 0")
display(df)
FROM Lake02.sales_delta VERSION AS OF 0
:
- This specifies that the query should be run against version 0 of the
Lake02.sales_delta
table. Delta tables keep a version history of all changes, and this clause allows you to query a specific version of the table
Version 3, current version
We can also use PySpark, to get the history of our Delta Table and visualize the changes over time, we will use the Delta Lake API to retrieve the full history of operations performed on the sales_delta
table. This will provide a detailed log of all modifications.
from delta.tables import DeltaTable
delta_table = DeltaTable.forName(spark, "Lake02.sales_delta")
history_df = delta_table.history()
display(history_df)
from delta.tables import DeltaTable
:
- This code imports the
DeltaTable
class from thedelta.tables
module. This class provides functionality to interact with Delta tables.
delta_table = DeltaTable.forName(spark, "Lake02.sales_delta")
:
- This creates an instance of
DeltaTable
for the table namedLake02.sales_delta
. - The
forName
method is used to reference an existing Delta table by its name. Thespark
session is passed as an argument to this method.
history_df = delta_table.history()
:
- This retrieves the history of changes (commits) made to the
Lake02.sales_delta
table and returns it as a DataFrame. - The
history
method provides a log of operations performed on the Delta table, such as inserts, updates, and deletes.
We will now use PySpark to explore the time travel feature, we will read the sales_delta
table as it existed at a specific point in time. This allows us to analyze the state of the data as of July 7, 2024, at 13:35 UTC.
df = spark.read.option("timestampAsOf", "2024-07-07T13:35:00Z").table("sales_delta")
spark.read
:
- This initiates a read operation using Spark’s DataFrameReader
.option("timestampAsOf", "2024-07-07T13:35:00Z")
:
- This sets an option for the read operation, specifying that you want to read the table as it was at a specific point in time.
- The
timestampAsOf
option is used to query the Delta table at the state it was at a particular timestamp. "2024-07-07T13:35:00Z"
is the specified timestamp (in ISO 8601 format, with "Z" indicating UTC time) as of which you want to read the table.
.table("sales_delta")
:
- This specifies the name of the Delta table to be read, in this case,
sales_delta
.
To verify the total quantity at the specified point in time, we will aggregate the Qty
column and print the result. This will help us confirm the state of our data as of the given timestamp.
print(df.agg({"Qty": "sum"}).collect()[0][0])
df.agg({"Qty": "sum"})
:
df
: This is the DataFrame created from thesales_delta
table as of the specified timestamp..agg({"Qty": "sum"})
: Theagg
function is used to perform aggregate calculations on the DataFrame. In this case, it is creating a dictionary{"Qty": "sum"}
that specifies you want to calculate the sum of theQty
column.
.collect()
:
- The
collect
method retrieves the aggregated result from the executors to the driver as a list of Rows. This is useful for getting the actual data out of the DataFrame after aggregation.
[0][0]
:
[0]
: Accesses the first Row in the list returned bycollect
.[0]
: Accesses the first element of the Row, which is the sum of theQty
column.
print(...)
:
- This prints the value retrieved in the previous steps, which is the sum of the
Qty
column.
Next, we will read the sales_delta
table as it existed in version 2. By aggregating the Qty
column and printing the result, we can verify the total quantity at this specific version, allowing us to track changes over time.
df = spark.read.option("versionAsOf", "2").table("sales_delta")
print(df.agg({"Qty": "sum"}).collect()[0][0])
To demonstrate the ability to revert our Delta Table to a previous state, we will restore the sales_delta
table to its original version. This action allows us to roll back any changes and return the table to its initial condition.
%%sql
RESTORE TABLE Lake02.sales_delta TO VERSION AS OF 0
from delta.tables import DeltaTable
:
- Imports the
DeltaTable
class, which provides functionality to interact with Delta tables.
delta_table = DeltaTable.forName(spark, "Lake02.sales_delta")
:
- Creates an instance of
DeltaTable
for the table namedLake02.sales_delta
.
delta_table.restoreToVersion(0)
:
- Restores the
Lake02.sales_delta
table to version 0 using therestoreToVersion
method.
After restoring the sales_delta
table to its original version, we will read the table again and aggregate the Qty
column to confirm that the table has been successfully reverted to its initial state
To verify and understand the changes made, including our recent restoration, we will review the historical metadata of the sales_delta
table again. This will provide a comprehensive log of all operations and help us confirm the restoration process.
%%sql
DESCRIBE HISTORY sales_delta
Microsoft Fabric Lakehouse offers a robust and scalable solution for managing both structured and unstructured data, integrating seamlessly with various data tools. The use of Delta Lake as a unified table format ensures efficient data handling. The time travel feature in Microsoft Fabric Lakehouse allows users to view and revert to historical data states, providing flexibility in data management. By leveraging PySpark and Spark SQL, users can efficiently interact with and analyze historical data, demonstrating the platform’s capabilities in maintaining and exploring data history. This feature is crucial for data reliability and recovery, making Microsoft Fabric Lakehouse a powerful tool for comprehensive data management and analysis.
Also, Refer
Complete Power BI in one Video in 11 hours
Mastering Power BI: 230+ Videos
Expertise Power BI: 150+ Videos
Power BI 50 Interview Questions and 10 Advanced Use Cases
My Medium blogs can be found here if you are interested
Click here to access all my blogs and videos in a jiffy via an exclusive glossary using Power BI. Please like, share, and comment on these blogs. I would appreciate your suggestions for improvement, challenges, and suggestions for topics so that I can explore these in more depth.
In addition, I have over 750 videos on my YouTube channel that are related to Power BI, Tableau, and Incorta. With the help of these videos, you will learn hundreds of DAX and Power Query functions, in addition to hundreds of Power BI, use cases that will assist you in becoming an expert in Power BI. Make sure you subscribe, like, and share it with your friends.
Master Power BI
Expertise in Power BI
Power BI For Tableau User
DAX for SQL Users
Learn SQL