How to Build a Scalable Lakehouse Data Pipeline with Microsoft Fabric Using Delta Tables and PySpark

As Microsoft Fabric is offering an analytics engineering certification, I decided to share some of the things I’m learning along the way. I felt particularly motivated when learning about the Lakehouse medallion architecture, as you can automate processes, improve data quality and serve several stakeholders at the same time. For this article I have uploaded the Contoso parquet files for the sales fact table, without making too many transformations between the bronze and silver layer. In the article image illustration, you can see that there are several ways to get data into the OneLake of Microsoft Fabric, however I have simply uploaded the files manually, into a bronze layer lakehouse, sitting inside the bronze workspace I created, and linked to a dedicated bronze Fabric domain. I believe the domain will serve a bigger purpose further down the stream, in regards to data ownership and data governance and security. The bronze layer Lakehouse, holds the raw data, without any tranformation whatsoever. The data is saved as the parquet file format.

To start, I made a plan, on what transformations are necessary, for the files in the bronze layer to successfully load into a single sales_fact delta table in the silver layer.

The first decision I made, is to save the data as a managed delta table, as I experienced difficulties moving and locating external delta tables throughout the Fabric environment.

However, this would mean that if the delta table is deleted, the files inside the SQL Endpont DB also dissapear. If you delete an external table, you will still see the files holding the data located in the files section of the lakehouse. So there theoretically would be an additional security layer. Also a note regarding the external delta table: it will not appear in the SQL Endpoint by default.

The use case for the data transformation looks as follows:

Sales fact raw data in bronze Lakehouse layer and workspace (parquet files with numeration). These files need to be combined to a single table.
Silver Lakehouse layer in the silver workspace is the load and transform destination.
We have several parquet files that need to be combined and transformed into a single delta table in the Silver Lakehouse layer.
We need an additional code, that checks for new files in the bronze layer, and combines it with the existing data in the silver Lakehouse layer automatically.
We need an indicator, of what the last added data files were, to determine which files are the new ones to be combined with the existing data.
We want the result to be an external Delta table, and a managed Delta table and explore the benefits and limitations between the two.

In order to achieve this, we have to save information about the last data load and transformation somewhere. I chose to save it in a separat sales_fact_dataloads table, to keep track of the past data load and see what the previous loads were.

Assumptions: I have to assume a few things for this code, the first is that the file numbering in the raw layer lakehouse are incrementing, as data is added, without any gaps in the numbering. A loop that skips gaps and continues counting up (i), whilst also keeping track of missing files count, proved to be too difficult at this time. Instead a loop that simply tracks the next missing file number and stops there, was good enough.

Also note in the illustration, that we are using an ELT process, extract, load and then transform, for the medallion architecture.

The conditions for the lakehouse notebook data load and tranformation, would look as follows:

This is how the data looks in the bronze lakehouse layer:

I tried the code with a few scenarios; with one file missing in between the file numbering, adding a new file to the bunch with correct numbering, adding a file that holds duplicates or previously loaded data, but with the correct numbering. Do note: this code does not check the delta table location, and compare it to all the files in the source library. It orientates itself using the file numeration, and checks if there is a new file.

The code in the end looks like this:

Copy to Clipboard

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.utils import AnalysisException

# Check if the table exists
if not spark.catalog.tableExists("Sales_Fact_Dataloads"):
    # If not, create the table
    spark.sql("""
    CREATE TABLE Sales_Fact_Dataloads (
      first_file_num_processed INT,
      last_successful_file_num INT,
      timestamp_dataload TIMESTAMP
    ) USING DELTA
    """)
    print("No previous data loads detected.")
    first_file_num_to_be_processed = 1
    last_file_num_processed = 1
    
else:
    # If the table exists, query the first and last file number processed from the row with the max last_file_num_processed
    result = spark.sql("""
    SELECT first_file_num_processed, last_successful_file_num
    FROM Sales_Fact_Dataloads
    WHERE last_successful_file_num = (SELECT MAX(last_successful_file_num) FROM Sales_Fact_Dataloads)
    """)
    # Check if the query returned any results
    if result.first() is not None:
        # Get the first and last file numbers processed
        first_file_num_processed = int(result.first()[0])
        last_file_num_processed = int(result.first()[1])
        first_file_num_to_be_processed = int(result.first()[1]) + 1
        print(f"The last file range number processed previously was {first_file_num_processed} - {last_file_num_processed}.")
        
    else:
        print("No previous data loads detected.")
        #set default values
        first_file_num_to_be_processed = 1
        last_file_num_processed = 1
        last_successful_file_num = 1

# Initialize combined_data as None
combined_data = None

# Initialize a counter for consecutive missing files
missing_files = 0

# Initialize i and last_successful_file_num
i = int(first_file_num_to_be_processed)

while True:
    # Break the loop if we've reached the maximum file number in case there is data processing resource limitation
    if i > 25:
        break

# Construct the file path
    file_path = f"abfss://Contoso_Bronze@onelake.dfs.fabric.microsoft.com/Contoso_Bronze_P1.Lakehouse/Files/Data_OrderRows_{i}.parquet"
    #check all file paths generated in loop
    #print(file_path)
    try:
        # Try to load the data from the file
        data = spark.read.parquet(file_path)
        # add index to data
        data = data.withColumn("index", monotonically_increasing_id())
        # Replace spaces in column names with underscores
        data = data.select([col(c).alias(c.replace(' ', '_')) for c in data.columns])
        # If combined_data is None, assign data to it directly
        if combined_data is None:
            combined_data = data
        else:
            # Otherwise, append the data to the combined_data DataFrame
            combined_data = combined_data.union(data)
            # Replace spaces in column names with underscores
            combined_data = combined_data.select([col(c).alias(c.replace(' ', '_')) for c in combined_data.columns])
       
        # Update the last successful file number and increment i for the next iteration
        last_successful_file_num = i
        i += 1

# Reset the missing files counter
        missing_files = 0

except AnalysisException:
        print(f"File number {i} is missing.")
        break  # Break the loop if a file is missing

# If the last successful file number is the same as the last data load file number, print a message and exit
if  last_successful_file_num <= last_file_num_processed:
    print("No new files to be processed.")
else:
    print(f"The file number range to be processed {first_file_num_to_be_processed} - {last_successful_file_num}.")

# Check if combined_data is not None before performing the union operation
if combined_data is not None:

# Check if the delta table Fact_Sales exists
    if spark.catalog.tableExists("fact_sales"):
        # Load existing Fact_Sales delta table data into df if exists
        df = spark.read.format("delta").load("abfss://Contoso_Silver@onelake.dfs.fabric.microsoft.com/Contoso_Silver_LH.Lakehouse/Tables/fact_sales")
        
        # Add index, to know where the old data ends
        df = df.withColumn("index", monotonically_increasing_id())
        
        # Add combine new data from combined_data with old data as df using union in a separate new df_2
        df_2 = df.union(combined_data)
        
        # Drop the index column before checking for duplicates
        df_2_no_index = df_2.drop("index")

# Check for duplicates
        if df_2_no_index.count() > df_2_no_index.dropDuplicates().count():
            # Identify duplicates
            duplicates_df_2_no_index = df_2_no_index.groupBy(df_2_no_index.columns)\
            .count().where('count > 1')
            num_duplicates_df_2_no_index = duplicates_df_2_no_index.count()
            print(f"{num_duplicates_df_2_no_index} duplicates detected and removed")

# Remove duplicates if detected
            df_2_no_index = df_2_no_index.dropDuplicates()
        
            # Check the index, and determine how many new rows were added after duplicates removal in df_2
            new_rows = df_2_no_index.count() - df.count()
            
            if new_rows == 0:
                # If 0, abort and print "only duplicates detected"
                print(f"Only duplicates detected. Load process aborted for files {first_file_num_to_be_processed} - {last_successful_file_num}.")
            else:
                # Else load the newly added data in df_2 to a separate new df_3
                df_3 = df_2.filter(col("index") >= df.count())
                # Drop the index column before checking for duplicates
                df_3_no_index = df_3.drop("index")
                # Remove duplicates if detected
                df_3_no_index = df_3_no_index.dropDuplicates()
                # Add df_3 data to the delta table Fact_Sales
                df_3_no_index.write.format("delta").mode("append").saveAsTable("fact_sales")
                #write last data load values into data_load table
                spark.sql(f"""
                INSERT INTO Sales_Fact_Dataloads (first_file_num_processed, last_successful_file_num, timestamp_dataload)
                VALUES ({first_file_num_to_be_processed}, {last_successful_file_num}, current_timestamp())
                """)
                print("New data appended to existing sales fact table and duplicates removed")
        else:
            # Else load the newly added data in df_2 to a separate new df_3
            df_3 = df_2.filter(col("index") >= df.count())
            # Drop the index column before checking for duplicates
            df_3_no_index = df_3.drop("index")
            # Remove duplicates if detected
            df_3_no_index = df_3_no_index.dropDuplicates()
            # Add df_3 data to the delta table Fact_Sales
            df_3_no_index.write.format("delta").mode("append").saveAsTable("fact_sales")
            #write last data load values into data_load table
            spark.sql(f"""
            INSERT INTO Sales_Fact_Dataloads (first_file_num_processed, last_successful_file_num, timestamp_dataload)
            VALUES ({first_file_num_to_be_processed}, {last_successful_file_num}, current_timestamp())
            """)
            print("New data appended to sales fact table load result saved in load table")
    else:
        # Identify duplicates
        duplicates_combined_data = combined_data.groupBy(combined_data.columns)\
        .count().where('count > 1')
        #check for duplicates in combined_data
        num_duplicates_combined_data = duplicates_combined_data.count()
        if combined_data.count() > combined_data.dropDuplicates().count():
            print(f"{num_duplicates_combined_data} duplicates detected and removed")
        combined_data_no_index = combined_data.drop("index")
        # Remove duplicates if detected
        combined_data_no_index = combined_data_no_index.dropDuplicates()
        # Load the newly data in combined_data to a new delta table named Fact_Sales
        combined_data_no_index.write.format("delta").saveAsTable("fact_sales")
        #write last data load values into data_load table
        spark.sql(f"""
        INSERT INTO Sales_Fact_Dataloads (first_file_num_processed, last_successful_file_num, timestamp_dataload)
        VALUES ({first_file_num_to_be_processed}, {last_successful_file_num}, current_timestamp())
        """)
        print("New delta fact table created and load result saved in load table")
else:
    print("No new data to be loaded from df")

The code does the following in steps:

It checks if there is the data_loads delta table, if there is none, create it.

Then it sets default variables, either from an existing data_loads table or other default values if there has been no data load before (it is the first data load).

Then it loops through the file numeration, counting up 1 step at a time, and checking if there are missing files (for example if the last file number is 7, 8 would come out as a missing file number). Then it combines the data in a df that has been found and performs minor transformations. It adds an index for example and replaces blanks in the column names with underscores.

It then loads the existing (if there is any) data from the sales_fact delta table into a seperate df, and combines it with the combined data from the loop.

Then we check for duplicates, and remove them accordingly, separate the new data from the old data using the index, remove the index in another df and then append the data to the existing delta table.

If there was no previous data, then we check for and remove duplicates, and create the new delta table with the data, and save the data load information in the separate data_loads delta table.

This is the output, of a first successful dataload:

This is the output, when new data with duplicates is added, but not all data are duplicates:

I hope this helps when data dumping into a location occurs (with a coordinated file naming), and perhaps you need to combine the data and load it. I’m sure the code can be improved, extended or limited, I am currently learning this myself.