Simple Incremental Load Project

Simple Pipeline Project for Incremental Load

    In this blog I will be building a simple pipeline project to implement incremental load using Azure data factory, Azure sql database.

Incremental Load: Incremental load refers to the process of updating a data warehouse, database, or data repository with only the data that has changed since the last update. Instead of reloading the entire dataset, incremental loading involves adding or updating only the new or modified data.

This approach is commonly used in scenarios where the dataset is large, and it is not efficient or practical to reload the entire dataset each time there is new data. Incremental loading can help save time, resources, and processing power by focusing only on the changes.

There are different strategies for implementing incremental loads, depending on the nature of the data and the system in use. Some common methods include:

  • Timestamps or Date-Based Incremental Load: Data is extracted based on a timestamp or date field. Only records that have been added or modified since the last update are included.
  • Change Data Capture (CDC): This technique identifies and captures changes made to the data since the last update. It may involve tracking inserts, updates, and deletes.
  • Flagging or Marking Updated Records: A flag or indicator is set for records that have been added or modified. During the incremental load, only records with this flag are processed.
  • Log-Based Incremental Load: Database logs are monitored for changes, and only the changes are extracted and loaded into the data warehouse.

    Incremental loading is crucial in scenarios where real-time or near-real-time data updates are required, and it helps in optimizing the data integration and ETL (Extract, Transform, Load) processes. It is commonly used in data warehousing, business intelligence, and other data-centric applications where keeping the data up-to-date is essential for accurate and timely analysis.

    In this project for incremental I will checking the most recent record in destination file using ID value and load all data from source file whose ID values are next to the last id in desitnation file

Project Architecture

Technologies:


    Azure SQL Database is a fully managed relational database service provided by Microsoft Azure. It is a cloud-based version of Microsoft SQL Server, which is a popular relational database management system (RDBMS)
        I will be using this both as a source database and destination database. Where incremental load is done for data from Office table to Office_final table in SQL database

    Azure Data Factory is a cloud-based data integration service provided by Microsoft Azure. Azure Data Factory enables organizations to collect, transform, and analyze data from various sources, making it a crucial tool in the context of data integration, ETL (Extract, Transform, Load), and data transformation workflows.
        Using this I will create a pipeline to query data (lookup) and copy data.

Steps

    Create a SQL database account - sampledb and sample-server2



    I created a connection to Azure SQL database from my local SQl server management studio and created tables and inserted data




Same process to create Offices_final with 7 records and azure sql database looks like this







Now I will create an Azure Data factory pipeline
    After creating azure account i will create a lookup activity for query data from sql database and provide neccessary conditions for creating a incremental load
    

    Create a connection with SQL database

       


Now I will query data from dbo.offices_final to check and see for any condition to make the incremental load possible i.e, Offices_Id, Copy data to it using reference of recent Offices_Id record




So now i will copy data from dbo.offices where records have incremented Offices_id's greater than 7
For this to happen i will create a copy and define ths source dbo.office and then specify condition as a query while loading it to dbo.offices_final 

Specifying the source and getting the data



This is the data we have in dbo.offices, no i will apply condition the query the required data



With this I get only the next entrie of the Offices_id i.e, greater than 7

    Now i wil specify the source i.e, dbo.offices_final and run the pipeline (validate and debug)



I can see that my pipeline is successfully run and when i check data in dbo.offices_final


This is simple incremental load example just trying to understand how incremental load works, further frquency or time can be set for such copy to happen at regular intervals.

Thank you for your time. Feel free to comment