End to End Data Engineering Project (Noble Prize Data)

Noble Prize Data - End to End Data Engineering Project

        In this session, I am gonna explain an end to end data engineering project, where an ETL pipeline is created to transfer data from API source and load into storage and then do some Analytics after some transformation.

The flow of the Project

    

    In this project the data is extracted from Web API(HTTPS) and stored in staging area i.e, Azure Data Lake Storage. Then the data is transformed and processed using Azure Databricks and again the transformed data is loaded to Azure Data Lake completing all stages of Extraction-Transformation-Load Pipeline which is built and managed using Azure Data Factory. The transformed data in Azure Data Lake storage is transfer to Azure Synapse Analytics, which is integrated both data warehouse(formerly Azure SQL Data Warehouse) and Big data analytics on single platform and numerous analytical operations are done.

The Steps: 

1. Building an Extract-Transform-Load Pipeline

Extraction - This step will extract data from HTTP(API) and store it in staging storage. Before extraction I will,

Create a Storage Account   
  • Create a Resource Group - NoblePrizesAnalytics 
  •  Create a Azure Data Lake Storage - nobleprizestorage
  •  Create a Container - nobleprize-container  - with two directories
      •  raw-data - the unprocessed data from source is kept
      • transformed-data - processed/transformed data is stored

Create an Azure Data Factory Account - noble-prize-dfactoy 

In this Azure data factory I will build a pipeline to transfer data from HTTP to raw-data directory in nobel-prize-container

 Data Source

 

In the Data factory I will create a copy activity to copy/transfer data from above source to data lake storage 


Firstly I will define the source details with linked service and properties, where I will specify the data set source.e, HTTP and the format i.e,CSV. Linked service name as CategoriesHTTP and spcify the URL.

specify the format                    


Specify the URL of my github data source 
 

Now I will specify the sink i,e. Azure Data Lake Storage Gen 2 and give the location of directory i.e,raw-data in noble-prize-container

specifying the raw directory path

  

When I validate and debug the copy activity, it executs without any error and data will be stored in raw-data directory 
Repeat the above steps untill all files are transfered, so finally we will have a directory like below 
Now since the data is staged in raw-data directory of Data lake storage we will connect this to databricks and perform the transformation operations.

To do  this we need a app service which aunthenticates and ease connections between data lake storage and databricks, as created below.
    
creating the app        

assigning the role 
 

2. Transformation - applying transformations, cleaning, processing etc. of data 

For Transformation I will be using Azure Databricks/

    this code will cet connection between databricks and data lake


In this databricks, I did all kind of transformation, just a small idea about that as below

3. Loading - Load all transfomed data into Data Lake 

Now I will save all the transformed tables and data into transform-data directory of Data lake storage using following codes


So now the transformed data is residing on my data lake storage

4. Analytics 

For analysis I will be using Azure service called Azure Synapse Analytics - it allows us to query and analyse the data stored in Azure Data Lake Storage
     


In Synapse Studio, go to the Data hub.
Under the Linked section, you can define an external data source that points to your Data Lake Storage account.
Specify the necessary details such as the data source name, data source type (e.g., Azure Data Lake Storage), and connection information.

Mainly I needed a notebook where i can do my analysis part

Here I am creating a notebook

Here I am define the role and providing necessary permissions to access the data lake


this is the workspace where I did the analysis

Some of the operations I did in Synapse

Analysing using pyspark
    df=spark.read.load('abfss://nobleprize-container@nobleprizestorage.dfs.core.windows.net/tranform-data/Countries.csv',format='csv',header=True)

    df.show()

Using T-SQL
Create External Tables:

Define external tables in Synapse Studio that reference the data stored in Data Lake Storage.

External tables provide a way to query data without physically moving it into Synapse Analytics.

You can create external tables using T-SQL (Transact-SQL) scripts or the Synapse Studio visual interface.

Example T-SQL script for creating an external table:


CREATE EXTERNAL TABLE dbo.Catoagries
(
    Country varchar,
    Prizes_number int,
    avg_age int,
    prizes_women int, 
    prizes_men int, 
    per_women float,
    per_men float
)
WITH
(,
    FILE_FORMAT = 'Parquet'
);
Query Data:

Once the external tables are created, you can run SQL queries on them to analyze the data stored in Data Lake Storage.
Use the Synapse Studio or any other SQL client to execute your queries.


Example T-SQL query:


SELECT *
FROM dbo.Categories
WHERE  = 'Argentina';

So this was all about the project I did, it has the explaination of all the steps I did, thankyou for reading and hope u understood the project. Feel free to comment and make suggestions