Noble Prize Data - End to End Data Engineering Project

In this session, I am gonna explain an end to end data engineering project, where an ETL pipeline is created to transfer data from API source and load into storage and then do some Analytics after some transformation.

The flow of the Project

In this project the data is extracted from Web API(HTTPS) and stored in staging area i.e, Azure Data Lake Storage. Then the data is transformed and processed using Azure Databricks and again the transformed data is loaded to Azure Data Lake completing all stages of Extraction-Transformation-Load Pipeline which is built and managed using Azure Data Factory. The transformed data in Azure Data Lake storage is transfer to Azure Synapse Analytics, which is integrated both data warehouse(formerly Azure SQL Data Warehouse) and Big data analytics on single platform and numerous analytical operations are done.

The Steps:

1. Building an Extract-Transform-Load Pipeline

Extraction - This step will extract data from HTTP(API) and store it in staging storage. Before extraction I will,

Create a Storage Account

Create a Resource Group - NoblePrizesAnalytics
Create a Azure Data Lake Storage - nobleprizestorage
Create a Container - nobleprize-container - with two directories

raw-data - the unprocessed data from source is kept
transformed-data - processed/transformed data is stored

Create an Azure Data Factory Account - noble-prize-dfactoy

In this Azure data factory I will build a pipeline to transfer data from HTTP to raw-data directory in nobel-prize-container

Data Source

In the Data factory I will create a copy activity to copy/transfer data from above source to data lake storage

Firstly I will define the source details with linked service and properties, where I will specify the data set source.e, HTTP and the format i.e,CSV. Linked service name as CategoriesHTTP and spcify the URL.

specify the format

Specify the URL of my github data source

Now I will specify the sink i,e. Azure Data Lake Storage Gen 2 and give the location of directory i.e,raw-data in noble-prize-container

specifying the raw directory path

When I validate and debug the copy activity, it executs without any error and data will be stored in raw-data directory

Repeat the above steps untill all files are transfered, so finally we will have a directory like below

Now since the data is staged in raw-data directory of Data lake storage we will connect this to databricks and perform the transformation operations.

To do this we need a app service which aunthenticates and ease connections between data lake storage and databricks, as created below.

creating the app

assigning the role

2. Transformation - applying transformations, cleaning, processing etc. of data

For Transformation I will be using Azure Databricks/

this code will cet connection between databricks and data lake

In this databricks, I did all kind of transformation, just a small idea about that as below

3. Loading - Load all transfomed data into Data Lake

Now I will save all the transformed tables and data into transform-data directory of Data lake storage using following codes

So now the transformed data is residing on my data lake storage

4. Analytics

For analysis I will be using Azure service called Azure Synapse Analytics - it allows us to query and analyse the data stored in Azure Data Lake Storage

In Synapse Studio, go to the Data hub.

Under the Linked section, you can define an external data source that points to your Data Lake Storage account.

Specify the necessary details such as the data source name, data source type (e.g., Azure Data Lake Storage), and connection information.

Mainly I needed a notebook where i can do my analysis part

Here I am creating a notebook

Here I am define the role and providing necessary permissions to access the data lake

this is the workspace where I did the analysis

Some of the operations I did in Synapse

Analysing using pyspark

df=spark.read.load('abfss://nobleprize-container@nobleprizestorage.dfs.core.windows.net/tranform-data/Countries.csv',format='csv',header=True)

df.show()

Using T-SQL

Create External Tables:

Define external tables in Synapse Studio that reference the data stored in Data Lake Storage.

External tables provide a way to query data without physically moving it into Synapse Analytics.

You can create external tables using T-SQL (Transact-SQL) scripts or the Synapse Studio visual interface.

Example T-SQL script for creating an external table:

CREATE EXTERNAL TABLE dbo.Catoagries

(

Country varchar,

Prizes_number int,

avg_age int,

prizes_women int,

prizes_men int,

per_women float,

per_men float

)

WITH

FILE_FORMAT = 'Parquet'

);

Query Data:

Once the external tables are created, you can run SQL queries on them to analyze the data stored in Data Lake Storage.

Use the Synapse Studio or any other SQL client to execute your queries.

Example T-SQL query:

SELECT *

FROM dbo.Categories

WHERE = 'Argentina';

So this was all about the project I did, it has the explaination of all the steps I did, thankyou for reading and hope u understood the project. Feel free to comment and make suggestions

Tech & Data

Search This Blog

End to End Data Engineering Project (Noble Prize Data)