Data Engineering Tools – Segregated
|
Segment |
General |
AWS |
|
Data
Ingestion |
Ingestion Tools: Apache Kafka Apache NiFi AWS Kinesis Logstash |
AWS Glue: Use Glue Crawlers to discover
and catalog metadata from various data sources. Glue
ETL jobs for transforming and loading data. Amazon Kinesis: Kinesis Data Streams for
real-time data streaming. Kinesis
Data Firehose for loading streaming data into data stores. AWS DataSync: Transfer data from
on-premises to AWS. |
|
Data
Storage |
Data Warehouses: Amazon
Redshift Google
BigQuery Snowflake Data Lakes: Amazon
S3 Azure
Data Lake Storage Google
Cloud Storage Databases: PostgreSQL MySQL MongoDB Cassandra |
Amazon S3: As a data lake for storing
raw and processed data. Versioning and lifecycle policies for managing data. Amazon Redshift: For data warehousing and
complex queries. Amazon DynamoDB: For NoSQL database
requirements. |
|
Data
Processing |
Batch Processing: Apache
Spark Apache
Flink Hadoop
MapReduce Stream Processing: Apache
Kafka Streams Apache
Storm Apache
Flink ETL
(Extract,Transform, Load): Apache
Beam Apache
Airflow Talend |
Amazon EMR
(Elastic MapReduce):For
big data processing using frameworks like Apache Spark and Hadoop. AWS Glue: Serverless ETL service for
data transformation and preparation. AWS Lambda: For serverless event-driven
processing. |
|
Data
Transformation |
Data Preparation: Pandas
(Python library) Apache
Beam Data Cleansing: Trifacta OpenRefine Data
Masking/Anonymization: Google
DLP Apache
Nifi |
AWS Glue: Use Glue jobs for ETL transformations. AWS Step
Functions: Orchestrate
and coordinate multiple AWS services in a serverless workflow. |
|
Analytics
and Reporting |
Business
Intelligence Tools: Tableau Power
BI Looker Analytics
Platforms: Databricks Google
Analytics Mixpanel |
Amazon QuickSight:
Business
intelligence service for visualizing and analyzing data. Amazon Athena: Serverless query service for
analyzing data in Amazon S3. |
|
Data
Orchestration |
Workflow
Management: Apache
Airflow Luigi Prefect Job Scheduling: Cron Apache
Oozie |
Apache Airflow on
Amazon MWAA (Managed Workflows for Apache Airflow): Orchestrate and schedule
complex data workflows. AWS Step
Functions: For
serverless workflow orchestration. |
|
Monitoring
and Logging |
Logging: ELK
Stack (Elasticsearch, Logstash, Kibana) Splunk Monitoring: Prometheus Grafana |
Amazon CloudWatch:
For
monitoring AWS resources and applications. AWS CloudTrail: For logging AWS API calls. |
|
Data
Data Quality and Governance |
Data Quality
Tools: Informatica Talend Apache
Griffin Metadata
Management: Collibra Apache
Atlas |
AWS Glue DataBrew:
For
data profiling, cleaning, and exploration. AWS Lake
Formation: Set
up and enforce security, governance, and auditing policies. |
|
Security
and Access Control |
Encryption: TLS/SSL HDFS
Encryption Access Control: Apache
Ranger AWS
IAM Google
Cloud Identity and Access Management (IAM) |
AWS IAM (Identity
and Access Management): Manage
access to AWS resources. AWS Key Management
Service (KMS): Encrypt
data at rest and in transit. |
|
Data
Science Integration |
Model Deployment: TensorFlow
Serving MLflow PMML
(Predictive Model Markup Language) Notebook
Environments: Jupyter
Notebooks Google
Colab Databricks
Notebooks |
Amazon SageMaker: For building, training, and
deploying machine learning models. |
|
Architectural
Patterns |
Lambda
Architecture:
Combines batch and stream processing for real-time and batch processing. Kappa
Architecture:
Simplifies the Lambda Architecture using only stream processing. |
Serverless
Architecture: Leverage
services like Lambda, Glue, and Step Functions for serverless processing. Data Lake
Architecture: Utilize
S3 as a central data lake to store structured and unstructured data. |
|
Data
Versioning and Lineage |
Version Control: Git DVC
(Data Version Control) Lineage Tracking: Apache
Atlas DataHub |
|
|
Cloud
Integration |
Cloud Platforms: AWS, Azure, Google Cloud Platform (GCP) Serverless
Computing: AWS Lambda Azure Functions Google Cloud Functions |
AWS Direct Connect
or VPN: Connect
on-premises data centers to AWS. AWS SDKs and CLI: Integrate and automate AWS
services using SDKs and the Command Line Interface. |
|
|
|
|
|
Segment |
Microsoft Azure |
Google Cloud Platform |
|
Data
Ingestion |
Azure
Data Factory: Orchestrate
and automate data workflows. Support for data movement from various sources
to data lakes or warehouses. Azure
Event Hubs: Ingest
and process massive amounts of streaming data. |
Cloud Pub/Sub: Real-time
messaging service for event-driven architectures. Cloud Storage: Object
storage for batch uploads. |
|
Data
Storage |
Azure
Data Lake Storage: Scalable
and secure data lake storage. Azure
SQL Data Warehouse (now part of Azure Synapse Analytics): Enterprise-grade analytics
service. Azure
Cosmos DB: Globally
distributed, multi-model database for operational and analytical workloads. |
BigQuery: Fully-managed, serverless
data warehouse for analytics. Cloud Storage: Object storage for raw data
and backups. CloudSQL: Managed relational
databases. |
|
Data
Processing |
Azure
Databricks: Apache
Spark-based analytics platform for big data and machine learning. HDInsight:
Fully
managed cloud service for big data analytics using Hadoop, Spark, HBase, and
more. Azure
Stream Analytics: Real-time
analytics on streaming data. |
Dataflow: Fully managed stream and batch
processing using Apache Beam. Dataprep by
Trifacta:
Cloud-native data preparation service. Dataproc: Managed Apache Spark and
Hadoop service. |
|
Data
Transformation |
Azure
Data Factory: Transform
and clean data using data flows and transformations. Azure
HDInsight: Leverage
Apache Spark or Hive for data transformation. |
Dataflow: Apache Beam for ETL
pipelines. Cloud Dataprep: Visual data preparation
tool. |
|
Analytics
and Reporting |
Power
BI: Business
Intelligence and visualization. Azure
Synapse Studio: Integrated
analytics and data exploration. |
BigQuery: For ad-hoc queries and
analytics. Looker, Tableau,
or Data Studio: Business
intelligence and visualization tools. |
|
Data
Orchestration |
Azure
Data Factory: Schedule
and orchestrate data workflows. Azure
Logic Apps: Automate
workflows and integrate services, including data services. |
Cloud Composer: Managed Apache Airflow for
workflow orchestration. Cloud Scheduler: Fully managed cron job
scheduler. |
|
Monitoring
and Logging |
Azure
Monitor: Monitor
the performance and health of resources. Azure
Log Analytics: Collect
and analyze log data. |
Cloud Monitoring: Infrastructure and
application monitoring. Cloud Logging: Centralized log management. |
|
Data
Data Quality and Governance |
Azure
Purview: Unified
data governance service for discovering, understanding, and managing data. Azure
Data Catalog: Discover,
register, and manage data asset. |
Cloud Data
Catalog:
Fully managed and scalable metadata management service. Cloud Data Loss
Prevention (DLP):
Sensitive data discovery and redaction. |
|
Security
and Access Control |
Azure
Active Directory (AAD): Identity
and access management. Azure
Key Vault: Securely
store and manage sensitive information like keys and secrets. |
Cloud Identity and
Access Management (IAM):
Access control for GCP resources. Cloud Key
Management Service (KMS): Manage
cryptographic keys. |
|
Data
Science Integration |
Azure
Machine Learning: End-to-end
platform for building, training, and deploying machine learning models. |
AI Platform: Managed services for
building, training, and deploying machine learning models. Notebooks: AI Platform Notebooks or
Jupyter Notebooks on AI Platform. |
|
Architectural
Patterns |
Modern
Data Warehouse (Azure Synapse Analytics): Combines big data and data
warehousing for analytics. Event-Driven
Architectures: Use
Azure Event Hubs and Azure Functions for event-driven processing. |
Serverless
Architecture:
Utilize serverless services like Cloud Functions. Data Lake and Data
Warehouse:
Combine Cloud
Storage and BigQuery for cost-effective storage and analytics. |
|
Data
Versioning and Lineage |
|
Cloud Data
Catalog:
Track and manage data lineage. BigQuery: Keep track of changes with
versioned tables. |
|
Cloud
Integration |
Azure
Functions: Serverless
computing for event-driven solutions. Azure
Logic Apps: Connect
and automate workflows across cloud and on-premises services. |
Cloud Functions: Serverless computing for
event-driven functions. Cloud Run: Fully managed compute
platform for containerized applications. |