Week 2 - Data Ingestion
This is an important part of data engineering skills, and highly recommend you to focus on that, you can search on this handbook a long with Googling, GPT, Gemini the knowledge; it’s free.
What this video https://youtu.be/ITRVg7cfAwY
I put the sample code here: Github repository
Source Systems
- Sources systems: Files, Change Data Capture (CDC), OLAP & OLTP, and Logs
- Message broker models
- Basic overview of RabbitMQ
- Fundamentals of Kafka: architecture, offset, partitions, and assign-message mechanisms
- How to setup Kafka properly: idle consumers, and rebalancing, .etc.
- CDC methods and Debezium
Data Lake (S3)
- What is a Data Lake
- ELT vs. ETL
- Alternatives to components (S3/HDFS/MinIO, Redshift, Snowflake etc.)
Introduction to Workflow orchestration
- What is an Orchestration Pipeline?
- What is a DAG?
Setting up Airflow locally
- Setting up Airflow with Docker-Compose
- More information in the airflow folder If you want to run a lighter version of Airflow with fewer services, check this video. It’s optional.
Ingesting data to S3 with Airflow
- Extraction: Download and unpack the data
- Pre-processing: Convert this raw data to parquet
- Upload the parquet files to Local >> S3
- Create an external table in Postgres >> Snowflake
Create a new connection call it for example minio_s3, is type Amazon AWS, and only has the extra field set to:
{
"aws_access_key_id": "your MinIO username",
"aws_secret_access_key": "your MinIO password",
"endpoint_url": "http://localhost:9000",
"region_name": "us-east-1"
}
Ingesting data to Local Postgres with Airflow
- Converting the ingestion script for loading data to Postgres to Airflow DAG
Transfer service
Moving files from between platform
You will need an AWS account for this. This section is optional.
What make you better ?
- Sample End-to-End data pipeline
- Practice Schedule job, Change data Capture.
- How system integration done, sFPT, ODBC, Authentication, Authorization
- System configuration