Week 2 - Data Ingestion

This is an important part of data engineering skills, and highly recommend you to focus on that, you can search on this handbook a long with Googling, GPT, Gemini the knowledge; it’s free.

What this video https://youtu.be/ITRVg7cfAwY

If you can not download file,

I put the sample code here: Github repository

Source Systems

Sources systems: Files, Change Data Capture (CDC), OLAP & OLTP, and Logs
Message broker models
Basic overview of RabbitMQ
Fundamentals of Kafka: architecture, offset, partitions, and assign-message mechanisms
How to setup Kafka properly: idle consumers, and rebalancing, .etc.
CDC methods and Debezium

Data Lake (S3)

What is a Data Lake
ELT vs. ETL
Alternatives to components (S3/HDFS/MinIO, Redshift, Snowflake etc.)

Introduction to Workflow orchestration

What is an Orchestration Pipeline?
What is a DAG?

Setting up Airflow locally

Setting up Airflow with Docker-Compose

More information in the airflow folder If you want to run a lighter version of Airflow with fewer services, check this video. It’s optional.

Ingesting data to S3 with Airflow

Extraction: Download and unpack the data
Pre-processing: Convert this raw data to parquet
Upload the parquet files to Local >> S3
Create an external table in Postgres >> Snowflake

Note

Create a new connection call it for example minio_s3, is type Amazon AWS, and only has the extra field set to:

    {
        "aws_access_key_id": "your MinIO username",
        "aws_secret_access_key": "your MinIO password",
        "endpoint_url": "http://localhost:9000",
        "region_name": "us-east-1"
    }

Please note. If you're running Airflow from a KinD cluster and MinIO on the same host in docker, you need to use host.docker.internal instead of localhost.

Ingesting data to Local Postgres with Airflow

Converting the ingestion script for loading data to Postgres to Airflow DAG

Transfer service

Moving files from between platform

You will need an AWS account for this. This section is optional.

What make you better ?

Sample End-to-End data pipeline
Practice Schedule job, Change data Capture.
How system integration done, sFPT, ODBC, Authentication, Authorization
System configuration

Keyboard shortcuts

Data Engineering Handbook (DEH)