Week 2 - Data Ingestion

This is an important part of data engineering skills, and highly recommend you to focus on that, you can search on this handbook a long with Googling, GPT, Gemini the knowledge; it’s free.

What this video https://youtu.be/ITRVg7cfAwY

If you can not download file,

I put the sample code here: Github repository

Source Systems

  • Sources systems: Files, Change Data Capture (CDC), OLAP & OLTP, and Logs
  • Message broker models
  • Basic overview of RabbitMQ
  • Fundamentals of Kafka: architecture, offset, partitions, and assign-message mechanisms
  • How to setup Kafka properly: idle consumers, and rebalancing, .etc.
  • CDC methods and Debezium

Data Lake (S3)

  • What is a Data Lake
  • ELT vs. ETL
  • Alternatives to components (S3/HDFS/MinIO, Redshift, Snowflake etc.)

Introduction to Workflow orchestration

  • What is an Orchestration Pipeline?
  • What is a DAG?

Setting up Airflow locally

  • Setting up Airflow with Docker-Compose
  • More information in the airflow folder If you want to run a lighter version of Airflow with fewer services, check this video. It’s optional.

Ingesting data to S3 with Airflow

  • Extraction: Download and unpack the data
  • Pre-processing: Convert this raw data to parquet
  • Upload the parquet files to Local >> S3
  • Create an external table in Postgres >> Snowflake

Note

Create a new connection call it for example minio_s3, is type Amazon AWS, and only has the extra field set to:

    {
        "aws_access_key_id": "your MinIO username",
        "aws_secret_access_key": "your MinIO password",
        "endpoint_url": "http://localhost:9000",
        "region_name": "us-east-1"
    }
Please note. If you're running Airflow from a KinD cluster and MinIO on the same host in docker, you need to use host.docker.internal instead of localhost.

Ingesting data to Local Postgres with Airflow

  • Converting the ingestion script for loading data to Postgres to Airflow DAG

Transfer service

Moving files from between platform

You will need an AWS account for this. This section is optional.

What make you better ?

  • Sample End-to-End data pipeline
  • Practice Schedule job, Change data Capture.
  • How system integration done, sFPT, ODBC, Authentication, Authorization
  • System configuration