Airflow Installation for Data Engineer (Part 2)

Setup (No-frills)

Airflow Setup

NofrillAirflowInstall.png

  1. Create a new sub-directory called airflow in your project dir (such as the one we’re currently in)

  2. Set the Airflow user:

On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. Otherwise the files created in dags, logs and plugins will be created with root user. You have to make sure to configure them for the docker-compose:

mkdir -p ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" >> .env

On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command.

To get rid of the warning (“AIRFLOW_UID is not set”), you can create .env file with this content:

AIRFLOW_UID=50000
  1. Docker Build:

When you want to run Airflow locally, you might want to use an extended image, containing some additional dependencies - for example you might add new python packages, or upgrade airflow providers to a later version.

Create a Dockerfile pointing to the latest Airflow version such as apache/airflow:2.2.3, for the base image,

And customize this Dockerfile by:

  • Also, integrating requirements.txt to install libraries via pip install
  1. Copy docker-compose-nofrills.yaml, .env_example & entrypoint.sh from this repo.

The changes from the official setup are:

  • Removal of redis queue, worker, triggerer, flower & airflow-init services, and changing from CeleryExecutor (multi-node) mode to LocalExecutor (single-node) mode
  • Inclusion of .env for better parametrization & flexibility
  • Inclusion of simple entrypoint.sh to the webserver container, responsible to initialize the database and create login-user (admin).
  • Updated Dockerfile to grant permissions on executing scripts/entrypoint.sh
  1. .env:
  • Rebuild your .env file by adding the following items to the file (but make sure your AIRFLOW_UID remains):

??? tip “Configure .env”

```shell title="Environment Config"
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow

AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC=10

AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}
AIRFLOW_CONN_METADATA_DB=postgres+psycopg2://airflow:airflow@postgres:5432/airflow
AIRFLOW_VAR__METADATA_DB_SCHEMA=airflow

AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True
AIRFLOW__CORE__LOAD_EXAMPLES=False

```

Here’s how the final versions of your Dockerfile, docker-compose-nofrills and entrypoint should look.

Note: Move entrypoint.sh to script folder, by command:

mv {URL}/entrypoint.sh /scripts/entrypoint.sh

Execution

By following the original setup, you can trigger and create you own job.

Execute Pulling and Starting Containers

docker-compose -f docker-compose-nofrills.yml up -d

After Launch Airflow-nofrill, you will see it ==> “Only Fundamental Installed”


[+] Running 3/3
 ✔ Container airflow-postgres-1   Started                                                                                                        1.2s
 ✔ Container airflow-scheduler-1  Started                                                                                                        1.6s
 ✔ Container airflow-webserver-1  Started

List all Docker container docker-compose ps

Problems

Submit the issue to channel or Airflow Github if relevant.

You can get the custom docker-compose-nofrill here

docker-compose-nofrill.yaml

Base Image

FROM apache/airflow:2.9.1

ENV AIRFLOW_HOME=/opt/airflow

USER root
RUN apt-get update -qq && apt-get install vim -qqq unzip -qqq
# git gcc g++ -qqq

COPY requirements.txt .

USER airflow
RUN pip install --no-cache-dir -r requirements.txt

# Ref: https://airflow.apache.org/docs/docker-stack/recipes.html

SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]

USER root

WORKDIR $AIRFLOW_HOME

COPY scripts scripts
RUN chmod +x scripts

USER $AIRFLOW_UID

Docker Compose

    version: '3'
    services:
        postgres:
            image: postgres:13
            env_file:
                - .env
            volumes:
                - postgres-db-volume:/var/lib/postgresql/data
            healthcheck:
                test: ["CMD", "pg_isready", "-U", "airflow"]
                interval: 5s
                retries: 5
            restart: always

        scheduler:
            build: .
            command: scheduler
            restart: on-failure
            depends_on:
                - postgres
            env_file:
                - .env
            volumes:
                - ./dags:/opt/airflow/dags
                - ./logs:/opt/airflow/logs
                - ./plugins:/opt/airflow/plugins
                - ./scripts:/opt/airflow/scripts


        webserver:
            build: .
            entrypoint: ./scripts/entrypoint.sh
            restart: on-failure
            depends_on:
                - postgres
                - scheduler
            env_file:
                - .env
            volumes:
                - ./dags:/opt/airflow/dags
                - ./logs:/opt/airflow/logs
                - ./plugins:/opt/airflow/plugins
                - ./scripts:/opt/airflow/scripts

            user: "${AIRFLOW_UID:-50000}:0"
            ports:
                - "8080:8080"
            healthcheck:
                test: [ "CMD-SHELL", "[ -f /home/airflow/airflow-webserver.pid ]" ]
                interval: 30s
                timeout: 30s
                retries: 3

    volumes:
    postgres-db-volume:

Endpoint

To execute the airflow command

#!/usr/bin/env bash

airflow db upgrade
airflow db init
airflow users create -r Admin -u admin -p admin -e admin@example.com -f admin -l airflow
# "$_AIRFLOW_WWW_USER_USERNAME" -p "$_AIRFLOW_WWW_USER_PASSWORD"

airflow webserver