Docker and SQL Setup

Notes I used for preparing the videos: link

Preparation

Make sure Docker has installed on your machine.

Create working folder week_1_basic_n_setup

Collect this docker-compose.yaml file

version: '3'
services:
  pgdatabase:
    image: postgres:13
    environment:
      POSTGRES_USER: root
      POSTGRES_PASSWORD: root
      POSTGRES_DB: ny_taxi
    volumes:
      - "./postgres_data:/var/lib/postgresql/data:rw"
    ports:
      - "5432:5432"
  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: admin@admin.com
      PGADMIN_DEFAULT_PASSWORD: root
    ports:
      - "8080:80"
    depends_on:
      - pgdatabase

Downloading the data from here https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Download and Open pg-test-connection.ipynb

# %%
import pandas as pd

# %% [markdown]
# ```bash
# pip install sqlalchemy psycopg2-binary
# ```

# %%
from sqlalchemy import create_engine

# %%
engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')


# %% [markdown]
# If you cannot connect to ny_taxi, use `postgres` database instead
#
# ```python
# engine = create_engine('postgresql://root:root@localhost:5432/postgres')
# ```

# %%
engine.connect()

# %%
query = """
SELECT 1 as number;
"""

pd.read_sql(query, con=engine)

# %%
query = """
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
    schemaname != 'information_schema';
"""

pd.read_sql(query, con=engine)

# %%
%%bash
ls


# %%
import pandas as pd

# Read the Parquet file
df = pd.read_parquet('../data/yellow_tripdata_2024-01.parquet')

df.head()
# Save as CSV file
df.to_csv('../data/yellow_tripdata_2024-01.csv', index=False)

# %%
df = pd.read_csv('../data/yellow_tripdata_2024-01.csv', nrows=100)

# %%
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

# %%
df.to_sql(name='yellow_taxi_trips', con=engine, index=False)

# %%
query = """
SELECT * FROM yellow_taxi_trips LIMIT 10
"""

pd.read_sql(query, con=engine)

# %% [markdown]
# ```sql
# SELECT *
# FROM pg_catalog.pg_tables
# WHERE schemaname != 'pg_catalog' AND
#     schemaname != 'information_schema';
# ```
#
# Source: https://www.postgresqltutorial.com/postgresql-show-tables/

Running Postgres with Docker

Access Docker

Pulling and Running Postgres Container

Bash Command for starting docker“

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v /Users/longbuivan/Documents/Projects/de-handbook/docs/datacamping/week_1_basics_and_infrastructure/2_docker_sql/postgres_data:/var/lib/postgresql/data \
  -p 5432:5432 \
  postgres:13

Issue If you have the following errors:

docker: Error response from daemon: invalid mode: \Program Files\Git\var\lib\postgresql\data.
See 'docker run --help'.

Solution Change the mounting path. Replace it with the following:

-p /datacamping/...:/var/lib/postgresql/data

Issue If you see that postgres_data is empty after running the container, try these:

Solution

Deleting the folder and running Docker again (Docker will re-create the folder)
Adjust the permissions of the folder by running sudo chmod a+rwx postgres_data

Prepare Data

Download NY Trips Dataset:

https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

Note: For changing to from parquet format to csv, use sample pandas function:

import pandas as pd

# Read the Parquet file
df = pd.read_parquet('../data/yellow_tripdata_2024-01.parquet.parquet')

df.head()
# Save as CSV file
df.to_csv('../data/yellow_tripdata_2024-01.parquet.csv', index=False)

pgAdmin

Running pgAdmin

  docker run -it \
    -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
    -e PGADMIN_DEFAULT_PASSWORD="root" \
    -p 8080:80 \
    dpage/pgadmin4

Test Connection

Open pg-test-connection.ipynb
Run All Code Block and make sure mo error occurs

Ingest Data

Running locally

Download this load_data_2_postgres.py file and run the command with following arguments


#!/usr/bin/env python
# coding: utf-8

import os
import argparse

from time import time
import pandas as pd
from sqlalchemy import create_engine

def load_data_to_postgres(params):
    # Connect to PostgreSQL database
    local = params.local
    user = params.user
    password = params.password
    host = params.host
    port = params.port
    db = params.db
    table_name = params.table_name
    csv_name = params.csv_file

    if local == 'True':
        os.system(f"wget {csv_name} -O {csv_name}")


    engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')

    # Load data from Parquet file in chunks

    df_iter = pd.read_csv(csv_name, iterator=True, chunksize=100000, index_col=False)

    # Process each chunk
    for df in df_iter:

        t_start = time()
        # Convert columns to datetime
        df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
        df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

        # Load chunk into PostgreSQL database
        df.to_sql(name=table_name, con=engine, if_exists='append', index=False)  # Exclude the "index" column

        t_end = time()

        print('inserted another chunk, took %.3f second' % (t_end - t_start))


    engine.dispose()

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Ingest CSV data to Postgres')

    parser.add_argument('--local', required=False, help='Check if running locally', default='False')
    parser.add_argument('--user', required=True, help='user name for postgres')
    parser.add_argument('--password', required=True, help='password for postgres')
    parser.add_argument('--host', required=True, help='host for postgres')
    parser.add_argument('--port', required=True, help='port for postgres')
    parser.add_argument('--db', required=True, help='database name for postgres')
    parser.add_argument('--table_name', required=True, help='name of the table where we will write the results to')
    parser.add_argument('--csv_file', required=True, help='csv_file of the csv file')

    args = parser.parse_args()

    load_data_to_postgres(args)

To understand the arguments:

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Ingest CSV data to Postgres')

    parser.add_argument('--local', required=False, help='Check if running locally', default='False')
    parser.add_argument('--user', required=True, help='user name for postgres')
    parser.add_argument('--password', required=True, help='password for postgres')
    parser.add_argument('--host', required=True, help='host for postgres')
    parser.add_argument('--port', required=True, help='port for postgres')
    parser.add_argument('--db', required=True, help='database name for postgres')
    parser.add_argument('--table_name', required=True, help='name of the table where we will write the results to')
    parser.add_argument('--csv_file', required=True, help='csv_file of the csv file')

    args = parser.parse_args()

    load_data_to_postgres(args) # Function to load data from local2postgres

Execute Python function with arguments“

python3 load_data_2_postgres.py \
  --user=root \
  --password=root \
  --host=localhost \
  --port=5432 \
  --db=ny_taxi \
  --table_name=yellow_taxi_trips \
  --csv_file=${URL}

Result: Data inserted to Table

inserted another chunk, took 6.863 second
inserted another chunk, took 6.878 second
inserted another chunk, took 6.534 second
inserted another chunk, took 6.806 second
inserted another chunk, took 6.670 second
inserted another chunk, took 6.743 second

Build the image, Download the Dockerfile

docker build -t ny_taxi_ingest:v001 .

Run the script with Docker

docker run -it \
  ny_taxi_ingest:v001 \
    --user=root \
    --password=root \
    --host=modest_panini.orb.local \
    --port=5432 \
    --db=ny_taxi \
    --table_name=yellow_taxi_trips \
    --csv_file=${URL}

Result: Data insert to table

inserted another chunk, took 8.036 second
inserted another chunk, took 7.992 second
inserted another chunk, took 8.042 second

Run Postgres and pgAdmin together

Using Docker-Compose file you have download above and run

docker-compose up
# Using `docker-compose up -d` if you don't want to trailing logs...

Seeing and access the Postgres and PgAdmin via:

Postgres: locahost:5432 (not UI access)
PgAdmin: localhost:8080 (UI)

Run it:

docker-compose up

Run in detached mode:

docker-compose up -d

Shutting it down:

docker-compose down

Note: to make pgAdmin configuration persistent, create a folder data_pgadmin. Change its permission via

sudo chown 5050:5050 data_pgadmin

and mount it to the /var/lib/pgadmin folder:

services:
  pgadmin:
    image: dpage/pgadmin4
    volumes:
      - ./data_pgadmin:/var/lib/pgadmin
    ...

Create analysis SQL

-- Query to get the total number of trips
SELECT COUNT(*) AS total_trips
FROM yellow_taxi_trips;

-- Query to get the average trip distance
SELECT AVG(trip_distance) AS avg_distance
FROM yellow_taxi_trips;

-- Query to get the maximum fare amount
SELECT MAX(fare_amount) AS max_fare
FROM yellow_taxi_trips;

-- Query to get the total revenue
SELECT SUM(fare_amount) AS total_revenue
FROM yellow_taxi_trips;

Tears down

Keep it run until you finish the homework, to shut down everything, use

docker stop $(docker ps -aq)

Here, docker ps -aq lists all running container IDs and docker stop stops each container.

Screenshots

alt text

Additional Scripts

-- Database: ny_taxi

-- DROP DATABASE IF EXISTS ny_taxi;

CREATE DATABASE ny_taxi
    WITH
    OWNER = root
    ENCODING = 'UTF8'
    LC_COLLATE = 'en_US.utf8'
    LC_CTYPE = 'en_US.utf8'
    TABLESPACE = pg_default
    CONNECTION LIMIT = -1
    IS_TEMPLATE = False;

Data Engineering Handbook (DEH)