Week 3 - Data Warehouse

What this video https://youtu.be/8Lfoe_iYUAk?si=QKphrQfs8Dm6y6IU

Note

“If you can not download file,”

I put the sample code here: Github repository

Data Architecture

Normalize and denormalize (Normal-Form)
Data Modeling: Relational Model –> Dimensional Model –> Super table –> Delta table
Data architectures: Data Fabric (agile platform, framework for management) and Data Mesh (decentralized architecture, domain focused, 360 data), Data Vault (enterprise, data-linkage)
Data management systems: Data Warehouse, Data Mart, Data Lake and LakeHouses
Data processing architectures: Lambda, Kappa and DataFlow

Storage Pattern

Data Storage Systems: File Storage, Object Storage, Block Storage, Cache, HDFS and Streaming Storage
ACID, Idempotent, Atomic
Table Partitioning, Clustering
Data Lake table formats: Delta Lake, Iceberg and Hudi
Advanced Schema Design, Best Practices - Video

Data Extracting, and Loading

We have already mentioned that in this handbook, answer to the following questions:

How to extract the data from raw data? (regex pattern, etc)
How to extract the delta data from sources (delta data is the subset of data changed) ?
Do we need a buffer for loading?
etc…

Depending on the data source and data availability requirements, the data will be loaded and refreshed as needed.

When dealing with Source data ingestion, type of data loads is determined by:

Full Load: loading small data (ex: dimensions, master data, etc)
Delta Load: transactional data (delta extraction by using incremental columns)

Another, you’re working to load data to Target data loading, try to use these engineering technique to load data:

Incremental: split data into smaller chunks
Snapshot: snapshot data and merge into destination

Example

“Key words for loading data from sources and targets”

Plus: Engineering columns during data loading and data transformation ?

From Source: source of data was from
Incremental Columns: seq, auto_increment(col), modified column,created column
Historical Tracking Columns : version, modified_datetime, created_datetime
Descriptive Columns : Dimension columns, natural key(col)
Identifier Column : mostly PK and FK
Audit Column: Record_ID for backfilling data, data mismatch,

Distributed System

Key Focus for Distributed Systems

Understanding distributed computing principles and concepts
Familiarity with distributed system architectures (e.g., client-server, peer-to-peer)
Proficiency in distributed data management and synchronization techniques
Experience with distributed system scalability and fault tolerance strategies
Understanding of distributed system security and data privacy considerations
Proficient in designing and implementing distributed system workflows and pipelines
Familiarity with distributed system monitoring and performance optimization techniques
Distributed System Overview - Slide
Data Shuffling - Video
Dealing with Data Issues - Video

Data Warehouse and Snowflake

Starting this Snowflake 101

Micro-partition and Data Clustering

In Snowflake, the concept of physical partitions (like in traditional databases) does not exist because Snowflake automatically divides the data into micro-partitions behind the scenes.

Important

Groups of rows in tables are mapped into individual micro-partitions, organized in a columnar fashion

Clustering information is in order to avoid unnecessary scanning of micro-partitions during querying

ClusteringIllustration

The table consists of 24 rows stored across 4 micro-partitions, with the rows divided equally between each micro-partition. Within each micro-partition, the data is sorted and stored by column, which enables Snowflake to perform the following actions for queries on the table:
- First, prune micro-partitions that are not needed for the query.
- Then, prune by column within the remaining micro-partitions.

Tip

Do not put the unnecessary column into the query if you do not really need these columns appear.

Partitioning and Clustering (Distribution)

ClusteringInformation

If you are not using the Snowflake

Partition table will be involved
A lot of Open Sources are available for community, startup: ClickHouse, Doris, ect
But from Snowflake, you may have known about why Snowflake helps us to resolve the partition management and focus on data analytics purposes and data consuming

Info

ClickHouse Performance Check

9 rows in set. Elapsed: 0.061 sec. Processed 2.00 million rows, 18.00 MB (32.56 million rows/s., 293.06 MB/s.)

Code: Setup Clickhouse for a few minutes

Best practices

Internals of Snowflake

Check summary Internals of Snowflake or official document of Snowflake for getting more knowledge base about how Data Warehouse works for enterprise.

Advance

ML

Snowflake Machine Learning SQL for ML in Snowflake - Sample

Important links

Deploying ML model

To be Updated

Snowflake Machine Learning Deployment Steps to extract and deploy model with docker

snowflakeModelCreated

Airflow and Snowflake Integration

Hosting Airflow (Week 1, 2) and Connection to Snowflake
Setup: Copy over the airflow directory (i.e. the Dockerized setup) from week_2_data_ingestion:
Setup S3 Local (MinIO) for storing the raw/landing data, you can get the file of MinIO in this book by searching the MinIO

minIOData