Data Architecture
Data architecture is a critical aspect of data engineering, encompassing the design and management of data systems to ensure efficiency, scalability, and accessibility. A comprehensive understanding of data architecture involves several key components, including normalization, data modeling, storage patterns, and more.
- Normalization and Denormalization
- Data Modeling
- Data Architectures
- Data Management Systems
- Data Processing Architectures
- Source Systems
- Storage Patterns
- Data Lake (S3)
- Data Extraction and Loading
- Distributed Systems
- Data Warehouse and Snowflake
Normalization and Denormalization
Normalization and denormalization are fundamental concepts in database design:
Normalization: Normalization involves organizing a database to reduce redundancy and improve data integrity. It involves decomposing tables into smaller, related tables and defining relationships between them. This process aims to ensure that each piece of data is stored only once, thereby reducing anomalies and inconsistencies.
Normal Forms:
- First Normal Form (1NF): Ensures that each column contains atomic (indivisible) values, and each record is unique.
- Second Normal Form (2NF): Achieved when the database is in 1NF, and all non-key attributes are fully functionally dependent on the primary key.
- Third Normal Form (3NF): Ensures that the database is in 2NF, and all the attributes are functionally dependent only on the primary key.
Denormalization: Denormalization is the process of combining tables to reduce the complexity of queries and improve performance. While it increases redundancy, it can be beneficial for read-heavy databases by minimizing the number of joins required.
Benefits:
- Improved query performance.
- Reduced need for complex joins.
Drawbacks:
- Increased data redundancy.
- More complex update operations.
Diagram: Normalization vs. Denormalization:
graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Normalization] --> B[1NF] B --> C[2NF] C --> D[3NF] E[Denormalization] --> F[Combined Tables] F --> G[Improved Performance] G --> H[Increased Redundancy]
Figure: Data Forms
Data Modeling
Data modeling is essential for designing the structure of databases and data warehouses. It involves creating models to represent data relationships and processes. Key models include:
Relational Model: Organizes data into tables with rows and columns. Tables can be related to each other through foreign keys.
Dimensional Model: Designed for data warehousing, it uses dimensions and facts to facilitate complex queries and reporting. Key components include:
- Fact Tables: Contain quantitative data.
- Dimension Tables: Contain descriptive attributes related to facts.
Super Table: A hybrid approach that combines features of both relational and dimensional models, often used for complex reporting needs.
Delta Table: A table that stores only the changes (deltas) since the last update, which helps in efficiently processing updates and incremental loads.
Diagram: Data Modeling:
graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Relational Model] --> B[Tables] B --> C[Rows] B --> D[Columns] E[Dimensional Model] --> F[Fact Tables] E --> G[Dimension Tables] H[Super Table] --> I[Hybrid Model] J[Delta Table] --> K[Incremental Changes]
Figure: Data Modeling Types
Data Architectures
Understanding various data architectures helps in designing scalable and efficient data systems:
Data Fabric: An agile data management platform that integrates and manages data across different environments. It provides a unified view of data, regardless of its location.
Data Mesh: A decentralized approach focusing on domain-oriented data ownership and management. It emphasizes the importance of domain-specific teams in managing their data.
Data Vault: An enterprise-focused approach to data warehousing that emphasizes data integration and historical tracking. It uses three key components:
- Hub: Central table for business keys.
- Link: Table for relationships between hubs.
- Satellite: Table for historical data and attributes.
Diagram: Data Architectures:
graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Data Fabric] --> B[Unified View] C[Data Mesh] --> D[Domain Teams] D --> E[Decentralized Management] F[Data Vault] --> G[Hub] F --> H[Link] F --> I[Satellite]
Figure: Data Vault Modeling
Data Management Systems
Data management systems are crucial for storing, managing, and processing data. Key systems include:
Data Warehouse: A centralized repository that stores integrated data from multiple sources for analysis and reporting.
Data Mart: A subset of a data warehouse, focused on a specific business area or department.
Data Lake: A storage system that holds raw data in its native format until it is needed. It supports various data types, including structured, semi-structured, and unstructured data.
Lakehouse: A hybrid data management system that combines the benefits of data lakes and data warehouses, providing both transactional and analytical capabilities.
Diagram: Data Management Systems:
graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Data Warehouse] --> B[Centralized Repository] C[Data Mart] --> D[Business Area Focus] E[Data Lake] --> F[Raw Data Storage] G[Lakehouse] --> H[Transactional + Analytical]
Figure: Data Management Types
Data Processing Architectures
Data processing architectures define how data is ingested, processed, and analyzed:
Lambda Architecture: Combines batch and real-time processing to handle large volumes of data. It consists of three layers:
- Batch Layer: Processes large volumes of historical data.
- Speed Layer: Handles real-time data processing.
- Serving Layer: Combines batch and real-time views for querying.
Kappa Architecture: Simplifies Lambda Architecture by using a single stream processing layer for both batch and real-time data.
DataFlow: Focuses on data pipelines and workflows to process data efficiently across different stages.
Diagram: Data Processing Architectures:
graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Lambda Architecture] --> B[Batch Layer] A --> C[Speed Layer] A --> D[Serving Layer] E[Kappa Architecture] --> F[Stream Processing] G[DataFlow] --> H[Data Pipelines] H --> I[Efficient Processing]
Figure: Lambda Architecture
Source Systems
Source systems are the origin points of data and include various types:
Files: Static files such as CSV, JSON, or XML that can be ingested into a data system.
Change Data Capture (CDC): Techniques for capturing and processing changes in data sources to keep data synchronized.
OLAP & OLTP:
- OLAP (Online Analytical Processing): Used for complex queries and analytics.
- OLTP (Online Transaction Processing): Used for transaction-oriented tasks.
Logs: Records of events or transactions that can be ingested for monitoring or analysis.
Diagram: Source Systems:
graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Files] --> B[CSV, JSON, XML] C[CDC] --> D[Change Capture Techniques] E[OLAP] --> F[Complex Queries] E[OLTP] --> G[Transaction Processing] H[Logs] --> I[Event Records]
Figure: File Sourcing System
Storage Patterns
Data storage patterns define how data is stored, accessed, and managed:
Data Storage Systems:
- File Storage: Stores data as files, typically used for document storage.
- Object Storage: Stores data as objects with metadata, ideal for unstructured data.
- Block Storage: Provides raw storage blocks, often used for databases.
- Cache: Temporary storage to speed up data access.
- HDFS: Distributed file system for large-scale data storage.
- Streaming Storage: Handles real-time data streams.
ACID Principles:
- Atomicity: Ensures transactions are fully completed or not at all.
- Consistency: Maintains data integrity during transactions.
- Isolation: Ensures transactions do not interfere with each other.
- Durability: Guarantees that completed transactions are permanently stored.
Table Partitioning and Clustering:
- Partitioning: Divides large tables into smaller, manageable pieces.
- Clustering: Groups similar data together to improve performance.
Data Lake Table Formats:
- Delta Lake: An open-source storage layer that provides ACID transactions and scalable metadata handling.
- Iceberg: A high-performance table format for large analytic datasets.
Diagram: Storage Patterns:
graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Data Storage Systems] --> B[File Storage] A --> C[Object Storage] A --> D[Block Storage] A --> E[Cache] A --> F[HDFS] A --> G[Streaming Storage] H[ACID Principles] --> I[Atomicity] H --> J[Consistency] H --> K[Isolation] H --> L[Durability] M[Table Partitioning] --> N[Smaller Pieces] M[Clustering] --> O[Improved Performance] P[Data Lake Table Formats] --> Q[Delta Lake] P --> R[Iceberg]
Figure: Data Storage Technique
Data Lake (S3)
What is a Data Lake?: A data lake is a centralized repository that allows you to store all your structured and unstructured data at scale. Data is stored in its raw format until it is needed for processing or analysis.
ELT vs. ETL:
- ETL (Extract, Transform, Load): Data is transformed before loading into the data warehouse.
- ELT (Extract, Load, Transform): Data is loaded into the data lake before being transformed.
Alternatives to Data Lake Components:
- S3: Scalable object storage service by AWS.
- HDFS: Hadoop Distributed File System for large-scale data.
- MinIO: Open-source object storage compatible with S3.
- Redshift: Data warehouse service by AWS.
- Snowflake: Cloud-based data warehouse service.
Diagram: Data Lake Components:
graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Data Lake] --> B[Raw Data Storage] C[ELT vs. ETL] --> D[ETL] C --> E[ELT] F[S3] --> G[Scalable Storage] H[HDFS] --> I[Large-Scale Data] J[MinIO] --> K[Open-Source Storage] L[Redshift] --> M[Data Warehouse] N[Snowflake] --> O[Cloud-Based Data Warehouse]
Figure: Data Lake
Data Extraction and Loading
Key Questions::
- How to Extract Data: Use tools like regex patterns to parse and extract data from raw sources.
- How to Extract Delta Data: Focus on incremental changes since the last load to efficiently process updates.
- Buffer for Loading: Determine if buffering is needed based on data volume and system requirements.
Data Load Types::
- Full Load: Load complete datasets, typically used for small or less frequently updated data.
- Delta Load: Load only the changed data, ideal for large or frequently updated datasets.
Engineering Techniques for Target Data Loading::
- Incremental Loading: Break data into smaller chunks to optimize processing.
- Snapshot Loading: Capture and merge snapshots of data into the destination system.
Key Terms::
- From Source: Origin of the data.
- Incremental Columns: Columns used to track changes, such as timestamps or auto-incremented IDs.
- Historical Tracking Columns: Columns for versioning and tracking changes over time.
- Descriptive Columns: Columns providing descriptive attributes.
- Identifier Column: Primary and foreign key columns.
- Audit Column: Columns used for auditing and backfilling data.
Diagram: Data Extraction and Loading:
graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Data Extraction] --> B[Regex Patterns] C[Delta Data Extraction] --> D[Incremental Changes] E[Buffering] --> F[Data Volume] G[Full Load] --> H[Complete Data Load] I[Delta Load] --> J[Changed Data] K[Incremental Loading] --> L[Smaller Chunks] M[Snapshot Loading] --> N[Data Snapshots] O[Key Terms] --> P[From Source] O --> Q[Incremental Columns] O --> R[Historical Tracking Columns] O --> S[Descriptive Columns] O --> T[Identifier Column] O --> U[Audit Column]
Figure: Data Loading Designs
Distributed Systems
Key Focus Areas::
- Distributed Computing Principles: Understanding the fundamentals of distributing computational tasks across multiple nodes.
- Distributed System Architectures: Familiarity with architectures like client-server and peer-to-peer.
- Data Management and Synchronization: Techniques for managing and synchronizing data across distributed systems.
- Scalability and Fault Tolerance: Strategies to scale systems and ensure reliability.
- Security and Privacy: Ensuring data security and privacy in distributed environments.
- Designing Workflows and Pipelines: Proficiency in creating efficient workflows and data pipelines.
- Monitoring and Performance Optimization: Techniques for monitoring and optimizing system performance.
Resources::
Diagram: Distributed Systems:
graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Distributed Systems] --> B[Distributed Computing Principles] A --> C[Architectures] C --> D[Client-Server] C --> E[Peer-to-Peer] A --> F[Data Management] A --> G[Scalability] A --> H[Fault Tolerance] A --> I[Security] A --> J[Designing Workflows] A --> K[Monitoring & Optimization]
Figure: Distribution Usage
Data Warehouse and Snowflake
Data Warehouse: A data warehouse is a centralized repository for integrated data from multiple sources, designed for query and analysis.
Snowflake: Snowflake is a cloud-based data warehouse solution known for its scalability, performance, and ease of use. It supports diverse data workloads and provides a unified platform for data storage, processing, and analysis.
Resources::
I put the sample code here: Github repository