Begin to Introduction
- As a Data Engineer
- Data Engineering Workflow
- Data Warehousing
- What is Machine Learning Workflow
- How Machine Learning Model and Data Works
- Role of Data Engineer in the Machine Learning Project
- Scope of Data Engineer
- Supporting Components for Data Engineer
- Why Machine Learning is mentioned in the Data Engineering roadmap
- Building Data Platform Team
As a Data Engineer
Data Engineers are the link between the management’s data strategy and the data scientists that need to work with data. What they do is building the platforms that making sure the data is available and consumable into modern applications like: BI, LLM, AI, etc.
These platforms are usually used in four different ways:
- For data ingestion and storage of large amounts of data
- Data scientists and Data Analyst can apply Algorithm creation and various analysis methods
- Make data is available, qualified for production use.
- Data visualization for customers, self-service.
To create big data platforms the engineer needs to be an expert in specifying, setting up and maintaining big data technologies like: Hadoop file system, Spark processing, HBase engine, Cassandra distributed database, MongoDB non-structured database, Kafka for streaming and buffering, Redis for caching, REST API, etc and more.
What we also need is experience, get our hand dirty is on how to deploy systems on cloud infrastructure like at Amazon or Google or on-premise hardware.
Data Engineering Workflow
To easily understand data engineering workflows (aka dataflow
) which is applying engineering techniques to data as structures and layers.
Stage | Description | Subtasks |
---|---|---|
Loading data | Connect and load data, depending on structure and loading mechanism | - Snapshot - Delta load - Full load - Incremental load |
Extracting data | Get data from the source | - Normalize - Standardize |
Transforming data | Shape data into meaningful formats | - Engineering - Business - Validating |
Storing data | Structure data into optimal collections | - Formatting - Converting |
Modeling | Create optimal data models for consumption | - Schema Design - Partition Strategies - Cataloging and Lineaging |
Orchestration | Scheduling, triggering data pipelines | - Time based data pipelines - Event based data pipelines Dependencies |
By connecting dataflow, we have the data lineage which shows the relationship between data objects and movement of data thru different processes. Use data lineage for monitoring and scheduling data jobs, managing dependency and automate the data processing.
If you want to step back and understand the basic of data engineering concepts, you can check these videos below:
- Better Understanding Data Warehouse - Not only Landing/Staging/Publishing
- Motivation of Data Modeling: Relational, Dimensional, One Big table data model
Data Warehousing
What are common keys in data warehouse
- Primary Key: Using to identify unique of row in table. E.g: row count, count unit, aggregation
- Surrogate Key: Using to tracking the history of data. E.g: change of data
- Natural Key: Using to determine the business unique in table. E.g: normalizing data, create normalized data
There was the Auto-increment default set for key of table, but there are a lot of limitation of it, e.g: low performance, less control value, out of index, etc. There is standard to associate value to the key by using UUID or GUID with the advantaged of HASH function. The parameters of HASH functions are depend on logic of field.
-- This code works on PostgreSQL
CREATE TABLE Users (
primary_key SERIAL PRIMARY KEY,
user_name VARCHAR(255) NOT NULL,
phone_number VARCHAR(20) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
surrogate_key UUID NOT NULL,
CONSTRAINT uq_surrogate_key UNIQUE (surrogate_key)
);
-- Creating a trigger function to generate a UUID for the surrogate key
CREATE OR REPLACE FUNCTION generate_uuid() RETURNS TRIGGER AS $$
BEGIN
NEW.surrogate_key := uuid_generate_v5(uuid_ns_url(), NEW.user_name || NEW.phone_number);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Creating the trigger to call the function before insert
CREATE TRIGGER set_uuid_before_insert
BEFORE INSERT ON Users
FOR EACH ROW
EXECUTE FUNCTION generate_uuid();
But for data life cycle management, we have to have proper backup restore data to optimize the cost and performance for data warehouse.
What is Machine Learning Workflow
The process of dev, test, train, integrate, validate, production, re-train is existing and known as MLOps.
Why Data Engineer has to know ML Workflow?
→ Understanding the collaboration between Data Scientists and Data Engineers is vital when examining the data science process and how machine learning is executed.
%%{init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#ffffff', 'primaryTextColor': '#000000', 'primaryBorderColor': '#666666', 'lineColor': '#666666', 'secondaryColor': '#ffffff', 'tertiaryColor': '#ffffff' } }}%% graph LR A[Data Sources] --> B[Data Ingestion] B --> C[Data Preprocessing] C --> D[Feature Engineering] D --> E[Training Data] E --> F[Model Training] F --> G[Model Evaluation] G --> H[Model Deployment] H --> I[Prediction Service] %% Data Engineering Stages subgraph sub1[Data Engineering] B C D end %% Machine Learning Stages subgraph sub2[Machine Learning] E F G H I end
Figure: Machine Learning Workflow
The machine learning process initiates with a training phase, where algorithms undergo training to generate the desired output. This phase comprises input parameters, configuring the model, and input data.
During training, the algorithm adapts the training parameters and manipulates the data to produce an output. Subsequently, an evaluation stage follows to assess if the output aligns with the desired outcome. If not, the process continues with further training iterations, which are often automated and repeated numerous times.
Upon achieving satisfactory output, the model progresses to the production phase. Here, it operates on live data, contrasting the training phase where it utilized historical data.
Transitioning from training to production, the subsequent step involves continual monitoring of the model’s output. If the output remains consistent and aligns with expectations, all is well. However, if discrepancies arise, indicating the model’s failure to meet expectations, retraining becomes necessary.
The model undergoes retraining, ensuring it adapts to the evolving data landscape. Once the output meets the expected criteria again, the updated model replaces the previous one in production, forming a cyclical process.
This entire process, encompassing development, testing, training, integration, validation, production, and retraining, constitutes what is known as MLOps or Machine Learning Operations, signifying the lifecycle of managing machine learning models.
How Machine Learning Model and Data Works
Understanding these concepts is insightful.
When considering it, there are two pivotal areas where data plays a crucial role.
During the training phase, there exist two distinct types of data:
-
Firstly, the data utilized for training purposes, essentially configuring the model and its hyper-parameters.
-
Secondly, as you transition to the production phase, there’s the influx of live data streaming in from various sources such as app databases, IoT devices, or event logs.
%%{init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#ffffff', 'primaryTextColor': '#000000', 'primaryBorderColor': '#666666', 'lineColor': '#666666', 'secondaryColor': '#ffffff', 'tertiaryColor': '#ffffff' } }}%% graph TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui] A[Data Sources] subgraph Training Phase B[Training Data] C[Model Training] D[Hyper-parameter Tuning] B --> C C --> D end subgraph Production Phase E[Live Data] F[Trained Model] G[Real-time Data Ingestion] H[Predictions] E --> G G --> F F --> H end D -.-> F A --> B A --> E
Figure: Machine Learning Deployment Guide
An integral component to consider is the data catalog, which delineates available features and annotates diverse datasets. These varying data types lead us to the engineering aspect. I’ve elucidated the processes of Data Science and the Machine Learning workflow, prompting the question: Why is this knowledge essential?
Understanding these processes empowers you to discern where you can contribute to the data landscape. It’s crucial not to dive into coding without an awareness of these processes and workflows.
Role of Data Engineer in the Machine Learning Project
A fundamental responsibility of Data Engineers involves ensuring data accessibility. This entails making data available for both the data scientist and the machine learning process.
%%{init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#ffffff', 'primaryTextColor': '#000000', 'primaryBorderColor': '#666666', 'lineColor': '#666666', 'secondaryColor': '#ffffff', 'tertiaryColor': '#ffffff' } }}%% flowchart TD %% Add a transparent text node as a watermark style Watermark fill:none,stroke:none Watermark[Created by: LongBui]; A[Data Engineering Role] --> B[Data Integration]; A --> C[Data Pipeline Development]; A --> D[Data Modeling]; A --> Q[Data Quality Assurance];
Figure: Data Engineer Responsibility
- Data Integration: Involves ETL processes, data warehousing, and data lakes.
- Data Pipeline Development: Includes designing ETL flows, automating pipelines, and monitoring performance.
- Data Modeling: Encompasses schema design, database optimization, and implementing data structures.
- Data Quality Assurance: Responsibilities like data validation, ensuring consistency, and implementing quality frameworks.
I skip hte Data Transformation due to the Machine Learning model are mostly looking for raw data in data lake, and data engineer jobs are extracting data from sources and populate into the lake in right strategy which help scientist is able to perform algorithms in advanced.
In summary, the role of a Data Engineer is deeply rooted in computer science, encompassing a wide array of responsibilities crucial to the success of a machine learning project.
Scope of Data Engineer
Generally, the responsibility of Data Engineer is crucial in ensuring that data-driven business of company or organization. Participating into data systems with effectively store, process and analyze amount of data that gain valuable insights and make it easier to help others in decision making.
- Building and maintaining data pipelines for efficient data processing and storage.
- Implementing and managing data warehouses, data lakes, and other data storage systems.
- Developing and maintaining data processing applications and tools.
- Designing and implementing data security and access control policies.
- Optimizing data retrieval and analysis performance.
- Collaborating with data analysts, data scientists, and other stakeholders to ensure data quality, integrity, and usability.
- Troubleshooting and debugging data infrastructure and systems issues.
- Staying up-to-date with emerging data technologies and industry best practices.
- Developing and maintaining documentation and training materials for data infrastructure and systems.
- Ensuring compliance with relevant data privacy and security regulations.
- Shaping and modeling data into structured format as business or product driven, organize and optimize data into value-oriented.
- Might support DevOps and Security to provide details of data operations (aka DataOps) and security (aka Data Governance)
- Monitoring and responding and resolving data issues, problem issues within SLA.
Supporting Components for Data Engineer
Skill sets of data engineer
- Programming languages: Data engineers need to be proficient in at least one programming language such as Python, SQL, NodeJS, Java, or Scala.
- Big data technologies: Data engineers should be familiar with various big data technologies such as Databricks, Hadoop, Spark, Kafka, Hive, and HBase.
- Cloud services: Data engineers should be familiar with cloud services such as AWS, Azure, and Google Cloud Platform.
- Data storage: Data engineers should be familiar with different types of data storage such as relational databases, NoSQL databases, and Data Lake.
- ETL tools: Data engineers should be familiar with ETL (Extract, Transform, Load) tools such as Apache Airflow, and AWS Glue.
- Data modeling: Data engineers should be familiar with data modeling techniques such as relational modeling, dimensional modeling, and schema design.
- Data visualization: Data engineers should be familiar with data visualization tools such as Tableau, Power BI, and Looker.
- Data governance: Data engineers should be familiar with data governance principles and regulations such as GDPR, CCPA, and HIPAA.
- Communication and collaboration: Data engineers should have excellent communication and collaboration skills, as they often work with cross-functional teams including data scientists, business analysts, and software engineers.
Why Machine Learning is mentioned in the Data Engineering roadmap
Machine learning has completely changed the data extraction methodologies and interpretations by replacing traditional statistical techniques with automated methods. It helps data scientists to analyze humongous data and automate the process. The knowledge of machine learning enables a data scientist to solve higher-level data science problems.
Data engineers and data scientists often work together to achieve their goals:
- Data engineers provide the data infrastructure and pipelines that data scientists use for analysis and modeling.
- Data scientists provide the feedback and requirements that data engineers use to improve the data quality and reliability.
Therefore, it is important for data engineers to have some knowledge of machine learning and data science works, as well as for data scientists to have some understanding of data engineering concepts and tools.
Building Data Platform Team
The emphasis lies in finding well-versed data engineers and developers who seamlessly integrate into the data platform culture within a company.
Critical roles such as the data engineer or solution architect are pivotal in shaping and constructing the foundation of a data platform. Collaboration between the data scientist and the data engineer is essential, necessitating a comprehensive understanding of various big data tools.
The data platform serves as a catalyst, equipping data scientists and analysts with the necessary tools for conducting analytics. Simultaneously, it aids engineers in effectively transitioning the data scientist’s algorithms into a production environment.
Ultimately, the synergy between these roles significantly influences the efficacy of the data platform, the quality of analytical insights, and the efficiency with which the entire system transitions into a production-ready state.
Plus, Getting started with this Data Engineering Handbook - DEH you may want to confused what this book can help you, support you in your career paths, I would prefer that you have a full-time job before starting anything else even that job is remotely or onsite.