How to use this Book ?

This handbook is designed to be your springboard into the exciting world of data engineering. It’s not a traditional training manual, but rather a roadmap to equip you with the knowledge and skills to become an awesome data engineer. Here’s how to get the most out of it:

  • Identify Key Topics: The book focuses on presenting real-world problems and guiding you through the thought process of brainstorming and building solutions. This approach helps you develop logical thinking and understand how data engineering impacts businesses. As you explore these problems, you’ll discover key topics relevant to different aspects of data pipelines.

  • Deepen Your Understanding with Code: This handbook goes beyond theory. Many chapters incorporate code examples that illustrate the concepts discussed. By working through this code, you can gain a practical understanding of how these systems function in action.

  • Leverage Reference Links: Throughout the book, you’ll find reference links to external resources like documentation, tutorials, and articles. These links provide additional information and deeper dives into specific topics, helping you expand your knowledge base.

Tip

Remember: Don’t hesitate to experiment with the executable code provided. Modify it, play around with different scenarios, and see how the systems behave. This hands-on approach will solidify your understanding and make you more confident in building your own data pipelines.

By using this combination of problem-solving approach, code examples, and reference links, you’ll gain a comprehensive understanding of data engineering concepts and be well-equipped to tackle real-world challenges.

In this book, I tend to mention the problems and tell you the full picture of how to brainstorm and build up the logical thinking how it impacts the business and right way to resolve the problem.

Power of Fundamental Foundation Learning

Target of this book is for dicing into Data Platform Blueprint and describing each component of the platform. Finding in the book tools that fit into each key area of a Data platform (Connect, Buffer, Processing Framework, Store, Visualize).

Then back to the basic of data platform with 5 tiers:

  1. Source: Connect, Integration, Ingestion
  2. Backend: Buffer, Processing
  3. Storage: Data-lake, Data Warehouse, Lake-house, OLAP, CUBE
  4. Semantic: API, Cache, Memory, Access Control
  5. Frontend: Visualization, Exporting, Revert ETL, Application

Select a few tools you are interested in, research and work with them.

I have also created a repo Setup Data Dev Environment, fork it and start a custom, please create PR if you find out any interesting tools/services.

I am keeping maintain this book as the technology is change as always but I will keep the fundamental concepts, knowledge here. I know people are learning things so fast and that is not easy to understand everything.

They might be loss the basic of software engineering and computer science. But I will keep it here.

Quote

Step back and move forward!

What if you are working in the legacy system?

As freelance and outsourcing consultant, I experienced with SQL Server, Oracle, IBM, Informatica Power Center; That felt uncomfortable, painful and weaknesses at the first time, I decided to view that with different angle and I realized that I can learn a lot from them, Yes - The Basic things that how it was built was created.

Key learnings:

  • Legacy technologies like Informatica, SSIS, and SQL Server can provide valuable learning opportunities.
  • By understanding and learning from these technologies, you can gain a deeper understanding of data management and ETL processes, which are still crucial in today’s data-driven world.
  • Building solid foundation knowledge and how the basis things work before jumping into other more “Easier”.
  • I mentioned “Easier” because the improvement, revolution of tools, platform with less managed from end-users.

I am using second brain for navigating thoughts to mapping and notes that I am noting, that helps me to structure and categorize knowledge in the better format. Check Mapping of Contents

If you prefer to hear the Podcast, check my Youtube channels Long Bui /@longdatadevlog, and subscribe to get new episodes posted.

This book is a work in progress!

As you can see, the technology always changed but the fundamental things have not been change overtime. I’m constantly adding new stuff and doing videos for the topics. But obviously, because I do this as a hobby my time is limited. You can help making this book even better.

Tell me your thoughts, what you value, what you think should be included, or correct me where I am wrong.

But no-worry about that you have shortcut the book itself or anything because this book is focusing on fundamental knowledge and the good for people want to learn the foundation.

Everything has changed by AI and ML, but those concepts were published since 19xx or early 20xx, and know it has been crazy enough to make the huge difference.

Then, I think that we should not affair this, stay focused, and keep going forwards, and last but not least, keep rules

flowchart BT
%% Add a transparent text node as a watermark
style Watermark fill:none,stroke:none
Watermark[Created by: LongBui]
r1["fontawesome-solid-check Start from small"] --> r2["Keep Discipline"] & r3["Do Consistent"] --> r4["Never-end"] --> r1["Start from small"]

Figure: How to start thing

Using Code Examples

Even though this book is helping you with idea of how things were initially and created, but good enough for practicing with the coding and programming.

As I prefer to use dotfile for setting up everything for Data Engineer and Data Architect to be able to develop and create Data Applications.

Warning

I highly recommend to use the the Open Source - Data Stack for development. Five layers has their own open-source for data stack.

You can use the code in this book for your reference and pre-produce it on your environment with dotfile setup which had custom and design for data people.

Deep Work

Few words I want to say that we have limited time, focus on for essential things, your matter things.

Because in today’s fast-paced world, it’s crucial to remember that our time is limited, and we must prioritize what truly matters. By focusing on essential tasks and avoiding distractions, we can achieve high-quality work and make the most of our time.

Important

Note 1: High Quality Work = Time Spent * Productivity

Note 2: Productivity (P) = Deep Focus / Interruption

I want to emphasize that the quality of our work is directly related to the time and effort we invest, coupled with our ability to maintain deep focus. Reducing interruptions and distractions is key to boosting productivity and, ultimately, achieving better results (life).

flowchart BT
%% Add a transparent text node as a watermark
style Watermark fill:none,stroke:none
Watermark[Created by: LongBui]
%%{init: {'theme':'base'}}%%

  subgraph quality["Formula 1: Quality is IMPORTNAT"]
      direction LR
      E>"Quality > Quantity"]
      E --> |to| F
      F[↑ Quality]
  end
  subgraph highqualitywork["Formula 2: Measure Quality"]
      A[ ↓ Time Spent]
      B[ ↑ Productivity]
      C[ ↑ High Quality Work]

      A --> C
      B --> C
      %% K --> C
  end

  subgraph Productivity["Formula 3: Calculate Productivity"]
      D[ ↑ Deep Work]
      I[↓ Interruption]
      D --> |device| I
  end
  B -.-> |composed by| Productivity -.-> |enhance| B
  F -.-> |measured by| highqualitywork

Figure: What should we do for Deep Work

Glossary

TermDescription
SFDESolid Foundations for Data Engineer
DECData Engineering Camping
DEHData Engineering Book
DFData Foundation
KBKnowledge Base
BPBest Practices
UCUse case
CSCase Study
DDDesign Pattern
TDTechnical Debt
xDVx Data Driven