Week 6 - Stream Processing

What is Kafka ?

Apache Kafka is an open-source distributed event streaming platform capable of handling trillions of events a day. Originally developed by LinkedIn, Kafka has become a vital component in modern data engineering for processing, storing, and analyzing large volumes of data in real time. It is designed to handle high throughput and low-latency data pipelines, making it suitable for a wide range of use cases.

Kafka Ecosystem Overview: The Kafka ecosystem includes several components that work together to build robust and scalable data pipelines

Introduction to Kafka Ecosystem

This diagram shows the full picture of Kafka ecosystem and how to process the data with Kafka

flowchart BT
%% Add a transparent text node as a watermark
style Watermark fill:none,stroke:none
Watermark[Created by: LongBui]
    A[Kafka Ecosystem] -->|Messaging Platform| B[Kafka Broker]
    A -->|Monitoring| C[Kafka Manager]
    A -->|Data Ingestion| D[Kafka Producers]
    A -->|Data Consumption| E[Kafka Consumers]
    A -->|Storage| F[Kafka Topics]
    A -->|Coordination| G[Zookeeper]
    A -->|Stream Processing| H[Kafka Streams]
    A -->|Connector| I[Kafka Connect]
    A -->|Schema Management| J[Kafka Schema Registry]
    A -->|SQL Querying| K[Kafka KSQL]

    D -->|Send Messages| F
    E -->|Read Messages| F
    H -->|Streams Data| F
    I -->|Connects Data Sources| F
    J -->|Manages Schemas| F
    K -->|Performs SQL Operations| F

    F -->|Stores Persistent Data| G
    G -->|Provides Coordination| B

Figure: Kafka Ecosystem Overview

Where Kafka being used in Data Engineering Project?

Apache Kafka has become an indispensable tool in the data engineering landscape, enabling organizations to build real-time, scalable, and resilient data pipelines. With its rich ecosystem and robust capabilities, Kafka supports a wide range of use cases, from real-time analytics to microservices communication, making it a cornerstone of modern data architectures.

graph LR
%% Add a transparent text node as a watermark
style Watermark fill:none,stroke:none
Watermark[Created by: LongBui];
    A[Kafka Ecosystem] --> B[Data Ingestion]
    A --> C[Stream Processing]
    A --> D[Data Integration]
    A --> E[Microservices Communication]
    A --> F[Data Warehousing & ETL]
    A --> G[Log Aggregation]
    A --> H[Event Sourcing]

    B --> I[Real-Time Data Collection]
    B --> J[Event-Driven Architectures]

    C --> K[Real-Time Analytics]
    C --> L[Fraud Detection]
    C --> M[Monitoring & Alerting]

    D --> N[Database Integration]
    D --> O[Cloud Services Integration]
    D --> P[Data Lake Integration]

    E --> Q[Asynchronous Messaging]
    E --> R[Event-Driven Workflows]

    F --> S[ETL Pipelines]
    F --> T[Near Real-Time Data Updates]

    G --> U[Centralized Log Aggregation]
    G --> V[Operational Visibility]

    H --> W[Event Sourcing Patterns]
    H --> X[State Reconstruction]

Figure: Applied Kafka in Data Project

Introduction to Kafka

  • Overview: Apache Kafka is a distributed streaming platform designed for high-throughput, low-latency data processing.
  • Key Components:
    • Producers: Publish data to Kafka topics.
    • Consumers: Subscribe to topics to process data.
    • Brokers: Kafka servers that store data and serve client requests.
    • Topics: Categories to which records are published.
    • Partitions: Subdivisions of topics for parallelism.
  • Type: Batch Query (Message per Second) and Stream Query (Transaction per Second) Operations
flowchart BT
%% Add a transparent text node as a watermark
style Watermark fill:none,stroke:none
Watermark[Created by: LongBui]
    subgraph broker
    direction LR
    Topic --> Partition
    end
A[Producer] --> |write| broker --> |read| C[Consumer]

Figure: Kafka Process Flow

Focusing on on problems and practices related to Kafka in data engineering.

Managing and Evolving Data Schemas

Apache Avro is a data serialization framework designed for efficient and compact data serialization.

mindmap
  root((Managing and Evolving Data Schemas))
    Challenges
      Complex Schema Management with large scale
      Schema Versioning
    Benefits
      Serialized data, improve speed of processing
      Backward and Forward schema without disrupting
    Schema Registry
      Centralize Schema
      Validate Schema during read&write
      Data Consistent

Figure: How Kafka manage Schema

Serialzed Data

Challenges and Pain Points:

  • Complex Schema Management: Handling schema evolution can be complex, especially with large data systems.
  • Schema Compatibility: Ensuring compatibility between different versions of schemas can be challenging.

Use Cases

  • Efficient Data Serialization: Useful for high-throughput systems requiring compact and fast serialization.
  • Schema Evolution: Ideal for systems where schemas evolve over time, requiring backward and forward compatibility.

Benefits

  • Compact and Fast: Provides efficient binary format which minimizes data size and maximizes processing speed.
  • Schema Evolution: Supports backward and forward compatibility, allowing changes to schemas without disrupting existing processes.

Schema Registry

  • Stores Avro schemas centrally and validates schemas during read and write operations.
  • Ensures data integrity and consistency across systems.

Building Real-Time Stream Processing Applications

Kafka Streams is a client library for building real-time applications and microservices that process data stored in Kafka clusters.

mindmap
  root((Building Real-Time Stream Processing Applications))
    Challenges
      Complex Stream Processing
      Stateful operation is risky for realtime analytic with reduce process
    Key Features
      DSL API
      KStreams and KTables
    Common Operations
      Filtering
      Mapping
      Joining

Figure: Use Kafka to build Real-time Pipeline

Challenges and Pain Points:

  • Complex Stream Processing: Building and maintaining stream processing applications can be complex.
  • State Management: Managing stateful operations in streams requires careful planning and resources.

Use Cases

  • Real-Time Data Processing: Ideal for applications needing real-time analytics and processing.
  • Event-Driven Applications: Useful for building systems that react to streaming events.

Key Features

  • DSL API: Allows defining stream processing logic using a Domain-Specific Language.
  • KStreams and KTables: Core abstractions for processing data streams and tables.

Common Operations

  • Filtering: Exclude records based on specific criteria.
  • Mapping: Transform records into different formats or structures.
  • Joining: Combine multiple streams into a single unified stream.

Integrating Data with Kafka Connect and Querying with KSQL

Kafka Connect is a tool for scalable and reliable streaming data between Kafka and other systems.

mindmap
  root((Integrating Data with Kafka Connect and Querying with KSQL))
    Challenges
      Connector Configuration Complex
      Various Data Integration
    Use Cases
      Connect Sources and Sinks
      SQL engine for Kafka
      Real-time insight

Figure:SQL Query Streaming Data With KSQL

Challenges and Pain Points:

  • Connector Configuration: Setting up and configuring connectors can be complex.
  • Data Integration: Integrating with various data sources and sinks may require extensive configuration and customization.

Use Cases

  • Data Integration: Simplifies integration with external systems and databases.
  • ETL Pipelines: Streamlines ETL processes by connecting data sources and sinks.

Key Features

  • Connectors: Plugins to integrate with a variety of data sources and sinks.
  • KSQL: KSQL is a streaming SQL engine for Kafka, allowing SQL-like queries for stream processing.

Challenges and Pain Points:

  • Query Performance: Complex queries may impact performance and require optimization.
  • Limited Functionality: Some advanced stream processing features may not be supported.

Use Cases

  • Stream Processing: Perform data filtering, transformations, and aggregations using SQL-like queries.
  • Real-Time Analytics: Enable real-time insights and analytics on streaming data.

Processing Streaming Data

Kafka is used for building real-time data pipelines and event-driven architectures.

mindmap
  root((Processing Streaming Data))
    Challenges
      Real-Time Data Processing
      Scalability with increased data volume and complexity
    Use Cases
      "Event-Driven Architectures"
      Real-Time Analytics
      Centralize Logging System
      "Distributing Alert & Monitoring"
      Communication Channels
    Tools and Techniques
      Kafka Streams
      Third-Party Tools

Figure: Process Kafka Data

Challenges and Pain Points:

  • Real-Time Data Processing: Handling high-volume, real-time data streams can be challenging.
  • Scalability: Ensuring the system scales effectively with increased data volume and complexity.

Use Cases

  • Event-Driven Architectures: Systems that respond to events in real-time.
  • Real-Time Analytics: Applications requiring continuous data analysis and processing.

Tools and Techniques

  • Kafka Streams: For stream processing within the Kafka ecosystem.
  • Third-Party Tools: Integration with tools like Apache Flink, Apache Spark, etc.

Handling Late Data

Handling late-arriving data involves ensuring accurate and complete data processing despite delays.

mindmap
  root((Handling Late Data))
    Challenges
      Process late-arriving data
      Complex handling to order, idempotent, duplication
    Techniques
      Windowing by group event into time windows
      Watermarks by track event time progress
      Reprocessing by re-compute results
    Use Cases
      Event Time Management
      Reprocessing

Figure: Handling Late Data arrived

Challenges and Pain Points:

  • Data Accuracy: Ensuring data accuracy when processing late-arriving data.
  • Complexity: Managing late data requires complex techniques and handling.

Use Cases

  • Event Time Management: Process data based on event time, even when data arrives late.
  • Reprocessing: Recompute results or adjust data based on late-arriving data.

Techniques

  • Windowing: Group events into time windows for processing.
  • Watermarks: Track event time progress to handle late data.
  • Reprocessing: Recompute results considering late-arriving data.

Most use cases of Late-Arriving Data in Realtime processing

Certainly! Here’s how you can represent each technique with a sequence diagram.

Windowing (Total T Days Profit)

sequenceDiagram
    participant User as User
    participant PurchaseEvent as Purchase Event System
    participant Windowing as Windowing System
    participant Aggregation as Aggregation Engine

    User->>PurchaseEvent: Make purchase at timestamp t1
    PurchaseEvent->>Windowing: Send purchase event at timestamp t1
    Windowing->>Windowing: Add purchase event to the 1-hour window
    Windowing->>Windowing: Time passes
    Windowing->>Aggregation: Close 1-hour window at t2
    Aggregation->>Aggregation: Aggregate purchase data
    Aggregation->>User: Emit aggregated purchase data for the past hour

Figure: Calculate Total Profit from last t1,t2 day

Watermarks (Last Selling Items Profit)

sequenceDiagram
    participant User as User
    participant PurchaseEvent as Purchase Event System
    participant Watermark as Watermark Generator
    participant Processing as Sales Processing System

    User->>PurchaseEvent: Make purchase at timestamp t1
    PurchaseEvent->>Watermark: Send event timestamp t1
    Watermark->>Processing: Generate watermark for timestamp t1
    Processing->>Processing: Process sales data up to watermark
    User->>PurchaseEvent: Make late purchase at timestamp t2
    PurchaseEvent->>Processing: Send late purchase event for timestamp t2
    Processing->>Processing: Adjust daily sales total for late purchase
    Processing->>Watermark: Update watermark for timestamp t2

Figure: Calculate profit from item selling date at t1,t2

Reprocessing (Last T Day Profit)

sequenceDiagram
    participant User as User
    participant PurchaseEvent as Purchase Event System
    participant Processing as Initial Sales Processing
    participant LateData as Late Data Handler
    participant Aggregation as Sales Aggregation Engine

    User->>PurchaseEvent: Make purchase for Day 1
    PurchaseEvent->>Processing: Send purchase data for Day 1
    Processing->>Aggregation: Calculate initial daily sales for Day 1
    Aggregation->>User: Emit daily sales report for Day 1
    User->>PurchaseEvent: Make late purchase for Day 1
    PurchaseEvent->>LateData: Send late purchase data
    LateData->>Processing: Reprocess sales data for Day 1
    Processing->>Aggregation: Recalculate sales totals including late data
    Aggregation->>User: Emit updated daily sales report for Day 1

Figure: Calculate total profit from last T Day

Starting Kafka 101

Download data code for starting Kafka Ecosystem with Docker-Compose File and Validation Code

Starting cluster

docker-compose up

Command line for Kafka

Using docker exec -it <broker-name>

Ensure you have the log

root@broker:/# kafka-topics --version
5.4.0-ce (Commit:a2eae1986f4792d4)

Create topic

kafka-topics --create --topic streamPayment --bootstrap-server localhost:9092 --partitions 2

Check if you have created the topic successfully

root@broker:/# kafka-topics --list --bootstrap-server localhost:9092 | grep Payment

streamPayment

Then Execute the python3 producer.py and python3 consumer.py commands

kafkaStreamingMsg

Calculating to Consume Realtime Data

To calculate the number of topics, partitions, and consumers required for consuming two processes of streaming with the given metrics, breaking it down as follows:

Definitions

  • MPS: Messages Per Second
  • TPS: Transactions Per Second
  • Topic: A category or feed name to which records are stored and published.
  • Partition: A way to split a topic’s data across multiple brokers in Kafka.
  • Consumer: An application that reads data from topics.

Assumptions

  • Message Size: Assume the average size of a message (in bytes). We’ll consider a typical size, e.g., 1 KB per message.
  • Throughput Capacity: Assume the maximum throughput capacity for a consumer (e.g., 1 MBps per consumer).
  • Redundancy & Replication Factor: Assume a replication factor for data reliability (e.g., 3).
flowchart BT
%% Add a transparent text node as a watermark
style Watermark fill:none,stroke:none
Watermark[Created by: LongBui]
   subgraph Query_Batch_Process
        direction TB
        Topic1[Topic: Query Batch\n10 MPS\nMessage Size: ~1MB] --> Partition1[Partition 1\n10 MPS]
        Partition1 --> Consumer1[Consumer 1\n10 MPS]
    end

    subgraph Real_Time_Process
        direction TB
        Topic2[Topic: Real-time\n10k TPS\nMessage Size: ~1KB] --> Partition2[Partition 2\n1k TPS]
        Topic2 --> Partition3[Partition 3\n1k TPS]
        Topic2 --> Partition4[Partition 4\n1k TPS]
        Topic2 --> Partition5[Partition 5\n1k TPS]
        Topic2 --> Partition6[Partition 6\n1k TPS]
        Topic2 --> Partition7[Partition 7\n1k TPS]
        Topic2 --> Partition8[Partition 8\n1k TPS]
        Topic2 --> Partition9[Partition 9\n1k TPS]
        Topic2 --> Partition10[Partition 10\n1k TPS]
        Topic2 --> Partition11[Partition 11\n1k TPS]

        Partition2 --> Consumer2[Consumer 2\n1k TPS]
        Partition3 --> Consumer3[Consumer 3\n1k TPS]
        Partition4 --> Consumer4[Consumer 4\n1k TPS]
        Partition5 --> Consumer5[Consumer 5\n1k TPS]
        Partition6 --> Consumer6[Consumer 6\n1k TPS]
        Partition7 --> Consumer7[Consumer 7\n1k TPS]
        Partition8 --> Consumer8[Consumer 8\n1k TPS]
        Partition9 --> Consumer9[Consumer 9\n1k TPS]
        Partition10 --> Consumer10[Consumer 10\n1k TPS]
        Partition11 --> Consumer11[Consumer 11\n1k TPS]
    end

Figure: Design Partition for Process Kafka Streaming

Given

  • Query Batch Process: 10 MPS (Messages Per Second)
  • Real-time Process: 10k TPS (Transactions Per Second)

Here’s the structured table for the calculations:

ProcessMetricCalculationResult
Query Batch ProcessTotal Messages10 MPS10 MPS
Message SizeAssuming 1 KB per message1 KB per message
Throughput10 MPS * 1 KB10 KBps
Partitions NeededEach partition handles 1 MBps (1024 KBps); Partitions required: 10KBps / 1024 KBps partition1 partition
Consumers Needed1 consumer per partition (since 1 partition handles 1 MBps)1 consumer
Real-time ProcessTotal Transactions10,000 TPS10,000 TPS
Message SizeAssuming 1 KB per message1 KB per message
Throughput10,000 TPS * 1 KB10 MBps
Partitions NeededEach partition handles 1 MBps; Partitions required: 10MBps / 1 MBps partitions10 partitions

Consumers Needed: 1 consumer per partition 10 consumers, Separate topics for each process 2 topics (1 per process)

  • Query Batch Process: Total Partitions: 1, Total Consumers: 1.

  • Real-time Process: Total Partitions: 10, Total Consumers: 10.

Additional Considerations

  • Replication Factor: If the replication factor is 3, you’ll need to multiply the storage and bandwidth by 3 for redundancy, but this does not impact the number of partitions or consumers directly.
  • Scaling: If message sizes increase or the processing speed needs to be faster, consider increasing the number of partitions and consumers accordingly.