How Our Team Uses Kafka

4 weeks ago 17

🗒️일러두기: 이 글은 글로벌 관계사와 독자를 고려해 2024년 5월 발행한 "우리 팀은 카프카를 어떻게 사용하고 있을까"를 영문화한 글입니다.

This article is for those who already know about Kafka. Rather than focusing on technical implementations, it provides a quick and broad overview of Kafka-based technical concepts and shares how our team applies them in practice. If you’re interested in more technical details, please check the official documentation or other technical blogs. Relevant links are included in the “References” section at the end.

Keywords such as the ones below will be discussed. Each concept is explained briefly in each section, but having some prior knowledge will help you grasp the concepts more easily.

Kafka
Transactional Outbox Pattern
Event Bus
Kafka Streams

First, let’s take a brief look at Kafka. Kafka is a distributed streaming platform used to handle large volumes of data and transmit it in real-time. All data is recorded in a log format on the file system. Here, "log" refers to an append-only, fully time-ordered sequence of records, allowing for centralized processing of data flow. This setup enables the collection of large-scale data and real-time streaming consumption.

Messages (records) are recorded in the order sent by the producer, ensuring sequence integrity. An offset denotes the location of the messages consumed by a consumer group. Each consumer group has an offset value, allowing multiple consumer groups to independently consume data from the same source. This enables multiple consumers to access messages from a single origin without loss or alteration, and data analysis can be performed based on the source data.

Key Terms:

Below are some basic Kafka terms and concepts you’ll need to understand this article.

Topic: A topic represents a category of data, with logs organized by name. When sending a message, a specific topic is specified.
Partition: A partition refers to a sequence of ordered logs. A topic can be divided into one or more partitions. Partitions support parallel processing and manage data distribution and replication.
Record: A record is the basic data unit, consisting of a key and a value (a key-value pair).
Offset: An offset is an identifier that indicates the position of a record within a partition.
Producer: A producer is responsible for sending data to topics, generating messages and sending them to specified topics.
Consumer: A consumer reads data from topics, polling and processing messages from specific topics. Consumer groups are made up of multiple consumer instances that share partitions of a specific topic, enabling parallel data processing and increased throughput.
Kafka Connector: The Kafka Connector is a framework that facilitates integration between Kafka and external systems, supporting various protocols such as MySQL, S3, and more.
- Source Connector: Related to message publishing.
- Sink Connector: Related to message consumption.

Our Delivery Service Team plays a key role in relaying over 1 million Baemin Delivery (a service where deliveries are made by Baedal Minjok’s own fleet) orders generated daily. The team receives orders from various order services (such as Baemin Delivery, BMart, and Baemin Store), then distributes them across multiple delivery services, overseeing and managing the entire delivery process. The team uses a distributed event-driven architecture for order and delivery processing, with Kafka as a core technology.

Below is a brief overview of our team’s distributed system structure.

The system consists of order-delivery servers that manage the delivery process by receiving order events, and analysis servers that process events for data analysis. As with many services, each server group is made up of multiple servers to enhance throughput and performance.

[1] Process Orders and Deliveries Securely

Preview

– Kafka is used as an event broker to ensure the order of domain events.
– The Transactional Outbox Pattern is applied using the MySQL source connector to manage data and message transmission within a single transaction in a distributed system, ensuring data consistency.

The primary goal is to securely handle the order from placement to delivery to the customer. To manage this process without confusion, the sequence of order and delivery events is crucial, and no events should be missed. Kafka is used as the event broker to guarantee event order, while the Transactional Outbox Pattern ensures event retries and prevents event loss by ensuring that event retries are run in the correct sequence.

Guaranteeing Event Sequences

The delivery process is illustrated briefly below. During the delivery, multiple events are triggered, some of which change the delivery state. The delivery status follows a sequence and progresses in order. Certain events may occur almost simultaneously and trigger changes in the delivery state. To ensure that the delivery process is run smoothly without confusion, it is essential to guarantee the delivery event sequences.

For example, a "dispatched" event and a "pickup preparation requested" event can occur almost simultaneously. The producer may send the “pickup preparation requested” event after the “dispatched” event, but due to network issues, the consumer might receive the “pickup preparation requested” event before the “dispatched” event. If the order is not guaranteed, the consumer may be confused about which event occurred first, causing issues in business logic processing.

Kafka ensures the order by allowing consumers to consume messages in the sequence they were published. Topics like order, delivery, and analysis can be structured based on purpose within a Kafka cluster, and each topic can have multiple partitions for parallel processing. Kafka guarantees the sequence of messages sent by the producer within the same partition. If the messages have the same key, they are assigned to the same partition, and only one consumer will process that partition. This guarantees the order of events for messages, using keys such as order ID or delivery ID, which need to maintain their sequence. It reduces concurrency issues that may arise from messages published at almost the same time, as the producer ensures the order of the messages.

Data Consistency

We store the data required for business logics in a MySQL database and publish events to Kafka. If there is an issue with Kafka, the updated delivery state may be stored in the database, but the event might not be successfully published. Let’s consider the case of a canceled order. When an order is canceled, the delivery is also canceled, and the database records that the delivery has been canceled. However, if event publication fails, the consumer (i.e., the other system processing the delivery) may not receive the cancellation message and the delivery can still proceed. Therefore, it is necessary to manage data and message publication as a single transaction to ensure data consistency.

Due to infrastructure and connection issues, or problems like timeouts, the process of adding messages to the topic may fail. Retries are needed to prevent loss and ensure the consistency. During the retry process, we wanted to maintain the order of messages. However, we also wanted to minimize the impact on other business event processes while retrying. In cases of failure in event publishing, we applied the Transactional Outbox Pattern to attempt retries while considering order and impact.

The Transactional Outbox Pattern is a strategy used in distributed systems to ensure data consistency and the atomicity of message publishing by combining database transactions with a message queue. Let’s consider the case where an event needs to be sent after a transaction is completed in a distributed system. If the transaction fails, the data will be rolled back, but the event may still be sent. Additionally, if an issue occurs during message transmission, the atomicity may not be guaranteed while publishing the message. The core idea of the pattern to address the case above involves the following:

Changes are recorded in an Outbox table in the transaction database when the transaction is completed.
Messages are sent whenever new records are added to the Outbox table.

• Source: Pattern: Transaction log tailing

To implement the pattern described above, we use the MySQL Kafka Connector supported by the Debezium library. Debezium is an open-source library that detects changes in the database and transforms them into an event stream. It utilizes log tailing techniques to read changes from the binlog (Change Data Capture) and sends them to configured topics. It records successful transactions in the binlog and reads the logs sequentially. If message publication fails, the data in the Outbox table is also rolled back, ensuring data consistency within a single transaction. The MySQL source connector in Debezium enforces the use of a single task, guaranteeing message order within a single connector.

If a single task operates slower than the rate at which data accumulates in the table, it may lead to message delays. To increase throughput, we separate the Outbox tables into clusters by topic. Each cluster consists of multiple Outbox tables based on identifiers, such as delivery-outbox1, delivery-outbox2, delivery-outbox3. Each table is connected to the connector, distributing the workload to be processed and improving throughput. The Outbox table guarantees that event messages are published in the order they are stored, with the same keys saved in the same table. Since each table uses a single connector, the order for messages with the same key is ensured.

[2] Use Kafka As An Event Bus

Preview

– Use Kafka as an event bus to notify the distributed system.

In a distributed system, multiple servers form a server group. When a value changes on one server, all servers in the group must know the updated value. The delivery server determines to which delivery service a delivery should be distributed by managing distribution rules in memory. In this case, the distribution rules are changed only for the delivery server where the distribution rule event has been consumed, and the existing distribution rules can be kept for all other delivery servers. When an operator changes these rules, all delivery servers need to be informed of the updates. Kafka is used as an event bus to notify the servers of the changes so that such changes can be applied accordingly.

Kafka is used as event bus, using Spring Cloud’s RemoteApplicationEvent . The topic (e.g., event-bus) is set up with a unique id in the format ${serverName}:${identifier}. Events that extend RemoteApplicationEvent are defined with the desired destination(server group), allowing the event bus to deliver them to the target server group. Spring Cloud publishes application events to an ID named servers. By subscribing to these defined events, the destination servers can receive and reflect the updated values. This ensures that in-memory values are consistently managed across the distributed system.

The DeliveryServiceRemoteApplicationEvent class, which extends RemoteApplicationEvent, is set as an abstract class. Concrete implementations are created based on specific characteristics, and events are published to be used where needed. This is used when all servers in a server group need to initialize or change values stored in memory. Below is an example code defining a RemoteApplicationEvent for changing distribution rules.

public abstract class DeliveryServiceRemoteApplicationEvent extends RemoteApplicationEvent { protected DeliveryServiceRemoteApplicationEvent(String destination) { super(SOURCE, ORIGIN, DESTINATION_FACTORY.getDestination(destination)); } } // CustomRemoteEvent of delivery distribution rule public class RouteRuleRemoteEvent extends DeliveryServiceRemoteApplicationEvent { public RouteRuleRemoteEvent() { super("delivery"); // destination: delivery server } }

Below is an example code that publishes and consumes a RemoteApplicationEvent. When the distribution rule changes and needs to be updated, the method is executed, and the updated rules are reflected across the entire delivery server group.

public void load() { routeRuleSetStore.load(); remoteApplicationEventPublisher.publishEvent(new RouteRuleRemoteEvent()); } @EventListener public void handle(RouteRuleRemoteEvent event) { routeRuleSetStore.load(); }

The Delivery Service Team manages the event bus topic with a single partition. This is because high throughput is not needed for configuration changes, and all servers in the same group should receive the same updates. Since the same consumer group shares the offset in a partition, each server must have a different consumer group ID. By using different consumer group IDs, multiple servers can change their settings with a single event publication. If a consumer’s name is not specified, Spring Cloud assigns a random consumer group ID prefixed with "anonymous." Thus, for each connected server, an anonymous.{identifier} formatted consumer group ID is created, which is filtered when we monitor systems.

[3] Analyze Data for Better Delivery

Preview

– Provide data in a suitable form for analysis.
– Use Kafka Streams to aggregate real-time delivery information to monitor delivery state.

Although batch processing can be used to provide data for analysis, batch jobs are run at regular intervals, making it difficult to reflect real-time data. Our team wanted to monitor delivery state by retrieving data in real time or near-real time and reflect it in our service. We are using Kafka Streams as the technology to meet these requirements.

Kafka Streams is a library that allows for event-based data (record) processing within Kafka. In simple terms, Kafka Streams is a real-time aggregation and analytics system, widely used as a platform for real-time data streaming and analysis. A Kafka Streams application processes the flow of data through preprocessing and stream connections, handling data by creating new streams and producing results accordingly.

Provide Processed Data for Analysis

After receiving delivery events, the analysis server processes them and reformats the data into a more accessible form for analysis and republishes them to an analysis topic. The original events are processed and republished to a separate topic through another stream for analysis. The original topics and analysis topics are distinct because they are used for different purposes. As the original and analysis topics require different data throughput and resources, the topics and servers are configured separately to ensure that resources can be used and adjusted as appropriate. This separation ensures that the service topics used in key service logics and the analysis topics are managed separately, isolating the impact in the event of an issue.

Deliveries progress in sequence, from creation, dispatch, pickup, to completion, with events published with a specific action. For analysis, it’s often more useful to view aggregated delivery information at a delivery level, rather than individual events of a delivery. In other words, there may be some cases where key information of a single delivery, including when it was created, dispatched, and completed, needs to be checked in an aggregated manner. In order to address such needs, preprocessing is used to aggregate multiple events related to a single delivery into a summarized report. Here, Redis serves as a temporary storage to manage this aggregated data. As delivery events are received in the original delivery topic, key order and delivery information is stored in Redis. Whenever events published during the course of a delivery are received, key information such as time of each delivery event is updated accordingly. Upon completion, the delivery data is deleted from Redis, and a new integrated delivery event consisting of meaningful information is published to the analysis topic to provide aggregated data of each delivery.

Using the S3 Sink Connector, events from the analysis topic are sent to AWS S3, an object storage service, for permanent storage. This separation of the event storage and business logic processing minimizes the mutual impact of analysis services and business services. Data stored in S3 can be queried via AWS Athena, allowing for access to historical records without overloading the business service repository. Tools can integrate with this data for business or operational departments to analyze past deliveries on a monthly basis or use them for settlement purposes.

The diagram below illustrates the flow of an integrated delivery event and how real-time data is provided afterward.

Provide real-time data

Since we want to understand real-time delivery data, such as how many deliveries are waiting for dispatch, how long it takes for a delivery to be dispatched, and the inflow volume of orders by service, we use Kafka Streams applications to aggregate real-time delivery data. This aggregated data is visualized through a Grafana dashboard, enabling us to respond to operational situations and analyze the delivery infrastructure level.

An example of real-time aggregation is the counting of deliveries by state. A state storage("statestore") called latest-delivery is built based on the flow of records (Stream) in the analysis delivery topic. A statestore temporarily stores keys and values. The latest-delivery statestore determines the latest delivery based on the time of the record and uses delivery identifiers as keys and the corresponding data as values, allowing us to aggregate delivery counts by state. A separate state store (count-per-status) stores delivery state as keys and the count for each delivery state as values, which enable us to quickly view the aggregated data of each delivery state in real time. Grafana gauges are then used to visualize these counts on a dashboard, enabling us to monitor whether too many deliveries are accumulating in certain states, or whether potential issues are expected in the delivery process in real time. Shown below is an example of a dashboard in a beta environment.

Real-time aggregated data helps monitor and respond to current delivery situations. This data can be leveraged to provide additional useful functionalities, such as automating anomaly detection, sending alerts for potential issues, and minimizing the impact of disturbances by reducing order inflow before they escalate into larger problems.

So far, I have briefly explained the concept and characteristics of Kafka and how our team uses it. At the time of writing this, As of the time I’m writing this post, I have been working as a developer for just over 2 years and have not yet had the experience of leading the design or implementation of the systems used by my team. I have compiled detailed resources provided by experts and official documentation at the end for those interested in more technical explanations.

In this article, I focused more on providing an overview of how our team uses Kafka, rather than delving into the technical details of Kafka itself. I wanted to better understand and use the technology myself, aand at the same time, had the desire to share examples and learn more together with others. I have included the links related to this topic in the References section. If you are interested in more detailed information, I recommend checking out the WOOWACON videos as well.