How Kafka replaced Zookeeper with the (K)Raft algorithm?

Back in 2021, when I first came across Kafka, I remember the DevOps engineer in my team using terms like Zookeeper, broker configs, etc on our team standup calls. I remember not caring about those terms, and simply focusing on learning about the producer, partitions, topics and consumer groups, and how they could be used in the product my team was developing.

While platforms like Kafka were built to abstract certain aspects of distributed computing (replication, consistency) while storing & processing logs, it's a pity how so many of us miss out on knowing the amazing engineering that went behind to build the different parts of a platform such as this one.

4 years later, I'm hungry enough to reverse engineer one of my favourite distributed platforms - Kafka!

What did the Zookeeper do?

To quote the 1st Kafka paper from 2011,

Kafka uses Zookeeper for the following tasks:
(1) detecting the addition and the removal of brokers and consumers,
(2) triggering a rebalance process in each consumer when the above events happen,
(3) maintaining the consumption relationship and keeping track of the consumed offset of each partition

So the Zookeeper was essentially a record keeper of brokers, consumers and producers and had multiple "registry" (ies).

Broker registry - metadata of the broker (host name, port) and set of topics and partitions that the broker holds.

Consumer registry - information about the consumer, the topics it has subscribed to and the consumer group of which it is a part of.

Ownership registry - This maintains a record of which consumer has subscribed to which partitions. So each consumer group will have an ownership registry associated with it, which will contain the mapping of the partition with the consumer consuming it. The offset registry stores the last consumed offset for each partition.

Revision note: A single partition's messages will only be consumed by 1 consumer. A consumer group subscribing to a topic will consume all messages in its multiple partitions by allocating ownership of each partition of only one of its consumers.

With this in mind, the Zookeeper would help rebalance the consumers whenever a new consumer would join the consumer group or an old consumer would leave/shut down. Since Kafka guarantees at-least-once delivery of messages, persisting just the offset registry in the Zookeeper helped minimizing the number of messages sent to a new consumer after rebalancing. All other registries were deleted as soon as the corresponding consumer or broker disconnected.

In addition to being a source of truth for the consumers, the Zookeeper would facilitate broker elections. One broker would be elected as a leader that could then coordinate replication among follower brokers. When a leader broker (controller) crashed, the Zookeeper would notify the other brokers in the cluster. The remaining brokers would then try to register ephemeral znodes (fundamental data nodes that are deleted when the broker disconnects) with the Zookeeper as the new leader. The Zookeeper would simply make the first requesting broker as the controller.

Bottlenecks in using the Zookeeper

As you may have guessed, the messages for coordination among consumers would grow linearly if the partitions, topics and consumers would increase.

Even in case of failure of a broker, the new leader would have to retrieve all metadata from Zookeeper and update it with its own host name and port, which would contribute to an unavailability window.

While all of these issues aren't breaking changes, but designers preferred simplifying the architecture and building an even more resilient system.

The KRaft Protocol

The shift from Zookeeper didn't happen in a single new release. Several modifications came in several updates such as moving the offset metadata to a new Kafka topic instead of having the Zookeeper storing it.

Eventually, everything was moved to a controller, that could be setup with the brokers in the same node or in a separate one. Even if they're not together, there isn't just one controller anymore maintaining the metadata, but a quorum of controllers.

Source: Data Engineering Minds (YT)

The quorum of controllers uses the Raft consensus algorithm to replicate and manage cluster metadata. The advantage? In the event of a controller failure, the newly elected leader controller doesn't have to fetch all of metadata information since it already has the replicated information.

This design ensures that all controllers have a synchronized copy of the metadata at all times, and removes the need for a new leader to fetch metadata during failover.

Hence, by moving the metadata management to Kafka topics itself, the Zookeeper was made redundant.

I wish I could present a few metrics listing exactly how compute saving the new Kafka was compared to the older one now that multiple Zookeepers didn't have to be setup separately, but couldn't find such specific case studies. However, there are a few Confluent blogs that come close to quantifying the optimization: Blog 1, Blog 2.

Hope you liked my new post in a long time! I'm always a learner and would love to hear any feedback or opinions you may have. Don't be shy!

If you found this blog insightful, do think of me the next time your org/team is looking to hire a software engineer! And in any case, feel free to reach out to me via LinkedIn or email. Have a great week!

Namaste JavaScript Quick Notes

Note: Akshay Saini's Namaste JavaScript is probably the best course for JavaScript developers out there. These are my personal notes that I made while watching the course; they serve more of as an online quick reference for my understanding and revision, and I hope it benefits anyone reading it too! Everything in JS happens inside an Execution Context. Before a JS code is run, memory is allocated and variables are set as undefined , and functions are set as their exact code in the scope within the Execution Context. The global execution context hosts all the global variables and function definitions. An Execution Context has 2 components: Memory, that stores variables and functions; and Code, that reads and executes the code. Call Stack maintains the order of execution contexts. Since JS is single threaded and asynchronous, at one point of time, only one function is executed which is at the top of the call stack. For each function, an execution context is created before ex...

Find a topic