EventStoreDB vs Kafka

Introduction

There is quite a lot of confusion in the community with regards to EventStoreDB and Kafka - especially when it comes to event sourcing. Developers not familiar with both products are having a hard time deciding which technology they should use, how do they compare and what are the trade-offs that they will have to make.

In this article, you will learn what the two solutions offer, how to use them effectively, and when to use one over the other. In order to do that, we first need to agree on the definition of event sourcing and requirements that the solution should meet.

Event sourcing

Event sourcing is an architectural pattern developed around the idea of storing events as a source of truth in an append-only log. In this article, we won't be going into the details of the pattern, so if you want to learn more about it following links might be useful:

On a high level, we can summarize the logical flow of operations in an event-sourced system with the following set of steps.

1. Receive a command

Commands are requests to mutate the state of a system, and they can be rejected if some of the business invariants aren't satisfied. Let’s consider an example of a startup implementing a credit card platform. When a customer is making a transaction at a store, the credit card network would send them an Authorise command, which can be either approved or declined (for example if it would take the customer over credit limit).

In the above example, the command was handled synchronously - the merchant needs to know the answer immediately. The alternative is an asynchronous transaction processing, where the caller isn't going to wait for the result, but we still have to make sure the request was processed. An example here could be a large number of “settle” commands sent by a credit card network daily. We could ingest these and put in some sort of message queue and then process at our leisure.

2. Fetch all events for a fine-grained stream

During this part of the processing we need to fetch all the events already persisted for a given fine-grained stream ID. In our example a fine-grained stream ID could be a credit card account number (assuming our aggregate models whole credit card account). We need to be able to fetch all these events quickly because if we are processing a live authorisation request the time budget might be quite tight. We also want to processes queued asynchronous commands as soon as possible and waiting for reads would hurt the throughput.

3. Reconstitute current state of an aggregate and handle the command

In this step using the result from the previous one, we calculate what is the current state of an aggregate (based on the history of events) and let it handle the command. If everything went ok, the outcome of the processing will be a list of events to be persisted (it can be one or many).

4. Append newly-created events to the stream

In order to persist the events to the fine-grained stream, we need to make sure that nothing was written to that stream in the meantime. In event-sourced systems, it is usually achieved by means of Optimistic Concurrency Control. We can do it by attaching the version of the stream we’ve fetched initially (in step 2), to the request that will be appending new data. If the version used to reconstitute the state and current version of the fine-grained stream match then the write will succeed. If not (because there was a concurrent client writing at the same time) then we can fetch the events again and retry the command. It might seem like a trivial check, but it’s a really critical part of the flow that ensures data consistency.

5. Distribute newly appended events to consumers

Last part of the flow is to make sure that consumers of our events - for example projections will get notified about the new entries appended at the end of the log. If we can write to a fine-grained stream and then read the stream of all events from one storage it will be quite a simple job to do (for example by periodic pooling). If the infrastructure doesn’t allow that it might be a more challenging task as it’s likely we would have to rely on 2-phase-commit.

Now that we have a description of what we want to be able to achieve, let’s see how EventStoreDB and Kafka can help us.

EventStoreDB

EventStoreDB is a database allowing the user to read and persist events into fine-grained streams, as well as reading all or a subset of events. It was a product built with event sourcing in mind and the above flow was the primary concern of its designers.

Let’s now consider how EventStoreDB can help us with handling the flow of operations. If the commands are handled asynchronously (step 1) we could ingest them directly to EventStoreDB (write to one or more streams) and then pull for newly written commands. The only limiting factor on the ingress would be 15,000 writes per second that EventStoreDB can handle.

Next in our flow (step 2) is reading of existing events from the fine-grained stream. This operation is well supported by EventStoreDB as it indexes the events by stream ID and such index allows for very fast queries. If the steam is big or we only need the latest n events (for example in cases where snapshots are used), EventStoreDB is able to return a slice of a stream.

After the application logic is executed (step 3) now is the time to persist new events to the stream (step 4). EventStoreDB allows us to send an ExpectedVersion, which will be used to manage the optimistic concurrency control. If the version matches data will be appended, if not then operation can be retried.

Last step (5) is to distribute events to consumers interested in them. Depending on needs this can be done using either a live-only subscription, catch-up subscription, persistent subscription, or reading events from Atom feed.

Kafka

Kafka is a distributed message streaming platform that has received a lot of attention during the last couple of years because of its ability to handle large amounts of data and durable storage. At first glance it seems like a good fit for event sourcing (data stored as an immutable sequence of events).

In our credit card example, Kafka would be easily able to ingest large amounts of requests being sent to the system to allow for later processing. We could design a system where hundreds of thousands of messages are being sent per second to a commands topic (step 1), and then consume them at a pace that suits the business.

Next thing that needs to happen during command processing is fetching events from a fine-grained stream (step 2). An easy choice would be to have an events topic and then for each stream have a partition, so that the data can be easily queried. That, unfortunately, isn't going to work as there's a limited number of partitions that Kafka can realistically handle in a single cluster.

An alternative would be to consistently partition events using a stream ID and always read and write to/from a single partition. This could work in a case where we have enough time to do a full scan of a partition and select only events we are interested in. Unfortunately, as the number of events written grows the read is going to take more and more time and the system will become unresponsive quickly. That would also mean that we have to store events forever, which likely isn’t the case (more on that later).

Confluent (the company behind Kafka) proposed an alternative model where we could use a Kafka Stream topology and cache the events or rolled up state either in the application or in a separate database. This could help to mitigate the issue, however, it would mean that we are introducing another database in order to mitigate the issue that Kafka itself isn't able to quickly fetch a single stream of events (it’s not a database with indexes after all). That said due to the propagation lag we will not be able to say that the state is always up to date, which violates the latest state guarantee required by event sourcing.

Once the events are fetched, an aggregate reconstituted, and command handled next we need to persist the events (step 4). Publishing a single message to a topic seems easy enough, and if we need to write an atomic batch we can use a transactional producer to achieve it. The one problem and a show stopper here is lack of optimistic concurrency control - the broker isn't able to reject the write in a case where new events were written to the store since the aggregate was reconstituted. We could try to work around it and apply a single-threaded writer pattern, which has its own challenges (how do you ensure that synchronous clients get a timely response? how do you ensure the availability of the writers?). That is really showing the problem that we are trying to use a message queue as a database.

The last step (5) of the flow is to distribute events to the consumers and create read models, feed data to process managers etc. Kafka is well suited to do this task and depending on needs can provide any type of delivery guarantees as well as will handle consumer load balancing and data partitioning.

Other use cases

The flow we’ve discussed so far is very basic and in most event-sourced systems we need to be able to do more than that.

Projection replay

One of the strengths of using events as a source of truth is the ability to project them into any read model we might need in the application. This doesn’t mean though that such a read model has to be present from the very beginning. We could add it at a later date and feed it with all events since the beginning, and then let it carry on with newly added events. A similar use case is when we discover there's a bug in a read model - fixing the bug and replaying all the events would true it up.

With EventStoreDB a replay is a natural operation and could be implemented easily by re-reading all events since the release of the system. Once the replay has finished the consumer can carry on reading the events from the stream.

With Kafka the replay could be as easy as well, as the model is pretty much the same - that is assuming the retention policy is to store events in the topic forever. In order to repopulate a read model, we would have to restart a consumer group to read messages from the beginning.

Data archivization policy

Most of medium and large systems have a way of archiving data to reduce operational costs and keep the system performant. An example in a credit card platform might be that each transaction that was fully paid off can be archived after 12 months since the last payment. One way of handling that in an Event Sourced system could be to read all events, filter out ones that aren't longer needed based on the business rules and then persist a new stream (and delete the old one).

Both in Kafka and EventStoreDB deleting specific messages from a topic or stream isn't possible and such an operation isn't supported. With Kafka, we can send a message with a specific partition key and a null payload which will effectively mark all messages with that partition key for deletion. With EventStoreDB we can delete a fine-grained stream and it’s one of the basic operations that the database supports.

Conclusions: EventStoreDB vs Kafka?

So based on what was discussed in this article - which technology wins when it comes to event sourcing? If we talk about reading, writing and deleting the data EventStoreDB seems to be the clear winner. Then we have Kafka with its great throughput and scalability - it can accept many more writes and reads per second, and makes autoscaling of competing consumers very easy.

From my point of view, they really don’t need to be seen as competitors at all. The two solutions co-exist in a similar space of message storage and processing, but their strengths are different. I think we should take advantage of that and use the right tool for the job, rather than trying to fit all use cases into one and ending up with a lot of accidental complexity.

One way of making them both work in one system could be to implement the ingress of commands directly to Kafka if we have requirements to handle large amounts of incoming async messages and do some initial transformations on them (so treat Kafka as a message queue). Alternatively we could put commands directly into a stream in the EventStoreDB if the traffic patterns don’t require us to ingest more than thousands of commands per seconds. If we don’t have a requirement to handle asynchronous commands then this step can be skipped entirely. Either way when the command is received we can move to the next step which is handling.

Then during command handling we would read existing events and write new ones to EventStoreDB as it gives us proper guarantees around reading latest state and conflict detection during write. In order to distribute events to projections and other handlers we then have two options. If the throughput isn’t very high we can subscribe to events directly from EventStoreDB which is the easiest option. If we have a need to handle tens of thousands of events per second (or more) we could have a connector that would read them from EventStoreDB and push to Kafka. That way they can be distributed to the listeners which are scaled out to handle the load, and configured with ordering and delivery guarantees appropriate for each type of consumer.