Event Sourcing: Snapshotting

In an Event Sourced system, the current state of an aggregate is usually reconstituted from the full history of events. It means that before handling a command we need to do a full read of a single fine-grained stream and transport the events over the network. For a well-designed aggregate, it’s usually not a problem as its lifecycle is bounded to a specific period and the number of events doesn’t grow indefinitely. But what if our design isn’t optimal, or we have some outliers that are requiring thousands of events to be transported every time we want to handle a command?

What is snapshotting

Snapshotting is an optimisation that reduces time spent on reading event from an event store. If for example a stream contains thousands of events, and we need to read all of them every time, then the time the system takes to handle a command will be noticeable. What we can do instead is to create a snapshot of the aggregate state and save it. Then before a command is handled we can load the latest snapshot and only new events since the snapshot was created.

A generalised flow of using snapshots looks like follows:

  • A new command is received

  • Latest snapshot of an aggregate is read from a snapshot store

  • If the snapshot was found: the aggregate state is set from the snapshot, the aggregate version is set to the snapshot version

  • All remaining events are read from the event store starting from the current aggregate version

  • The state is updated with remaining events (if any)

  • Command is handled as usual

When to implement snapshotting

As we can see implementing snapshots in practice isn’t very complicated, but you might wonder - when is it a good time to start thinking about them? In order to make an informed decision, I recommend to have some data in place and make the decision based on that data. If for example, we agreed with the stakeholders that 95% of commands will be handled with a latency of 200ms or less, then we can put appropriate metrics in place and measure how long it takes to read, write and return a result to the client. If we start noticing that the reads from the event store are taking too long, and it’s indeed correlated with the number of events, then it might be a good time to implement snapshots.

I avoid recommending an arbitrary number of events (in a stream) after which you should start snapshotting, as it will depend on a number of factors. These factors include the size of events, database baking the event store, load on the database, network, number of large streams, latencies agreed with business and so on. In practice, I’ve implemented snapshotting, because command handling became too slow as we had streams with a few thousand events or more.

When to persists snapshots

There are multiple ways in which snapshots can be implemented, and a specific strategy will depend on the needs. You can write a new snapshot:

  • Right after new events are written to the store, and before returning to the client (if it’s synchronous). It’s one of the simplest strategies but might incur a latency penalty.

  • After writing new events and returning to the client. This strategy will handle the creation of the snapshot in the background, so the client will not see increased latency.

  • If command handling is asynchronous (e.g reading commands from an MQ) the snapshot can be created as part of, or after command handling. Depending on how time-sensitive the command handling is, it can slow down the throughput of the consumer.

  • In a separate chaser process that monitors streams and creates a snapshot when a threshold is reached.

Usually, it doesn’t make sense to create a snapshot every time we write new events as the time it takes to write a snapshot would wipe out some of the gains we are hoping for. For that reason, the snapshots are typically created once a threshold of new events is reached. Such threshold will depend on a specific application and its characteristics, but in my experience creating a snapshot every few hundred of events worked well.

For example, if our threshold is set to 500 events, our latest snapshot was taken at version 2000, and we just wrote new events with sequence numbers 2499, 2500 and 2501, then it’s time to create a new snapshot.

How to persist snapshots

There is a number of ways in which snapshots can be persisted. The specific solution will likely depend on a set of technologies you have available. The list of option includes storing snapshots in:

  • A filesystem

  • A database (relational/document)

  • A separate snapshot stream in the event store

In practice, each snapshot will have a number of persisting with it:

  • Aggregate id

  • Aggregate version (at which the snapshot was taken)

  • Snapshot data

  • Snapshot metadata (snapshot schema version, creation date etc.)

One of the fields worth paying special attention to is the snapshot schema version. As we evolve our aggregate design over time, we might need our aggregates to be aware of more details, than we had originally persisted. In such a situation we can increment the version of the schema when the changes are made, and discard the snapshot if its version is older than what we expected. Then we can fetch all events for a given stream and create a new snapshot after the command is handled. It’s done because we want to minimise the risk of making incorrect business decisions or causing an error in the application flow because of a stale snapshot.

Summary

Snapshotting is a fairly simple to implement optimisation that is sometimes needed in the event-sourced systems. It allows to reduce the time it takes to read events from the event store, at the cost of storing extra data in a snapshot store. Snapshots should be implemented only if the time it takes to process commands is unsatisfactory and we have metrics in place that tell confirm it’s related to the time it takes to read events. In some cases snapshots can be avoided by improving the design of the aggregates.