This is the first post in a multi-part series describing the Raft distributed
consensus algorithm and its complete implementation in Go. Here is a complete
Raft is a relatively new algorithm (2014), but it’s already being used quite a
bit in industry. The best known example is probably Kubernetes, which relies on
Raft through the etcd distributed key-value
The goal of this series of posts is to describe a fully functioning and
rigorously tested implementation of Raft, and
provide some intuition for how Raft works along the way. It is not meant as the
sole resource for learning Raft. I assume you’ve read the Raft paper at least once; in addition it’s highly
recommended to spend some time perusing the resources on the Raft website – watch a talk or two by its creators, play with the
visualization, skim Ongaro’s PhD thesis for more details, etc.
Don’t expect fully grasping Raft in a single day. Even though it was designed
to be easier to understand than Paxos, Raft is still pretty complicated. The
problem it’s solving – distributed consensus – is a hard problem, so there’s a
natural lower limit on the complexity of the solution.
Replicated State Machines
Distributed consensus algorithms can be seen as solving the problem of
replicating a deterministic state machine across multiple servers. The term
state machine is used to represent an arbitrary service; after all, state
machines are one of the foundations of computer science and everything can be
represented by them. Databases, file servers, lock servers etc. can all be
thought of as complex state machines.
Consider some service represented by a state machine. Multiple clients can
connect to it and issue requests, expecting responses back:
This system works well as long as the server executing the state machine is
reliable. If the server crashes, our service becomes unavailable, which
may not be acceptable. In general, our system is only as reliable as the single
server running it.
A common way to increase the reliability of a service is by means of
replication. We can run several instances of the service on different
servers. This creates a cluster of servers that works in unison to provide the
service, and any one server crashing shouldn’t bring the service down. Isolating
the servers from each other  further increases reliability by removing
common modes of failure that would affect multiple servers simultaneously.
Instead of contacting a single server to perform a service, clients will
contact the whole cluster. In addition, the service replicas that make up a
cluster have to communicate between themselves to properly replicate the state:
Each state machine in this diagram is a replica of the service. The idea is that
all state machines execute in lockstep, taking the same inputs from client
requests and performing the same state transitions. This ensures that they also
return the same results to clients, even if some of the servers fail. Raft is
an algorithm that implements this.
This is the right place to clarify some terminology which this series of posts
will be using repeatedly:
Service: is the logical task of the distributed system we’re implementing.
For example, a key-value database.
Server or Replica: one instance of Raft-enabled service running on
an isolated machine with a network connection to other replicas and to
Cluster: a set of Raft servers collaborating to implement a
distributed service. Typical cluster sizes are 3 or 5.
Consensus module and Raft log
Now it’s time to peek inside one of those state machines shown in the diagram
above. Raft, being a generic algorithm, doesn’t dictate how services
are implemented in terms of state machines. All it aims to achieve is the
ability to reliably and deterministically record and reproduce the sequence of
inputs (also called commands in Raft parlance) to a state machine. Given an
initial state and all the inputs, it’s possible to replay a state machine with
full fidelity. An alternative way to think about it: if we take two separate
replicas of the same state machine and feed them the same sequence of inputs
starting from the same initial state, the state machines will end up in the same
state and will produce the same outputs along the way.
This is the structure of a generic service using Raft:
Some more details about these components:
The state machine is the same one as we’ve seen above. It represents some
arbitrary service; a key-value store is a common example when presenting Raft.
The Log is where all the commands (inputs) issued by clients are stored.
The commands are not applied to the state machine directly; rather, Raft
applies them when they have been successfully replicated to a majority of
servers. Moreover, this log is persistent – it’s saved on stable storage that
survives crashes, and can be used to replay the state machine after a crash.
The Consensus Module is the heart of the Raft algorithm; it’s accepting
commands from clients, makes sure to save them in the log, replicate them
with other Raft replicas in the cluster (the same green arrow as in the
previous diagram) and commit them to the state machine when satisfied that it’s
safe. Committing to the state machine results in notifying clients of
the actual change.
If this seems vague for now, don’t worry. This is what the rest of the posts in
the series are going to explain in detail!
Leaders and followers
Raft uses a strong leadership model, wherein one of the replicas in the
cluster serves as a leader and others serve as followers. The leader is
responsible for acting upon client requests, replicating commands to followers
and returning responses to clients.
During normal operation, the goal of followers is to simply replicate the
leader’s log. In case of a leader failure or network partition, one follower
can take over leadership, so the service remains available.
This model has its pros and cons. A significant advantage is simplicity. Data
always flows from a leader to followers, and only a leader answers client
requests. This makes Raft clusters easier to analyze, test and debug. A
disadvantage is performance – since only a single server in the cluster talks to
clients, this can become a bottleneck in case of a spike of client
activity. The answer to this is typically that Raft shouldn’t be used for
traffic-heavy services. It’s more suitable for low-traffic scenarios where
consistency is critical, at the possible expense of availability – we’ll get
back to this point in the section on fault tolerance.
Earlier, I said “Instead of contacting a single server to perform a service,
clients will contact the whole cluster”; but what does this mean? A cluster
is just a group of servers connected over the network, so how do you contact
“the whole cluster”?
The answer is simple:
When working with a Raft cluster, a client knows the network addresses of
the cluster’s replicas. How it knows this (e.g. by using some sort of
service discovery mechanism) it out of scope for this post.
A client initially sends a request to an arbitrary replica. If this replica
is the leader, it acknowledges the request immediately and the client will
wait for a full response. After that, the client remembers that this replica
is the leader and won’t have to search for it again (until some failure,
like leader crash).
If the replica says it’s not the leader, the client will try another replica.
A possible optimization here is that a follower replica can tell the client
which other replica is the leader. Since replicas communicate continuously
among themselves, typically it knows the right answer. This can save the
client a couple of guesses.
Another case in which the client may realize the replica it contacted is
not the leader is if it’s request is not committed within some timeout.
This may mean the replica it contacted is not actually the leader (even if
it still thinks it is) – it may have been partitioned from the other Raft
servers. When the timeout elapses, the client will go on searching for a
different leader again.
The optimization mentioned in the third bullet point is not necessary in most
cases. In general, it’s useful to distinguish between “normal
operation” and “fault scenario” in Raft. A typical service will spend over
99.9% of its time in “normal operation”, where clients know who the leader is
because they have this information cached from back when they first contacted
the service. Fault scenarios – which we’ll discuss in more detail in the next
section – definitely muddle the waters, but only for a short while. As we’ll
learn in detail in the next posts, a Raft cluster will recover from a temporary
server failure or network partition very quickly – in well under a second in
most scenarios. There will be a short blip of unavailability while the new
leader asserts its leadership and clients find which server it is, but
afterwards it will go back to the “normal operation mode”.
Fault tolerance in Raft and the CAP theorem
Let’s look a back at the diagram of three Raft replicas, without connected
clients this time:
What kinds of failures can we expect in this cluster?
Every component in a modern computer can fail, but to make the discussion a bit
easier let’s treat a server running a Raft instance as an atomic unit. This
leaves us with two primary failure kinds:
A server crash, wherein one of the servers stops responding to all network
traffic for a while. A crashed server is typically restarted and may come
back online after a brief break.
A network partition, wherein one or more servers gets disconnected from other
servers and/or from clients, due to a problem with the networking equipment
or transmission medium.
From the point of view of server A communicating with server B, a crash in B
is indistinguishable from a partition between A and B. They both manifest in
the same way – A stops receiving any messages or responses from B. On a
system-level view, network partitions are quite a bit more insidious, though,
because they affect multiple servers at the same time. We’ll be going through
some tricky scenarios resulting from partitions in the next parts of the
To be able to handle arbitrary network partitions and server crashes gracefully,
Raft requires a majority of servers in the cluster to be up and available
to the leader at any given moment to make progress. With 3 servers, Raft will
tolerate a single server failure. With 5 servers, it will tolerate 2; with
2N+1 servers, it will tolerate N failures.
This brings us to the CAP theorem, the practical consequence of
which is that in the presence of network partitions (which are an unavoidable
part of life), we have to carefully balance availability vs. consistency.
In this balance, Raft is firmly in the consistency camp. Its invariants are
designed to prevent cases where the cluster may reach an inconsistent state,
where different clients get different answer. For this to work, Raft sacrifices
As I mentioned briefly before, Raft wasn’t designed for high-throughput,
fine-grained services. Every client request triggers quite a bit of work –
communication between Raft replicas to replicate it to a majority, as well as
persist it; all before a client gets a response.
So you wouldn’t design a replicated database where all client requests go
through Raft, for example. It would just be too slow. Raft is much better suited
for coarse-grained distributed primitives – like implementing a lock server,
electing leaders for higher-level protocols, replicating critical configuration
data in a distributed system, and so on.
The Raft implementation presented in this series is written in Go. From my point
of view, Go has three strong advantages that make it a promising implementation
language for this series, and for networked services in general:
Concurrency: algorithms like Raft and profoundly concurrent in nature. Each
replica performs ongoing operations, runs timers for timed events, and has
to respond to asynchronous requests from other replicas and clients.
I’ve already written why I consider Go the perfect language for such code.
Standard library: Go has a powerful and industrial-strength standard library
which makes it easy to write sophisticated network servers without importing
and learning any third-party libraries. Particularly in the case of Raft, the
first question one has to answer is “how to send messages between replicas?”,
and many folks get mired in the details of designing a protocol and some
serialization, or using heavy third-party libraries. Go just has net/rpc,
which is a good-enough solution for such tasks, is very quick to set up and
requires no imports.
Simplicity: implementing distributed consensus is complicated enough even
before we start considering the implementation language. It’s possible to
write clear and simple code in any language, but in Go it’s the default
idiom, and the language opposes complexity on every possible level.
Thank you for reading so far! Please let me know if there’s anything I could’ve
written better. Raft may be deceptively simple on the conceptual level, but
there are many gotchas once we get into the code. Future parts in the series
will provide much more details on every facet of the algorithm.
Now you should be ready to move on to Part 1,
where we start implementing Raft.
For example, by keeping them in different racks and/or connected to
different power supplies, or even in different buildings. The really
critical services by large companies are typically replicated at
planetary scale, with replicas in different geographic regions.