Motivation
- Goal: shared key-value store (state machine)
- Host it on a single machine attached to network
- Pros: easy, consistent
- Cons: prone to failure
- With Raft, keep consistency yet deal with failures
What Is Consensus
- Agreement on shared state (single system image)
- Recovers from server failures autonomously
- Minority of servers fail: no problem
- Majority fail: lose availability, retain consistency
- Key to building consistent storage systems
Replicated State Machines
Typical architecture for consensus systems
-
Replicated log ⇒ replicated state machine
- All servers execute same commands in same order
- Consensus module ensures proper log replication
- System makes progress as long as any majority of servers up
- Failure model: fail-stop (not Byzantine), delayed/lost msgs
How Is Consensus Used?
Top-level system configuration
Replicate entire database state
Paxos Protocol
- Leslie Lamport, 1989
- Nearly synonymous with consensus
“The dirty little secret of the NSDI community is that at
most five people really, truly understand every part of
Paxos ;-).” —NSDI reviewer
“There are significant gaps between the description of
the Paxos algorithm and the needs of a real-world
system...the final system will be based on an unproven
protocol.” —Chubby authors
Raft
- Algorithm for implementing a replicated log
- System makes progress as long as any majority of servers up
- Failure model: fail-stop (not Byzantine), delayed/lost msgs
- Designed for understandability
Raft's Design for Understandability
We wanted an algorithm optimized for building real systems
- Must be correct, complete, and perform well
- Must also be understandable
“What would be easier to understand or explain?”
- Fundamentally different decomposition than Paxos
- Less complexity in state space
- Less mechanism
Raft Overview
Leader election- Select one of the servers to act as cluster leader
- Detect crashes, choose new leader
Log replication (normal operation)- Leader takes commands from clients, appends to its log
- Leader replicates its log to other servers (overwriting
inconsistencies)
Safety- Only a server with an up-to-date log can become leader
Core Raft Review
Leader election- Heartbeats and timeouts to detect crashes
- Randomized timeouts to avoid split votes
- Majority voting to guarantee at most one leader per term
Log replication (normal operation)- Leader takes commands from clients, appends to its log
- Leader replicates its log to other servers (overwriting
inconsistencies)
- Built-in consistency check simplifies how logs may differ
Safety- Only elect leaders with all committed entries in their logs
- New leader defers committing entries from prior terms
Randomized Timeouts
- How much randomization is needed to avoid split votes?
- Conservatively, use random range ~10x network latency
Raft Implementations
Name
Primary Authors
Language
License
etcd/raft
Xiang Li and Yicheng Qin
Go
Apache 2.0
RethinkDB/clustering
C++
AGPL
TiKV
Jay,
ngaut,
siddontang,
tiancaiamao.
Rust
Apache2
hashicorp/raft
Armon Dadgar
Go
MPL-2.0
copycat
Jordan Halterman
Java
Apache2
LogCabin
Diego Ongaro
C++
ISC
Kudu
David Alves,
Todd Lipcon,
Mike Percy
C++
Apache2
rafter
Andrew Stone
Erlang
Apache2
akka-raft
Konrad Malawski
Scala
Apache2
OpenDaylight
Moiz Raja, Kamal Rameshan, Robert Varga, Tom Pantelis
Java
Eclipse
zraft_lib
Gunin Alexander
Erlang
Apache2
willemt/raft
Willem-Hendrik Thiart
C
BSD
peterbourgon/raft
Peter Bourgon
Go
Simplified BSD
hoverbear/raft
Andrew Hobden,
Dan Burkert
Rust
MIT
raftos
Alexander Zhebrak
Python
MIT
jgroups-raft
Bela Ban
Java
Apache2
...
...
...
...
Copied from Raft website, probably stale.
LogCabin
- Started as research platform for Raft at Stanford
- Developed into production system at Scale Computing
- Network service running Raft replicated state machine
- Data model: hierarchical key-value store, kept in memory
- Written in C++ (gcc 4.4's C++0x)
Conclusion
- Consensus useful for both coordination and data management
- Consensus widely regarded as difficult
- Raft designed for understandability
- Easier to teach in classrooms
- Better foundation for building practical systems
- Pieces needed for a practical system:
- Cluster membership changes (simpler in dissertation)
- Log compaction (expanded tech report/dissertation)
- Client interaction (expanded tech report/dissertation)
- Evaluation (dissertation: understandability, correctness,
leader election & replication performance)
1
The Raft Consensus Algorithm
Diego Ongaro and John Ousterhout
January 2017
Source code available at https://github.com/ongardie/raft-talk.
Unless otherwise noted, this work is:
© 2012-2017 Diego Ongaro,
© 2012-2014 John Ousterhout.
Licensed under the Creative Commons Attribution 4.0 International License.