The Raft Consensus Algorithm

Diego Ongaro and John Ousterhout

January 2017

Source code available at https://github.com/ongardie/raft-talk. Unless otherwise noted, this work is: © 2012-2017 Diego Ongaro, © 2012-2014 John Ousterhout. Licensed under the Creative Commons Attribution 4.0 International License.

Motivation

Goal: shared key-value store (state machine)
Host it on a single machine attached to network
- Pros: easy, consistent
- Cons: prone to failure
With Raft, keep consistency yet deal with failures

What Is Consensus

Agreement on shared state (single system image)
Recovers from server failures autonomously
- Minority of servers fail: no problem
- Majority fail: lose availability, retain consistency
Key to building consistent storage systems

Replicated State Machines

Typical architecture for consensus systems

Replicated log ⇒ replicated state machine
- All servers execute same commands in same order
Consensus module ensures proper log replication
System makes progress as long as any majority of servers up
Failure model: fail-stop (not Byzantine), delayed/lost msgs

How Is Consensus Used?

Top-level system configuration

Replicate entire database state

Paxos Protocol

Leslie Lamport, 1989
Nearly synonymous with consensus

“The dirty little secret of the NSDI community is that at most five people really, truly understand every part of Paxos ;-).” —NSDI reviewer “There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system...the final system will be based on an unproven protocol.” —Chubby authors

Raft

Algorithm for implementing a replicated log
System makes progress as long as any majority of servers up
Failure model: fail-stop (not Byzantine), delayed/lost msgs
Designed for understandability

Raft's Design for Understandability

We wanted an algorithm optimized for building real systems

Must be correct, complete, and perform well
Must also be understandable

“What would be easier to understand or explain?”

Fundamentally different decomposition than Paxos
Less complexity in state space
Less mechanism

Raft User Study

Raft Overview

Leader election

Select one of the servers to act as cluster leader
Detect crashes, choose new leader

Log replication (normal operation)

Leader takes commands from clients, appends to its log
Leader replicates its log to other servers (overwriting inconsistencies)

Safety

Only a server with an up-to-date log can become leader

RaftScope Visualization

or https://raft.github.io/raftscope-replay/

Core Raft Review

Leader election

Heartbeats and timeouts to detect crashes
Randomized timeouts to avoid split votes
Majority voting to guarantee at most one leader per term

Log replication (normal operation)

Leader takes commands from clients, appends to its log
Leader replicates its log to other servers (overwriting inconsistencies)
Built-in consistency check simplifies how logs may differ

Safety

Only elect leaders with all committed entries in their logs
New leader defers committing entries from prior terms

Randomized Timeouts

How much randomization is needed to avoid split votes?

Conservatively, use random range ~10x network latency

Raft Implementations

Name Primary Authors Language License etcd/raft Xiang Li and Yicheng Qin Go Apache 2.0 RethinkDB/clustering C++ AGPL TiKV Jay, ngaut, siddontang, tiancaiamao. Rust Apache2 hashicorp/raft Armon Dadgar Go MPL-2.0 copycat Jordan Halterman Java Apache2 LogCabin Diego Ongaro C++ ISC Kudu David Alves, Todd Lipcon, Mike Percy C++ Apache2 rafter Andrew Stone Erlang Apache2 akka-raft Konrad Malawski Scala Apache2 OpenDaylight Moiz Raja, Kamal Rameshan, Robert Varga, Tom Pantelis Java Eclipse zraft_lib Gunin Alexander Erlang Apache2 willemt/raft Willem-Hendrik Thiart C BSD peterbourgon/raft Peter Bourgon Go Simplified BSD hoverbear/raft Andrew Hobden, Dan Burkert Rust MIT raftos Alexander Zhebrak Python MIT jgroups-raft Bela Ban Java Apache2 ... ... ... ...

Copied from Raft website, probably stale.

LogCabin

Started as research platform for Raft at Stanford
Developed into production system at Scale Computing
Network service running Raft replicated state machine
Data model: hierarchical key-value store, kept in memory
Written in C++ (gcc 4.4's C++0x)

Conclusion

Consensus useful for both coordination and data management
Consensus widely regarded as difficult
Raft designed for understandability
- Easier to teach in classrooms
- Better foundation for building practical systems
Pieces needed for a practical system:
- Cluster membership changes (simpler in dissertation)
- Log compaction (expanded tech report/dissertation)
- Client interaction (expanded tech report/dissertation)
- Evaluation (dissertation: understandability, correctness, leader election & replication performance)

Questions

raft.github.io

raft-dev mailing list

The Raft Consensus Algorithm Diego Ongaro and John Ousterhout January 2017 Source code available at https://github.com/ongardie/raft-talk. Unless otherwise noted, this work is: © 2012-2017 Diego Ongaro, © 2012-2014 John Ousterhout. Licensed under the Creative Commons Attribution 4.0 International License.

The Raft Consensus Algorithm

ongardie

The Raft Consensus Algorithm

2 1 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

raft-talk

The Raft Consensus Algorithm

Diego Ongaro and John Ousterhout

January 2017

Motivation

What Is Consensus

Replicated State Machines

How Is Consensus Used?

Paxos Protocol

Raft

Raft's Design for Understandability

Raft User Study

Raft Overview

RaftScope Visualization

Core Raft Review

Randomized Timeouts

Raft Implementations

LogCabin

Conclusion

Questions

2 1