The Raft Consensus Algorithm



The Raft Consensus Algorithm

2 1


raft-talk

Slides on Raft that I use for my usual talks.

On Github ongardie / raft-talk

The Raft Consensus Algorithm

Diego Ongaro and John Ousterhout

January 2017

Source code available at https://github.com/ongardie/raft-talk. Unless otherwise noted, this work is: © 2012-2017 Diego Ongaro, © 2012-2014 John Ousterhout. Licensed under the Creative Commons Attribution 4.0 International License.

Motivation

  • Goal: shared key-value store (state machine)
  • Host it on a single machine attached to network
    • Pros: easy, consistent
    • Cons: prone to failure
  • With Raft, keep consistency yet deal with failures

What Is Consensus

  • Agreement on shared state (single system image)
  • Recovers from server failures autonomously
    • Minority of servers fail: no problem
    • Majority fail: lose availability, retain consistency
  • Key to building consistent storage systems

Replicated State Machines

Typical architecture for consensus systems

  • Replicated log ⇒ replicated state machine
    • All servers execute same commands in same order
  • Consensus module ensures proper log replication
  • System makes progress as long as any majority of servers up
  • Failure model: fail-stop (not Byzantine), delayed/lost msgs

How Is Consensus Used?

Top-level system configuration

Replicate entire database state

Paxos Protocol

  • Leslie Lamport, 1989
  • Nearly synonymous with consensus
“The dirty little secret of the NSDI community is that at most five people really, truly understand every part of Paxos ;-).” —NSDI reviewer “There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system...the final system will be based on an unproven protocol.” —Chubby authors

Raft

  • Algorithm for implementing a replicated log
  • System makes progress as long as any majority of servers up
  • Failure model: fail-stop (not Byzantine), delayed/lost msgs
  • Designed for understandability

Raft's Design for Understandability

We wanted an algorithm optimized for building real systems

  • Must be correct, complete, and perform well
  • Must also be understandable

“What would be easier to understand or explain?”

  • Fundamentally different decomposition than Paxos
  • Less complexity in state space
  • Less mechanism

Raft User Study

Raft Overview

Leader election
  • Select one of the servers to act as cluster leader
  • Detect crashes, choose new leader
Log replication (normal operation)
  • Leader takes commands from clients, appends to its log
  • Leader replicates its log to other servers (overwriting inconsistencies)
Safety
  • Only a server with an up-to-date log can become leader

RaftScope Visualization

or https://raft.github.io/raftscope-replay/

Core Raft Review

Leader election
  • Heartbeats and timeouts to detect crashes
  • Randomized timeouts to avoid split votes
  • Majority voting to guarantee at most one leader per term
Log replication (normal operation)
  • Leader takes commands from clients, appends to its log
  • Leader replicates its log to other servers (overwriting inconsistencies)
  • Built-in consistency check simplifies how logs may differ
Safety
  • Only elect leaders with all committed entries in their logs
  • New leader defers committing entries from prior terms

Randomized Timeouts

  • How much randomization is needed to avoid split votes?

  • Conservatively, use random range ~10x network latency

Raft Implementations

Name Primary Authors Language License etcd/raft Xiang Li and Yicheng Qin Go Apache 2.0 RethinkDB/clustering C++ AGPL TiKV Jay, ngaut, siddontang, tiancaiamao. Rust Apache2 hashicorp/raft Armon Dadgar Go MPL-2.0 copycat Jordan Halterman Java Apache2 LogCabin Diego Ongaro C++ ISC Kudu David Alves, Todd Lipcon, Mike Percy C++ Apache2 rafter Andrew Stone Erlang Apache2 akka-raft Konrad Malawski Scala Apache2 OpenDaylight Moiz Raja, Kamal Rameshan, Robert Varga, Tom Pantelis Java Eclipse zraft_lib Gunin Alexander Erlang Apache2 willemt/raft Willem-Hendrik Thiart C BSD peterbourgon/raft Peter Bourgon Go Simplified BSD hoverbear/raft Andrew Hobden, Dan Burkert Rust MIT raftos Alexander Zhebrak Python MIT jgroups-raft Bela Ban Java Apache2 ... ... ... ...

Copied from Raft website, probably stale.

LogCabin

  • Started as research platform for Raft at Stanford
  • Developed into production system at Scale Computing
  • Network service running Raft replicated state machine
  • Data model: hierarchical key-value store, kept in memory
  • Written in C++ (gcc 4.4's C++0x)

Conclusion

  • Consensus useful for both coordination and data management
  • Consensus widely regarded as difficult
  • Raft designed for understandability
    • Easier to teach in classrooms
    • Better foundation for building practical systems
  • Pieces needed for a practical system:
    • Cluster membership changes (simpler in dissertation)
    • Log compaction (expanded tech report/dissertation)
    • Client interaction (expanded tech report/dissertation)
    • Evaluation (dissertation: understandability, correctness, leader election & replication performance)

Questions

raft.github.io

raft-dev mailing list

1
The Raft Consensus Algorithm Diego Ongaro and John Ousterhout January 2017 Source code available at https://github.com/ongardie/raft-talk. Unless otherwise noted, this work is: © 2012-2017 Diego Ongaro, © 2012-2014 John Ousterhout. Licensed under the Creative Commons Attribution 4.0 International License.