译 - In Search of an Understandable Consensus Algorithm (Extended Version)

原文:In Search of an Understandable Consensus Algorithm (Extended Version)


Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.

Raft 是一种用来管理日志复制的共识算法。它的性能和 (multi-)Paxos 是一样的,并且和Paxos一样高效,但是它的结构和 Paxos 不一样;这使得 Raft 更容易理解并且也为构建实用系统提供了更好的基础。为了增强可理解性,Raft 将共识算法分为了几个部分,例如领导选取(leader selection),日志复制(log replication)和安全性(safety),同时它使用了更强的一致性来减少了必须要考虑的状态数量。从用户学习的结果来看,Raft 比 Paxos 更容易学习。Raft 还包括了一种新的机制来动态改变集群成员,它使用重叠大多数(overlapping majorities)来保证安全。


Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems. Paxos [15, 16] has dominated the discussion of consensus algorithms over the last decade: most implementations of consensus are based on Paxos or influenced by it, and Paxos has become the primary vehicle used to teach students about consensus.

共识算法允许一群机器像一个整体一样工作,即使其中的一些成员发生故障也不会出现问题。基于这一点,它在构建可靠的大规模软件系统的过程中起着关键的作用。Paxos[15, 16]一直主导着过去十年间对共识算法的讨论:许多共识算法的实现都是以Paxos为基础或者受到它的影响,并且Paxos已经成为了用来教授学生关于共识算法的主要工具。

Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. Furthermore, its architecture requires complex changes to support practical systems. As a result, both system builders and students struggle with Paxos.


After struggling with Paxos ourselves, we set out to find a new consensus algorithm that could provide a better foundation for system building and education. Our approach was unusual in that our primary goal was understandability: could we define a consensus algorithm for practical systems and describe it in a way that is significantly easier to learn than Paxos? Furthermore, we wanted the algorithm to facilitate the development of intuitions that are essential for system builders. It was important not just for the algorithm to work, but for it to be obvious why it works.


The result of this work is a consensus algorithm called Raft. In designing Raft we applied specific techniques to improve understandability, including decomposition (Raft separates leader election, log replication, and safety) and state space reduction (relative to Paxos, Raft reduces the degree of nondeterminism and the ways servers can be inconsistent with each other). A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos.


Raft is similar in many ways to existing consensus algorithms (most notably, Oki and Liskov’s Viewstamped Replication [29, 22]), but it has several novel features:

  • Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand.
  • Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.
  • Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.


  • 强领导者:Raft使用了比其他共识算法更强的领导形式。例如,日志条目只能从领导者流向其他服务器。这简化了对复制日志的管理,并使Raft更易于理解。
  • 领导者选举:Raft使用随机的时钟来选举领导者。这只是在共识算法原有的心跳检测的基础上增加了少量的特殊机制。使得冲突解决变得更加简单快速。
  • 成员变更:Raft使用了一种新的联合共识的方法来处理集群成员变更的问题,处于调整过程中的两种不同的配置集群中大多数机器会有重叠,这就使得集群在成员变换的时候依然可以继续工作。

We believe that Raft is superior to Paxos and other consensus algorithms, both for educational purposes and as a foundation for implementation. It is simpler and more understandable than other algorithms; it is described completely enough to meet the needs of a practical system; it has several open-source implementations and is used by several companies; its safety properties have been formally specified and proven; and its efficiency is comparable to other algorithms.


The remainder of the paper introduces the replicated state machine problem (Section 2), discusses the strengths and weaknesses of Paxos (Section 3), describes our general approach to understandability (Section 4), presents the Raft consensus algorithm (Sections 5–8), evaluates Raft (Section 9), and discusses related work (Section 10).

本文的第2章介绍了状态机的相关问题,第3章描述了Paxos的优缺点,第4章介绍了我们达成可理解性目标的一些方法,第5到8章详细介绍了 Raft 共识算法,第9章描述了对Raft的评估,第10章讨论了Raft相关一些成果。

2、Replicated state machines(复制状态机)

Consensus algorithms typically arise in the context of replicated state machines [37]. In this approach, state machines on a collection of servers compute identical copies of the same state and can continue operating even if some of the servers are down. Replicated state machines are used to solve a variety of fault tolerance problems in distributed systems. For example, large-scale systems that have a single cluster leader, such as GFS [8], HDFS [38], and RAMCloud [33], typically use a separate replicated state machine to manage leader election and store configuration information that must survive leader crashes. Examples of replicated state machines include Chubby [2] and ZooKeeper [11].

共识算法是在复制状态机[37]的背景下提出来的。在这个方法中,一组服务器上的状态机对同一个状态计算并产生多个完全相同的副本,这使得即使其中一些服务器崩溃了,这组服务器也还可以继续正常运行。复制状态机通常用于解决分布式系统中容错相关的一系列问题。例如,GFS[8],HDFS[38], RAMCloud[33],这些拥有单一集群领导者的大规模应用系统,会使用一个独立的复制状态机来管理领导选取及存储集群配置信息来应对领导者的崩溃。复制状态机典型的例子包括 Chubby[2] 和 ZooKeeper[11]。

Replicated state machine architecture

Figure 1: Replicated state machine architecture. The consensus algorithm manages a replicated log containing state machine commands from clients. The state machines process identical sequences of commands from the logs, so they produce the same outputs.

图1: 复制状态机架构。共识算法管理着一个复制日志,其中包含来自客户端的状态机命令。状态机负责处理日志中相同的命令序列,因此它们会产生相同的输出。

Replicated state machines are typically implemented using a replicated log, as shown in Figure 1. Each server stores a log containing a series of commands, which its state machine executes in order. Each log contains the same commands in the same order, so each state machine processes the same sequence of commands. Since the state machines are deterministic, each computes the same state and the same sequence of outputs.


Keeping the replicated log consistent is the job of the consensus algorithm. The consensus module on a server receives commands from clients and adds them to its log. It communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail. Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs are returned to clients. As a result, the servers appear to form a single, highly reliable state machine.


Consensus algorithms for practical systems typically have the following properties:

  • They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.
  • They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.
  • They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems.
  • In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system performance.


  • 确保非拜占庭情况下的安全性(从来不会返回一个错误的结果),包括网络的延迟、分区及数据包的丢包、冗余和乱序情况。
  • 只要集群主体中的大多数机器能够运行,并且可以相互通信和与客户机通信,这个集群就可用。因此,一个拥有 5 台机器的集群最多可以容忍其中的 2 台的宕机。假定服务器因停止而发生了故障;它们可能稍后就会恢复稳定存储状态并重新加入集群。
  • 不依赖于时间来确保日志的一致性:错误的时钟和极端的消息延迟在最坏的情况下会导致可用性的问题。
  • 通常情况下,只要集群的大多数成员响应了一轮远程过程调用,命令就可以完成;少数慢速服务器不会影响总体的系统性能。

3、What’s wrong with Paxos?(Paxos有什么问题?)

Over the last ten years, Leslie Lamport’s Paxos protocol [15] has become almost synonymous with consensus: it is the protocol most commonly taught in courses, and most implementations of consensus use it as a starting point. Paxos first defines a protocol capable of reaching agreement on a single decision, such as a single replicated log entry. We refer to this subset as single-decree Paxos. Paxos then combines multiple instances of this protocol to facilitate a series of decisions such as a log (multi-Paxos). Paxos ensures both safety and liveness, and it supports changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.

在过去的十年中,Leslie Lamport的Paxos协议[15]几乎成为了共识算法的代名词:它是授课中最常讲授的算法,同时也是许多共识算法实现的起点。Paxos首先定义了一个能够就单个决策达成一致的协议,例如单个复制的日志条目。我们将这个子集称为单一决策 Paxos。然后,Paxos将该协议的多个实例组合起来从而形成一系列决策,例如日志(multi-Paxos)。Paxos既保证了安全性又保证了活跃性,同时它支持集群成员角色的变更。它的正确性已被证明了并且在一般的情况下也被证明是高效的。

Unfortunately, Paxos has two significant drawbacks. The first drawback is that Paxos is exceptionally difficult to understand. The full explanation [15] is notoriously opaque; few people succeed in understanding it, and only with great effort. As a result, there have been several attempts to explain Paxos in simpler terms [16, 20, 21]. These explanations focus on the single-decree subset, yet they are still challenging. In an informal survey of attendees at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year.

不幸的是,Paxos有两个显著的缺点。第一个是 Paxos 太难以理解。众所周知,它的完整说明是出乎寻常的晦涩,很少有人能够完全理解它。因此,人们曾多次尝试用更简单的术语来解释Paxos[16,20,21]。虽然它们都侧重于single-decree subset的版本,但是仍然非常具有挑战性。在一项针对NSDI 2012与会者的非正式调查中,我们发现很少有人对Paxos感到舒服,即使是那些有着丰富经验的研究人员。我们自己也对Paxos感到非常痛苦,我们在阅读了几个简化版的描述以及设计了我们自己的替代协议后才能够理解完整的协议,而这整个过程持续了将近一年。

We hypothesize that Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation. Single-decree Paxos is dense and subtle: it is divided into two stages that do not have simple intuitive explanations and cannot be understood independently. Because of this, it is difficult to develop intuitions about why the singledecree protocol works. The composition rules for multiPaxos add significant additional complexity and subtlety. We believe that the overall problem of reaching consensus on multiple decisions (i.e., a log instead of a single entry) can be decomposed in other ways that are more direct and obvious.

我们认为Paxos的晦涩来源于它将single-decree subset作为自己的基础。Single-decree Paxos被认为是微妙的:它被划分为两个阶段,它们没有简单直观的解释,也不能被独立理解。因此,这就导致了很难对single-decree协议是如何工作的进行联想。而multi-Paxos的组成规则(composition rule)则更加添加了复杂性。我们认为,就多项决策达成共识的总体问题(即,一个日志而不是一个条目)可以用其他更直接、更明显的方式来分解。

The second problem with Paxos is that it does not provide a good foundation for building practical implementations. One reason is that there is no widely agreedupon algorithm for multi-Paxos. Lamport’s descriptions are mostly about single-decree Paxos; he sketched possible approaches to multi-Paxos, but many details are missing. There have been several attempts to flesh out and optimize Paxos, such as [26], [39], and [13], but these differ from each other and from Lamport’s sketches. Systems such as Chubby [4] have implemented Paxos-like algorithms, but in most cases their details have not been published.

Paxos的第二个问题是它没有为构建实际的实现提供良好的基础。一大原因是multi-Paxos没有一个广受认可的算法。Lamport的描述主要针对的是single-decree Paxos;它为multi-Paxos提供了一个大概的框架,但是很多细节并没有提及。对于充实以及优化Paxos已经做了很多努力,例如[26]、[39]和[13],但是它们各自之间,以及和Lamport的概述都不相同。像Chubby这样的系统已经实现了类Paxos算法,但是它的很多细节并没有公开。

Furthermore, the Paxos architecture is a poor one for building practical systems; this is another consequence of the single-decree decomposition. For example, there is little benefit to choosing a collection of log entries independently and then melding them into a sequential log; this just adds complexity. It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order. Another problem is that Paxos uses a symmetric peer-to-peer approach at its core (though it eventually suggests a weak form of leadership as a performance optimization). This makes sense in a simplified world where only one decision will be made, but few practical systems use this approach. If a series of decisions must be made, it is simpler and faster to first elect a leader, then have the leader coordinate the decisions.


As a result, practical systems bear little resemblance to Paxos. Each implementation begins with Paxos, discovers the difficulties in implementing it, and then develops a significantly different architecture. This is timeconsuming and error-prone, and the difficulties of understanding Paxos exacerbate the problem. Paxos’ formulation may be a good one for proving theorems about its correctness, but real implementations are so different from Paxos that the proofs have little value. The following comment from the Chubby implementers is typical:

  • There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. . . . the final system will be based on an unproven protocol [4].


  • Paxos算法的描述和现实世界的系统需求之间有着巨大的矛盾…而最终实现的系统都将建立在一个未经证明的协议之上[4]。

Because of these problems, we concluded that Paxos does not provide a good foundation either for system building or for education. Given the importance of consensus in large-scale software systems, we decided to see if we could design an alternative consensus algorithm with better properties than Paxos. Raft is the result of that experiment.


4、Designing for understandability(可理解性设计)

We had several goals in designing Raft: it must provide a complete and practical foundation for system building, so that it significantly reduces the amount of design work required of developers; it must be safe under all conditions and available under typical operating conditions; and it must be efficient for common operations. But our most important goal—and most difficult challenge—was understandability. It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations.

我们在设计Raft的时候有以下几个目标:它必须为系统构建提供一个完整并实际可行的基础,这将大大减少系统开发者的设计工作;它必须在任何情况下都能够确保安全性,并且保证在典型应用场景下的可用性;它在通常的应用操作中必须是高效的。另外,最重要的一点,也是最具挑战性的一点是它必须易于理解,从而使广大的读者能够很好的理解这个算法。 并且要能够培养出对这个算法的直觉(develop intuitions),从而让系统构建者能够在实际实现中做出必要的扩展。

There were numerous points in the design of Raft where we had to choose among alternative approaches. In these situations we evaluated the alternatives based on understandability: how hard is it to explain each alternative (for example, how complex is its state space, and does it have subtle implications?), and how easy will it be for a reader to completely understand the approach and its implications?


We recognize that there is a high degree of subjectivity in such analysis; nonetheless, we used two techniques that are generally applicable. The first technique is the well-known approach of problem decomposition: wherever possible, we divided problems into separate pieces that could be solved, explained, and understood relatively independently. For example, in Raft we separated leader election, log replication, safety, and membership changes.


Our second approach was to simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible. Specifically, logs are not allowed to have holes, and Raft limits the ways in which logs can become inconsistent with each other. Although in most cases we tried to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability. In particular, randomized approaches introduce nondeterminism, but they tend to reduce the state space by handling all possible choices in a similar fashion (“choose any; it doesn’t matter”). We used randomization to simplify the Raft leader election algorithm.


5、The Raft consensus algorithm(Raft公式算法)

Raft is an algorithm for managing a replicated log of the form described in Section 2. Figure 2 summarizes the algorithm in condensed form for reference, and Figure 3 lists key properties of the algorithm; the elements of these figures are discussed piecewise over the rest of this section.


StateRequestVote RPCAppendEntries RPCRules For Servers

Figure 2: A condensed summary of the Raft consensus algorithm (excluding membership changes and log compaction). The server behavior in the upper-left box is described as a set of rules that trigger independently and repeatedly. Section numbers such as §5.2 indicate where particular features are discussed. A formal specification [31] describes the algorithm more precisely.


Raft Properties

Figure 3: Raft guarantees that each of these properties is true at all times. The section numbers indicate where each property is discussed.


Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines. Having a leader simplifies the management of the replicated log. For example, the leader can decide where to place new entries in the log without consulting other servers, and data flows in a simple fashion from the leader to other servers. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.


Given the leader approach, Raft decomposes the consensus problem into three relatively independent subproblems, which are discussed in the subsections that follow:

  • Leader election: a new leader must be chosen when an existing leader fails (Section 5.2).
  • Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own (Section 5.3).
  • Safety: the key safety property for Raft is the State Machine Safety Property in Figure 3: if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index. Section 5.4 describes how Raft ensures this property; the solution involves an additional restriction on the election mechanism described in Section 5.2.

After presenting the consensus algorithm, this section discusses the issue of availability and the role of timing in the system.


  • 领导者选举:现有的领导者宕机后将选举新的领导者(第5.2节)。
  • 日志复制:领导者必须能够接受来自客户端的日志条目,并复制到集群中的其它服务器,强制其他服务器的日志与自己的日志保持一致(第5.3节)。
  • 安全性:Raft的关键安全性属性是图3中的状态机的安全性属性:如果任何服务器已将特定的日志条目应用于其状态机,那么其他服务器都不能对同一个日志索引应用不同的命令。第5.4节描述了Raft如何确保该特性;解决方案中包括了对第5.2节所述选举机制的额外限制。

在介绍了共识算法之后,本节讨论了可用性问题以及时机(the role of timing)在系统中的作用。

5.1、Raft basics(Raft基础)

A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures. At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader as described in Section 5.2. Figure 4 shows the states and their transitions; the transitions are discussed below.

一个 Raft 集群包括若干个服务器;对于一个典型的拥有 5 个服务器的集群来说,最多能够容忍 2 台服务器宕机。集群运行期间,服务器都会处于三个状态之中:领导者(Leader) 、跟随者(Follower)、候选者(Candidate)。正常情况下,只有一个服务器处于领导者状态,其它的都是跟随者。跟随者是被动的:他们不发送任何请求,只是简单的响应来自领导者和候选者的请求。领导者来处理所有来自客户端的请求(如果一个客户端与跟随者进行通信,跟随者会将请求信息转发给领导者)。候选者是用来选取新的领导者的。图-4 阐述了这些状态及它们之间的转换。

Raft Server States

Figure 4: Server states. Followers only respond to requests from other servers. If a follower receives no communication, it becomes a candidate and initiates an election. A candidate that receives votes from a majority of the full cluster becomes the new leader. Leaders typically operate until they fail.


Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.


Raft Terms

Figure 5: Time is divided into terms, and each term begins with an election. After a successful election, a single leader manages the cluster until the end of the term. Some elections fail, in which case the term ends without choosing a leader. The transitions between terms may be observed at different times on different servers.


Different servers may observe the transitions between terms at different times, and in some situations a server may not observe an election or even entire terms. Terms act as a logical clock [14] in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.


Raft servers communicate using remote procedure calls (RPCs), and the basic consensus algorithm requires only two types of RPCs. RequestVote RPCs are initiated by candidates during elections (Section 5.2), and AppendEntries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat (Section 5.3). Section 7 adds a third RPC for transferring snapshots between servers. Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance.

Raft服务器间通过RPC方式进行通信,基础的共识算法只需要两种类型的RPC:请求投票RPCs(RequestVote RPCs)是候选者在选举期间发起的,并在选举时使用(第5.2节)和 复制日志条目 RPCs(Append Entries RPCs) 是由领导者发起的,在复制日志及心跳检测时使用(第5.3节)。第7节中新增了用于在服务器之间传输快照的第三个RPC。如果服务器没有及时收到响应,则会重试RPC,通过并行发出RPC以获得最佳性能。

5.2、Leader election(领导者选举)

Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.

Raft使用心跳机制来触发领导者选举。当服务器启动时,它们会处于跟随者的状态。并且保持这种状态直到接收到来自领导者或者候选者的合法RPCs。领导者会定期向所有追随者发送心跳信号(AppendEntries RPCs 是不携带日志条目的RPC),以保持它们的领导者地位。如果一个跟随者在选举超时的时间内没有收到任何通信,那么它就假定没有有效的领导者,并开始新一轮的选举以选择出新的领导人。

To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens: (a) it wins the election, (b) another server establishes itself as leader, or © a period of time goes by with no winner. These outcomes are discussed separately in the paragraphs below.

为了开始进行选举,跟随者会增加其当前任期并切换到候选者状态。然后它为自己投票,并向集群中的其他服务器并行发出请求投票请求(RequestVote RPCs)。候选者会一直保持自身状态,直到以下三种情况中的任何一种发生:(a)它赢得了选举,成为了领导者(b)其他候选者赢得了领导者的地位,(c)选举超时,未能成功的选出领导者。这些结果将在下文各段中单独讨论。

A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes). The majority rule ensures that at most one candidate can win the election for a particular term (the Election Safety Property in Figure 3). Once a candidate wins an election, it becomes leader. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new elections.


While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state.

在等待投票时,候选者可能会收到其他声称是领导者的复制日志条目 RPC( AppendEntries RPC)。如果领导者的任期(被包含在它的RPC请求中)和候选者的当前任期相同(或者大于候选者的任期),那么候选者就承认领导者是合法的,并且从候选者状态转换成跟随者的状态。如果RPC请求中的任期信息小于当前候选者的任期,当前候选者则会拒绝该RPC并且继续处于候选者状态。

The third possible outcome is that a candidate neither wins nor loses the election: if many followers become candidates at the same time, votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs. However, without extra measures split votes could repeat indefinitely.

第三种可能的结果是候选者既没有赢得选举也没有输掉选举:如果很多跟随者同时成为候选者,选票就可能会被分割,最终可能是没有候选者获得多数票。当这种情况发生时,每一位候选者都将进入选举超时状态,之后通过增加它们的任期和发送新一轮的请求投票RPCs(RequestVote RPCs)来发起新一轮的选举。然而,如果不采用额外的措施,分裂的投票(split votes)将会无限的重复。

Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms). This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends heartbeats before any other servers time out. The same mechanism is used to handle split votes. Each candidate restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before starting the next election; this reduces the likelihood of another split vote in the new election. Section 9.3 shows that this approach elects a leader rapidly.

Raft使用随机的选举超时时间来确保分裂投票(split votes)很少发生,或者即使发生了也能很快的被解决。为了在一开始就避免分裂投票的发生,选举超时时间被设定为一个固定范围(例如150-300毫秒)中的随机值。这就能够使服务器很好的分散开来,确保在大多数场景下只会有一个服务器发生选举超时;当一个服务器赢得选举后,它能够在其他服务器选举超时之前向它们发送心跳信息。每一个候选者在选举开始时会重置一个随机的选举超时时间,然后等待超时时间的带来,之后再重新启动下一轮的选举,这就大大减少了下一次选举时分裂投票的情况发生。第9.3节表明,这种方法能够快速的选举出领导者。

Elections are an example of how understandability guided our choice between design alternatives. Initially we planned to use a ranking system: each candidate was assigned a unique rank, which was used to select between competing candidates. If a candidate discovered another candidate with higher rank, it would return to follower state so that the higher ranking candidate could more easily win the next election. We found that this approach created subtle issues around availability (a lower-ranked server might need to time out and become a candidate again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader). We made adjustments to the algorithm several times, but after each adjustment new corner cases appeared. Eventually we concluded that the randomized retry approach is more obvious and understandable.


5.3、Log replication(日志复制)

Once a leader has been elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.

一旦一个领导者被选出,它就开始接受处理客户端的请求。每个客户端的请求都会包含一条需要状态机执行的命令。领导者将命令作为一个新条目追加到自身的日志中,然后并行的发送追加条目RPCs(AppendEntries RPCs)到其他的服务器进行日志条目的复制。当条目被安全的复制(如下所述)后,领导者将该条目应用于其状态机,并将执行结果返回给客户端。如果跟随者发生宕机,运行缓慢或网络数据包丢失等情况,领导者会无限次的重试发送AppendEntries RPCs,直到所有的跟随者都成功复制了所有的日志条目。

Raft Logs

Figure 6: Logs are composed of entries, which are numbered sequentially. Each entry contains the term in which it was created (the number in each box) and a command for the state machine. An entry is considered committed if it is safe for that entry to be applied to state machines.


Logs are organized as shown in Figure 6. Each log entry stores a state machine command along with the term number when the entry was received by the leader. The term numbers in log entries are used to detect inconsistencies between logs and to ensure some of the properties in Figure 3. Each log entry also has an integer index identifying its position in the log.


The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries created by previous leaders. Section 5.4 discusses some subtleties when applying this rule after leader changes, and it also shows that this definition of commitment is safe. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).

领导者决定对状态机应用日志条目的安全时机,这样的条目称为提交(committed)。Raft保证提交后的条目是持久的,并且最终将被所有可用的状态机执行。当一个日志条目被集群中的大多数服务器成功复制后,它就会被领导者提交(例如图6中的条目7),这一个过程同时也会将此条目之前的所有日志条目一并提交,包括之前任期的领导者所创建的条目。第5.4节讨论了领导者发生变动后应用这个规则的微妙之处,同时也说明了这个承诺的定义是安全的。领导者会一直跟踪最新提交的日志条目索引,并将它包含在随后的Append Entries RPCs(包括心跳)中,以便其他服务器识别,并应用到自身状态机。

We designed the Raft log mechanism to maintain a high level of coherency between the logs on different servers. Not only does this simplify the system’s behavior and make it more predictable, but it is an important component of ensuring safety. Raft maintains the following properties, which together constitute the Log Matching Property in Figure 3:

  • If two entries in different logs have the same index and term, then they store the same command.
  • If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.


  • 如果不同日志中的两个条目具有相同的索引和任期号,则它们存储着相同的命令。
  • 如果不同日志中的两个条目具有相同的索引和任期号,则之前所有的条目中的日志都是相同的。

The first property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log. The second property is guaranteed by a simple consistency check performed by AppendEntries. When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries.

第一个特性表明,领导者在一个日志索引位置至多只会创建一个日志条目,并且日志中的条目位置都是固定的。第二个特性由AppendEntries执行的简单一致性检查来保证。在发送Append Entries RPCs时,领导者会将要发送的最新条目之前的条目索引(preLogIndex)及任期号(preLogTerm)包含进去,如果跟随者在其日志中找不到匹配的前置条目索引和前置任期号,则拒绝该日志条目。一致性检查执行符合递归特性:初始的空日志满足日志匹配属性(Log Matching Property),随着每一次日志扩充,一致性检查都确保符合Log Matching Property。因此,每当AppendEntries成功返回时,领导者就知道跟随者的日志在新通过的日志及之前的日志和自己的保持一致。

During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consistency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsistencies can compound over a series of leader and follower crashes. Figure 7 illustrates the ways in which followers’ logs may differ from that of a new leader. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms.

正常情况下,领导者和跟随者的日志能够保持一致,因此一致性检查不会失败。但是,当领导者宕机后,就会出现日志不一致的情况(旧的领导者可能还会有一部分日志没来得及成功的复制给跟随者)。日志的不一致会随着一系列的领导者和跟随者的宕机而变得更加严重。图7 中展示了跟随者和新的领导者日志的不同之处。跟随者可能会缺少一些领导者中存在的日志条目,也有可能拥有一些领导者中不存在的日志条目,或者这两种情况都存在。日志中丢失的和无关的条目可能跨越多个任期。

Diff Logs

Figure 7: When the leader at the top comes to power, it is possible that any of scenarios (a–f) could occur in follower logs. Each box represents one log entry; the number in the box is its term. A follower may be missing entries (a–b), may have extra uncommitted entries (c–d), or both (e–f). For example, scenario (f) could occur if that server was the leader for term 2, added several entries to its log, then crashed before committing any of them; it restarted quickly, became leader for term 3, and added a few more entries to its log; before any of the entries in either term 2 or term 3 were committed, the server crashed again and remained down for several terms.


In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show that this is safe when coupled with one more restriction.


To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs. The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.

为了使跟随者的日志与自己的日志保持一致,领导者必须找到两个日志一致的最新日志条目,删除该点之后跟随者日志中的所有条目,并在该点之后将领导者的所有条目发送给跟随者。所有这些操作都是由AppendEntries RPCs的一致性检查引发执行的。**领导者为每个跟随者都维护了一个nextIndex变量,它是领导者将要发送给改跟随者的下一个日志条目的索引。当一个领导者第一次掌权时,它会将所有跟随者的nextIndex初始化为最后一个日志条目的下一个索引(图7中的11)。如果跟随者与领导者的日志不一致,AppendEntries的一致性检查就会在下一次的Append Entries RPCs中返回失败。一次失败后,领导者就会将该跟随着的nextIndex减1,然后重新发送Append Entries RPCs,如此循环往复,直到找到一个领导者和跟随者的日志能通过AppendEntries的一致性检查的nextIndex值。此时,AppendEntries将成功执行,这将删除跟随者日志中任何冲突的条目,并复制领导者此索引之后的所有日志同步给跟随者。**一旦AppendEntries成功,追随者的日志与领导者的日志是一致的,并且在余下的任期内都将保持这种状态。

If desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs. For example, when rejecting an AppendEntries request, the follower can include the term of the conflicting entry and the first index it stores for that term. With this information, the leader can decrement nextIndex to bypass all of the conflicting entries in that term; one AppendEntries RPC will be required for each term with conflicting entries, rather than one RPC per entry. In practice, we doubt this optimization is necessary, since failures happen infrequently and it is unlikely that there will be many inconsistent entries.

如果需要的话,可以对协议进行优化,以减少被拒绝的RPC的数量。例如,当拒绝AppendEntries RPCs时,跟随者可以将包括冲突条目的任期和此任期内存储的第一个条目返回给领导者。这样,领导者就可以将nextIndex直接减去所有冲突的条目最早的那个条目。一个任期内的日志条目冲突只需要一次AppendEntries RPCs就可以,而不需要像之前那样每个条目一次AppendEntries RPCs。但是在实际应用中,我们认为此优化是完全没有必要的,因为AppendEntries RPCs请求失败并不是经常发生,并且好像也不会有很多冲突的日志条目。

With this mechanism, a leader does not need to take any special actions to restore log consistency when it comes to power. It just begins normal operation, and the logs automatically converge in response to failures of the AppendEntries consistency check. A leader never overwrites or deletes entries in its own log (the Leader Append-Only Property in Figure 3).

通过这种机制,当一个领导者掌权时,不需要采取任何额外的措施来恢复日志一致性。它只需要执行正常的操作,日志就会随着AppendEntries的一致性检查自动收敛。领导者永远不会覆盖或删除自己日志中的条目(图3中的Leader Append Only属性)。

This log replication mechanism exhibits the desirable consensus properties described in Section 2: Raft can accept, replicate, and apply new log entries as long as a majority of the servers are up; in the normal case a new entry can be replicated with a single round of RPCs to a majority of the cluster; and a single slow follower will not impact performance.



The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms described so far are not quite sufficient to ensure that each state machine executes exactly the same commands in the same order. For example, a follower might be unavailable while the leader commits several log entries, then it could be elected leader and overwrite these entries with new ones; as a result, different state machines might execute different command sequences.


This section completes the Raft algorithm by adding a restriction on which servers may be elected leader. The restriction ensures that the leader for any given term contains all of the entries committed in previous terms (the Leader Completeness Property from Figure 3). Given the election restriction, we then make the rules for commitment more precise. Finally, we present a proof sketch for the Leader Completeness Property and show how it leads to correct behavior of the replicated state machine.

这一章节对于可能会被选为领导者的服务器添加了一些限制。使得特定任期内的领导者能够包含之前任期内提交的日志条目(图3中的Leader Completion属性)。通过增加了这些选举限制,我们进一步细化了提交规则。最后,我们呈现了一个领导者完备性(Leader Completeness Property)的证明草图,并展示了它是如何指导状态机正确执行的。

5.4.1、Election restriction(选举限制)

In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. In some consensus algorithms, such as Viewstamped Replication [22], a leader can be elected even if it doesn’t initially contain all of the committed entries. These algorithms contain additional mechanisms to identify the missing entries and transmit them to the new leader, either during the election process or shortly afterwards. Unfortunately, this results in considerable additional mechanism and complexity. Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader. This means that log entries only flow in one direction, from leaders to followers, and leaders never overwrite existing entries in their logs.

在任何基于领导者的共识算法(leader-based consensus algorithm)中,领导者最终都必须保存着所有提交的日志条目。在一些共识算法中,比如 Viewstamped Replication[22],即使一开始没有包含所有提交的条目,也可以选出一个领导者。这些算法包含额外的机制来识别丢失的条目,并在选举过程中或之后不久将其传送给新的领导人。不幸的是,这带来了相当多的额外机制和复杂性。Raft使用一种简单的方法使得之前领导者提交的日志条目能够在一选举出新的领导者时就能完整的程现在领导者上,而不需要任何的传送。这就意味着,日志条目只会从领导者流向跟随者,领导者永远不会覆盖其日志中的现有条目

Time Sequence

Figure 8: A time sequence showing why a leader cannot determine commitment using log entries from older terms. In (a) S1 is leader and partially replicates the log entry at index 2. In (b) S1 crashes; S5 is elected leader for term 3 with votes from S3, S4, and itself, and accepts a different entry at log index 2. In © S5 crashes; S1 restarts, is elected leader, and continues replication. At this point, the log entry from term 2 has been replicated on a majority of the servers, but it is not committed. If S1 crashes as in (d), S5 could be elected leader (with votes from S2, S3, and S4) and overwrite the entry with its own entry from term 3. However, if S1 replicates an entry from its current term on a majority of the servers before crashing, as in (e), then this entry is committed (S5 cannot win an election). At this point all preceding entries in the log are committed as well.

图 8:该时间序列显示了为什么领导者不能使用历史任期的日志条目来确定承诺(determine commitment)。 (a) S1 是领导者,部分复制了索引 2 处的日志条目。 (b) S1 崩溃,S5 被选为第 3 任期的领导者(来自S3、S4 和它自己的投票),并在日志索引 2 处接受不同的条目。 © S5 宕机, S1 重新启动,被选为领导者,并继续复制。 此时第 2 项的日志条目已在大多数服务器上复制,但尚未提交。 (d)S1再次宕机,则S5可以被选为领导(来自S2,S3和S4的投票)并从第 3 任期中覆盖其自身条目的条目。但是,如果S1在崩溃之前在大多数服务器上复制了其当前任期的条目,如(e)中所示,之后该条目被提交(则S5无法赢得选举)。此时,日志中所有前面的条目也将提交。

Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other log in that majority (where “up-to-date” is defined precisely below), then it will hold all the committed entries. The RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate.

Raft使用投票过程来阻止候选者赢得选举,除非它的日志包含所有提交的条目。候选者必须联系集群的大多数成员才能当选,这意味着每个提交的条目必须至少出现在其中一个服务器中。如果候选者的日志至少和大多数人的日志一样都是最新的(下面精确地定义了“最新”),那么它将保存所有提交的条目。RequestVote RPC实现了这个限制:RPC中包含了关于候选者日志的信息,如果投票者自己的日志比候选者的日志更新得多,投票者就拒绝投票。

Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.


5.4.2、Committing entries from previous terms(提交前置任期的条目)

As described in Section 5.3, a leader knows that an entry from its current term is committed once that entry is stored on a majority of the servers. If a leader crashes before committing an entry, future leaders will attempt to finish replicating the entry. However, a leader cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers. Figure 8 illustrates a situation where an old log entry is stored on a majority of servers, yet can still be overwritten by a future leader.

如第 5.3 节所述,一旦该条目存储在大多数服务器上,领导者就知道其当前任期中的条目已提交。 如果领导者在提交条目之前崩溃,未来的领导者将尝试完成复制该条目。 然而,领导者不能立即断定前一任期的条目是否已经存储在大多数服务器上并完成了提交。 图 8 中说明了一种场景,存在大多数服务器上的日志条目被新的领导者的日志给覆盖了。

Overlay Log

Figure 9: If S1 (leader for term T) commits a new log entry from its term, and S5 is elected leader for a later term U, then there must be at least one server (S3) that accepted the log entry and also voted for S5.

图 9:如果 S1(任期 T 的领导者)在其任期中提交了一个新的日志条目,并且 S5 被选为以后任期 U 的领导者,那么必须至少有一个服务器(S3)接受该日志条目并投票给 S5。

To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property. There are some situations where a leader could safely conclude that an older log entry is committed (for example, if that entry is stored on every server), but Raft takes a more conservative approach for simplicity.

为了消除图 8 中的这种问题,Raft 从来不会通过计算副本数来决定是否提交上一个任期的日志条目。 只有领导者当期的日志条目需要通过计算备份数来决定提交。一旦当前任期内的一个日志条目以这种方式被提交了,那么根据 Log Matching Property 的限制,所有之前的所有日志条目也就间接的被提交了。在某些情况下,领导者能够立即识别一个旧的日志条目是否被提交了(例如,如果该条目存储在每个服务器上),但是Raft为了简洁,选择了使用更加保守的方法。

Raft incurs this extra complexity in the commitment rules because log entries retain their original term numbers when a leader replicates entries from previous terms. In other consensus algorithms, if a new leader rereplicates entries from prior “terms,” it must do so with its new “term number.” Raft’s approach makes it easier to reason about log entries, since they maintain the same term number over time and across logs. In addition, new leaders in Raft send fewer log entries from previous terms than in other algorithms (other algorithms must send redundant log entries to renumber them before they can be committed).

Raft 在提交规则中会产生这种额外的复杂性,因为当领导者从以前的任期复制条目时,日志条目会保留其原始任期号。 在其他共识算法中,如果一个新的领导者从之前的“任期”中重新复制条目,它必须使用新的“任期号”这样做。 Raft 的方法使推理日志条目变得更容易,因为它们随着时间的推移和跨日志保持相同的术语编号。 此外,与其他算法相比,Raft 中的新领导者发送的之前任期中的日志条目更少(其他算法必须发送冗余日志条目来重新编号,然后才能提交)。

5.4.3、Safety argument(安全论证)

Given the complete Raft algorithm, we can now argue more precisely that the Leader Completeness Property holds (this argument is based on the safety proof; see Section 9.2). We assume that the Leader Completeness Property does not hold, then we prove a contradiction. Suppose the leader for term T (leaderT) commits a log entry from its term, but that log entry is not stored by the leader of some future term. Consider the smallest term U > T whose leader (leaderU) does not store the entry.

  1. The committed entry must have been absent from leaderU’s log at the time of its election (leaders never delete or overwrite entries).
  2. leaderT replicated the entry on a majority of the cluster, and leaderU received votes from a majority of the cluster. Thus, at least one server (“the voter”) both accepted the entry from leaderT and voted for leaderU, as shown in Figure 9. The voter is key to reaching a contradiction.
  3. The voter must have accepted the committed entry from leaderT before voting for leaderU; otherwise it would have rejected the AppendEntries request from leaderT (its current term would have been higher than T).
  4. The voter still stored the entry when it voted for leaderU, since every intervening leader contained the entry (by assumption), leaders never remove entries, and followers only remove entries if they conflict with the leader.
  5. The voter granted its vote to leaderU, so leaderU’s log must have been as up-to-date as the voter’s. This leads to one of two contradictions.
  6. First, if the voter and leaderU shared the same last log term, then leaderU’s log must have been at least as long as the voter’s, so its log contained every entry in the voter’s log. This is a contradiction, since the voter contained the committed entry and leaderU was assumed not to.
  7. Otherwise, leaderU’s last log term must have been larger than the voter’s. Moreover, it was larger than T, since the voter’s last log term was at least T (it contains the committed entry from term T). The earlier leader that created leaderU’s last log entry must have contained the committed entry in its log (by assumption). Then, by the Log Matching Property, leaderU’s log must also contain the committed entry, which is a contradiction.
  8. This completes the contradiction. Thus, the leaders of all terms greater than T must contain all entries from term T that are committed in term T.
  9. The Log Matching Property guarantees that future leaders will also contain entries that are committed indirectly, such as index 2 in Figure 8(d).

给出了完整的 Raft 算法,我们现在可以进一步的对领导者完备性(Leader Completeness Property)进行论证。(这个论证基于安全性证明,参见第 9.2 节)。 首先我们假设Leader Completeness Property 不成立,那么我们需要提出一个矛盾点。 假设任期为T的领导者T (Leader T) 提交了其任期内的日志条目,但是这个日志条目并没有被之后任期的领导者存储。 假设存在没有存储这条日志条目的领导者U(Leader U),其中任期U大于任期T。

  1. 在LeaderU当选时,它的日志里面肯定没有这个已经被提交的日志条目。(领导者永远不会删除或覆盖条目)。
  2. LeaderT 已经将该日志条目复制给了集群中的大多数成员,LeaderU在选举时收到了集群大多数成员的投票。 因此,至少有一个服务器("“投票者”)同时接受了来自 LeaderT 的日志条目并投票给了 LeaderU,如图 9 所示。投票者是达成矛盾的关键。
  3. 投票者在给LeaderU之前必然已经接受了LeaderT提交的日志条目,否则它将拒绝来自 LeaderT 的 AppendEntries 请求(拒绝的时候,其当前任期将高于 T)。
  4. 投票者在投票给 LeaderU 时还存储该条目,因为每个参与其中的领导者都包含该条目(假设),领导者从不删除条目,而跟随者仅在与领导者冲突时才会删除条目。
  5. 投票者把选票给了LeaderU,因此LeaderU的日志必须和投票者的日志一样都是最新的。 这导致了两个矛盾中的一个。
  6. 首先,如果投票者和LeaderU 的最后一个日志任期相同,那么LeaderU 的日志必须至少和投票者一样长,所以它的日志包含了投票者日志中的每一个条目。 这是一个矛盾,因为投票者包含了被提交的条目,而LeaderU 则没有。
  7. 否则,LeaderU 的最后一个日志任期必须大于投票者的。 进一步说,它大于 T,因为投票者的最后一个日志期限至少是 T(它包含来自期限 T 的提交条目)。假设,创建 LeaderU 的最后一个日志条目的较早的领导者的日志中必须包含已经提交的日志条目,那么,根据日志匹配属性,LeaderU 的日志也必须包含提交的条目,这是一个矛盾。
  8. 这就完成了矛盾。 因此,所有任期大于 T 的领导者必须包含任期 T 中已经提交的所有日志条目。
  9. Log Matching Property保证未来的领导者也将包含间接提交的条目,例如图 8(d) 中的索引 2。

Given the Leader Completeness Property, we can prove the State Machine Safety Property from Figure 3, which states that if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index. At the time a server applies a log entry to its state machine, its log must be identical to the leader’s log up through that entry and the entry must be committed. Now consider the lowest term in which any server applies a given log index; the Log Completeness Property guarantees that the leaders for all higher terms will store that same log entry, so servers that apply the index in later terms will apply the same value. Thus, the State Machine Safety Property holds.

依据领导者完整性属性(Leader Completeness Property),我们可以证明图 3 中的状态机安全属性(State Machine Safety Property),它指出如果服务器已将给定索引处的日志条目应用于其状态机,则没有其他服务器会为同一索引应用不同的日志条目。当服务器已经将日志条目应用于其状态机时,其日志必须与通过该条目成为领导者的日志相同,并且该日志条目必须已经被提交了。现在考虑任何服务器应用给定日志索引的最低期限; 日志完整性属性(Log Completeness Property)保证所有更高任期的领导者将存储相同的日志条目,因此在以后的任期中应用索引的服务器将应用相同的值。 因此,状态机安全属性(State Machine Safety Property)成立。

Finally, Raft requires servers to apply entries in log index order. Combined with the State Machine Safety Property, this means that all servers will apply exactly the same set of log entries to their state machines, in the same order.

最后,Raft 要求服务器按日志索引顺序应用条目。 结合状态机安全属性(State Machine Safety Property),这意味着所有服务器都将以相同的顺序将完全相同的日志条目集应用于其状态机。

5.5、Follower and candidate crashes(跟随者和候选者宕机)

Until this point we have focused on leader failures. Follower and candidate crashes are much simpler to handle than leader crashes, and they are both handled in the same way. If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this causes no harm. For example, if a follower receives an AppendEntries request that includes log entries already present in its log, it ignores those entries in the new request.

到目前为止,我们的关注点都在领导者的失败上。 跟随者和候选者的失败相对来说,更容易进行处理,处理机制也与领导者相同。如果追随者或候选者发生了宕机,那么之后发送给它们的 RequestVote RPCs 和 AppendEntries RPCs 将失败。 Raft 通过无限重试来处理这些失败; 如果宕机的服务器重新启动,则 RPC 将成功完成请求。 当服务器接收处理完RPC请求,但是在回复之前宕机,那么它会在重新启动后再次收到相同的 RPC。 Raft RPC 是幂等的,所以这种情况并不会引发任何问题。 例如,如果一个跟随者收到一个 AppendEntries 请求,其中包括其日志中已经存在的日志条目,它会忽视这此请求。

5.6、Timing and availability(时间和可用性)

One of our requirements for Raft is that safety must not depend on timing: the system must not produce incorrect results just because some event happens more quickly or slowly than expected. However, availability (the ability of the system to respond to clients in a timely manner) must inevitably depend on timing. For example, if message exchanges take longer than the typical time between server crashes, candidates will not stay up long enough to win an election; without a steady leader, Raft cannot make progress.


Leader election is the aspect of Raft where timing is most critical. Raft will be able to elect and maintain a steady leader as long as the system satisfies the following timing requirement:

  • broadcastTime ≪ electionTimeout ≪ MTBF

领导者选举是 Raft 中时机最关键的方面。 只要系统满足以下时序要求,Raft 将能够选举和维护一个稳定的领导者:

  • 广播时间 << 选举超时时间 << 平均故障间隔(MTBF,Mean Time Between Failures)

In this inequality broadcastTime is the average time it takes a server to send RPCs in parallel to every server in the cluster and receive their responses; electionTimeout is the election timeout described in Section 5.2; and MTBF is the average time between failures for a single server. The broadcast time should be an order of magnitude less than the election timeout so that leaders can reliably send the heartbeat messages required to keep followers from starting elections; given the randomized approach used for election timeouts, this inequality also makes split votes unlikely. The election timeout should be a few orders of magnitude less than MTBF so that the system makes steady progress. When the leader crashes, the system will be unavailable for roughly the election timeout; we would like this to represent only a small fraction of overall time.

在这个不等式中,广播时间(broadcastTime)是服务器向集群中的每个服务器并行发送 RPC 并接收它们的响应所花费的平均时间; 选举超时时间(electionTimeout)是第 5.2 节中描述的选举超时时间; 平均故障时间(MTBF,Mean Time Between Failures)是单个服务器的平均故障间隔时间。 广播时间应该比选举超时时间少一个数量级,这样领导者可以及时的发送心跳信息给跟随者以组织新的领导选举。通过使用随机的选举超时时间,分裂投票的情况也不大可能会出现。选举超时时间应该比 平均故障时间(MTBF) 小几个数量级,这样系统就能正常运行。 当领导者宕机时,系统将在选举超时时间(electionTimeout) 内不可用; 我们希望这仅占总时间的一小部分。

The broadcast time and MTBF are properties of the underlying system, while the election timeout is something we must choose. Raft’s RPCs typically require the recipient to persist information to stable storage, so the broadcast time may range from 0.5ms to 20ms, depending on storage technology. As a result, the election timeout is likely to be somewhere between 10ms and 500ms. Typical server MTBFs are several months or more, which easily satisfies the timing requirement.

广播时间(Broadcast Time)和 平均故障时间(MTBF,Mean Time Between Failures)是底层系统的属性,而选举超时时间是需要我们自己进行设置的。 Raft 的 RPCs 通常需要接收方将信息持久化到稳定的存储中,因此广播时间可能在 0.5 毫秒到 20 毫秒之间,具体取决于存储技术。 因此,选举超时很可能在 10 毫秒到 500 毫秒之间。 典型的服务器 MTBF 为几个月或更长时间,完全满足系统的时间因素要求。

6、Cluster membership changes(集群成员变更)

Up until now we have assumed that the cluster configuration (the set of servers participating in the consensus algorithm) is fixed. In practice, it will occasionally be necessary to change the configuration, for example to replace servers when they fail or to change the degree of replication. Although this can be done by taking the entire cluster off-line, updating configuration files, and then restarting the cluster, this would leave the cluster unavailable during the changeover. In addition, if there are any manual steps, they risk operator error. In order to avoid these issues, we decided to automate configuration changes and incorporate them into the Raft consensus algorithm.

到目前为止,我们假设集群配置(参与共识算法的服务器集)是固定的。 在实践中,有时需要更改配置,例如在服务器出现故障时更换服务器或更改复制的程度。 虽然这可以通过使整个集群下线、更改配置,然后重新启动集群来完成,但这会使集群在切换期间不可用。 另外,人为操作的因素也更容易引发系统错误。 为了避免这些问题,我们决定实现配置变更的自动化,并将其融合进共识算法中。

For the configuration change mechanism to be safe, there must be no point during the transition where it is possible for two leaders to be elected for the same term. Unfortunately, any approach where servers switch directly from the old configuration to the new configuration is unsafe. It isn’t possible to atomically switch all of the servers at once, so the cluster can potentially split into two independent majorities during the transition (see Figure 10).

为了保障配置变更机制的安全,在配置变更期间,不能存在同一任期内选举出两个领导者的情况。 不幸的是,任何服务器直接从旧配置切换到新配置的方法都是不安全的。 一次原子地切换所有服务器是不可能的,因此在转换过程中,集群极有可能出现裂脑现象(参见图 10)。

Cluster Membership Changes

Figure 10: Switching directly from one configuration to another is unsafe because different servers will switch at different times. In this example, the cluster grows from three servers to five. Unfortunately, there is a point in time where two different leaders can be elected for the same term, one with a majority of the old configuration (Cold) and another with a majority of the new configuration (Cnew).

图 10:由于不同的服务器的切换时间不一样,因此直接从一种配置切换到另一种配置是不安全的。 在本例中,集群从三台服务器增加到五台。 不幸的是,有一个时间点上会出现同一个任期内可以选举出两个领导者的情况,一个领导者被拥有旧配置(Cold)的成员选举出,另一个则被拥有新配置(Cnew)的成员选举出。(在图例中出现问题的时刻,Server1可以通过自身以及Server2的投票拿到**2/3比例的选票??而赢得选举,成为领导者;并且此时Server5可以通过自身和Server3以及Server4的投票拿到3/5比例的选票??**赢得选举,最终存在两个领导者。)

In order to ensure safety, configuration changes must use a two-phase approach. There are a variety of ways to implement the two phases. For example, some systems (e.g., [22]) use the first phase to disable the old configuration so it cannot process client requests; then the second phase enables the new configuration. In Raft the cluster first switches to a transitional configuration we call joint consensus; once the joint consensus has been committed, the system then transitions to the new configuration. The joint consensus combines both the old and new configurations:

  • Log entries are replicated to all servers in both configurations.

  • Any server from either configuration may serve as leader

  • Agreement (for elections and entry commitment) requires separate majorities from both the old and new configurations.

为了确保安全,配置更改必须使用两阶段法。 有多种方法可以实现这两个阶段。 例如,某些系统(例如 [22]) 在第一阶段禁用旧配置,使其无法处理客户端请求; 然后在第二阶段启用新配置。 在 Raft 中,集群的配置会首先进入到我们称之为联合共识(joint consensus)的过渡配置; 一旦达成了联合共识,系统就会转换到新的配置。 联合共识中结合了新旧配置:

  • 日志条目会被复制到集群中两种配置下的所有服务器上。
  • 任一配置中的任何服务器都可以作为领导者。
  • 选举和日志条目提交的商定需要按照新旧配置中的大多数服务器原则来要求。
Timeline For A Configuration Change

Figure 11: Timeline for a configuration change. Dashed lines show configuration entries that have been created but not committed, and solid lines show the latest committed configuration entry. The leader first creates the Cold,new configuration entry in its log and commits it to Cold,new (a majority of Cold and a majority of Cnew). Then it creates the Cnew entry and commits it to a majority of Cnew. There is no point in time in which Cold and Cnew can both make decisions independently.

图 11:配置更改的时间表。 虚线代表已创建但未提交的配置条目,实线代表最新提交的配置条目。 领导者首先在其日志中创建 Cold,new 配置条目并将其提交到 Cold,new(大多数 Cold 和大多数 Cnew)。 然后它创建 Cnew 条目并将其提交给大多数 Cnew。 Cold 和 Cnew 没有时间窗口可以独立做出决定。

The joint consensus allows individual servers to transition between configurations at different times without compromising safety. Furthermore, joint consensus allows the cluster to continue servicing client requests throughout the configuration change.

联合共识( joint consensus)允许单个服务器在不影响安全性的基础上,在不同的特定时刻进行不同配置的转换。 此外,联合共识允许集群在整个配置更改期间继续为客户端请求提供服务。

Cluster configurations are stored and communicated using special entries in the replicated log; Figure 11 illustrates the configuration change process. When the leader receives a request to change the configuration from Cold to Cnew, it stores the configuration for joint consensus (Cold,new in the figure) as a log entry and replicates that entry using the mechanisms described previously. Once a given server adds the new configuration entry to its log, it uses that configuration for all future decisions (a server always uses the latest configuration in its log, regardless of whether the entry is committed). This means that the leader will use the rules of Cold,new to determine when the log entry for Cold,new is committed. If the leader crashes, a new leader may be chosen under either Cold or Cold,new, depending on whether the winning candidate has received Cold,new. In any case, Cnew cannot make unilateral decisions during this period.

集群配置是通过使用复制日志中的特殊条目进行存储和通信; 图 11 说明了配置更改过程。 当领导者收到将配置从 Cold 更改为 Cnew 的请求时,它将联合共识的配置(图中的Cold,NEW)存储为日志条目,并按照前面所描述的机制将该条目复制给其他服务器。一旦某个服务器将收到的 Cold,new 配置日志条目并添加到自身的日志中,那么之后其所有的决策都将以此配置 Cold,new 为依据(服务器总是以日志中最新的配置为依据进行决策,无论该条目是否已提交)。 这意味着领导者将使用 Cold,new 的规则来确定 Cold,new 的日志条目何时被提交。 如果领导者发生了宕机,新的领导者将在旧配置 Cold或者联合配置 Cold,new 的机器中选举出来。这取决于获胜的候选者是否收到了 Cold,new。无论如何,Cnew在此期间不能单方面做出决定。

Once Cold,new has been committed, neitherCold norCnew can make decisions without approval of the other, and the Leader Completeness Property ensures that only servers with the Cold,new log entry can be elected as leader. It is now safe for the leader to create a log entry describing Cnew and replicate it to the cluster. Again, this configuration will take effect on each server as soon as it is seen. When the new configuration has been committed under the rules of Cnew, the old configuration is irrelevant and servers not in the new configuration can be shut down. As shown in Figure 11, there is no time when Cold and Cnew can both make unilateral decisions; this guarantees safety.

一旦 Cold,new 被提交后,具有Cold或者Cnew的服务器将不能再没有其它服务器允许的情况下单独做出任何决策,并且Leader Completeness Property 确保只有具有Cold,new 日志条目的服务器才能被选举为领导者。 此时,领导者可以安全地创建一个描述 Cnew 的日志条目并将其复制到集群的其他服务器中。 同样,当复制的服务器收到配置条目后就会立刻生效。当新的配置被提交后,旧的配置就变得无关紧要了,并且没有新配置的服务器可以被关闭了。 如图 11 所示,Cold 和 Cnew 没有时机能够单独做出决策, 这保证了安全。

There are three more issues to address for reconfiguration. The first issue is that new servers may not initially store any log entries. If they are added to the cluster in this state, it could take quite a while for them to catch up, during which time it might not be possible to commit new log entries. In order to avoid availability gaps, Raft introduces an additional phase before the configuration change, in which the new servers join the cluster as non-voting members (the leader replicates log entries to them, but they are not considered for majorities). Once the new servers have caught up with the rest of the cluster, the reconfiguration can proceed as described above.

在配置转换期间存在着三方面的问题,第一个就是新的服务器初始化启动的时候不包含任何日志条目,当它们加入集群中时,需要花费相当长的时间同步到最新的状态,在此期间,它将不能提交任何日志条目。为了避免可用性断层,Raft 在配置更改之前引入了一个额外的阶段,在这个阶段,新服务器作为非投票成员(none-voting)加入集群(领导者将日志条目复制给它们,但它们不纳入大多数考虑的范围)。当新的服务器同步到最新的状态后,就可以执行正常的配置转换过程了。

The second issue is that the cluster leader may not be part of the new configuration. In this case, the leader steps down (returns to follower state) once it has committed the Cnew log entry. This means that there will be a period of time (while it is committingCnew) when the leader is managing a cluster that does not include itself; it replicates log entries but does not count itself in majorities. The leader transition occurs when Cnew is committed because this is the first point when the new configuration can operate independently (it will always be possible to choose a leader from Cnew). Before this point, it may be the case that only a server from Cold can be elected leader.

第二个问题是集群领导者可能是没有新配置的那一部分。 在这种情况下,一旦提交了Cnew配置,领导者就会被转换成跟随者。这就意味着会有一段时间领导者管理着一个不包含自己的集群。它复制日志条目,但是却将自身排除在大多数机器之外。当Cnew被提交时会发生领导者转换,因为这个是新配置可以独立运行的第一个时刻(总是可以从 Cnew 中选择领导者)。在此之前,只有处于Cold的服务器才可以被选举为领导者。

The third issue is that removed servers (those not in Cnew) can disrupt the cluster. These servers will not receive heartbeats, so they will time out and start new elections. They will then send RequestVote RPCs with new term numbers, and this will cause the current leader to revert to follower state. A new leader will eventually be elected, but the removed servers will time out again and the process will repeat, resulting in poor availability.

**第三个问题是无关的服务器(不在Cnew中的服务器)可能会破坏集群。**因为这些服务器不会收到心跳请求,所以它们就会产生超时并启动新一轮的选举。然后它们发送带有新的任期号的RequestVote RPCs,这就会导致当前的领导者接收到请求后转换到跟随者状态,最终会选举出一个新的领导者。但是那些无关的的服务器会再次超时,如此循环往复,最终会导致系统可用性的大大降低。

To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specifically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote. This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from removed servers: if a leader is able to get heartbeats to its cluster, then it will not be deposed by larger term numbers.

为了避免这样的问题发生,服务器在认为当前领导者存在时会忽略 RequestVote RPC。具体来说,如果服务器在听取当前领导者的最小选举超时内收到 RequestVote RPC,则不会更新其任期或授予其投票权。这不会影响正常选举,其中每个服务器在开始选举之前至少等待最小选举超时。 然而,它有助于避免被移除的服务器造成的扰乱:如果领导者能够发送心跳给集群,那么它就不会被更大的任期号废黜。

7、Log compaction(日志压缩)

Raft’s log grows during normal operation to incorporate more client requests, but in a practical system, it cannot grow without bound. As the log grows longer, it occupies more space and takes more time to replay. This will eventually cause availability problems without some mechanism to discard obsolete information that has accumulated in the log.


Snapshotting is the simplest approach to compaction. In snapshotting, the entire current system state is written to a snapshot on stable storage, then the entire log up to that point is discarded. Snapshotting is used in Chubby and ZooKeeper, and the remainder of this section describes snapshotting in Raft.

快照是最简单的压缩方法。通过快照将某一时刻系统的当前状态写入快照文件,保存到磁盘,然后将这一时刻之前的所有日志丢弃。 Chubby 和 ZooKeeper 都使用了快照技术,本节的其余部分描述 Raft 中的快照。

Incremental approaches to compaction, such as log cleaning [36] and log-structured merge trees [30, 5], are also possible. These operate on a fraction of the data at once, so they spread the load of compaction more evenly over time. They first select a region of data that has accumulated many deleted and overwritten objects, then they rewrite the live objects from that region more compactly and free the region. This requires significant additional mechanism and complexity compared to snapshotting, which simplifies the problem by always operating on the entire data set. While log cleaning would require modifications to Raft, state machines can implement LSM trees using the same interface as snapshotting.

渐进式压缩方法,例如日志清理 [36] 和日志结构合并树 [30, 5]。 它们一次对一小部分数据进行操作,因此它们会随着时间的推移更均匀地分布压缩负载。 他们首先选择一个已经积累了许多已删除和覆盖对象的数据区域,然后他们更紧凑地重写该区域中的活动对象并释放该区域。 与快照相比,这需要大量额外的机制和复杂性,这通过始终对整个数据集进行操作来简化问题。 虽然日志清理需要对 Raft 进行修改,但状态机可以使用与快照相同的接口来实现 LSM 树。

Figure 12 shows the basic idea of snapshotting in Raft. Each server takes snapshots independently, covering just the committed entries in its log. Most of the work consists of the state machine writing its current state to the snapshot. Raft also includes a small amount of metadata in the snapshot: the last included index is the index of the last entry in the log that the snapshot replaces (the last entry the state machine had applied), and the last included term is the term of this entry. These are preserved to support the AppendEntries consistency check for the first log entry following the snapshot, since that entry needs a previous log index and term. To enable cluster membership changes (Section 6), the snapshot also includes the latest configuration in the log as of last included index. Once a server completes writing a snapshot, it may delete all log entries up through the last included index, as well as any prior snapshot.

图 12 展示了 Raft 中快照的基本思想。 各个服务器独立的对已提交的日志条目进行日志快照。主要的工作是由状态机将它当前的状态写入快照文件来完成。Raft也保留了一些元信息在快照中:last included index代表状态机最后应用的日志条目索引,last included term则是指这一条目的任期。因为日志条目需要包含preLogIndex和preLogTerm这两个属性以应对AppendEntries的一致性检查。为了支持集群成员变更(第 6 节),快照文件中也在last included index配置前包含了最新的配置条目。一旦服务器完成写入快照,他就会将last include index之前的所有日志条目都删除掉,以及任何先前的快照。

Log Compression

Figure 12: A server replaces the committed entries in its log (indexes 1 through 5) with a new snapshot, which stores just the current state (variables x and y in this example). The snapshot’s last included index and term serve to position the snapshot in the log preceding entry 6.

图 12:服务器用新快照替换其日志中已提交的条目(索引 1 到 5),该快照仅存储当前状态(在本例中为变量 x 和 y)。 快照最后包含的索引和术语用于将快照定位在条目 6 之前的日志中。

Although servers normally take snapshots independently, the leader must occasionally send snapshots to followers that lag behind. This happens when the leader has already discarded the next log entry that it needs to send to a follower. Fortunately, this situation is unlikely in normal operation: a follower that has kept up with the leader would already have this entry. However, an exceptionally slow follower or a new server joining the cluster (Section 6) would not. The way to bring such a follower up-to-date is for the leader to send it a snapshot over the network.

尽管服务器通常独立拍摄快照,但在领导者必须偶尔向落后的跟随者发送快照,这种情况通常是由于领导者可能会丢弃了它需要发送给跟随者的下一个日志条目。 幸运的是,这种情况在正常操作中不太可能发生:和领导者保持同步的跟随者拥有着领导者的所有日志,但是,落后非常大的跟随着或者刚加入集群的服务器(第6节)却并非如此。处理此类跟随者的机制就是领导者发送日志快照来进行同步。

InstallSnapshot RPC

Figure 13: A summary of the InstallSnapshot RPC. Snapshots are split into chunks for transmission; this gives the follower a sign of life with each chunk, so it can reset its election timer.

图 13:InstallSnapshot RPC 的摘要。 快照被分成块进行传输; 这为追随者提供了每个块的生命迹象,因此它可以重置其选举计时器。

The leader uses a new RPC called InstallSnapshot to send snapshots to followers that are too far behind; see Figure 13. When a follower receives a snapshot with this RPC, it must decide what to do with its existing log entries. Usually the snapshot will contain new information not already in the recipient’s log. In this case, the follower discards its entire log; it is all superseded by the snapshot and may possibly have uncommitted entries that conflict with the snapshot. If instead the follower receives a snapshot that describes a prefix of its log (due to retransmission or by mistake), then log entries covered by the snapshot are deleted but entries following the snapshot are still valid and must be retained.

领导者使用一个名为 InstallSnapshot 的新 RPC 向落后较大的追随者发送快照; 请参见图 13。当跟随者收到带有此 RPC 的快照时,它必须决定如何处理其现有的日志条目。 通常,快照将包含收件人日志中尚未包含的新信息。 在这种情况下,跟随者会丢弃其整个日志(可能包含未提交的和和快照中有冲突的条目),然后替换为快照中的日志条目。相反,如果跟随者收到的快照中包含的日志条目是其自身日志之前的部分的条目(因为重传或其他错误),那么就会将快照覆盖的自身日志条目删除掉,但是这之后的日志条目任然有效,需要保留下来。

This snapshotting approach departs from Raft’s strong leader principle, since followers can take snapshots without the knowledge of the leader. However, we think this departure is justified. While having a leader helps avoid conflicting decisions in reaching consensus, consensus has already been reached when snapshotting, so no decisions conflict. Data still only flows from leaders to followers, just followers can now reorganize their data.

这种快照方法与 Raft 的强领导原则背道而驰,因为跟随者可以在领导者不知情的情况下拍摄快照。 然而,我们认为这种背道而驰是合理的。 虽然拥有领导者有助于在达成共识时避免冲突的决策,但在快照时已经达成共识,因此没有决策冲突。 数据仍然只从领导者流向跟随者,现在只是跟随者可以重组他们自己的数据。

We considered an alternative leader-based approach in which only the leader would create a snapshot, then it would send this snapshot to each of its followers. However, this has two disadvantages. First, sending the snapshot to each follower would waste network bandwidth and slow the snapshotting process. Each follower already has the information needed to produce its own snapshots, and it is typically much cheaper for a server to produce a snapshot from its local state than it is to send and receive one over the network. Second, the leader’s implementation would be more complex. For example, the leader would need to send snapshots to followers in parallel with replicating new log entries to them, so as not to block new client requests.

我们考虑了另一种的基于领导者的方法(leader-based approach),其中只有领导者会创建一个快照,然后将这个快照发送给它的每个追随者。 但是这种方法有两个缺点。 首先,将快照发送给每个跟随者会浪费网络带宽并拖慢整个的快照过程。 每个跟随者已经拥有了生成自己的快照所需的信息,并且服务器从其本地状态生成快照通常比通过网络发送和接收快照成本更低。 其次,领导者的实现会变得更加复杂, 例如,领导者需要在向跟随者发送快照的同时发送新的日志条目,并且不能阻塞客户端的请求。

There are two more issues that impact snapshotting performance. First, servers must decide when to snapshot. If a server snapshots too often, it wastes disk bandwidth and energy; if it snapshots too infrequently, it risks exhausting its storage capacity, and it increases the time required to replay the log during restarts. One simple strategy is to take a snapshot when the log reaches a fixed size in bytes. If this size is set to be significantly larger than the expected size of a snapshot, then the disk bandwidth overhead for snapshotting will be small.

还有两个问题会影响快照的性能。 首先,服务器必须决定何时进行快照。 如果服务器快照过于频繁,则会浪费磁盘带宽和能源; 如果快照不频繁,则有可能会耗尽其存储容量,并且会增加重新启动期间重放日志所需的时间。 一种简单的策略是在日志达到固定大小(以字节为单位)时拍摄快照。 如果将此大小设置为明显大于快照的预期大小,则快照的磁盘带宽开销将很小。

The second performance issue is that writing a snapshot can take a significant amount of time, and we do not want this to delay normal operations. The solution is to use copy-on-write techniques so that new updates can be accepted without impacting the snapshot being written. For example, state machines built with functional data structures naturally support this. Alternatively, the operating system’s copy-on-write support (e.g., fork on Linux) can be used to create an in-memory snapshot of the entire state machine (our implementation uses this approach).

第二个性能问题是写入快照可能需要大量时间,我们不希望这影响正常的系统运行,我们可以采用Cow(copy-on-write)机制,这样就可以在不影响正在写入的快照的情况下接受新的更新。例如,基于功能性结构数据的状态机(state machines built with functional data structures)就天然的支持这种特性。或者我们可以使用操作系统的copy-on-write机制(例如,Linux 上的 fork)来创建状态机的内存快照(我们的实现使用这种方法)。

8、Client interaction(客户端交互)

This section describes how clients interact with Raft, including how clients find the cluster leader and how Raft supports linearizable semantics [10]. These issues apply to all consensus-based systems, and Raft’s solutions are similar to other systems.

本节描述客户端如何与 Raft 交互,包括客户端如何找到集群领导者以及 Raft 如何支持线性化语义(linearizable semantics)[10]。 这些问题适用于所有基于共识算法的系统,Raft 的解决方案与其他系统大体相同。

Clients of Raft send all of their requests to the leader. When a client first starts up, it connects to a randomlychosen server. If the client’s first choice is not the leader, that server will reject the client’s request and supply information about the most recent leader it has heard from (AppendEntries requests include the network address of the leader). If the leader crashes, client requests will time out; clients then try again with randomly-chosen servers.

Raft 的客户端将它们所有的请求发送给领导者。 当客户端第一次启动时,它会连接到随机选择的服务器。 如果客户端第一连接的不是领导者,则该服务器将拒绝客户端的请求,并提供有关它最近了解到的领导者的信息(AppendEntries 请求包括领导者的网络地址)。 如果领导者宕机,客户端请求就会超时,客户端然后使用随机选择的服务器进行重试。

Our goal for Raft is to implement linearizable semantics (each operation appears to execute instantaneously, exactly once, at some point between its invocation and its response). However, as described so far Raft can execute a command multiple times: for example, if the leader crashes after committing the log entry but before responding to the client, the client will retry the command with a new leader, causing it to be executed a second time. The solution is for clients to assign unique serial numbers to every command. Then, the state machine tracks the latest serial number processed for each client, along with the associated response. If it receives a command whose serial number has already been executed, it responds immediately without re-executing the request.

我们对 Raft 的目标是实现可线性化的语义(每一次操作都是立刻执行的,并且只执行一次)。 但是,到目前为止,Raft 也存在可能多次执行同一个命令的场景:例如,如果领导者在提交日志条目后但在回复客户端之前宕机,客户端就会重新向新的领导者发送同样的命令请求,这将会导致同一个命令被再次执行。解决方案是,客户端给每一次的请求命令添加一个唯一的序列码, 然后,服务器的状态机就可以根据请求的序列码追踪到相应的回复。当服务器收到一个和之前序列码相同的命令请求时,服务器就可以不必重新执行命令,而获取响应返回给客户端。

Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware. Linearizable reads must not return stale data, and Raft needs two extra precautions to guarantee this without using the log. First, a leader must have the latest information on which entries are committed. The Leader Completeness Property guarantees that a leader has all committed entries, but at the start of its term, it may not know which those are. To find out, it needs to commit an entry from its term. Raft handles this by having each leader commit a blank no-op entry into the log at the start of its term. Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests. Alternatively, the leader could rely on the heartbeat mechanism to provide a form of lease [9], but this would rely on timing for safety (it assumes bounded clock skew).

只读操作可以直接处理而不需要写入日志,但是可能会返回过期数据,因为响应的领导者可能已经被新的领导者所替代。线性特性不允许返回过期数据,Raft在不记录日志的情况下需要两个额外的预防措施来避免这一情况的发生。第一,领导者必须拥有最新的日志条目。 Leader Completeness Property能够保证领导者拥有所有已提交的日志条目。但是在任期之初,领导者并不知道哪些条目是已提交的。为了解决这个问题,在任期开始的时候,领导者需要提交一个空的 no-op条目。第二,领导者在处理只读请求之前必须先检测一下自己是否已经被替代。Raft通过让领导者在处理只读请求之前向集群大多数服务器发送心跳信息来处理这个问题。或者,领导人可以依赖心跳机制来实现一种租约 [9]的机制,但是这种方法依赖时序来保证安全性。

9、Implementation and evaluation(实现和评估)

We have implemented Raft as part of a replicated state machine that stores configuration information for RAMCloud [33] and assists in failover of the RAMCloud coordinator. The Raft implementation contains roughly 2000 lines of C++ code, not including tests, comments, or blank lines. The source code is freely available [23]. There are also about 25 independent third-party open source implementations [34] of Raft in various stages of development, based on drafts of this paper. Also, various companies are deploying Raft-based systems [34].

我们已经实现了Raft,并将其作为存储 RAMCloud [33] 的配置信息和协助 RAMCloud 协调器的故障转移的复制状态机的一部分。 Raft 实现包含大约 2000 行 C++ 代码,不包括测试、注释或空行。 源代码可免费获得[23]。 根据本文的草稿,还有大约 25 个独立的第三方开源实现 [34] Raft 处于不同的开发阶段。 此外,各种公司正在部署基于 Raft 的系统 [34]。

The remainder of this section evaluates Raft using three criteria: understandability, correctness, and performance.

本节的其余部分使用三个标准评估 Raft:可理解性、正确性和性能。


To measure Raft’s understandability relative to Paxos, we conducted an experimental study using upper-level undergraduate and graduate students in an Advanced Operating Systems course at Stanford University and a Distributed Computing course at U.C. Berkeley. We recorded a video lecture of Raft and another of Paxos, and created corresponding quizzes. The Raft lecture covered the content of this paper except for log compaction; the Paxos lecture covered enough material to create an equivalent replicated state machine, including single-decree Paxos, multi-decree Paxos, reconfiguration, and a few optimizations needed in practice (such as leader election). The quizzes tested basic understanding of the algorithms and also required students to reason about corner cases. Each student watched one video, took the corresponding quiz, watched the second video, and took the second quiz. About half of the participants did the Paxos portion first and the other half did the Raft portion first in order to account for both individual differences in performance and experience gained from the first portion of the study. We compared participants’ scores on each quiz to determine whether participants showed a better understanding of Raft.

为了对比 Raft 和 Paxos 的可理解性,我们对斯坦福大学(Stanford University)的高级操作系统课程(Advanced Operating Systems course)和加州大学(U.C. Berkeley)的分布式计算课程(Distributed Computing course)的高年级本科生和研究生进行了实验研究。我们录制了一个 Raft 和 Paxos 的视频讲座,并创建了相应的测验。 Raft 讲座涵盖了本文的内容,除了 log compaction; Paxos 讲座涵盖了足够的材料来创建等效的复制状态机,包括单决策 Paxos、多决策 Paxos、重新配置和一些实践中所需的优化(例如领导者选举)。测验测试了对算法的基本理解,还要求学生对极端情况进行推理。每个学生观看一个视频,参加相应的测验,观看第二个视频,并参加第二个测验。大约一半的参与者先做 Paxos 部分,另一半先做 Raft 部分,以考虑到从研究的第一部分中获得的表现和经验的个体差异。我们比较了参与者在每个测验中的分数,以确定参与者是否对 Raft 表现出更好的理解。

Raft and Paxos

Figure 14: A scatter plot comparing 43 participants’ performance on the Raft and Paxos quizzes. Points above the diagonal (33) represent participants who scored higher for Raft.

图14:一个散点图,比较了43名参与者在 Raft 和 Paxos 测验中的表现。对角线(33)以上的分数代表在 Raft 得分较高的参与者。

We tried to make the comparison between Paxos and Raft as fair as possible. The experiment favored Paxos in two ways: 15 of the 43 participants reported having some prior experience with Paxos, and the Paxos video is 14% longer than the Raft video. As summarized in Table 1, we have taken steps to mitigate potential sources of bias. All of our materials are available for review [28, 31].

我们试图让 Paxos 和 Raft 之间的比较尽可能公平。 该实验在两个方面对 Paxos 有利:43 名参与者中有 15 人报告说有一些 Paxos 的先前经验,Paxos 视频比 Raft 视频长 14%。 如表 1 所述,我们已采取措施减轻潜在的偏见来源。 我们所有的材料都可供审查 [28, 31]。

On average, participants scored 4.9 points higher on the Raft quiz than on the Paxos quiz (out of a possible 60 points, the mean Raft score was 25.7 and the mean Paxos score was 20.8); Figure 14 shows their individual scores. A paired t-test states that, with 95% confidence, the true distribution of Raft scores has a mean at least 2.5 points larger than the true distribution of Paxos scores.

平均而言,参与者在 Raft 测验中的得分比 Paxos 测验高 4.9 分(在可能的 60 分中,平均 Raft 得分为 25.7,平均 Paxos 得分为 20.8); 图 14 显示了他们的个人得分。 在成对的T检验中表明,在 95% 的置信度下,Raft 分数的真实分布的平均值至少比 Paxos 分数的真实分布高 2.5 分。

We also created a linear regression model that predicts a new student’s quiz scores based on three factors: which quiz they took, their degree of prior Paxos experience, and the order in which they learned the algorithms. The model predicts that the choice of quiz produces a 12.5-point difference in favor of Raft. This is significantly higher than the observed difference of 4.9 points, because many of the actual students had prior Paxos experience, which helped Paxos considerably, whereas it helped Raft slightly less. Curiously, the model also predicts scores 6.3 points lower on Raft for people that have already taken the Paxos quiz; although we don’t know why, this does appear to be statistically significant.

我们还创建了一个线性回归模型,该模型根据三个因素预测新生的测验分数:他们参加了哪个测验、他们先前的 Paxos 经验程度以及他们学习算法的顺序。 该模型预测测验的选择会产生 12.5 分的差异,这有利于 Raft。 这明显高于观察到的 4.9 分的差异,因为许多实际的学生之前都有 Paxos 经验,这对 Paxos 有很大帮助,而对 Raft 的帮助略小。 奇怪的是,该模型还预测已经参加 Paxos 测验的人在 Raft 上的得分低 6.3 分; 虽然我们不知道为什么,但这似乎在统计上是显著的。

5 Point

Figure 15: Using a 5-point scale, participants were asked (left) which algorithm they felt would be easier to implement in a functioning, correct, and efficient system, and (right) which would be easier to explain to a CS graduate student.

图 15:使用 5 分制,参与者被问及(左)他们认为哪种算法在功能正常、正确且高效的系统中更容易实现,(右)哪种算法更容易向 CS 研究生解释。

We also surveyed participants after their quizzes to see which algorithm they felt would be easier to implement or explain; these results are shown in Figure 15. An overwhelming majority of participants reported Raft would be easier to implement and explain (33 of 41 for each question). However, these self-reported feelings may be less reliable than participants’ quiz scores, and participants may have been biased by knowledge of our hypothesis that Raft is easier to understand.

我们还在测验后对参与者进行了调查,以了解他们认为哪种算法更容易实现或解释; 这些结果显示在图 15 中。绝大多数参与者报告说 Raft 更容易实现和解释(每个问题 41 个中的 33 个)。 然而,这些自我报告的感觉可能不如参与者的测验分数可靠,而且参与者可能因为我们对 Raft 更容易理解的假设的了解而产生偏见。

A detailed discussion of the Raft user study is available at [31].

Raft 用户研究的详细讨论可在 [31] 中找到。


We have developed a formal specification and a proof of safety for the consensus mechanism described in Section 5. The formal specification [31] makes the information summarized in Figure 2 completely precise using the TLA+ specification language [17]. It is about 400 lines long and serves as the subject of the proof. It is also useful on its own for anyone implementing Raft. We have mechanically proven the Log Completeness Property using the TLA proof system [7]. However, this proof relies on invariants that have not been mechanically checked (for example, we have not proven the type safety of the specification). Furthermore, we have written an informal proof [31] of the State Machine Safety property which is complete (it relies on the specification alone) and relatively precise (it is about 3500 words long).

我们已经为第 5 节中描述的共识机制制定了正式规范和安全证明。正式规范 [31] 使用 TLA+ 规范语言 [17] 使图 2 中总结的信息完全准确。 它大约有 400 行长,作为证明的主题。 对于任何实现 Raft 的人来说,它本身也很有用。 我们已经使用 TLA 证明系统 [7] 机械证明了日志完整性属性。 然而,这个证明依赖于没有经过机械检查的不变量(例如,我们没有证明规范的类型安全)。 此外,我们编写了状态机安全属性的非正式证明 [31],该证明是完整的(仅依赖于规范)且相对精确(大约 3500 字长)。


Table 1: Concerns of possible bias against Paxos in the study, steps taken to counter each, and additional materials available.

表 1:研究中对 Paxos 可能存在偏见的担忧、针对每种偏见采取的措施以及可用的其他材料。

The time to detect and replace a crashed leader

Figure 16: The time to detect and replace a crashed leader. The top graph varies the amount of randomness in election timeouts, and the bottom graph scales the minimum election timeout. Each line represents 1000 trials (except for 100 trials for “150–150ms”) and corresponds to a particular choice of election timeouts; for example, “150–155ms” means that election timeouts were chosen randomly and uniformly between 150ms and 155ms. The measurements were taken on a cluster of five servers with a broadcast time of roughly 15ms. Results for a cluster of nine servers are similar.

图 16:检测和更换宕机的领导者所需的时间。 上图改变了选举超时的随机性,下图缩放了最小选举超时。 每行代表 1000 次试验(除了“150-150ms”的 100 次试验)并且对应于选举超时的特定选择; 例如,“150-155ms”表示选举超时时间是随机选择的,并且在 150ms 和 155ms 之间统一。 测量是在五台服务器的集群上进行的,广播时间大约为 15 毫秒。 九台服务器集群的结果是相似的。


Raft’s performance is similar to other consensus algorithms such as Paxos. The most important case for performance is when an established leader is replicating new log entries. Raft achieves this using the minimal number of messages (a single round-trip from the leader to half the cluster). It is also possible to further improve Raft’s performance. For example, it easily supports batching and pipelining requests for higher throughput and lower latency. Various optimizations have been proposed in the literature for other algorithms; many of these could be applied to Raft, but we leave this to future work.

Raft 的性能类似于 Paxos 等其他共识算法。 性能最重要的情况是当已建立的领导者正在复制新的日志条目时。 Raft 使用最少数量的消息(从领导者到一半集群的单次往返)实现了这一点。 还可以进一步提高 Raft 的性能。 例如,它可以轻松支持批处理和流水线请求,以实现更高的吞吐量和更低的延迟。 在其他算法的文献中已经提出了各种优化; 其中许多可以应用于 Raft,但我们将其留给未来的工作。

We used our Raft implementation to measure the performance of Raft’s leader election algorithm and answer two questions. First, does the election process converge quickly? Second, what is the minimum downtime that can be achieved after leader crashes?

我们使用 Raft 实现来衡量 Raft 的领导者选举算法的性能并回答两个问题。 第一,选举过程收敛很快吗? 其次,领导者宕机后可以达到的最小停机时间是多少?

To measure leader election, we repeatedly crashed the leader of a cluster of five servers and timed how long it took to detect the crash and elect a new leader (see Figure 16). To generate a worst-case scenario, the servers in each trial had different log lengths, so some candidates were not eligible to become leader. Furthermore, to encourage split votes, our test script triggered a synchronized broadcast of heartbeat RPCs from the leader before terminating its process (this approximates the behavior of the leader replicating a new log entry prior to crashing). The leader was crashed uniformly randomly within its heartbeat interval, which was half of the minimum election timeout for all tests. Thus, the smallest possible downtime was about half of the minimum election timeout.

为了衡量领导者选举,我们反复让五台服务器组成的集群的领导者宕机,并对检测到宕机和选举新领导所需的时间进行计时(见图 16)。 为了产生最坏的情况,每次试验中的服务器都有不同的日志长度,因此一些候选者没有资格成为领导者。 此外,为了鼓励分裂投票,我们的测试脚本在终止进程之前触发了来自领导者的心跳 RPC 的同步广播(这近似于领导者在崩溃之前复制新日志条目的行为)。 领导者在其心跳间隔内均匀随机崩溃,这是所有测试的最小选举超时时间的一半。 因此,最小可能的停机时间大约是最小选举超时时间的一半。

The top graph in Figure 16 shows that a small amount of randomization in the election timeout is enough to avoid split votes in elections. In the absence of randomness, leader election consistently took longer than 10 seconds in our tests due to many split votes. Adding just 5ms of randomness helps significantly, resulting in a median downtime of 287ms. Using more randomness improves worst-case behavior: with 50ms of randomness the worstcase completion time (over 1000 trials) was 513ms.

图 16 中的顶部图表显示,选举超时中的少量随机化足以避免选举中的分裂投票。 在缺乏随机性的情况下,由于许多分裂选票,在我们的测试中,领导者选举的时间始终超过 10 秒。 仅添加 5 毫秒的随机性有很大帮助,导致平均停机时间为 287 毫秒。 使用更多的随机性可以改善最坏情况的行为:随机性为 50 毫秒时,最坏情况的完成时间(超过 1000 次试验)为 513 毫秒。

The bottom graph in Figure 16 shows that downtime can be reduced by reducing the election timeout. With an election timeout of 12–24ms, it takes only 35ms on average to elect a leader (the longest trial took 152ms). However, lowering the timeouts beyond this point violates Raft’s timing requirement: leaders have difficulty broadcasting heartbeats before other servers start new elections. This can cause unnecessary leader changes and lower overall system availability. We recommend using a conservative election timeout such as 150–300ms; such timeouts are unlikely to cause unnecessary leader changes and will still provide good availability.

图 16 中的底部图表显示可以通过减少选举超时来减少停机时间。 选举超时时间为 12-24 毫秒,平均只需要 35 毫秒就可以选举出一个领导者(最长的试验需要 152 毫秒)。 然而,将超时时间降低到这一点之后违反了 Raft 的时间要求:在其他服务器开始新的选举之前,领导者很难广播心跳。 这可能会导致不必要的领导者变更并降低整体系统可用性。 我们建议使用保守的选举超时,例如 150-300 毫秒; 此类超时不太可能导致不必要的领导者更改,并且仍将提供良好的可用性。

There have been numerous publications related to consensus algorithms, many of which fall into one of the following categories:

  • Lamport’s original description of Paxos [15], and attempts to explain it more clearly [16, 20, 21].
  • Elaborations of Paxos, which fill in missing details and modify the algorithm to provide a better foundation for implementation [26, 39, 13].
  • Systems that implement consensus algorithms, such as Chubby [2, 4], ZooKeeper [11, 12], and Spanner [6]. The algorithms for Chubby and Spanner have not been published in detail, though both claim to be based on Paxos. ZooKeeper’s algorithm has been published in more detail, but it is quite different from Paxos.
  • Performance optimizations that can be applied to Paxos [18, 19, 3, 25, 1, 27].
  • Oki and Liskov’s Viewstamped Replication (VR), an alternative approach to consensus developed around the same time as Paxos. The original description [29] was intertwined with a protocol for distributed transactions, but the core consensus protocol has been separated in a recent update [22]. VR uses a leaderbased approach with many similarities to Raft.


  • Lamport 对 Paxos 的原始描述 [15],并试图更清楚地解释它 [16, 20, 21]。
  • Paxos 的详细说明,填补缺失的细节并修改算法,为实现提供更好的基础 [26, 39, 13]。
  • 实现共识算法的系统,例如 Chubby [2, 4]、ZooKeeper [11, 12] 和 Spanner [6]。 Chubby 和 Spanner 的算法尚未详细发布,但都声称基于 Paxos。 ZooKeeper 的算法已经更详细的公布了,但是和 Paxos 有很大的不同。
  • 可应用于 Paxos [18, 19, 3, 25, 1, 27] 的性能优化。
  • Oki 和 Liskov 的 Viewstamped Replication (VR),一种与 Paxos 大约同时发展的共识替代方法。 最初的描述 [29] 与分布式交易协议交织在一起,但在最近的更新 [22] 中,核心共识协议已被分离。 VR 使用基于领导者的方法,与 Raft 有许多相似之处。

The greatest difference between Raft and Paxos is Raft’s strong leadership: Raft uses leader election as an essential part of the consensus protocol, and it concentrates as much functionality as possible in the leader. This approach results in a simpler algorithm that is easier to understand. For example, in Paxos, leader election is orthogonal to the basic consensus protocol: it serves only as a performance optimization and is not required for achieving consensus. However, this results in additional mechanism: Paxos includes both a two-phase protocol for basic consensus and a separate mechanism for leader election. In contrast, Raft incorporates leader election directly into the consensus algorithm and uses it as the first of the two phases of consensus. This results in less mechanism than in Paxos.

Raft 和 Paxos 最大的区别在于 Raft 的强大领导力:Raft 将领导选举作为共识协议的重要组成部分,并将尽可能多的功能集中在领导身上。 这种方法导致更简单的算法更容易理解。 例如,在 Paxos 中,领导者选举与基本共识协议是正交的:它仅用作性能优化,而不是达成共识所必需的。 然而,这导致了额外的机制:Paxos 包括用于基本共识的两阶段协议和用于领导者选举的单独机制。 相比之下,Raft 将领导者选举直接纳入共识算法,并将其用作共识的两个阶段中的第一个。 这导致比 Paxos 更少的机制。

Like Raft, VR and ZooKeeper are leader-based and therefore share many of Raft’s advantages over Paxos. However, Raft has less mechanism that VR or ZooKeeper because it minimizes the functionality in non-leaders. For example, log entries in Raft flow in only one direction: outward from the leader in AppendEntries RPCs. In VR log entries flow in both directions (leaders can receive log entries during the election process); this results in additional mechanism and complexity. The published description of ZooKeeper also transfers log entries both to and from the leader, but the implementation is apparently more like Raft [35].

与 Raft 一样,VR 和 ZooKeeper 也是基于领导者的,因此与 Paxos 相比,有许多 Raft 的优势。 然而,Raft 的机制比 VR 或 ZooKeeper 少,因为它最大限度地减少了非领导者的功能。 例如,Raft 中的日志条目仅向一个方向流动:从 AppendEntries RPC 中的领导者向外流动。 在 VR 中,日志条目是双向流动的(leader 可以在选举过程中收到日志条目); 这会导致额外的机制和复杂性。 已发布的 ZooKeeper 描述也将日志条目传输到领导者和从领导者传输日志条目,但实现显然更像 Raft [35]。

Raft has fewer message types than any other algorithm for consensus-based log replication that we are aware of. For example, we counted the message types VR and ZooKeeper use for basic consensus and membership changes (excluding log compaction and client interaction, as these are nearly independent of the algorithms). VR and ZooKeeper each define 10 different message types, while Raft has only 4 message types (two RPC requests and their responses). Raft’s messages are a bit more dense than the other algorithms’, but they are simpler collectively. In addition, VR and ZooKeeper are described in terms of transmitting entire logs during leader changes; additional message types will be required to optimize these mechanisms so that they are practical.

相比较于上述我们提及的其他服务于日志复制的共识算法的算法,Raft 拥有更少的消息类型。例如,我们统计了一下VR 和 ZooKeeper 使用的用于基本共识需要和成员变更的消息类型数(不包括日志压缩和客户端交互,因为这些几乎独立于算法)。VR 和 ZooKeeper 都分别定义了 10 种不同的消息类型,相对的,Raft 只有 4 种消息类型(两种 RPC 请求和对应的响应)。Raft 的消息都稍微比其他算法的要信息量大,但是都很简单。另外,VR 和 ZooKeeper 都在领导人变更时传输了整个日志;所以为了能够实践中使用,额外的消息类型就很必要了。

Raft’s strong leadership approach simplifies the algorithm, but it precludes some performance optimizations. For example, Egalitarian Paxos (EPaxos) can achieve higher performance under some conditions with a leaderless approach [27]. EPaxos exploits commutativity in state machine commands. Any server can commit a command with just one round of communication as long as other commands that are proposed concurrently commute with it. However, if commands that are proposed concurrently do not commute with each other, EPaxos requires an additional round of communication. Because any server may commit commands, EPaxos balances load well between servers and is able to achieve lower latency than Raft in WAN settings. However, it adds significant complexity to Paxos.

Raft 的强领导人方法简化了整个算法,但是同时也妨碍了一些性能优化的方法。例如, Egalitarian Paxos (EPaxos)在某些没有领导人的情况下可以达到很高的性能。EPaxos 充分发挥了在状态机指令中的交换性。任何服务器都可以在一轮通信下就提交指令,只要其他同时被提交的指令和它进行沟通。然而,如果并发被提交的指令,互相之间没有进行通信沟通,那么 EPaxos 就需要额外的一轮通信。因为任何服务器都可能提交指令,EPaxos 在服务器之间的负载均衡做的很好,并且很容易在 WAN 网络环境下获得很低的延迟。但同时,它也在 Paxos 上增加了许多重要的复杂度。

Several different approaches for cluster membership changes have been proposed or implemented in other work, including Lamport’s original proposal [15], VR [22], and SMART [24]. We chose the joint consensus approach for Raft because it leverages the rest of the consensus protocol, so that very little additional mechanism is required for membership changes. Lamport’s α-based approach was not an option for Raft because it assumes consensus can be reached without a leader. In comparison to VR and SMART, Raft’s reconfiguration algorithm has the advantage that membership changes can occur without limiting the processing of normal requests; in contrast, VR stops all normal processing during configuration changes, and SMART imposes an α-like limit on the number of outstanding requests. Raft’s approach also adds less mechanism than either VR or SMART.

一些处理集群成员变换的方法已经被提出或者在其他的成果中被实现,包括 Lamport 最初的讨论,VR 和 SMART。我们选择使用联合共识(joint consensus)方法,是因为利用了共识协议,这样我们就只需要增加很少一部分机制就可以实现成员变更。 Lamport 的基于 α 的方法对于Raft并不适用,因为它假定共识可以不需要领导者就可以达到。和 VR 和 SMART 相比较,Raft 的重配置算法可以在不影响正常请求处理的情况下进行;相比较而言,VR 需要停止所有的处理请求。SMART 则引入了一个和 α 类似的方法,限制了请求处理的数量。同时,Raft 的方法需要更少的额外机制来实现。


Algorithms are often designed with correctness, efficiency, and/or conciseness as the primary goals. Although these are all worthy goals, we believe that understandability is just as important. None of the other goals can be achieved until developers render the algorithm into a practical implementation, which will inevitably deviate from and expand upon the published form. Unless developers have a deep understanding of the algorithm and can create intuitions about it, it will be difficult for them to retain its desirable properties in their implementation.

算法的设计通常以正确性、效率 和/或 简洁性为主要目标。 尽管这些都是有价值的目标,但我们认为可理解性同样重要。 在开发人员将算法转化为实际的实现之前,其他任何目标都无法实现,这将不可避免地偏离和扩展已发布的形式。 除非开发人员对算法有深刻的理解并且可以对它产生直觉,否则他们将很难在他们的实现中保留其理想的属性。

In this paper we addressed the issue of distributed consensus, where a widely accepted but impenetrable algorithm, Paxos, has challenged students and developers for many years. We developed a new algorithm, Raft, which we have shown to be more understandable than Paxos. We also believe that Raft provides a better foundation for system building. Using understandability as the primary design goal changed the way we approached the design of Raft; as the design progressed we found ourselves reusing a few techniques repeatedly, such as decomposing the problem and simplifying the state space. These techniques not only improved the understandability of Raft but also made it easier to convince ourselves of its correctness.

在本文中,我们解决了分布式共识的问题,其中一种被广泛接受但难以理解的算法 Paxos 多年来一直在挑战学生和开发人员。 我们开发了一种新算法 Raft,我们已经证明它比 Paxos 更容易理解。 我们也相信 Raft 为系统构建提供了更好的基础。 使用可理解性作为主要设计目标改变了我们处理 Raft 设计的方式; 随着设计的进展,我们发现自己重复使用了一些技术,例如分解问题和简化状态空间。 这些技术不仅提高了 Raft 的可理解性,而且更容易让我们相信它的正确性。


The user study would not have been possible without the support of Ali Ghodsi, David Mazieres, and the students of CS 294-91 at Berkeley and CS 240 at Stanford. Scott Klemmer helped us design the user study, and Nelson Ray advised us on statistical analysis. The Paxos slides for the user study borrowed heavily from a slide deck originally created by Lorenzo Alvisi. Special thanks go to David Mazieres and Ezra Hoch for finding subtle bugs in Raft. Many people provided helpful feedback on the paper and user study materials, including Ed Bugnion, Michael Chan, Hugues Evrard,Daniel Giffin, Arjun Gopalan, Jon Howell, Vimalkumar Jeyakumar, Ankita Kejriwal, Aleksandar Kracun, Amit Levy, Joel Martin, Satoshi Matsushita, Oleg Pesok, David Ramos, Robbert van Renesse, Mendel Rosenblum, Nicolas Schiper, Deian Stefan, Andrew Stone, Ryan Stutsman, David Terei, Stephen Yang, Matei Zaharia, 24 anonymous conference reviewers (with duplicates), and especially our shepherd Eddie Kohler. Werner Vogels tweeted a link to an earlier draft, which gave Raft significant exposure. This work was supported by the Gigascale Systems Research Center and the Multiscale Systems Center, two of six research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program, by STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, by the National Science Foundation under Grant No. 0963859, and by grants from Facebook, Google, Mellanox, NEC, NetApp, SAP, and Samsung. Diego Ongaro is supported by The Junglee Corporation Stanford Graduate Fellowship.

如果没有 Ali Ghodsi、David Mazieres 和伯克利 CS 294-91 和斯坦福大学 CS 240 学生的支持,用户研究是不可能的。 Scott Klemmer 帮助我们设计了用户研究,Nelson Ray 为我们提供了统计分析方面的建议。用于用户研究的 Paxos 幻灯片大量借用了最初由 Lorenzo Alvisi 创建的幻灯片。特别感谢 David Mazieres 和 Ezra Hoch 在 Raft 中发现了细微的错误。许多人对论文和用户研究材料提供了有用的反馈,包括 Ed Bugnion、Michael Chan、Hugues Evrard、Daniel Giffin、Arjun Gopalan、Jon Howell、Vimalkumar Jeyakumar、Ankita Kejriwal、Aleksandar Kracun、Amit Levy、Joel Martin、Satoshi Matsushita, Oleg Pesok、David Ramos、Robbert van Renesse、Mendel Rosenblum、Nicolas Schiper、Deian Stefan、Andrew Stone、Ryan Stutsman、David Terei、Stephen Yang、Matei Zaharia,24 位匿名会议评论员(有重复),尤其是我们的牧羊人 Eddie Kohler。 Werner Vogels 在推特上发布了一个指向早期草案的链接,这让 Raft 得到了大量曝光。这项工作得到了千兆系统研究中心和多尺度系统中心的支持,这两个研究中心是在焦点中心研究计划(一个半导体研究公司计划)下资助的六个研究中心中的两个,由 STARnet(一个由 MARCO 和 DARPA 赞助的半导体研究公司计划) 0963859 号美国国家科学基金会,以及 Facebook、谷歌、Mellanox、NEC、NetApp、SAP 和三星的资助。 Diego Ongaro 得到 Junglee Corporation 斯坦福大学研究生奖学金的支持。


[1] BOLOSKY, W. J., BRADSHAW, D., HAAGENS, R. B., KUSTERS, N. P., AND LI, P. Paxos replicated state machines as the basis of a high-performance data store. In Proc. NSDI’11, USENIX Conference on Networked Systems Design and Implementation (2011), USENIX, pp. 141–154.

[2] BURROWS, M. The Chubby lock service for looselycoupled distributed systems. In Proc. OSDI’06, Symposium on Operating Systems Design and Implementation (2006), USENIX, pp. 335–350.

[3] CAMARGOS, L. J., SCHMIDT, R. M., AND PEDONE, F. Multicoordinated Paxos. In Proc. PODC’07, ACM Symposium on Principles of Distributed Computing (2007), ACM, pp. 316–317.

[4] CHANDRA, T. D., GRIESEMER, R., AND REDSTONE, J. Paxos made live: an engineering perspective. In Proc. PODC’07, ACM Symposium on Principles of Distributed Computing (2007), ACM, pp. 398–407.

[5] CHANG, F., DEAN, J., GHEMAWAT, S., HSIEH, W. C., WALLACH, D. A., BURROWS, M., CHANDRA, T., FIKES, A., AND GRUBER, R. E. Bigtable: a distributed storage system for structured data. In Proc. OSDI’06, USENIX Symposium on Operating Systems Design and Implementation (2006), USENIX, pp. 205–218.

[6] CORBETT, J. C., DEAN, J., EPSTEIN, M., FIKES, A., FROST, C., FURMAN, J. J., GHEMAWAT, S., GUBAREV, A., HEISER, C., HOCHSCHILD, P., HSIEH, W., KANTHAK, S., KOGAN, E., LI, H., LLOYD, A., MELNIK, S., MWAURA, D., NAGLE, D., QUINLAN, S., RAO, R., ROLIG, L., SAITO, Y., SZYMANIAK, M., TAYLOR, C., WANG, R., AND WOODFORD, D. Spanner: Google’s globally-distributed database. In Proc. OSDI’12, USENIX Conference on Operating Systems Design and Implementation (2012), USENIX, pp. 251–264.

[7] COUSINEAU, D., DOLIGEZ, D., LAMPORT, L., MERZ, S., RICKETTS, D., AND VANZETTO, H. TLA+ proofs. In Proc. FM’12, Symposium on Formal Methods (2012), D. Giannakopoulou and D. M´ery, Eds., vol. 7436 of Lecture Notes in Computer Science, Springer, pp. 147–154.

[8] GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Google file system. In Proc. SOSP’03, ACM Symposium on Operating Systems Principles (2003), ACM, pp. 29–43.

[9] GRAY, C., AND CHERITON, D. Leases: An efficient faulttolerant mechanism for distributed file cache consistency. In Proceedings of the 12th ACM Ssymposium on Operating Systems Principles (1989), pp. 202–210.

[10] HERLIHY, M. P., AND WING, J. M. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12 (July 1990), 463–492.

[11] HUNT, P., KONAR, M., JUNQUEIRA, F. P., AND REED, B. ZooKeeper: wait-free coordination for internet-scale systems. In Proc ATC’10, USENIX Annual Technical Conference (2010), USENIX, pp. 145–158.

[12] JUNQUEIRA, F. P., REED, B. C., AND SERAFINI, M. Zab: High-performance broadcast for primary-backup systems. In Proc. DSN’11, IEEE/IFIP Int’l Conf. on Dependable Systems & Networks (2011), IEEE Computer Society, pp. 245–256.

[13] KIRSCH, J., AND AMIR, Y. Paxos for system builders. Tech. Rep. CNDS-2008-2, Johns Hopkins University, 2008.

[14] LAMPORT, L. Time, clocks, and the ordering of events in a distributed system. Commununications of the ACM 21, 7 (July 1978), 558–565.

[15] LAMPORT, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133–169.

[16] LAMPORT, L. Paxos made simple. ACM SIGACT News 32, 4 (Dec. 2001), 18–25.

[17] LAMPORT, L. Specifying Systems, The TLA+ Language and Tools for Hardware and Software Engineers. AddisonWesley, 2002.

[18] LAMPORT, L. Generalized consensus and Paxos. Tech. Rep. MSR-TR-2005-33, Microsoft Research, 2005.

[19] LAMPORT, L. Fast paxos. Distributed Computing 19, 2 (2006), 79–103.

[20] LAMPSON, B. W. How to build a highly available system using consensus. In Distributed Algorithms, O. Baboaglu and K. Marzullo, Eds. Springer-Verlag, 1996, pp. 1–17.

[21] LAMPSON, B. W. The ABCD’s of Paxos. In Proc. PODC’01, ACM Symposium on Principles of Distributed Computing (2001), ACM, pp. 13–13.

[22] LISKOV, B., AND COWLING, J. Viewstamped replication revisited. Tech. Rep. MIT-CSAIL-TR-2012-021, MIT, July 2012.

[23] LogCabin source code. http://github.com/logcabin/logcabin.

[24] LORCH, J. R., ADYA, A., BOLOSKY, W. J., CHAIKEN, R., DOUCEUR, J. R., AND HOWELL, J. The SMART way to migrate replicated stateful services. In Proc. EuroSys’06, ACM SIGOPS/EuroSys European Conference on Computer Systems (2006), ACM, pp. 103–115.

[25] MAO, Y., JUNQUEIRA, F. P., AND MARZULLO, K. Mencius: building efficient replicated state machines for WANs. In Proc. OSDI’08, USENIX Conference on Operating Systems Design and Implementation (2008), USENIX, pp. 369–384.

[26] MAZIERES , D. Paxos made practical. http: //www.scs.stanford.edu/˜dm/home/ papers/paxos.pdf, Jan. 2007.

[27] MORARU, I., ANDERSEN, D. G., AND KAMINSKY, M. There is more consensus in egalitarian parliaments. In Proc. SOSP’13, ACM Symposium on Operating System Principles (2013), ACM.

[28] Raft user study. http://ramcloud.stanford.edu/~ongaro/userstudy/.

[29] OKI, B. M., AND LISKOV, B. H. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proc. PODC’88, ACM Symposium on Principles of Distributed Computing (1988), ACM, pp. 8–17.

[30] O’NEIL, P., CHENG, E., GAWLICK, D., AND ONEIL, E. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351–385.

[31] ONGARO, D. Consensus: Bridging Theory and Practice. PhD thesis, Stanford University, 2014 (work in progress). http://ramcloud.stanford.edu/~ongaro/thesis.pdf.

[32] ONGARO, D., AND OUSTERHOUT, J. In search of an understandable consensus algorithm. In Proc ATC’14, USENIX Annual Technical Conference (2014), USENIX.

[33] OUSTERHOUT, J., AGRAWAL, P., ERICKSON, D., KOZYRAKIS, C., LEVERICH, J., MAZIERES ` , D., MITRA, S., NARAYANAN, A., ONGARO, D., PARULKAR, G., ROSENBLUM, M., RUMBLE, S. M., STRATMANN, E., AND STUTSMAN, R. The case for RAMCloud. Communications of the ACM 54 (July 2011), 121–130.

[34] Raft consensus algorithm website. http://raftconsensus.github.io

[35] REED, B. Personal communications, May 17, 2013.

[36] ROSENBLUM, M., AND OUSTERHOUT, J. K. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10 (February 1992), 26–52.

[37] SCHNEIDER, F. B. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys 22, 4 (Dec. 1990), 299–319

[38] SHVACHKO, K., KUANG, H., RADIA, S., AND CHANSLER, R. The Hadoop distributed file system. In Proc. MSST’10, Symposium on Mass Storage Systems and Technologies (2010), IEEE Computer Society, pp. 1–10.

[39] VAN RENESSE, R. Paxos made moderately complex. Tech. rep., Cornell University, 2012.

Author: bugwz
Link: https://bugwz.com/2021/05/01/raft/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.