译 - The Google File System

《The Google File System》 是由 Google 公司开发的分布式文件系统,旨在解决存储海量数据的问题。GFS 采用了一些独特的设计,如基于大块的文件存储、多副本存储和自动故障恢复等。GFS 能够支持高并发、高吞吐量的数据访问,并且具有良好的扩展性和可靠性。GFS 的设计思想已经被广泛应用于其他分布式存储系统的开发中,是分布式存储领域的重要里程碑之一。


We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

我们设计并实现了 Google File System,这是⼀种可扩展的分布式⽂件系统,适⽤于⼤型分布式数据密集型应⽤程序。它支持在廉价的商品硬件上运⾏时提供容错能⼒,并能够为⼤量客户端提供⾼性能的访问。

While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.


The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

GFS 已经成功地满足了我们的存储需求。它在 Google 内部被广泛部署,作为我们服务所使用的数据的生成和处理的存储平台,以及用于需要大型数据集的研究和开发工作。迄今为止,最大的 GFS 集群在一千多台机器上的数千个磁盘上提供了数百 TB 的存储,并被数百个客户同时访问。

In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.



D [4]: 3—Distributed file systems

D [4]: 3—分布式文件系统


Design, reliability, performance, measurement



Fault tolerance, scalability, data storage, clustered storage



We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system design assumptions. We have reexamined traditional choices and explored radically different points in the design space.

为了满足 Google 快速增长的数据处理需求,我们设计并实现了 Google File System(GFS)。GFS 与以前的分布式系统有很多相同的目标,比如性能、可扩展性、可靠性和可用性。然而,它的设计来源于我们对应用负载和技术环境的观察和预期,这与以前的文件系统表现出了完全不同的猜想与假设。因此,我们重新考虑了传统的选择,并探索了完全不同的设计。

First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.

第一,组件故障应该是常态而非例外。GFS 的存储节点由成百上千台廉价设备所构建而成,并且给数量众多的客户端提供访问服务。设备的数量和质量决定了几乎在任何时间上都会有部分组件无法正常工作,甚至于部分组件将无法从故障中恢复。我们已经看到了应用程序错误、操作系统错误、人为错误以及硬盘、内存、连接器、网络和电源造成的问题。因此,系统必须具备持续监控、错误检测、容错以及自动恢复的能力。

Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many application objects such as web documents. When we are regularly working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of approximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and blocksizes have to be revisited.

第二,文件相比于传统的标准来说更大。数 GB 大小的文件是十分常见的。每个文件通常包含很多应用程序对象,例如 Web 文档等。因为我们的数据集由数十亿个总量 TB 的对象组成,且这个数字还在快速增长,即使操作系统支持管理数十亿个几 KG 大小的文件的,这也是非常不明智的。因此,我们需要重新考虑像 IO 操作和块大小等的设计和参数。

Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be intermediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.


Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility. For example, we have relaxed GFS’s consistency model to vastly simplify the file system without imposing an onerous burden on the applications. We have also introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them. These will be discussed in more details later in the paper.

第四,共同设计应用程序和文件系统 API 有助于提高整个系统的灵活性。例如,我们放宽了 GFS 的一致性模型,极大的简化了文件系统,减少了应用程序的负担。我们还引入一种原子追加的操作,以便于多个客户端可以同时追加到同一个文件,而无需在他们之间进行额外的同步。这些将在本⽂后⾯进⾏更详细的讨论。

Multiple GFS clusters are currently deployed for different purposes. The largest ones have over 1000 storage nodes, over 300 TB of diskstorage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.

⽬前部署了多个 GFS 集群⽤于不同的⽬的。其中最大的集群拥有超过 1000 个存储节点,超过 300 TB 的磁盘存储,并被不同机器上的数百个客户端连续不断的大量访问。



In designing a file system for our needs, we have been guided by assumptions that offer both challenges and opportunities. We alluded to some key observations earlier and now lay out our assumptions in more details.


  • The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
  • The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
  • The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more. Successive operations from the same client often read through a contiguous region of a file. A small random read typically reads a few KBs at some arbitrary offset. Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go backand forth.
  • The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient.
  • The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producerconsumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently append to a file. Atomicity with minimal synchronization overhead is essential. The file may be read later, or a consumer may be reading through the file simultaneously.
  • High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulkat a high rate, while few have stringent response time requirements for an individual read or write.

  • 系统由许多可能经常发生故障的廉价的组件构成。它必须不断地自我检测、定期检测、容错组件故障并能够迅速的进行恢复。
  • 系统存储一定数量的大文件。 我们期望能够存储几百万个大小为 100MB 甚至更大的文件。系统中经常有一些 GB 级别的问题,且这些文件需要被高效的进行管理。系统同样也必须支持小文件,但我们不需要为它们进行优化。
  • 工作负载主要由两种读组成:大规模的流式读取和小规模的随机读取。 在大规模的流式读取中,每次读取通常会读取数百 KB,1MB 甚至于更多。来自同一个客户端的连续操作通常会读取文件的一个连续区域。 小规模的随机读取通常会在文件的某个任意偏移处读取几 KB。 性能敏感型的应用程序通常会对小规模的读取进行批处理和排序,这样可以顺序地遍历文件,而不是来回遍历。
  • 工作负载还来自很多对文件的大规模追加写入。一般来说,写入的规模与读取的规模相似。文件一旦被写入就几乎不会被再次修改。系统同样支持小规模的随机写入,但并不一定要高效地执行。
  • 系统必须能很好的定义并实现多个客户端并发向同一个文件追加数据的语义。我们的文件通常在生产者-消费者队列或多路归并中使用。来自不同机器的数百个生产者可能会并发地向同一个文件中追加写入数据。因此,最小化原子性的同步开销是必不可少的。文件在被生产后可能同时或稍后就会被消费者读取。
  • 持续的高吞吐比低延迟更重要。我们的大多数应用程序都非常重视以高速率来批量处理数据,而很少有应用程序对单个读取或写入的响应时间有严格的要求。


GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by pathnames. We support the usual operations to create, delete, open, close, read, and write files.

GFS 提供了一个熟悉的文件系统接口,尽管它没有实现如 POSIX 的标准 API 。 文件在目录中分层组织并由路径名标识。 我们支持创建、删除、打开、关闭、读取和写入文件的常规操作。

Moreover, GFS has snapshot and record append operations. Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results and producerconsumer queues that many clients can simultaneously append to without additional locking. We have found these types of files to be invaluable in building large distributed applications. Snapshot and record append are discussed further in Sections 3.4 and 3.3 respectively.

此外,GFS 支持快照(Snapshot)和记录追加(Record Append)操作。 快照能以低成本来创建文件或目录树的副本。 记录追加(Record Append)允许多个客户端同时将数据追加到同一个文件,同时保证每个客户端追加的原子性。 它对于实现多路合并结果和生产者-消费者队列很有用,许多客户端可以同时追加到这些队列而不需要进行加锁。 我们发现这些类型的文件在构建大型分布式应用程序时具有无可估量的价值。 快照和记录追加将分别在 3.4 和 3.3 节中进一步讨论。


A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.

GFS 集群由一个主服务器(Master)和多个块服务器(ChunkServer)组成,并被多个客户端访问,如图 1 所示。每个客户端通常都是运行用户级服务器进程的商用 Linux 机器。 在同一台机器上运行块服务器和客户端是很容易的,前提是机器资源允许,并且能够接受运行不稳定的应用程序所导致的低可靠性。

图1: GFS架构

Files are divided into fixed-size chunks. Each chunkis identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunkcreation. Chunkservers store chunks on local disks as Linux files and read or write chunkdata specified by a chunkhandle and byte range. For reliability, each chunkis replicated on multiple chunkservers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace.

文件被分成固定大小的块(Chunk)。 每个块在创建时都会由主服务器分配一个不可变且全局唯一的 64 位的句柄。块服务器将块作为 Linux 文件存储在本地磁盘上,并通过块具柄和字节范围来确定要读取和写入的数据范围。为了可靠性,每个块都被复制到多个块服务器上。 默认情况下,我们存储三个副本,用户也可以为文件命名空间的不同区域指定不同的复制级别。

The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunklease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.

主服务器维护所有文件系统元数据。 这包括命名空间、访问控制信息、从文件到块的映射以及块的当前位置。 它还控制系统范围的活动,例如块的租约管理、孤儿块的垃圾回收以及块服务器之间的块迁移。 主服务器周期性地与每个块服务器进行心跳通信(HeartBeat)来下发指令并采集它的状态。

GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux vnode layer.

应用程序通过使用 GFS 的客户端代码来实现文件系统的 API,并借助于它来实现在主服务器和块服务器之间进行读写操作。客户端访问主服务器进行元数据的操作,访问块服务器进行实际数据的操作。我们不提供 POSIX API ,因此不需要连接到 Linux vnode 层。

Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.

客户端和块服务器都不缓存文件数据。 客户端缓存几乎没有什么好处,因为大多数应用程序都是流式的传输大量文件或者因为工作集太大以至于没有办法缓存。没有缓存也就不需要解决缓存一致性的问题,从而简化客户端和整个系统。(然而,客户端会缓存一些元数据。)块服务器不需要缓存文件数据,因为块存储是本地文件,Linux 的缓冲区缓存会将经常访问的数据保存到内存中。


Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.

单一的主服务器极大的简化了我们的设计,并使得主服务器能够使用全局的信息实现复杂的块放置以及复制决策。但是,我们必须尽量减少它对读写的参与,以免它成为瓶颈。客户端永远不会通过主服务器读写文件数据。 但是,客户端会通过询问主服务器来了解它应该访问哪些块服务器。并且客户端会在有限的一段时间内缓存这些信息,在后续的很多操作中它将直接与块服务器交互。

Let us explain the interactions for a simple read with reference to Figure 1. First, using the fixed chunksize, the client translates the file name and byte offset specified by the application into a chunkindex within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunkindex as the key.

让我们参考图 1 介绍一次简单的读操作的交互情况。首先,在使用固定块大小的前提下,客户端将应用程序指定的文件名和字节偏移量转换为文件中的块索引。然后,客户端向主服务器发送一个包含文件名和块索引的请求。主服务器回复给客户端对应的块句柄和副本的位置。客户端使用文件名和块索引作为键来缓存这些信息。

The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunkrequire no more client-master interaction until the cached information expires or the file is reopened. In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This extra information sidesteps several future client-master interactions at practically no extra cost.

然后客户端向其中一个副本发送请求,最有可能是最近的副本。 这个请求指定块句柄和该块内的字节范围。在缓存信息过期或文件被重新打开之前,对同一块的下一步读取操作不再需要客户端与主服务器的交互。事实上,客户端通常会在单次请求中请求多个块信息,而主服务器也可以在需要请求的信息后添加之后的块的信息。这些额外的信息无需消耗额外的成本就可以避免未来客户端和主服务器的交互。


Chunksize is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system blocksizes. Each chunkreplica is stored as a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunksize.

块大小是关键的设计参数之一。 我们选择了 64 MB,这比典型的文件系统块大小大得多。 每个块副本(ChunkReplica)都作为普通的 Linux 文件存储在块服务器上,并且仅在需要时进行扩展。 惰性的空间分配避免了由于内部碎片造成的空间浪费,这可能是对如此大的块大小的最大反对意见。

A large chunksize offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunkrequire only one initial request to the master for chunklocation information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. Even for small random reads, the client can comfortably cache all the chunklocation information for a multi-TB working set. Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persis tent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.

一个大的块大小提供了几个重要的优点。首先,因为在同一个块上读取和写入只需要向主服务器发送一个初始请求来获取块位置信息,所以它减少了客户端和主服务器的交互需求。由于很多应用程序大多按顺序读取和写入大文件,因此这种减少对于我们的工作负载来说十分重要。即使对于小的随机读取,客户端也可以轻松缓存数 TB 数据集的所有块位置的信息。其次,客户端更有可能在较大的块大小上执行更多的操作,它可以延长与块服务器的 TCP 连接时间来减少后续的网络连接开销。再次,较大的块大小可以减少主服务器上元数据的大小。这就使得我们能够将元数据保存在内存中,这也带来了其他的一些优点,后续我们将在第 2.6.1 节中进行讨论。

On the other hand, a large chunksize, even with lazy space allocation, has its disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunkfiles sequentially.


However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunkfile and then started on hundreds of machines at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batchqueue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.

然而,当 GFS 首次被批处理队列系统使用时,热点确实出现了:一个可执行文件作为一个单个块文件写入 GFS,然后同时在数百台机器上启动。 存储这个可执行文件的少数块服务器在处理数百个并发请求时出现了超载。我们可以通过设置更高的复制系数(Replication Factor)来存储这种可执行文件,并使批处理队列系统错开应用程序的启动时间来解决这个问题。一个可行的长期的解决方案是允许客户端在这种情况下能够从其他的客户端中读取数据。


The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. The first two types (namespaces and file-to-chunk mapping) are also kept persistent by logging mutations to an operation log stored on the master’s local diskand replicated on remote machines. Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash. The master does not store chunk location information persistently. Instead, it asks each chunkserver about its chunks at master startup and whenever a chunkserver joins the cluster.

主服务器主要存储三种类型的元数据:文件(File)和块命名空间(Chunk Namespace),文件(File)到块(Chunk)的映射,以及每个块(Chunk)的副本位置。 所有的元数据都保存在主服务器的内存中。前两种类型(文件和块命名空间,文件到块的映射)通过将变动记录到存储到主服务器本地磁盘上,并将其(变动记录)复制到远程机器的操作日志中,来保证数据的持久化。使用日志可以让我们简单、可靠地更新主服务器的状态,并且不会担心主服务器崩溃时数据不一致的问题。主服务器不会持久化存储块的位置信息。相反,它会在主服务器启动和一个块服务器加入集群时,向每个块服务器询问它们的块信息。


Since metadata is stored in memory, master operations are fast. Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background. This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver failures, and chunk migration to balance load and diskspace usage across chunkservers. Sections 4.3 and 4.4 will discuss these activities further.

由于元数据存储在内存中,主服务器的操作速度十分快。 此外,主服务器在后台定期扫描其整个状态变得既简单又高效。这种定期扫描可用于实现块的垃圾收集,在块服务器故障时进行数据的重新复制,以及进行块迁移来平衡块服务器之间的负载和磁盘空间。4.3 和 4.4 节将进一步讨论这些特性。

One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. This is not a serious limitation in practice. The master maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Similarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.

这种仅使用内存的方法存在的一个潜在问题是:块的数量以及整个系统的容量受限于主服务器所拥有的内存的容量。实践中发现这并不是一个严重的限制。主服务器需要使用不超过 64 字节的元数据来管理每个 64MB 大小的块。由于大多数文件包含很多块,所以在大多数的块都被完全填充,只有最后一个块可能被部分填充。同样地,由于文件命名空间数据(File Namespace Data)使用前缀压缩的紧凑编码,所以每个文件中这部分的数据通常会小于 64 个字节。

If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory.



The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.


We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.


Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.



The operation log contains a historical record of critical metadata changes. It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations. Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created.

操作日志包含了关键元数据变更的历史记录。它是 GFS 的核心所在。它不仅是元数据的唯一持久记录,而且还作为逻辑时间线,定义并发操作的顺序。文件、块以及它们的版本(详见 4.5 节)都可以通过它们被创建的逻辑时间唯一且永久地进行标识。

Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely.


The master batches several log records together before flushing thereby reducing the impact of flushing and replication on overall system throughput. The master recovers its file system state by replaying the operation log. To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing. This further speeds up recovery and improves availability.

主服务器通过将多个日志记录打包成一起进行的刷写,以此来降低刷写和复制对整个系统吞吐量的影响。主服务器通过回放操作日志来恢复其文件系统状态。为了最小化启动时间,我们必须保持日志的小巧。主服务器会在操作日志增长到一定大小时生成检查点,以便在需要恢复时从本地磁盘加载最新检查点,并且只需重放检查点之后的有限数量的操作日志记录即可完成恢复。检查点采用紧凑的类 B 树结构,可以直接映射到内存中,并用于命名空间查找而无需额外解析。这进一步提高了恢复速度并改善了可用性。

Because building a checkpoint can take a while, the master’s internal state is structured in such a way that a new checkpoint can be created without delaying incoming mutations. The master switches to a new log file and creates the new checkpoint in a separate thread. The new checkpoint includes all mutations before the switch. It can be created in a minute or so for a cluster with a few million files. When completed, it is written to diskboth locally and remotely.

由于创建检查点可能需要一些时间,因此主服务器的内部状态的结构支持能够在不延迟传入变化的情况下创建新的检查点。主服务器会在一个单独的线程中切换到一个新的日志文件,并创建一个新的检查点。新的检查点包括切换之前的所有变化。对于一个拥有数百万个文件的集群,它可以在一分钟左右创建完成。当完成后,它会在本地和远程写入磁盘持久化。这种结构能够快速创建新的检查点,这使得 GFS 系统的可用性更高。

Recovery needs only the latest complete checkpoint and subsequent log files. Older checkpoints and log files can be freely deleted, though we keep a few around to guard against catastrophes. A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints.



GFS has a relaxed consistency model that supports our highly distributed applications well but remains relatively simple and efficient to implement. We now discuss GFS’s guarantees and what they mean to applications. We also highlight how GFS maintains these guarantees but leave the details to other parts of the paper.

GFS 采用松散的一致性模型,能够很好地支持我们高度分布式的应用程序,并且相对简单高效。现在我们讨论一下 GFS 的保证(承诺),以及它们对应用程序的意义。我们还将强调 GFS 如何确保这些保证(承诺),但具体的细节将在本论文的其他部分进行讲解。

2.7.1、GFS 的保证

File namespace mutations (e.g., file creation) are atomic. They are handled exclusively by the master: namespace locking guarantees atomicity and correctness (Section 4.1); the master’s operation log defines a global total order of these operations (Section 2.6.3).

文件命名空间的变动(例如:文件创建)是原子性的。它们由主服务器独立处理:命名空间的锁保证了原子性和正确性(第 4.1 节);主服务器的操作日志定义了这些操作的全局序列(第 2.6.3 节)。

The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations. Table 1 summarizes the result. A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. When a mutation succeeds without interference from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations. A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times. We describe below how our applications can distinguish defined regions from undefined regions. The applications do not need to further distinguish between different kinds of undefined regions.

数据变更后的文件区域状态取决于变更类型、成功或失败以及是否存在并发变更。表1 总结了结果。如果所有客户端无论从哪个副本读取数据都将看到相同的数据,那么文件区域就是一致的。如果在文件数据变更后,区域定义了,那么它是一致的,并且客户端将看到变更所写的全部内容。当一个变更成功且没有受到并发写入的干扰时,受影响的区域被定义为一致的(暗示着区域一直是定义的):所有客户端将始终看到变更所写的内容。并发成功的变更将区域定义为未定义但一致的,但可能不反映任何一个变更所写的内容。通常,它由多个变更的混合片段组成。失败的变更会使区域不一致(因此也未定义):不同的客户端在不同的时间可能会看到不同的数据。我们将在下面介绍如何区分应用程序中的定义区域和未定义区域。应用程序不需要进一步区分不同类型的未定义区域。

表 1:变更后的文件区域状态

Data mutations may be writes or record appends. A write causes data to be written at an application-specified file offset. A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing (Section 3.3). (In contrast, a “regular” append is merely a write at an offset that the client believes to be the current end of file.) The offset is returned to the client and marks the beginning of a defined region that contains the record. In addition, GFS may insert padding or record duplicates in between. They occupy regions considered to be inconsistent and are typically dwarfed by the amount of user data.

数据变更可能是写操作或记录追加操作。写操作会导致数据被写入应用程序指定的文件偏移量处。记录追加操作会导致数据(即 “记录” )至少被追加一次,即使在并发变更的情况下,追加的位置也是由 GFS 选择的(请参阅第 3.3 节)。 (相比之下,“常规” 追加仅是在客户端认为是当前文件末尾的偏移处进行写入。)偏移量会被返回给客户端,并标记包含记录的已定义区域的开头。此外, GFS 可能会在这些区域之间插入填充或记录副本。它们占据被视为不一致的区域,但通常远远小于用户的数据量。

After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. GFS achieves this by (a) applying mutations to a chunkin the same order on all its replicas (Section 3.1), and (b) using chunkversion numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down (Section 4.5). Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations. They are garbage collected at the earliest opportunity.

在一系列成功的变更之后,经过变更的文件区域保证是已定义的,并包含由最后一次变更写入的数据。 GFS 通过以下方式实现这一点:(a)按照相同顺序将变更应用于所有副本中的块(第 3.1 节);(b)使用块版本号检测任何副本上的数据是否因为其块服务器停机,错过了数据变更而较老(第 4.5 节)。陈旧的副本将永远不会参与变更,也不会被提供给向主服务器请求块位置的客户端。它们会尽快地被垃圾回收。

Since clients cache chunklocations, they may read from a stale replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next open of the file, which purges from the cache all chunkinformation for that file. Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunkrather than outdated data. When a reader retries and contacts the master, it will immediately get current chunklocations.


Long after a successful mutation, component failures can of course still corrupt or destroy data. GFS identifies failed chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming (Section 5.2). Once a problem surfaces, the data is restored from valid replicas as soon as possible (Section 4.3). A chunk is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes. Even in this case, it becomes unavailable, not corrupted: applications receive clear errors rather than corrupt data.

在成功变更很久之后,组件故障仍然可能会破坏或销毁数据。GFS 通过主服务器和所有块服务器之间的定期握手来识别失败的块服务器,并通过校验和来检测数据损坏(第 5.2 节)。一旦出现问题,数据将尽快从有效副本中恢复(第 4.3 节)。只有当所有副本在 GFS 能够及时处理之前(通常在几分钟内)全部丢失了数据时,块才会被永久丢失。即使在这种情况下,它也是会变得不可用,而不是损坏:应用程序会收到明确的错误消息,而不是损坏的数据。


GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.

GFS 应用程序可以通过一些简单的技术来适应松散一致性模型,这些技术已经用于其他目的:依赖于追加而不是覆盖写、进行检查点操作,并编写自我验证、自我识别的记录。

Practically all our applications mutate files by appending rather than overwriting. In one typical use, a writer generates a file from beginning to end. It atomically renames the file to a permanent name after writing all the data, or periodically checkpoints how much has been successfully written. Checkpoints may also include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined state. Regardless of consistency and concurrency issues, this approach has served us well. Appending is far more efficient and more resilient to application failures than random writes. Checkpointing allows writers to restart incrementally and keeps readers from processing successfully written file data that is still incomplete from the application’s perspective.

几乎所有的 GFS 应用程序都使用追加而不是覆盖的方式进行文件的修改。在一个典型的应用中,写操作将文件从开头一直写到结尾。在所有数据写入后,它会将文件原子重命名为永久名称,或者定期检查已经成功写入了多少数据。检查点还可以包括应用程序级别的校验和。读者只验证和处理文件区域,直到最后一个检查点,这被认为处于已定义的状态。不管一致性和并发问题如何,这种方法都为我们服务得很好。追加写比随机写更有效率,而且更能够抵御应用程序的故障。检查点允许编写者进行增量重启,并防止读者处理应用程序仍未完成的已成功写入的文件数据。

In the other typical use, many writers concurrently append to a file for merged results or as a producer-consumer queue. Record append’s append-at-least-once semantics preserves each writer’s output. Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra padding and record fragments using the checksums. If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent operations), it can filter them out using unique identifiers in the records, which are often needed anyway to name corresponding application entities such as web documents. These functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google. With that, the same sequence of records, plus rare duplicates, is always delivered to the record reader.

在另一种典型的用法中,许多写入者并发地将数据追加到一个文件中,以进行合并结果或作为生产者-消费者队列。至少追加一次的记录追加语义保留了每个写入者的输出。读取器根据以下方式处理偶尔出现的填充和重复项。每个写入者准备的记录都包含额外的信息,例如校验和,以便可以验证其有效性。读取器可以使用校验和标识并丢弃额外的填充和记录片段。如果读取器无法容忍偶尔的重复项(例如,如果它们会触发非幂等操作),则可以使用记录中的唯一标识符将其过滤掉,这通常也需要为应用程序实体(如 Web 文档)命名。这些用于记录 I/O 的功能(除了重复项删除)在我们的应用程序中共享的库代码中,并适用于 Google 中的其他文件接口实现。因此,相同的记录序列加上极少出现的重复项总是会被传递给记录读取器。


We designed the system to minimize the master’s involvement in all operations. With that background, we now describe how the client, master, and chunkservers interact to implement data mutations, atomic record append, and snapshot.



A mutation is an operation that changes the contents or metadata of a chunksuch as a write or an append operation. Each mutation is performed at all the chunk’s replicas. We use leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary.

突变(变更)是一种更改块的内容或元数据的操作,例如写入或追加操作。 每个变更都在块的所有副本上执行。 我们使用租约来维护副本之间一致的变更顺序。 主服务器给其中一个副本授予某个数据块的租约,我们把这个副本称为 “主副本”。主副本为数据块中的所有变更选择了一个序列顺序,所有副本都按照这个顺序应用变更。因此,全局变更顺序首先由主服务器选择的租约授予顺序定义,然后在租约内部由主副本分配的序列号定义。

The lease mechanism is designed to minimize management overhead at the master. A lease has an initial timeout of 60 seconds. However, as long as the chunkis being mutated, the primary can request and typically receive extensions from the master indefinitely. These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all chunkservers. The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed). Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires.

租约机制旨在最大程度地减少主服务器的管理开销。租约的初始超时时间为 60 秒。然而,只要数据块正在变更,主副本就可以无限期地向主服务器发送请求来续约。这些续约请求和授予都会被存储在主服务器和所有块服务器之间定期交换的心跳消息中。有时,主服务器可能会在租约到期之前尝试撤销租约(例如,当主服务器想要禁用正在重命名的文件上的变更时)。即使主服务器失去了与主副本的通信,它仍然可以在旧租约过期后安全地向另一个副本授予新的租约。

In Figure 2, we illustrate this process by following the control flow of a write through these numbered steps.

在图 2 中,我们通过按照这些编号步骤来跟踪写入控制流程来说明这个过程。

图 2:写控制和数据流

  1. The client asks the master which chunkserver holds the current lease for the chunkand the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses (not shown).
  2. The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease.
  3. The client pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out. By decoupling the data flow from the control flow, we can improve performance by scheduling the expensive data flow based on the networktopology regardless of which chunkserver is the primary. Section 3.2 discusses this further.
  4. Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization. It applies the mutation to its own local state in serial number order.
  5. The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary.
  6. The secondaries all reply to the primary indicating that they have completed the operation.
  7. The primary replies to the client. Any errors encountered at any of the replicas are reported to the client. In case of errors, the write may have succeeded at the primary and an arbitrary subset of the secondary replicas. (If it had failed at the primary, it would not have been assigned a serial number and forwarded.) The client request is considered to have failed, and the modified region is left in an inconsistent state. Our client code handles such errors by retrying the failed mutation. It will make a few attempts at steps (3) through (7) before falling backto a retry from the beginning of the write.

  1. 客户端向主服务器询问哪个块服务器持有该数据块的租约以及其他副本的位置。如果没有任何服务器持有该租约,主服务器将向它选择的某个副本授予该租约(未显示)。
  2. 主服务器回复客户端,提供主副本的身份和其他(次要)副本的位置。客户端会缓存这些数据以备将来的变更使用。只有当主副本无法访问或在回复表明它不再持有租约时,客户端才需要再次联系主服务器。
  3. 客户端将数据推送到所有副本。客户端可以以任何顺序这样做。每个块服务器都将数据存储在内部 LRU 缓存中,直到数据被使用或过期为止。通过将数据流与控制流分离,我们可以根据网络拓扑安排开销较大的数据流,而不管哪个块服务器是主副本,从而提高性能。第 3.2 节进一步讨论了这一点。
  4. 一旦所有副本已确认接收数据,客户端将向主副本发送写入请求。该请求标识先前推送到所有副本的数据。主副本为其接收到的所有变更分配连续的序列号,这些变更可能来自多个客户端,因此这种方式提供了必要的序列化。它按序列号顺序将变更应用于自己的本地状态。
  5. 主副本将写入请求转发给所有次要副本。每个次要副本按照主副本分配的相同序列号顺序应用变更。
  6. 所有的次要副本都会回复主副本,表示它们已经完成了该操作。
  7. 主副本会回复客户端。在任何副本中遇到的错误都将报告给客户端。如果发生错误,则可能已经在主副本和任意数量的次要副本中成功写入。 (如果在主副本中失败,则不会被分配序列号并转发。)客户端请求被视为失败,并且修改的区域处于不一致状态。我们的客户端代码通过重试失败的变更来处理此类错误。在回退到写入开头的重试之前,它将尝试步骤(3)到(7)几次。

If a write by the application is large or straddles a chunk boundary, GFS client code breaks it down into multiple write operations. They all follow the control flow described above but may be interleaved with and overwritten by concurrent operations from other clients. Therefore, the shared file region may end up containing fragments from different clients, although the replicas will be identical because the individual operations are completed successfully in the same order on all replicas. This leaves the file region in consistent but undefined state as noted in Section 2.7.

如果应用程序的写入大小很大或跨越了块边界, GFS 客户端代码则会将其拆分为多个写操作。它们都遵循上述控制流程,但可能会与其他客户端的并发操作交错和覆盖。因此,共享文件区域可能包含来自不同客户端的片段,尽管副本将是相同的,因为所有操作都按照相同的顺序在所有副本上成功完成。如第 2.7 节所述,这将使文件区域处于一种一致但未定义的状态。


We decouple the flow of data from the flow of control to use the network efficiently. While control flows from the client to the primary and then to all secondaries, data is pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion. Our goals are to fully utilize each machine’s network bandwidth, avoid network bottlenecks and high-latency links, and minimize the latency to push through all the data.


To fully utilize each machine’s networkbandwidth, the data is pushed linearly along a chain of chunkservers rather than distributed in some other topology (e.g., tree). Thus, each machine’s full outbound bandwidth is used to transfer the data as fast as possible rather than divided among multiple recipients.


To avoid network bottlenecks and high-latency links (e.g., inter-switch links are often both) as much as possible, each machine forwards the data to the “closest” machine in the network topology that has not received it. Suppose the client is pushing data to chunkservers S1 through S4. It sends the data to the closest chunkserver, say S1. S1 forwards it to the closest chunkserver S2 through S4 closest to S1, say S2. Similarly, S2 forwards it to S3 or S4, whichever is closer to S2, and so on. Our network topology is simple enough that “distances” can be accurately estimated from IP addresses.

为了尽可能避免网络瓶颈和高延迟链接(例如,交换机之间的链接通常都是这样),每台机器将数据转发到网络拓扑中最近的未接收数据的机器。假设客户端正在向 S1 到 S4 的块服务器推送数据。它将数据发送到最近的块服务器(例如 S1 ),然后 S1 将其转发到最靠近 S1 的 S2 到 S4 中的最近的块服务器(例如 S2 )。类似地, S2 将其转发到 S3 或 S4 ,取决于 S2 更接近哪个块服务器,以此类推。我们的网络拓扑足够简单,可以从 IP 地址准确地估算出 “距离”。

Finally, we minimize latency by pipelining the data transfer over TCP connections. Once a chunkserver receives some data, it starts forwarding immediately. Pipelining is especially helpful to us because we use a switched network with full-duplex links. Sending the data immediately does not reduce the receive rate. Without network congestion, the ideal elapsed time for transferring B bytes to R replicas is B/T + RL where T is the network throughput and L is latency to transfer bytes between two machines. Our network links are typically 100 Mbps (T), and L is far below 1 ms. Therefore, 1 MB can ideally be distributed in about 80 ms.

最后,我们通过在 TCP 连接上进行流水线化的数据传输来最小化延迟。一旦块服务器接收到一些数据,它就立即开始转发。流水线化对我们特别有帮助,因为我们使用具有全双工链路的交换网络。立即发送数据不会降低接收速率。在没有网络拥塞的情况下,将 B 字节传输到 R 个副本的理想经过时间是 B/T + RL,其中 T 是网络吞吐量,L 是两台机器之间传输字节的延迟。我们的网络链接通常是 100 Mbps(T),而 L 远低于 1 毫秒。因此,理想情况下,1 MB 可以在大约 80 毫秒内分发。


GFS provides an atomic append operation called record append. In a traditional write, the client specifies the offset at which data is to be written. Concurrent writes to the same region are not serializable: the region may end up containing data fragments from multiple clients. In a record append, however, the client specifies only the data. GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client. This is similar to writing to a file opened in O_APPEND mode in Unix without the race conditions when multiple writers do so concurrently.

GFS 提供了一种称为 “记录追加” 的原子追加操作。在传统的写操作中,客户端需要指定要写入数据的偏移量。同时进行的写操作可能无法保证序列化的顺序:文件区域可能会包含来自多个客户端的数据片段。但是,在记录追加中,客户端只需要指定数据内容, GFS 会将其原子地追加到文件中,至少追加一次(即一系列连续的字节),并将该偏移量返回给客户端。这类似于在 Unix 中以 O_APPEND 模式打开文件进行写操作,但避免了多个写操作同时进行时的竞态条件。

Record append is heavily used by our distributed applications in which many clients on different machines append to the same file concurrently. Clients would need additional complicated and expensive synchronization, for example through a distributed lockmanager, if they do so with traditional writes. In our workloads, such files often serve as multiple-producer/single-consumer queues or contain merged results from many different clients.


Record append is a kind of mutation and follows the control flow in Section 3.1 with only a little extra logic at the primary. The client pushes the data to all replicas of the last chunkof the file Then, it sends its request to the primary. The primary checks to see if appending the record to the current chunkwould cause the chunkto exceed the maximum size (64 MB). If so, it pads the chunkto the maximum size, tells secondaries to do the same, and replies to the client indicating that the operation should be retried on the next chunk. (Record append is restricted to be at most one-fourth of the maximum chunksize to keep worstcase fragmentation at an acceptable level.) If the record fits within the maximum size, which is the common case, the primary appends the data to its replica, tells the secondaries to write the data at the exact offset where it has, and finally replies success to the client.

记录追加是一种变更(变异)操作,并遵循第 3.1 节中的控制流程,只需在主服务器上添加一些额外的逻辑。客户端将数据推送到文件的最后一个块的所有副本,然后将其请求发送到主服务器。主服务器检查将记录追加到当前块是否会导致该块超过最大大小( 64 MB)。如果是,它会将该块填充到最大大小,告诉副本节点执行同样的操作,并向客户端回复,指示应在下一个块上重试操作。(为保持最坏情况下的分片水平在可接受范围内,记录追加被限制为最多为最大块大小的四分之一。)如果记录适合最大大小(这是常见情况),主服务器将数据追加到其副本,告诉副本节点在确切的偏移量处写入数据,最后向客户端回复成功。

If a record append fails at any replica, the client retries the operation. As a result, replicas of the same chunkmay contain different data possibly including duplicates of the same record in whole or in part. GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit. This property follows readily from the simple observation that for the operation to report success, the data must have been written at the same offset on all replicas of some chunk. Furthermore, after this, all replicas are at least as long as the end of record and therefore any future record will be assigned a higher offset or a different chunkeven if a different replica later becomes the primary. In terms of our consistency guarantees, the regions in which successful record append operations have written their data are defined (hence consistent), whereas intervening regions are inconsistent (hence undefined). Our applications can deal with inconsistent regions as we discussed in Section 2.7.2.

如果记录追加在任何副本上失败,客户端将重试该操作。因此,同一块的副本可能包含不同的数据,可能包括完整或部分重复的相同记录。 GFS 不能保证所有副本是按字节完全相同的。它只保证数据作为一个原子单位至少被写入一次。这个属性很容易从简单的观察中得出,即为了报告成功,数据必须已经在某个块的所有副本上的相同偏移量处被写入。此外,在此之后,所有副本至少与记录的结束位置一样长,因此,即使稍后的不同副本成为主副本,任何未来的记录也将被分配更高的偏移量或不同的块。就我们的一致性保证而言,成功记录追加操作写入其数据的区域被定义(因此是一致的),而介于区域之间的区域是不一致的(因此未定义的)。正如我们在第 2.7.2 节中讨论的那样,我们的应用程序可以处理不一致的区域。


The snapshot operation makes a copy of a file or a directory tree (the “source”) almost instantaneously, while minimizing any interruptions of ongoing mutations. Our users use it to quickly create branch copies of huge data sets (and often copies of those copies, recursively), or to checkpoint the current state before experimenting with changes that can later be committed or rolled backeasily.


Like AFS [5], we use standard copy-on-write techniques to implement snapshots. When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files it is about to snapshot. This ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder. This will give the master an opportunity to create a new copy of the chunk first.

与 AFS [5] 类似,我们使用标准的写时复制技术来实现快照。当主服务器接收到快照请求时,它首先撤销任何正在使用待快照文件中块的租约。这确保任何后续对这些块的写操作都需要与主服务器进行交互以查找租约持有者。这将为主服务器创建块的新副本提供机会。

After the leases have been revoked or have expired, the master logs the operation to disk. It then applies this log record to its in-memory state by duplicating the metadata for the source file or directory tree. The newly created snapshot files point to the same chunks as the source files.


The first time a client wants to write to a chunk C after the snapshot operation, it sends a request to the master to find the current lease holder. The master notices that the reference count for chunk C is greater than one. It defers replying to the client request and instead picks a new chunk handle C’. It then asks each chunkserver that has a current replica of C to create a new chunkcalled C’. By creating the new chunk on the same chunkservers as the original, we ensure that the data can be copied locally, not over the network(our disks are about three times as fast as our 100 Mb Ethernet links). From this point, request handling is no different from that for any chunk: the master grants one of the replicas a lease on the new chunk C’ and replies to the client, which can write the chunk normally, not knowing that it has just been created from an existing chunk.

当客户端在快照操作后首次想要写入块 C 时,它会向主服务器发送请求以查找当前的租赁持有者。主服务器注意到块 C 的引用计数大于 1,它将延迟回复客户端请求并选择一个新的块句柄 C’。然后它要求每个当前拥有块 C 副本的块服务器创建一个新的块,称为 C’。通过在原始块的相同块服务器上创建新块,我们确保数据可以在本地复制,而不是通过网络传输(我们的磁盘速度约为我们的 100 Mb 以太网连接的三倍)。从此时开始,请求处理与任何块的处理方式没有区别:主服务器向新块 C’ 的一个副本授予租赁,然后回复给客户端,客户端可以正常写入该块,不知道它刚刚是从现有块创建的。

4、Master 操作

The master executes all namespace operations. In addition, it manages chunk replicas throughout the system: it makes placement decisions, creates new chunks and hence replicas, and coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers, and to reclaim unused storage. We now discuss each of these topics.



Many master operations can take a long time: for example, a snapshot operation has to revoke chunkserver leases on all chunks covered by the snapshot. We do not want to delay other master operations while they are running. Therefore, we allow multiple operations to be active and use locks over regions of the namespace to ensure proper serialization.


Unlike many traditional file systems, GFS does not have a per-directory data structure that lists all the files in that directory. Nor does it support aliases for the same file or directory (i.e, hard or symbolic links in Unix terms). GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. With prefix compression, this table can be efficiently represented in memory. Each node in the namespace tree (either an absolute file name or an absolute directory name) has an associated read-write lock.

与许多传统文件系统不同, GFS 没有列出目录中所有文件的每个目录数据结构。它也不支持同一文件或目录的别名(即 Unix 术语中的硬链接或符号链接)。 GFS 将其命名空间逻辑上表示为将完整路径名映射到元数据的查找表。通过前缀压缩,该表可以在内存中高效地表示。命名空间树中的每个节点(绝对文件名或绝对目录名)都有一个关联的读写锁。

Each master operation acquires a set of locks before it runs. Typically, if it involves /d1/d2/…/dn/leaf, it will acquire read-locks on the directory names /d1, /d1/d2, …, /d1/d2/…/dn, and either a read lockor a write lockon the full pathname /d1/d2/…/dn/leaf. Note that leaf may be a file or directory depending on the operation.

每个主服务器操作在运行前会获取一组锁。通常,如果它涉及到路径 /d1/d2/…/dn/leaf,它将获取目录名 /d1,/d1/d2,…,/d1/d2/…/dn 的读锁,并且在完整路径名 /d1/d2/…/dn/leaf 上获取读锁或写锁,具体取决于操作所涉及的是文件还是目录。

We now illustrate how this locking mechanism can prevent a file /home/user/foo from being created while /home/user is being snapshotted to /save/user. The snapshot operation acquires read lock s on /home and /save, and write locks on /home/user and /save/user. The file creation acquires read locks on /home and /home/user, and a write lockon /home/user/foo. The two operations will be serialized properly because they try to obtain conflicting locks on /home/user. File creation does not require a write lock on the parent directory because there is no “directory”, or inode-like, data structure to be protected from modification. The read lockon the name is sufficient to protect the parent directory from deletion.

我们现在举例说明这种锁定机制如何防止在将 /home/user 快照为 /save/user 时创建文件 /home/user/foo。快照操作在 /home 和 /save 上获取读锁,并在 /home/user 和 /save/user 上获取写锁。文件创建在 /home 和 /home/user 上获取读锁,以及在 /home/user/foo 上获取写锁。由于它们试图在 /home/user 上获取冲突的锁,因此这两个操作将正确地进行序列化。文件创建不需要在父目录上获取写锁,因为没有需要保护免受修改的 “目录” 或 “inode” 类似的数据结构。名称上的读锁足以保护父目录免受删除。

One nice property of this locking scheme is that it allows concurrent mutations in the same directory. For example, multiple file creations can be executed concurrently in the same directory: each acquires a read lockon the directory name and a write lockon the file name. The read lockon the directory name suffices to prevent the directory from being deleted, renamed, or snapshotted. The write locks on file names serialize attempts to create a file with the same name twice.


Since the namespace can have many nodes, read-write lock objects are allocated lazily and deleted once they are not in use. Also, locks are acquired in a consistent total order to prevent deadlock: they are first ordered by level in the namespace tree and lexicographically within the same level.



A GFS cluster is highly distributed at more levels than one. It typically has hundreds of chunkservers spread across many machine racks. These chunkservers in turn may be accessed from hundreds of clients from the same or different racks. Communication between two machines on different racks may cross one or more network switches. Additionally, bandwidth into or out of a rackmay be less than the aggregate bandwidth of all the machines within the rack. Multi-level distribution presents a unique challenge to distribute data for scalability, reliability, and availability.

GFS 集群在多个级别上是高度分布式的。它通常由数百个块服务器组成并分布在许多机架上。这些块服务器又可以从来自同一机架或不同机架的数百个客户端访问。两个位于不同机架上的计算机之间的通信可能会跨越一个或多个网络交换机。此外,机架进出带宽可能小于机架内所有机器的总带宽。多级分布为可扩展性、可靠性和可用性的数据分发提供了独特的挑战。

The chunk replica placement policy serves two purposes: maximize data reliability and availability, and maximize network bandwidth utilization. For both, it is not enough to spread replicas across machines, which only guards against diskor machine failures and fully utilizes each machine’s network bandwidth. We must also spread chunkreplicas across racks. This ensures that some replicas of a chunk will survive and remain available even if an entire rackis damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit). It also means that traffic, especially reads, for a chunkcan exploit the aggregate bandwidth of multiple racks. On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly.



Chunkreplicas are created for three reasons: chunkcreation, re-replication, and rebalancing.


When the master creates a chunk, it chooses where to place the initially empty replicas. It considers several factors. (1) We want to place new replicas on chunkservers with below-average disk space utilization. Over time this will equalize disk utilization across chunkservers. (2) We want to limit the number of “recent” creations on each chunkserver. Although creation itself is cheap, it reliably predicts imminent heavy write traffic because chunks are created when demanded by writes, and in our append-once-read-many workload they typically become practically read-only once they have been completely written. (3) As discussed above, we want to spread replicas of a chunk across racks.

当主服务器创建一个块时,它会选择在哪里放置最初为空的副本,并考虑几个因素。(1) 我们希望将新的副本放置在磁盘空间利用率低于平均值的块服务器上。随着时间的推移,这将使磁盘利用率在块服务器之间平衡。(2) 我们希望限制每个块服务器上 “最近” 创建的块数。尽管创建块本身开销很低,但由于块是在有写入需求的时候被创建的,所以它能够可靠地预测到即将出现的大量的写入流量。并且在我们的一次追加写入和多次读取的工作负载中,一旦块被完全写入(写满),它们通常变得几乎只读。(3) 正如上面讨论的那样,我们希望跨机架部署块的副本。

The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal. This could happen for various reasons: a chunkserver becomes unavailable, it reports that its replica may be corrupted, one of its disks is disabled because of errors, or the replication goal is increased. Each chunk that needs to be re-replicated is prioritized based on several factors. One is how far it is from its replication goal. For example, we give higher priority to a chunk that has lost two replicas than to a chunk that has lost only one. In addition, we prefer to first re-replicate chunks for live files as opposed to chunks that belong to recently deleted files (see Section 4.4). Finally, to minimize the impact of failures on running applications, we boost the priority of any chunk that is blocking client progress.

主服务器会在可用副本数量低于用户的指定值时立即重新复制一个块。这种情况可能发生的原因有多种:一个块服务器不可用,它报告其副本可能已经损坏,其中一个磁盘因错误而被禁用,或者增加了复制目标。需要重新复制的每个块都根据几个因素进行优先排序。其中一个因素是它距离复制目标有多远。例如,我们会优先处理失去两个副本的块,而不是只失去一个副本的块。此外,我们更喜欢先重新复制实时文件的块,而不是属于最近删除的文件的块(请参见第 4.4 节)。最后,为了最小化故障对正在运行的应用程序的影响,我们会提高阻塞客户端进度的任何块的优先级。

The master picks the highest priority chunk and “clones” it by instructing some chunkserver to copy the chunk data directly from an existing valid replica. The new replica is placed with goals similar to those for creation: equalizing diskspace utilization, limiting active clone operations on any single chunkserver, and spreading replicas across racks. To keep cloning traffic from overwhelming client traffic, the master limits the numbers of active clone operations both for the cluster and for each chunkserver. Additionally, each chunkserver limits the amount of bandwidth it spends on each clone operation by throttling its read requests to the source chunkserver.

主服务器会选择优先级最高的数据块,并通过指示某个块服务器直接从现有的有效副本中复制数据来 “克隆” 该数据块。新副本的放置目标与创建时类似:均衡磁盘空间利用率,限制单个块服务器上活动克隆操作的数量,并在机架间分布副本。为了避免克隆流量压倒客户端流量,主服务器限制了集群和每个块服务器上活跃的克隆操作的数量。此外,每个块服务器通过限制从源块服务器发出的读请求的传输速率来限制其在每个克隆操作上花费的带宽。

Finally, the master rebalances replicas periodically: it examines the current replica distribution and moves replicas for better diskspace and load balancing. Also through this process, the master gradually fills up a new chunkserver rather than instantly swamps it with new chunks and the heavy write traffic that comes with them. The placement criteria for the new replica are similar to those discussed above. In addition, the master must also choose which existing replica to remove. In general, it prefers to remove those on chunkservers with below-average free space so as to equalize diskspace usage.



After a file is deleted, GFS does not immediately reclaim the available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels. We find that this approach makes the system much simpler and more reliable.

在文件被删除之后, GFS 并不会立即回收可用的物理存储空间,而是在文件和块级别的常规垃圾回收期间才会进行惰性回收。我们发现这种方法使得系统更加简单和可靠。


When a file is deleted by the application, the master logs the deletion immediately just like other changes. However instead of reclaiming resources immediately, the file is just renamed to a hidden name that includes the deletion timestamp. During the master’s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days (the interval is configurable). Until then, the file can still be read under the new, special name and can be undeleted by renaming it back to normal. When the hidden file is removed from the namespace, its inmemory metadata is erased. This effectively severs its links to all its chunks.


In a similar regular scan of the chunk namespace, the master identifies orphaned chunks (i.e., those not reachable from any file) and erases the metadata for those chunks. In a HeartBeat message regularly exchanged with the master, each chunkserver reports a subset of the chunks it has, and the master replies with the identity of all chunks that are no longer present in the master’s metadata. The chunkserver is free to delete its replicas of such chunks.



Although distributed garbage collection is a hard problem that demands complicated solutions in the context of programming languages, it is quite simple in our case. We can easily identify all references to chunks: they are in the fileto-chunk mappings maintained exclusively by the master. We can also easily identify all the chunk replicas: they are Linux files under designated directories on each chunkserver. Any such replica not known to the master is “garbage.”

虽然在编程语言的上下文中,分布式的垃圾回收是一个需要复杂解决方案的难题,但在我们的情况下却非常简单。我们可以轻松地识别所有对块的引用:它们都在由主服务器专门维护的文件到块的映射中。我们也可以轻松地识别所有块的副本:它们是每个块服务器上指定目录下的 Linux 文件。任何不为主服务器所知的复制副本都是 “垃圾”。

The garbage collection approach to storage reclamation offers several advantages over eager deletion. First, it is simple and reliable in a large-scale distributed system where component failures are common. Chunk creation may succeed on some chunkservers but not others, leaving replicas that the master does not know exist. Replica deletion messages may be lost, and the master has to remember to resend them across failures, both its own and the chunkserver’s. Garbage collection provides a uniform and dependable way to clean up any replicas not known to be useful. Second, it merges storage reclamation into the regular background activities of the master, such as the regular scans of namespaces and handshakes with chunkservers. Thus, it is done in batches and the cost is amortized. Moreover, it is done only when the master is relatively free. The master can respond more promptly to client requests that demand timely attention. Third, the delay in reclaiming storage provides a safety net against accidental, irreversible deletion.


In our experience, the main disadvantage is that the delay sometimes hinders user effort to fine tune usage when storage is tight. Applications that repeatedly create and delete temporary files may not be able to reuse the storage right away. We address these issues by expediting storage reclamation if a deleted file is explicitly deleted again. We also allow users to apply different replication and reclamation policies to different parts of the namespace. For example, users can specify that all the chunks in the files within some directory tree are to be stored without replication, and any deleted files are immediately and irrevocably removed from the file system state.



Chunk replicas may become stale if a chunkserver fails and misses mutations to the chunk while it is down. For each chunk, the master maintains a chunk version number to distinguish between up-to-date and stale replicas.


Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-todate replicas. The master and these replicas all record the new version number in their persistent state. This occurs before any client is notified and therefore before it can start writing to the chunk. If another replica is currently unavailable, its chunk version number will not be advanced. The master will detect that this chunkserver has a stale replica when the chunkserver restarts and reports its set of chunks and their associated version numbers. If the master sees a version number greater than the one in its records, the master assumes that it failed when granting the lease and so takes the higher version to be up-to-date.


The master removes stale replicas in its regular garbage collection. Before that, it effectively considers a stale replica not to exist at all when it replies to client requests for chunk information. As another safeguard, the master includes the chunk version number when it informs clients which chunkserver holds a lease on a chunk or when it instructs a chunkserver to read the chunk from another chunkserver in a cloning operation. The client or the chunkserver verifies the version number when it performs the operation so that it is always accessing up-to-date data.



One of our greatest challenges in designing the system is dealing with frequent component failures. The quality and quantity of components together make these problems more the norm than the exception: we cannot completely trust the machines, nor can we completely trust the disks. Component failures can result in an unavailable system or, worse, corrupted data. We discuss how we meet these challenges and the tools we have built into the system to diagnose problems when they inevitably occur.

我们在设计系统时面临的最大挑战之一是处理频繁的组件故障。 组件的质量和数量共同使这些问题成为常态而不是例外:我们不能完全信任机器,也不能完全信任磁盘。 组件故障可能导致系统不可用,或者糟糕到数据损坏。 我们讨论了我们如何应对这些挑战,以及我们在系统中内置的工具,以便在问题不可避免地发生时对其进行诊断。


Among hundreds of servers in a GFS cluster, some are bound to be unavailable at any given time. We keep the overall system highly available with two simple yet effective strategies: fast recovery and replication.

在 GFS 集群中的数百台服务器中,有些服务器可能在任何特定的时间上不可用。我们通过两个简单而有效的策略保持整个系统的高可用性:快速恢复和复制。


Both the master and the chunkserver are designed to restore their state and start in seconds no matter how they terminated. In fact, we do not distinguish between normal and abnormal termination; servers are routinely shut down just by killing the process. Clients and other servers experience a minor hiccup as they time out on their outstanding requests, reconnect to the restarted server, and retry. Section 6.2.2 reports observed startup times.

主服务器和块服务器都被设计为:无论它们发生了什么异常,都能在几秒内启动并恢复它们的状态。实际上,我们不区分正常终止和异常终止; 服务器通常只是通过终止进程来关闭。 客户端和其他服务器在未完成的请求超时、重新连接到重启的服务器并重试时会遇到轻微的问题。 第 6.2.2 节展示了我们观察到的启动时间信息。


As discussed earlier, each chunkis replicated on multiple chunkservers on different racks. Users can specify different replication levels for different parts of the file namespace. The default is three. The master clones existing replicas as needed to keep each chunk fully replicated as chunkservers go offline or detect corrupted replicas through checksum verification (see Section 5.2). Although replication has served us well, we are exploring other forms of cross-server redundancy such as parity or erasure codes for our increasing readonly storage requirements. We expect that it is challenging but manageable to implement these more complicated redundancy schemes in our very loosely coupled system because our traffic is dominated by appends and reads rather than small random writes.

如前面所说,每个块都被复制到不同机架上的多个块服务器上。 用户可以为文件命名空间的不同部分指定不同的复制级别。 默认值为 3 。 主服务器根据需要克隆现有副本,当块服务器离线或者通过校验和验证检测损坏的副本时保持每个块完全复制(参见第 5.2 节)。 尽管复制对我们很有帮助,但我们正在探索其他形式的跨服务器冗余,例如奇偶校验或纠删码,以满足我们不断增加的只读存储需求。 我们预计在我们非常松散耦合的系统中实施这些更复杂,具有挑战性但易于管理的冗余方案,因为我们的流量主要是追加写和读取,而不是小的随机写入。


The master state is replicated for reliability. Its operation log and checkpoints are replicated on multiple machines. A mutation to the state is considered committed only after its log record has been flushed to disklocally and on all master replicas. For simplicity, one master process remains in charge of all mutations as well as background activities such as garbage collection that change the system internally. When it fails, it can restart almost instantly. If its machine or diskfails, monitoring infrastructure outside GFS starts a new master process elsewhere with the replicated operation log. Clients use only the canonical name of the master (e.g. gfs-test), which is a DNS alias that can be changed if the master is relocated to another machine.

复制主服务器的状态来确保可靠性。 它的操作日志和检查点被复制到多台机器上。 只有在其日志记录已刷新到磁盘本地和所有主副本上后,才认为对状态的更改已提交。 为简单起见,一个主服务器进程仍然负责所有变更以及后台活动,例如在内部更改系统的垃圾收集。 当它失败时,它几乎可以立即重新启动。 如果它的机器或磁盘出现故障,GFS 外部的监控基础设施会在别处启动一个新的主服务器进程,并使用复制的操作日志。 客户端仅使用主服务器的规范名称(例如 gfs-test),这是一个 DNS 别名,如果主服务器被重新定位到另一台机器则可以改名。

Moreover, “shadow” masters provide read-only access to the file system even when the primary master is down. They are shadows, not mirrors, in that they may lag the primary slightly, typically fractions of a second. They enhance read availability for files that are not being actively mutated or applications that do not mind getting slightly stale results. In fact, since file content is read from chunkservers, applications do not observe stale file content. What could be stale within short windows is file metadata, like directory contents or access control information.

此外,主服务的 “影子“ 提供了对文件系统的只读访问,即使在主要的主服务器(primary master)关闭时也是如此。 它们是影子,而不是镜子,因为它们可能会稍微滞后于主服务器,通常是几分之一秒。 它们增强了未被主动改变的文件或不介意获得略微过时结果的应用程序的读取可用性。 事实上,由于文件内容是从块服务器读取的,应用程序也不会看到老的文件内容。 短窗口内可能过时的是文件元数据,如目录内容或访问控制信息。

To keep itself informed, a shadow master reads a replica of the growing operation log and applies the same sequence of changes to its data structures exactly as the primary does. Like the primary, it polls chunkservers at startup (and infrequently thereafter) to locate chunk replicas and exchanges frequent handshake messages with them to monitor their status. It depends on the primary master only for replica location updates resulting from the primary’s decisions to create and delete replicas.

为了让自己了解情况,主服务器的影子读取不断增长的操作日志的副本,并将与主服务器完全相同的更改序列应用于其数据结构。 与主服务器一样,它在启动时(之后很少)轮询块服务器以定位块副本并与它们交换频繁的握手消息以监视它们的状态。 它仅依赖来获取由主要的主服务器(primary master)创建和删除副本的决定所导致的副本位置更新。


Each chunkserver uses checksumming to detect corruption of stored data. Given that a GFS cluster often has thousands of disks on hundreds of machines, it regularly experiences disk failures that cause data corruption or loss on both the read and write paths. (See Section 7 for one cause.) We can recover from corruption using other chunk replicas, but it would be impractical to detect corruption by comparing replicas across chunkservers. Moreover, divergent replicas may be legal: the semantics of GFS mutations, in particular atomic record append as discussed earlier, does not guarantee identical replicas. Therefore, each chunkserver must independently verify the integrity of its own copy by maintaining checksums.

每个块服务器使用校验和来检测存储数据的损坏。 鉴于 GFS 集群通常在数百台机器上有数千个磁盘,它经常会遇到磁盘故障,导致读取和写入路径上的数据损坏或丢失。 (一个原因参见第 7 节。)我们可以使用其他块副本从损坏中恢复,但是通过跨块服务器比较副本来检测损坏是不切实际的。 此外,不同的副本可能是合法的:GFS 变更(突变)的语义,特别是前面讨论的原子记录追加,不保证副本的相同。 因此,每个块服务器必须通过维护校验和来独立验证自己副本的完整性。

A chunkis broken up into 64 KB blocks. Each has a corresponding 32 bit checksum. Like other metadata, checksums are kept in memory and stored persistently with logging, separate from user data.

块被分成 64 KB 的块。 每个都有相应的 32 位校验和。 与其他元数据一样,校验和保存在内存中并与日志记录一起永久存储,与用户数据分开。

For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether a client or another chunkserver. Therefore chunkservers will not propagate corruptions to other machines. If a block does not match the recorded checksum, the chunkserver returns an error to the requestor and reports the mismatch to the master. In response, the requestor will read from other replicas, while the master will clone the chunk from another replica. After a valid new replica is in place, the master instructs the chunkserver that reported the mismatch to delete its replica.

对于读取操作,块服务器在将任何数据返回给请求者之前验证与读取范围重叠的数据块的校验和,无论是客户端还是另一个块服务器。 因此块服务器不会将损坏传播到其他机器。 如果块与记录的校验和不匹配,块服务器将错误返回给请求者并将不匹配报告给主服务器。 作为响应,请求者将从其他副本读取,而主服务器将从另一个副本克隆块。 在一个有效的新副本就位后,主服务器指示报告不匹配的块服务器删除其副本。

Checksumming has little effect on read performance for several reasons. Since most of our reads span at least a few blocks, we need to read and checksum only a relatively small amount of extra data for verification. GFS client code further reduces this overhead by trying to align reads at checksum block boundaries. Moreover, checksum lookups and comparison on the chunkserver are done without any I/O, and checksum calculation can often be overlapped with I/Os.

由于多种原因,校验和对读取性能几乎没有影响。 由于我们的大部分读取至少跨越几个块,因此我们只需要读取和校验和校验相对少量的额外数据以进行验证。 GFS 客户端代码通过尝试在校验和块边界对齐读取进一步减少了这种开销。 此外,块服务器上的校验和查找和比较是在没有任何 I/O 的情况下完成的,并且校验和的计算通常可以与 I/O 重叠。

Checksum computation is heavily optimized for writes that append to the end of a chunk(as opposed to writes that overwrite existing data) because they are dominant in our workloads. We just incrementally update the checksum for the last partial checksum block, and compute new checksums for any brand new checksum blocks filled by the append. Even if the last partial checksum block is already corrupted and we fail to detect it now, the new checksum value will not match the stored data, and the corruption will be detected as usual when the blockis next read.

校验和的计算对于块末尾的追加写入(与覆盖现有数据的写入相反)进行了高度优化,因为它们在我们的工作负载中占主导地位。 我们只是增量地更新最后一个部分校验和块的校验和,并为追加填充的任何全新校验和块计算新的校验和。 即使最后一个部分校验和块已经损坏并且我们现在无法检测到它,新的校验和值也不会与存储的数据匹配,并且在下次读取块时会像往常一样检测到损坏。

In contrast, if a write overwrites an existing range of the chunk, we must read and verify the first and last blocks of the range being overwritten, then perform the write, and finally compute and record the new checksums. If we do not verify the first and last blocks before overwriting them partially, the new checksums may hide corruption that exists in the regions not being overwritten.

相反,如果写入覆盖了块的现有范围,我们必须读取并验证被覆盖范围的第一个和最后一个块,然后执行写入,最后计算并记录新的校验和。 如果我们在部分覆盖之前不验证第一个和最后一个块,新的校验和可能会隐藏未被覆盖区域中存在的损坏。

During idle periods, chunkservers can scan and verify the contents of inactive chunks. This allows us to detect corruption in chunks that are rarely read. Once the corruption is detected, the master can create a new uncorrupted replica and delete the corrupted replica. This prevents an inactive but corrupted chunk replica from fooling the master into thinking that it has enough valid replicas of a chunk.

在空闲期间,块服务器可以扫描并验证非活动块的内容。 这使我们能够检测很少读取的块中的损坏。 一旦检测到损坏,主服务器就可以创建一个新的未损坏的副本并删除损坏的副本。 这可以防止不活动但已损坏的块副本欺骗主服务器,使其认为它具有足够的块有效副本。


Extensive and detailed diagnostic logging has helped immeasurably in problem isolation, debugging, and performance analysis, while incurring only a minimal cost. Without logs, it is hard to understand transient, non-repeatable interactions between machines. GFS servers generate diagnostic logs that record many significant events (such as chunkservers going up and down) and all RPC requests and replies. These diagnostic logs can be freely deleted without affecting the correctness of the system. However, we try to keep these logs around as far as space permits.

广泛而详细的诊断日志记录在问题隔离、调试和性能分析方面提供了不可估量的帮助,同时仅需要极低的成本开销。 没有日志,就很难理解机器之间短暂的、不可重复的交互现象。 GFS 服务器生成诊断日志,记录许多重要事件(例如块服务器的启动和关闭)和所有 RPC 请求和回复。 这些诊断日志可以随意删除而不影响系统的正确性。 但是,我们尽量在空间允许的范围内保留这些日志。

The RPC logs include the exact requests and responses sent on the wire, except for the file data being read or written. By matching requests with replies and collating RPC records on different machines, we can reconstruct the entire interaction history to diagnose a problem. The logs also serve as traces for load testing and performance analysis.

RPC 日志包括在线路上发送的确切请求和响应,正在读取或写入的文件数据除外。 通过匹配请求与回复并整理不同机器上的 RPC 记录,我们可以重建整个交互历史来诊断问题。 日志还用作负载测试和性能分析的跟踪。

The performance impact of logging is minimal (and far outweighed by the benefits) because these logs are written sequentially and asynchronously. The most recent events are also kept in memory and available for continuous online monitoring.

日志记录对性能的影响很小(并且远远超过其好处),因为这些日志是按顺序和异步写入的。 最近的事件也保存在内存中,可用于连续在线监控。


In this section we present a few micro-benchmarks to illustrate the bottlenecks inherent in the GFS architecture and implementation, and also some numbers from real clusters in use at Google.

在本节中,我们提供了一些微基准来说明 GFS 体系结构和实现中的固有的瓶颈,以及来自 Google 使用的真实集群的一些数字。


We measured performance on a GFS cluster consisting of one master, two master replicas, 16 chunkservers, and 16 clients. Note that this configuration was set up for ease of testing. Typical clusters have hundreds of chunkservers and hundreds of clients.

我们在一个包含 1 个主服务器、 2 个服务器副本、 16 个块服务器和 16 个客户端的 GFS 集群上进行了性能测试。请注意,此配置是为了方便测试而设置的。典型的集群通常具有数百个块服务器和数百个客户端。

All the machines are configured with dual 1.4 GHz PIII processors, 2 GB of memory, two 80 GB 5400 rpm disks, and a 100 Mbps full-duplex Ethernet connection to an HP 2524 switch. All 19 GFS server machines are connected to one switch, and all 16 client machines to the other. The two switches are connected with a 1 Gbps link.

所有机器都配置有 2 个 1.4 GHz 的 PIII 处理器、2 GB 内存、2 个 80 GB 的 5400 转速(Revolutions Per Minute , 转每分)的磁盘以及 100 Mbps 全双工以太网,并连接到 1 个 HP 2524 交换机。所有 19 台 GFS 服务器机器都连接到一个交换机,而所有 16 台客户机连接到另一个交换机。这两个交换机通过 1 Gbps 的链路连接。


N clients read simultaneously from the file system. Each client reads a randomly selected 4 MB region from a 320 GB file set. This is repeated 256 times so that each client ends up reading 1 GB of data. The chunkservers taken together have only 32 GB of memory, so we expect at most a 10% hit rate in the Linux buffer cache. Our results should be close to cold cache results.

N 个客户端同时从文件系统读取。每个客户端从 320 GB 的文件集中随机选择一个 4 MB 的区域进行读取。这个过程重复 256 次,以便每个客户端最终能读取 1 GB 的数据。所有块服务器仅有 32 GB 的内存,因此我们预计 Linux 缓存中最多只有 10% 的命中率。我们的结果应该接近冷启动缓存的结果。

Figure 3(a) shows the aggregate read rate for N clients and its theoretical limit. The limit peaks at an aggregate of 125 MB/s when the 1 Gbps link between the two switches is saturated, or 12.5 MB/s per client when its 100 Mbps network interface gets saturated, whichever applies. The observed read rate is 10 MB/s, or 80% of the per-client limit, when just one client is reading. The aggregate read rate reaches 94 MB/s, about 75% of the 125 MB/s linklimit, for 16 readers, or 6 MB/s per client. The efficiency drops from 80% to 75% because as the number of readers increases, so does the probability that multiple readers simultaneously read from the same chunkserver.

图 3 (a) 显示了 N 个客户端的聚合读取速率及其理论极限。当两个交换机之间的 1 Gbps 链路被打满时,聚合读取速率的极限峰值为 125 MB/s ,或者当 100 Mbps 的网络接口被打满时,每个客户端的极限峰值为 12.5 MB/s 。当只有一个客户端读取时,观察到的读取速率为 10 MB/s ,即每个客户端极限峰值的 80% 。当有 16 个读取者时,聚合读取速率达到 94 MB/s ,约为 125 MB/s 链路限制的 75% ,即每个客户端的读取速率为 6 MB/s 。效率从 80% 下降到 75% ,因为随着读取者数量的增加,多个读取者同时从同一块服务器读取的概率也增加了。


N clients write simultaneously to N distinct files. Each client writes 1 GB of data to a new file in a series of 1 MB writes. The aggregate write rate and its theoretical limit are shown in Figure 3(b). The limit plateaus at 67 MB/s because we need to write each byte to 3 of the 16 chunk servers, each with a 12.5 MB/s input connection.

N 个客户端同时向 N 个不同的文件写数据。每个客户端以连续写 1 MB 的方式将 1 GB 的数据写入一个新文件。图 3 (b) 显示了聚合写入速率及其理论极限。极限值稳定在 67 MB/s 左右,因为我们写每个字节的时候都需要将他们写入到 16 个块服务器中的 3 个中,所以每个服务器都会有 12.5 MB/s 的输入流量。

The write rate for one client is 6.3 MB/s, about half of the limit. The main culprit for this is our network stack. It does not interact very well with the pipelining scheme we use for pushing data to chunk replicas. Delays in propagating data from one replica to another reduce the overall write rate.

一个客户端的写入速率为 6.3 MB/s ,约为极限速率的一半。网络堆栈是造成这种情况的主要原因。它不能很好的与我们的将数据推送给块副本的流水线的方案相结合。将数据从一个副本传播到另一个副本的延迟会降低整体写入速率。

Aggregate write rate reaches 35 MB/s for 16 clients (or 2.2 MB/s per client), about half the theoretical limit. As in the case of reads, it becomes more likely that multiple clients write concurrently to the same chunkserver as the number of clients increases. Moreover, collision is more likely for 16 writers than for 16 readers because each write involves three different replicas.

当有 16 个客户端时,聚合写入速率会达到 35 MB/s (每个客户端 2.2 MB/s ),约为理论极限的一半。与读取情况一样,随着客户端数量的增加,多个客户端同时向同一块服务器写入的概率越来越大。由于每个写入操作会涉及到三个不同的副本,所以,与 16 个读取者相比,16 个写入者发生冲突的概率会更大。

Writes are slower than we would like. In practice this has not been a major problem because even though it increases the latencies as seen by individual clients, it does not significantly affect the aggregate write bandwidth delivered by the system to a large number of clients.



Figure 3(c) shows record append performance. N clients append simultaneously to a single file. Performance is limited by the network bandwidth of the chunkservers that store the last chunk of the file, independent of the number of clients. It starts at 6.0 MB/s for one client and drops to 4.8 MB/s for 16 clients, mostly due to congestion and variances in network transfer rates seen by different clients.

图 3 (c) 显示了记录追加性能。 N 个客户端同时追加写数据到一个单独的文件中。性能受限于存储文件的最后一个块的块服务器的网络带宽,与客户端数量无关。对于一个客户端,性能从 6.0 MB/s 开始,对于 16 个客户端则降至 4.8 MB/s ,主要是由于拥塞和不同客户端感知的网络传输速率的差异。

Our applications tend to produce multiple such files concurrently. In other words, N clients append to M shared files simultaneously where both N and M are in the dozens or hundreds. Therefore, the chunkserver network congestion in our experiment is not a significant issue in practice because a client can make progress on writing one file while the chunkservers for another file are busy.

我们的应用程序往往会同时产生多个这样的文件。换句话说, N 个客户端同时追加写数据到 M 个共享文件中,其中 N 和 M 都是几十或几百个。因此,在实践中,我们实验中的块服务器网络拥塞并不是一个重大问题,因为客户端可以在为另一个文件的块服务器忙碌时继续写入一个文件。


We now examine two clusters in use within Google that are representative of several others like them. Cluster A is used regularly for research and development by over a hundred engineers. A typical taskis initiated by a human user and runs up to several hours. It reads through a few MBs to a few TBs of data, transforms or analyzes the data, and writes the results back to the cluster. Cluster B is primarily used for production data processing. The tasks last much longer and continuously generate and process multi-TB data sets with only occasional human intervention. In both cases, a single “task” consists of many processes on many machines reading and writing many files simultaneously.

我们现在来看两个 Google 内部具有代表性集群的实际使用案例。集群 A 通常由 100 多名工程师用于研究和开发。一个典型的任务会由一个用户发起,并持续几个小时。它会读取数 MB 到数 TB 的数据,并对数据进行转换或分析,然后将结果写回集群。集群 B 主要用于生产数据处理。任务持续时间更长,会连续生成和处理数 TB 的数据集,并只在偶尔的时候才需要人来干预。在这两种情况下,单个 “任务” 都需要包括多个进程,并在许多机器上同时读写许多文件。

表 2:两个 GFS 集群的特征


As shown by the first five entries in the table, both clusters have hundreds of chunkservers, support many TBs of disk space, and are fairly but not completely full. “Used space” includes all chunk replicas. Virtually all files are replicated three times. Therefore, the clusters store 18 TB and 52 TB of file data respectively.

正如表格中的前五个条目所展示的,这两个集群都拥有数百个块服务器,并支撑着数 TB 的磁盘空间,这些磁盘空间的使用率都挺高。 “已使用空间” 包括所有块的副本。几乎所有文件都有三个副本。因此,这两个集群分别存储了 18TB 和 52TB 的文件数据。

The two clusters have similar numbers of files, though B has a larger proportion of dead files, namely files which were deleted or replaced by a new version but whose storage have not yet been reclaimed. It also has more chunks because its files tend to be larger.

这两个集群拥有类似数量的文件,但 B 集群有更高比例的 “死文件”(无用文件),即那些已经被删除或者被新版本替换但其存储空间还没有被回收的文件。 B 集群还有更多的块,因为它的文件往往较大。


The chunkservers in aggregate store tens of GBs of metadata, mostly the checksums for 64 KB blocks of user data. The only other metadata kept at the chunkservers is the chunk version number discussed in Section 4.5.

总体而言,这些块服务器存储了数十 GB 的元数据,主要是用户数据的 64 KB 块的校验和。在块服务器中保留的唯一其他元数据是第 4.5 节中讨论的块版本号。

The metadata kept at the master is much smaller, only tens of MBs, or about 100 bytes per file on average. This agrees with our assumption that the size of the master’s memory does not limit the system’s capacity in practice. Most of the per-file metadata is the file names stored in a prefix-compressed form. Other metadata includes file ownership and permissions, mapping from files to chunks, and each chunk’s current version. In addition, for each chunk we store the current replica locations and a reference count for implementing copy-on-write.

在主服务器中保存的元数据要小得多,只有数十 MB ,平均每个文件约占用 100 个字节。这与我们的假设相符,即主服务器中内存的大小实际上并不会限制整个系统的容量。单个文件的大部分元数据就是文件名,这里使用了前缀压缩的方式来存储它们。其他元数据有:文件所有权和权限,文件到块的映射以及每个块的当前版本。此外,对于每个块,我们还存储当前副本位置和引用计数,以便于实现写时复制(COW)。

Each individual server, both chunkservers and the master, has only 50 to 100 MB of metadata. Therefore recovery is fast: it takes only a few seconds to read this metadata from disk before the server is able to answer queries. However, the master is somewhat hobbled for a period – typically 30 to 60 seconds – until it has fetched chunk location information from all chunkservers.

每个单独的服务器,包括块服务器和主服务器,只有 50MB 到 100MB 的元数据。因此,恢复速度很快:在服务器能够相应查询之前,仅需几秒钟的时间就可以从磁盘上读取元数据。但是,主服务器在一段时间内会受到一些限制,这通常是30到60秒,这是因为主服务器需要从所有的块服务器中获取块的位置信息。


Table 3 shows read and write rates for various time periods. Both clusters had been up for about one week when these measurements were taken. (The clusters had been restarted recently to upgrade to a new version of GFS.)

表 3 显示了各个时间段的读写速率。在进行这些测量时,这两个集群已经运行了大约一周。(这两个集群最近已重新启动,以升级到 GFS 的新版本。)

The average write rate was less than 30 MB/s since the restart. When we took these measurements, B was in the middle of a burst of write activity generating about 100 MB/s of data, which produced a 300 MB/s network load because writes are propagated to three replicas.

自重新启动以来,平均写入速率不到 30MB/s 。在进行这些测量时, B 正在进行一次写入活动,每秒生成大约 100MB 的数据,由于写入会传播到三个副本,所以这会产生 300MB/s 的网络负载。

图 3:聚合吞吐量

Top curves show theoretical limits imposed by our network topology. Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements.

顶部曲线显示了我们的网络拓扑施加的理论限制。 底部曲线显示测量的吞吐量。 他们有显示 95% 置信区间的误差,在某些情况下由于测量方差小而难以辨认。

表 3:两个 GFS 集群的性能指标

The read rates were much higher than the write rates. The total workload consists of more reads than writes as we have assumed. Both clusters were in the middle of heavy read activity. In particular, A had been sustaining a read rate of 580 MB/s for the preceding week. Its network configuration can support 750 MB/s, so it was using its resources efficiently. Cluster B can support peakread rates of 1300 MB/s, but its applications were using just 380 MB/s.

读取速率比写入速率高得多。我们假设总的工作负载中含有较多的读取操作和较少的写入操作。这两个集群都处于大量读取操作的中间阶段。特别地, A 在前一周一直保持着 580MB/s 的读取速率。它的网络配置可以支持 750MB/s ,因此它有效地利用了自己的资源。集群 B 可以支持最大读取速率为 1300MB/s ,但它的应用程序只使用了 380MB/s 。

6.2.4、Master 负载

Table 3 also shows that the rate of operations sent to the master was around 200 to 500 operations per second. The master can easily keep up with this rate, and therefore is not a bottleneckfor these workloads.

表 3 还显示,发送到主服务器的操作速率约为每秒 200 到 500 个。主服务器可以轻松应对这个速率,因此对于这些工作负载来说主服务器并没有瓶颈。

In an earlier version of GFS, the master was occasionally a bottleneckfor some workloads. It spent most of its time sequentially scanning through large directories (which contained hundreds of thousands of files) looking for particular files. We have since changed the master data structures to allow efficient binary searches through the namespace. It can now easily support many thousands of file accesses per second. If necessary, we could speed it up further by placing name lookup caches in front of the namespace data structures.

在 GFS 的早期版本中,主服务器偶尔会成为某些工作负载的瓶颈。为了寻找特定的文件,它的大部分时间都在顺序扫描包含数十万个文件的大目录。我们后来改变了主服务器的数据结构,使其能够通过命名空间进行高效的二分查找。现在它可以轻松应对每秒数千次的文件访问。如果需要,我们可以通过在命名空间数据结构前面放置名称查找缓存来进一步加快速度。


After a chunkserver fails, some chunks will become underreplicated and must be cloned to restore their replication levels. The time it takes to restore all such chunks depends on the amount of resources. In one experiment, we killed a single chunkserver in cluster B. The chunkserver had about 15,000 chunks containing 600 GB of data. To limit the impact on running applications and provide leeway for scheduling decisions, our default parameters limit this cluster to 91 concurrent clonings (40% of the number of chunkservers) where each clone operation is allowed to consume at most 6.25 MB/s (50 Mbps). All chunks were restored in 23.2 minutes, at an effective replication rate of 440 MB/s.

当一个块服务器异常后,一些块就会缺少一些备份数量,因此必须进行克隆以恢复它们的备份数量。恢复所有这样的块所需的时间取决于可用资源。在一个实验中,我们在 B 集群中杀死了一个块服务器。这个块服务器大约有 15,000 个块,其中包含了 600 GB 的数据。为了给调度决策提供一些资源,并尽量减少对正在运行的应用程序的影响,我们采用默认的参数并将这个集群限制为 91 个并发克隆( 40% 的块服务器数量),每个克隆操作允许消耗最多 6.25 MB/s( 50 Mbps )。所有的块在 23.2 分钟内恢复,有效的复制速率为 440 MB/s 。

In another experiment, we killed two chunkservers each with roughly 16,000 chunks and 660 GB of data. This double failure reduced 266 chunks to having a single replica. These 266 chunks were cloned at a higher priority, and were all restored to at least 2x replication within 2 minutes, thus putting the cluster in a state where it could tolerate another chunkserver failure without data loss.

在另一个实验中,我们关闭了两个块服务器,每个块服务器中大约有 16,000 个块和 660GB 的数据。这次两个块服务器的故障导致了 266 个块只有一个副本。这 266 个块将会以更高的优先级被克隆,所有这些块在 2 分钟内都被恢复到至少 2 个副本,从而使集群处于可以容忍即使另一个块服务器故障但也不会丢失数据的状态。

6.3、 工作负载分解

In this section, we present a detailed breakdown of the workloads on two GFS clusters comparable but not identical to those in Section 6.2. Cluster X is for research and development while cluster Y is for production data processing.

在本节中,我们将详细介绍两个与 6.2 节中的类似但不完全相同的 GFS 集群的工作负载。 X 集群用于研究和开发,而 Y 集群用于生产数据处理。


These results include only client originated requests so that they reflect the workload generated by our applications for the file system as a whole. They do not include interserver requests to carry out client requests or internal background activities, such as forwarded writes or rebalancing.


Statistics on I/O operations are based on information heuristically reconstructed from actual RPC requests logged by GFS servers. For example, GFS client code may breaka read into multiple RPCs to increase parallelism, from which we infer the original read. Since our access patterns are highly stylized, we expect any error to be in the noise. Explicit logging by applications might have provided slightly more accurate data, but it is logistically impossible to recompile and restart thousands of running clients to do so and cumbersome to collect the results from as many machines.

I/O 操作的统计数据是基于从 GFS 服务器记录的实际 RPC 请求中启发式重建(heuristically reconstructed)的信息。例如, GFS 客户端代码可能会将读取操作分解为多个 RPC 以增加并行性,从中我们推断出原始读取操作。鉴于我们的访问模式高度规范化,我们预计任何错误都在可接受范围内。虽然应用程序的显式日志记录可能会提供略微更准确的数据,但重新编译和重启数千个正在运行的客户端来执行这种日志记录在逻辑上是不可能的,并且从这么多机器上收集结果也是很麻烦的。

One should be careful not to overly generalize from our workload. Since Google completely controls both GFS and its applications, the applications tend to be tuned for GFS, and conversely GFS is designed for these applications. Such mutual influence may also exist between general applications and file systems, but the effect is likely more pronounced in our case.

人们应该注意不要过度泛化我们的工作量。由于 Google 能够完全控制 GFS 和其应用程序,这些应用程序往往会针对于 GFS 进行调优,反之亦然, GFS 也是为这些应用程序而设计的。类似的相互影响也可能存在于通用应用程序和文件系统之间,但在我们的情况下,影响可能更加明显。



For reads, the size is the amount of data actually read and transferred, rather than the amount requested.


Table 4 shows the distribution of operations by size. Read sizes exhibit a bimodal distribution. The small reads (under 64 KB) come from seek-intensive clients that look up small pieces of data within huge files. The large reads (over 512 KB) come from long sequential reads through entire files.

表格4展示了操作按大小的分布情况。读取操作的大小呈双峰分布。小读取(小于 64 KB )来自于寻址密集型客户端,这些客户端在大文件中查找小数据块。大读取(大于 512 KB )来自于对整个文件的长时间连续读取操作。

A significant number of reads return no data at all in cluster Y. Our applications, especially those in the production systems, often use files as producer-consumer queues. Producers append concurrently to a file while a consumer reads the end of file. Occasionally, no data is returned when the consumer outpaces the producers. Cluster X shows this less often because it is usually used for short-lived data analysis tasks rather than long-lived distributed applications.

在集群 Y 中,有相当数量的读取操作返回的数据为空。我们的应用程序,特别是生产系统中的应用程序,通常将文件用作生产者-消费者队列。生产者同时向文件追加数据,而消费者则读取文件末尾的数据。偶尔,当消费者的速度超过生产者时,可能会返回空数据。与长期运行的分布式应用程序不同,集群 X 通常用于短期数据分析任务,因此这种情况在集群 X 中发生的概率较小。

Write sizes also exhibit a bimodal distribution. The large writes (over 256 KB) typically result from significant buffering within the writers. Writers that buffer less data, checkpoint or synchronize more often, or simply generate less data account for the smaller writes (under 64 KB).

写入操作的大小同样呈双峰分布。大写入操作(大于 256 KB )通常是由写入程序中的大量缓存引起的。缓存数据较少、更频繁进行检查点或同步、或者生成的数据量较少的写入程序导致了小写入操作(小于 64 KB )。

As for record appends, cluster Y sees a much higher percentage of large record appends than cluster X does because our production systems, which use cluster Y, are more aggressively tuned for GFS.

关于记录追加操作,我们看到集群 Y 中大记录的追加比例比集群 X 高得多,因为我们使用集群 Y 的生产系统更积极地针对 GFS 进行了调优。

Table 5 shows the total amount of data transferred in operations of various sizes. For all kinds of operations, the larger operations (over 256 KB) generally account for most of the bytes transferred. Small reads (under 64 KB) do transfer a small but significant portion of the read data because of the random seek workload.

表 5 显示了各种大小的操作中传输的总数据量。对于所有类型的操作来说,较大的操作(大于 256 KB )通常占据了大部分传输的字节数。由于随机查找工作负载的存在,小型的读取操作(小于 64 KB )确实会传输一小部分但却很重要的读取数据。


Record appends are heavily used especially in our production systems. For cluster X, the ratio of writes to record appends is 108:1 by bytes transferred and 8:1 by operation counts. For cluster Y, used by the production systems, the ratios are 3.7:1 and 2.5:1 respectively. Moreover, these ratios suggest that for both clusters record appends tend to be larger than writes. For cluster X, however, the overall usage of record append during the measured period is fairly low and so the results are likely skewed by one or two applications with particular buffer size choices.

记录追加操作在我们的生产系统中得到了广泛使用。对于集群 X ,按传输的字节数计算,写操作与记录追加操作的比率为 108:1 ,按操作计数计算,比率为 8:1 。对于生产系统使用的集群 Y ,这些比率分别为 3.7:1 和 2.5:1 。此外,这些比率表明,对于这两个集群,记录追加操作往往比写操作更大。然而,对于集群 X ,在测量期间记录追加操作的整体使用率相对较低,因此结果可能会受到一两个具有特定缓冲区大小选择的应用程序的影响。

As expected, our data mutation workload is dominated by appending rather than overwriting. We measured the amount of data overwritten on primary replicas. This approximates the case where a client deliberately overwrites previous written data rather than appends new data. For cluster X, overwriting accounts for under 0.0001% of bytes mutated and under 0.0003% of mutation operations. For cluster Y, the ratios are both 0.05%. Although this is minute, it is still higher than we expected. It turns out that most of these overwrites came from client retries due to errors or timeouts. They are not part of the workload per se but a consequence of the retry mechanism.

正如预期的那样,我们的数据变更(变异)的⼯作负载主要是追加⽽不是覆盖。我们测量了主副本上被覆盖的数据量。这近似于客户端有意覆盖之前写入的数据而不是追加新数据的情况。对于 X 集群,覆盖的字节占变更(变异)字节的不到 0.0001% ,变更(变异)操作的比例不到 0.0003% 。对于 Y 集群,这些比例都是 0.05% 。尽管这很小,但它仍然比我们预期的要高。事实证明,大多数这些覆盖都来自于客户端重试由于错误或超时而导致的情况。它们不是工作负载本身的一部分,而是重试机制的后果。

表 5:按操作大小划分的传输字节数细分

For reads, the size is the amount of data actually read and transferred, rather than the amount requested. The two may differ if the read attempts to read beyond end of file, which by design is not uncommon in our workloads.

对于读取,大小是实际读取和传输的数据量,而不是请求的数据量。 如果读取尝试读取文件末尾以外的内容,这两者可能会有所不同,这在我们的工作负载中并不少见。

表 6:按类型划分的主服务器的请求

6.3.4、Master 工作负载

Table 6 shows the breakdown by type of requests to the master. Most requests ask for chunk locations (FindLocation) for reads and lease holder information (FindLeaseLocker) for data mutations.

表格 6 显示了对主服务器请求类型的分类。大多数请求是为了读取数据而请求块位置(FindLocation),而租约持有者信息(FindLeaseLocker)主要是用于进行数据变更(变异)。

Clusters X and Y see significantly different numbers of Delete requests because cluster Y stores production data sets that are regularly regenerated and replaced with newer versions. Some of this difference is further hidden in the difference in Open requests because an old version of a file may be implicitly deleted by being opened for write from scratch (mode “w” in Unix open terminology).

集群 X 和 Y 的删除请求数量明显不同,因为集群 Y 存储的生产数据集定期被新版本替换和重生成。其中一些差异进一步隐藏在打开请求的差异中,因为通过从头开始写入( Unix 打开术语中的 “w” 模式)打开文件的旧版本可能会被隐式删除。

FindMatchingFiles is a pattern matching request that supports “ls” and similar file system operations. Unlike other requests for the master, it may process a large part of the namespace and so may be expensive. Cluster Y sees it much more often because automated data processing tasks tend to examine parts of the file system to understand global application state. In contrast, cluster X’s applications are under more explicit user control and usually know the names of all needed files in advance.

FindMatchingFiles 是一种支持 “ls” 和类似文件系统操作的模式匹配请求。与对主服务器的其他请求不同,它可能处理命名空间的大部分,因此可能很昂贵。集群 Y 更经常看到这个请求,因为自动化数据处理任务倾向于检查文件系统的部分以了解全局应用程序状态。相比之下,集群 X 的应用程序处于更明确的用户控制之下,并且通常提前知道所有需要的⽂件的名称。


In the process of building and deploying GFS, we have experienced a variety of issues, some operational and some technical.

在构建和部署 GFS 的过程中,我们遇到了各种问题,有些是操作上的问题,有些是技术上的问题。

Initially, GFS was conceived as the backend file system for our production systems. Over time, the usage evolved to include research and development tasks. It started with little support for things like permissions and quotas but now includes rudimentary forms of these. While production systems are well disciplined and controlled, users sometimes are not. More infrastructure is required to keep users from interfering with one another.

最初, GFS 被设想为我们生产系统的后端文件系统。随着时间的推移,使用方式发展为包括研究和开发任务。它最初对权限和配额等方面的支持很少,但现在包括这些方面的基本形式。虽然生产系统受到良好的纪律和控制,但用户有时却没有。需要更多的基础设施来防止用户相互干扰。

Some of our biggest problems were disk and Linux related. Many of our disks claimed to the Linux driver that they supported a range of IDE protocol versions but in fact responded reliably only to the more recent ones. Since the protocol versions are very similar, these drives mostly worked, but occasionally the mismatches would cause the drive and the kernel to disagree about the drive’s state. This would corrupt data silently due to problems in the kernel. This problem motivated our use of checksums to detect data corruption, while concurrently we modified the kernel to handle these protocol mismatches.

我们遇到的一些最大的问题与磁盘和 Linux 有关。我们的许多磁盘声称它们支持一系列 IDE 协议版本,但实际上只对较新的版本做出可靠响应。由于协议版本非常相似,这些驱动器大多可以正常工作,但偶尔不匹配会导致驱动器和内核对驱动器的状态产生分歧。由于内核中的问题,这会无声地损坏数据。这个问题促使我们使用校验和来检测数据损坏,同时我们修改了内核以处理这些协议不匹配。

Earlier we had some problems with Linux 2.2 kernels due to the cost of fsync(). Its cost is proportional to the size of the file rather than the size of the modified portion. This was a problem for our large operation logs especially before we implemented checkpointing. We worked around this for a time by using synchronous writes and eventually migrated to Linux 2.4.

早期,我们由于 fsync() 的成本与 Linux 2.2 内核存在一些问题。fsync() 的成本与文件的大小成正比,而不是修改部分的大小。这对于我们的大型操作日志来说是一个问题,特别是在我们实施检查点之前。我们通过使用同步写入来解决了这个问题,并最终迁移到了 Linux 2.4 。

Another Linux problem was a single reader-writer lock which any thread in an address space must hold when it pages in from disk(reader lock) or modifies the address space in an mmap() call (writer lock). We saw transient timeouts in our system under light load and looked hard for resource bottlenecks or sporadic hardware failures. Eventually, we found that this single lock blocked the primary network thread from mapping new data into memory while the disk threads were paging in previously mapped data. Since we are mainly limited by the network interface rather than by memory copy bandwidth, we worked around this by replacing mmap() with pread() at the cost of an extra copy.

另一个 Linux 问题是单个读写锁,任何一个地址空间中的线程在从磁盘页面调入(读取锁)或修改地址空间中的 mmap() 调用(写入锁)时必须持有该锁。我们在低负载下看到了系统中的瞬时超时,并且努力寻找资源瓶颈或零星的硬件故障。最终,我们发现这个单一的锁会阻止主网络线程将新数据映射到内存中,而磁盘线程正在调入先前映射的数据。由于我们主要受到网络接口的限制而不是内存复制带宽的限制,我们通过将 mmap() 替换为 pread() 来解决这个问题,代价是多了一次复制。

Despite occasional problems, the availability of Linux code has helped us time and again to explore and understand system behavior. When appropriate, we improve the kernel and share the changes with the open source community.

尽管偶尔会遇到问题, Linux 代码的可用性一次又一次地帮助我们探索和理解系统行为。在适当的情况下,我们会改进内核并与开源社区分享这些变化。


Like other large distributed file systems such as AFS [5], GFS provides a location independent namespace which enables data to be moved transparently for load balance or fault tolerance. Unlike AFS, GFS spreads a file’s data across storage servers in a way more akin to xFS [1] and Swift [3] in order to deliver aggregate performance and increased fault tolerance.

与其他大型分布式文件系统(如 AFS [5] )一样, GFS 提供了一个位置独立的命名空间,可以实现数据的透明移动以实现负载平衡或容错。但是,与 AFS 不同的是, GFS 通过一种类似于 xFS [1] 和 Swift [3] 的方式将文件的数据分散到存储服务器上,以提供聚合性能和增加容错性。

As disks are relatively cheap and replication is simpler than more sophisticated RAID [9] approaches, GFS currently uses only replication for redundancy and so consumes more raw storage than xFS or Swift.

由于磁盘相对便宜,而复制比更复杂的 RAID [9] 方法更简单,因此 GFS 目前仅使用复制来实现冗余,因此消耗的原始存储空间比 xFS 或 Swift 更多。

In contrast to systems like AFS, xFS, Frangipani [12], and Intermezzo [6], GFS does not provide any caching below the file system interface. Our target workloads have little reuse within a single application run because they either stream through a large data set or randomly seek within it and read small amounts of data each time.

与 AFS 、 xFS 、 Frangipani 和 Intermezzo 等系统不同, GFS 在文件系统接口以下不提供任何缓存。我们的目标工作负载在单个应用程序运行期间很少重用,因为它们要么通过大型数据集进行流式传输,要么随机搜索其中并每次读取少量数据。

Some distributed file systems like Frangipani, xFS, Minnesota’s GFS[11] and GPFS [10] remove the centralized server and rely on distributed algorithms for consistency and management. We opt for the centralized approach in order to simplify the design, increase its reliability, and gain flexibility. In particular, a centralized master makes it much easier to implement sophisticated chunk placement and replication policies since the master already has most of the relevant information and controls how it changes.We address fault tolerance by keeping the master state small and fully replicated on other machines. Scalability and high availability (for reads) are currently provided by our shadow master mechanism. Updates to the master state are made persistent by appending to a write-ahead log. Therefore we could adapt a primary-copy scheme like the one in Harp [7] to provide high availability with stronger consistency guarantees than our current scheme.

一些分布式文件系统,比如 Frangipani、xFS、明尼苏达州的 GFS[11] 和 GPFS [10],取消了中心化的服务器,依靠分布式算法来保证一致性和管理。我们选择集中式的方法以简化设计、提高可靠性和增加灵活性。特别地,一个集中式的主服务器使得实现复杂的块放置和复制策略更加容易,因为主服务器已经拥有大部分相关信息并控制它如何变化。我们通过将主服务器状态保持小并完全复制到其他机器来解决容错问题。可扩展性和高可用性(对于读操作)目前通过我们的影子主服务器机制来提供。主服务器状态的更新通过追加到预写日志来实现持久性。因此,我们可以采用类似 Harp[7] 中的主副本方案来提供高可用性,并具有比我们当前方案更强的一致性保证。

We are addressing a problem similar to Lustre [8] in terms of delivering aggregate performance to a large number of clients. However, we have simplified the problem significantly by focusing on the needs of our applications rather than building a POSIX-compliant file system. Additionally, GFS assumes large number of unreliable components and so fault tolerance is central to our design.

我们在解决问题上类似于 Lustre[8],即向大量客户端提供聚合性能的问题。然而,我们通过关注我们应用程序的需求而不是构建符合 POSIX 的文件系统,显著简化了这个问题。此外,GFS 假设存在大量不可靠的组件,因此容错性是我们设计的核心。

GFS most closely resembles the NASD architecture [4]. While the NASD architecture is based on network-attached diskdrives, GFS uses commodity machines as chunkservers, as done in the NASD prototype. Unlike the NASD work, our chunkservers use lazily allocated fixed-size chunks rather than variable-length objects. Additionally, GFS implements features such as rebalancing, replication, and recovery that are required in a production environment.

GFS 最接近的是 NASD 架构[4]。虽然 NASD 架构基于网络附加磁盘驱动器,但 GFS 使用像 NASD 原型中一样的通用机器作为块服务器。与 NASD 的工作不同,我们的块服务器使用懒惰分配的固定大小块,而不是可变长度对象。此外,GFS 实现了在生产环境中所需的重新平衡、复制和恢复等功能。

Unlike Minnesota’s GFS and NASD, we do not seek to alter the model of the storage device. We focus on addressing day-to-day data processing needs for complicated distributed systems with existing commodity components.

与明尼苏达州的 GFS 和 NASD 不同,我们不试图改变存储设备的模型。我们专注于利用现有的通用组件,解决复杂分布式系统的日常数据处理需求。

The producer-consumer queues enabled by atomic record appends address a similar problem as the distributed queues in River [2]. While River uses memory-based queues distributed across machines and careful data flow control, GFS uses a persistent file that can be appended to concurrently by many producers. The River model supports m-to-n distributed queues but lacks the fault tolerance that comes with persistent storage, while GFS only supports m-to-1 queues efficiently. Multiple consumers can read the same file, but they must coordinate to partition the incoming load.

原子记录追加所实现的生产者-消费者队列解决了与 River[2] 中的分布式队列类似的问题。River 使用基于内存的分布式队列和精细的数据流控制,而 GFS 使用可以由多个生产者并发追加的持久性文件。River 模型支持 m 对 n 的分布式队列,但缺乏持久存储所带来的容错性,而 GFS 只有效地支持 m 对 1 的队列。多个消费者可以读取相同的文件,但必须协调来分配进入的负载。


The Google File System demonstrates the qualities essential for supporting large-scale data processing workloads on commodity hardware. While some design decisions are specific to our unique setting, many may apply to data processing tasks of a similar magnitude and cost consciousness. We started by reexamining traditional file system assumptions in light of our current and anticipated application workloads and technological environment.

Google File System 展示了支持通用硬件上大规模数据处理工作负载所必需的特性。虽然一些设计决策是针对我们独特的环境做出的,但许多决策可能适用于类似规模和成本意识的数据处理任务。我们从重新审视传统文件系统的假设开始,结合我们当前和预期的应用工作负载和技术环境。

Our observations have led to radically different points in the design space. We treat component failures as the norm rather than the exception, optimize for huge files that are mostly appended to (perhaps concurrently) and then read (usually sequentially), and both extend and relax the standard file system interface to improve the overall system.


Our system provides fault tolerance by constant monitoring, replicating crucial data, and fast and automatic recovery. Chunk replication allows us to tolerate chunkserver failures. The frequency of these failures motivated a novel online repair mechanism that regularly and transparently repairs the damage and compensates for lost replicas as soon as possible. Additionally, we use checksumming to detect data corruption at the diskor IDE subsystem level, which becomes all too common given the number of disks in the system.

我们的系统通过不断监控、复制重要数据以及快速自动恢复来提供容错功能。块复制使我们能够容忍块服务器故障。这些故障的频率激发了一种新颖的在线修复机制,定期并透明地修复损坏,并尽快补偿丢失的副本。此外,我们使用校验和来检测磁盘或 IDE 子系统级别的数据损坏,这在系统中的磁盘数量很多时变得司空见惯。

Our design delivers high aggregate throughput to many concurrent readers and writers performing a variety of tasks. We achieve this by separating file system control, which passes through the master, from data transfer, which passes directly between chunkservers and clients. Master involvement in common operations is minimized by a large chunk size and by chunkleases, which delegates authority to primary replicas in data mutations. This makes possible a simple, centralized master that does not become a bottleneck. We believe that improvements in our networking stack will lift the current limitation on the write throughput seen by an individual client.


GFS has successfully met our storage needs and is widely used within Google as the storage platform for research and development as well as production data processing. It is an important tool that enables us to continue to innovate and attack problems on the scale of the entire web.

GFS 已成功满足我们的存储需求,并在 Google 内广泛应用作为研究开发和生产数据处理的存储平台。它是一个重要的工具,使我们能够在整个网络的规模上继续创新和解决问题。


We wish to thankthe following people for their contributions to the system or the paper. Brain Bershad (our shepherd) and the anonymous reviewers gave us valuable comments and suggestions. Anurag Acharya, Jeff Dean, and David desJardins contributed to the early design. Fay Chang worked on comparison of replicas across chunkservers. Guy Edjlali worked on storage quota. Markus Gutschke worked on a testing frameworkand security enhancements. David Kramer worked on performance enhancements. Fay Chang, Urs Hoelzle, Max Ibel, Sharon Perl, Rob Pike, and Debby Wallach commented on earlier drafts of the paper. Many of our colleagues at Google bravely trusted their data to a new file system and gave us useful feedback. Yoshka helped with early testing.

我们要感谢以下人员对系统或论文的贡献。 Brain Bershad(我们的牧羊人)和匿名审稿人给了我们宝贵的意见和建议。 Anurag Acharya、Jeff Dean 和 David desJardins 为早期设计做出了贡献。 Fay Chang 致力于跨 chunkservers 的副本比较。 Guy Edjlali 负责存储配额。 Markus Gutschke 致力于测试框架和安全增强。 David Kramer 致力于性能增强。 Fay Chang、Urs Hoelzle、Max Ibel、Sharon Perl、Rob Pike 和 Debby Wallach 对本文的早期草稿进行了评论。 我们在谷歌的许多同事勇敢地将他们的数据托付给了一个新的文件系统,并给了我们有用的反馈。 Yoshka 帮助进行了早期测试。


[1]. Thomas Anderson, Michael Dahlin, Jeanna Neefe, David Patterson, Drew Roselli, and Randolph Wang. Serverless networkfile systems. In Proceedings of the 15th ACM Symposium on Operating System Principles, pages 109–126, Copper Mountain Resort, Colorado, December 1995.

[2]. Remzi H. Arpaci-Dusseau, Eric Anderson, Noah Treuhaft, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS ’99), pages 10–22, Atlanta, Georgia, May 1999.

[3]. Luis-Felipe Cabrera and Darrell D. E. Long. Swift: Using distributed diskstriping to provide high I/O data rates. Computer Systems, 4(4):405–436, 1991.

[4]. Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang, Howard Gobioff, Charles Hardin, ErikRiedel, David Rochberg, and Jim Zelenka. A cost-effective, high-bandwidth storage architecture. In Proceedings of the 8th Architectural Support for Programming Languages and Operating Systems, pages 92–103, San Jose, California, October 1998.

[5]. John Howard, Michael Kazar, Sherri Menees, David Nichols, Mahadev Satyanarayanan, Robert Sidebotham, and Michael West. Scale and performance in a distributed file system. ACM Transactions on Computer Systems, 6(1):51–81, February 1988.

[6]. InterMezzo. http://www.inter-mezzo.org, 2003.

[7]. Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba Shrira, and Michael Williams. Replication in the Harp file system. In 13th Symposium on Operating System Principles, pages 226–238, Pacific Grove, CA, October 1991.

[8]. Lustre. http://www.lustreorg, 2003.

[9]. David A. Patterson, Garth A. Gibson, and Randy H. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pages 109–116, Chicago, Illinois, September 1988.

[10]. FrankSchmuckand Roger Haskin. GPFS: A shared-diskfile system for large computing clusters. In Proceedings of the First USENIX Conference on File and Storage Technologies, pages 231–244, Monterey, California, January 2002.

[11]. Steven R. Soltis, Thomas M. Ruwart, and Matthew T. O’Keefe. The Gobal File System. In Proceedings of the Fifth NASA Goddard Space Flight Center Conference on Mass Storage Systems and Technologies, College Park, Maryland, September 1996.

[12]. Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. Frangipani: A scalable distributed file system. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 224–237, Saint-Malo, France, October 1997.

作者: bugwz
链接: https://bugwz.com/2023/01/10/gfs/
声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 咕咕