译 - The rsync algorithm

《The rsync algorithm》这篇发表于 1996 年的论文中介绍了一种名为 rsync 的增量同步算法,它能够快速地将两个文件夹中的内容同步。该算法利用了文件的局部性和差异性,通过计算文件的弱校验和和块校验和来确定文件的相似性,并进行增量同步。该算法具有高效性、可靠性和安全性等优点,在实际应用中被广泛使用。

0、摘要

This report presents an algorithm for updating a file on one machine to be identical to a file on another machine. We assume that the two machines are connected by a low-bandwidth high-latency bi-directional communications link. The algorithm identifies parts of the source file which are identical to some part of the destination file, and only sends those parts which cannot be matched in this way. Effectively, the algorithm computes a set of differences without having both files on the same machine. The algorithm works best when the files are similar, but will also function correctly and reasonably efficiently when the files are quite different.

该报告中提出了一种算法,用于更新一台机器上的文件使其与另一台机器上的文件相同。 我们假设两台机器之间通过低带宽、高延迟的网络进行双向连接。 该算法能够识别出源文件与目标文件中相同的部分,并且只发送那些不一致的部分。 实际上,该算法可以计算出一个差异的数据集,而无需两个文件在同一个机器上。该算法在文件相似的情况下效果很好,在文件完全不同的情况下,它也能正确并合理的高效工作。

1、问题

Imagine you have two files, A and B, and you wish to update B to be the same as A. The obvious method is to copy A onto B.

假设您有两个文件,A 和 B,您希望将 B 的内容更新的和 A 一样。很明显最简单(显而易见)的方法是将 A 复制到 B。

Now imagine that the two files are on machines connected by a slow com- munications link, for example a dial up IP link. If A is large, copying A onto B will be slow. To make it faster you could compress A before sending it, but that will usually only gain a factor of 2 to 4.

现在假设这两个文件之间的通信链路十分缓慢,例如使用一个拨号上网的路由器。 如果 A 文件很大,那么将 A 文件复制到 B 的过程会十分缓慢。 为了使其速度更快,我们可以在发送之前压缩 A,但这通常只会增加 2 到 4 倍的传输效率。

Now assume that A and B are quite similar, perhaps both derived from the same original file. To really speed things up you would need to take advantage of this similarity. A common method is to send just the differences between A and B down the link and then use this list of differences to reconstruct the file.

现在假设 A 和 B 非常相似,可能都来自于同一个原始文件。 要想真正的提高传输速度,我们可以利用这种相似性。 一种常见的方法是仅传输 A 和 B 之间的差异,然后使用此差异列表来重建文件。

The problem is that the normal methods for creating a set of differences between two files rely on being able to read both files. Thus they require that both files are available beforehand at one end of the link. If they are not both available on the same machine, these algorithms cannot be used (once you had copied the file over, you wouldn’t need the differences). This is the problem that rsync addresses.

问题在于创建两个文件差异数据集的常规方法需要能够读取这两个文件。 因此,它们要求在传输开始前这两个文件在链接一端是存在的。如果这两个文件在同一个机器上不存在,则无法使用这些算法(一旦将文件复制过来,就不需要差异信息了)。这就是 rsync 解决的问题。

The rsync algorithm efficiently computes which parts of a source file match some part of an existing destination file. These parts need not be sent across the link; all that is needed is a reference to the part of the destination file. Only parts of the source file which are not matched in this way need to be sent verbatim. The receiver can then construct a copy of the source file using the references to parts of the existing destination file and the verbatim material.

rsync 算法能够高效地计算源文件与目标文件中匹配的部分。 这部分数据不需要通过链接发送;所需要的只是引用目标文件的部分数据。 只需要发送源文件中不匹配的分布数据。然后,接收者可以使用对现有目标文件部分内容的引用和逐字记录的材料(传输的差异数据)来构建源文件的副本。

Trivially, the data sent to the receiver can be compressed using any of a range of common compression algorithms, for further speed improvements.

通常,可以使用众多常用压缩算法中的任何一种来压缩待发送到接收器的数据,来进一步提高速度。

2、rsync算法

Suppose we have two general purpose computers a and b. Computer a has access to a file A and b has access to file B, where A and B are “similar”. There is a slow communications link between a and b.

假设我们有两台通用计算机 a 和 b。 计算机 a 可以访问文件 A,b 可以访问文件 B,其中 A 和 B 是 “相似的” 。 a 和 b 之间的通信链路很慢。

The rsync algorithm consists of the following steps:

rsync算法包括以下步骤:

  1. b splits the file B into a series of non-overlapping fixed-sized blocks of size S bytes [1] . The last block may be shorter than S bytes.
  2. For each of these blocks b calculates two checksums: a weak “rolling” 32-bit checksum (described below) and a strong 128-bit MD4 checksum.
  3. b sends these checksums to a.
  4. a searches through A to find all blocks of length S bytes (at any offset, not just multiples of S) that have the same weak and strong checksum as one of the blocks of B. This can be done in a single pass very quickly using a special property of the rolling checksum described below.
  5. a sends b a sequence of instructions for constructing a copy of A. Each instruction is either a reference to a block of B, or literal data. Literal data is sent only for those sections of A which did not match any of the blocks of B.

  1. b 将文件 B 拆分为一系列大小为 S 字节 [1] 的非重叠的固定大小的块。 最后一个块的大小可能小于 S 字节。
  2. 对于这些块中的每一个,b 会计算两个校验和:弱 “滚动” 32 位校验和(如下所述)和强 128 位 MD4 校验和。
  3. b 将这些校验和发送给 a。
  4. a 搜索 A 以找到所有长度为 S 字节的块(在任何偏移量,而不仅仅是 S 的倍数),这些块具有与 B 的块之一相同的弱校验和和强校验和。使用下面介绍的滚动校验和的特殊属性可以非常快速地一次完成此操作。
  5. a 向 b 发送一系列指令,用于构造 A 的副本。每条指令要么是对 B 块的引用,要么是文字数据。 只有当 A 和 B 的不匹配的部分数据块才会发送文字数据。

The end result is that b gets a copy of A, but only the pieces of A that are not found in B (plus a small amount of data for checksums and block indexes) are sent over the link. The algorithm also only requires one round trip, which minimises the impact of the link latency.

最终结果是 b 获得了 A 的副本,但是仅通过链路发送了 B 中找不到的 A 中的片段(以及很少的用于校验和和块索引的数据)。该算法只需要一次往返,从而能够最大限度的减少链路延迟的影响。

The most important details of the algorithm are the rolling checksum and the associated multi-alternate search mechanism which allows the all-offsets checksum search to proceed very quickly. These will be discussed in greater detail below.

该算法最重要的细节是滚动校验和以及相关联的多变量搜索机制,它使得能够非常快速的进行偏移量校验和搜索。 这些将在下面更详细地讨论。

3、滚动校验和

The weak rolling checksum used in the rsync algorithm needs to have the property that it is very cheap to calculate the checksum of a buffer X(2)..X(n+1) given the checksum of buffer X(1)..X(n) and the values of the bytes X(1) and X(n+1).

rsync 算法中使用的弱滚动校验和必须具有以下特性:在给定缓冲区 X(1) .. X(n) 的校验和的情况下,计算缓冲区 X(2) .. X(n +1) 的校验和以及字节 X(1) 和 X(n + 1) 非常方便 。

The weak checksum algorithm we used in our implementation was inspired by Mark Adler’s adler-32 checksum. Our checksum is defined by:

我们在实现中使用的弱校验和的算法灵感来自 Mark Adler 的 adler-32 校验和。 我们的校验和定义为:

where s(k, l) is the rolling checksum of the bytes X(k)..X(l). For simplicity and speed, we use M = 2^16 .

其中 s(k, l) 是字节 X(k) .. X(l) 的滚动校验和。 为了简单和速度,我们使用 M = 2^16 。

The important property of this checksum is that successive values can be computed very efficiently using the recurrence relation

该校验和的重要特性是可以使用递归关系非常高效地计算连续值

Thus the checksum can be calculated for blocks of length S at all possible offsets within a file in a “rolling” fashion, with very little computation at each point.

因此,可以使用 “滚动” 的方式为文件内所有可能偏移量计算长度为 S 的块校验和,每个点的计算量非常少。

Despite its simplicity, this checksum was found to be quite adequate as a rst level check for a match of two file blocks. We have found in practice that the probability of this checksum matching when the blocks are not equal is quite low. This is important because the much more expensive strong checksum must be calculated for each block where the weak checksum matches.

尽管它很简单,但将这个校验和作为两个文件块匹配的第一级检查是足够的了。 我们在实践中发现,当块不相等时,这个校验和匹配的概率很低。 这很重要,否则的话就需要为弱校验和匹配的每个块计算更昂贵的强校验和。

4、校验和搜索

Once a has received the list of checksums of the blocks of B, it must search A for any blocks at any offset that match the checksum of some block of B. The basic strategy is to compute the 32-bit rolling checksum for a block of length S starting at each byte of A in turn, and for each checksum, search the list for a match. To do this our implementation uses a simple 3 level searching scheme.

一旦 a 收到来自 B 的块的校验和列表,它必须在 A 中搜索任何偏移量与 B 的某个块的校验和匹配的块。基本策略是从 A 的每个字节处开始计计算长度为 S 的块的 32 位滚动校验和,并且为每个校验和再列表中搜索匹配项。为此,我们的实现使用了一个简单的三级搜索方案。

The first level uses a 16-bit hash of the 32-bit rolling checksum and a 2^16 entry hash table. The list of checksum values (i.e., the checksums from the blocks of B) is sorted according to the 16-bit hash of the 32-bit rolling checksum. Each entry in the hash table points to the first element of the list for that hash value, or contains a null value if no element of the list has that hash value.

第一级使用 32 位滚动校验和的 16 位哈希和 2^16 个条目的哈希表。 根据 32 位滚动校验和的 16 位哈希对校验和值列表(即来自 B 块的校验和)进行排序。哈希表中的每个条目都指向该哈希值的列表的第一个元素,或者如果列表中没有元素具有该哈希值,则包含一个空值。

At each offset in the file the 32-bit rolling checksum and its 16-bit hash are calculated. If the hash table entry for that hash value is not a null value, the second level check is invoked.

在文件中的每个偏移处计算 32 位滚动校验和及其 16 位哈希。 如果该哈希值的哈希表条目不是空值,则调用第二级检查。

The third level check involves calculating the strong checksum for the current offset in the file and comparing it with the strong checksum value in the current list entry. If the two strong checksums match, we assume that we have found a block of A which matches a block of B. In fact the blocks could be different, but the probability of this is microscopic, and in practice this is a reasonable assumption.

第三级检查涉及计算文件中当前偏移量的强校验和,并将其与当前列表条目中的强校验和值进行比较。 如果两个强校验和匹配,我们假设我们找到了一个匹配 B 块的 A 块。事实上,块可能不同,但这种可能性很小,实际上这是一个合理的假设。

When a match is found, a sends b the data in A between the current file offset and the end of the previous match, followed by the index of the block in B that matched. This data is sent immediately a match is found, which allows us to overlap the communication with further computation.

当找到匹配项时,a 向 b 发送 A 中当前文件偏移量和上一个匹配项结尾之间的数据,然后是 B 中匹配的块的索引。 找到匹配项后立即发送此数据,这使我们可以将通信与进一步的计算重叠。

If no match is found at a given offset in the file, the rolling checksum is updated to the next offset and the search proceeds. If a match is found, the search is restarted at the end of the matched block. This strategy saves a considerable amount of computation for the common case where the two files are nearly identical. In addition, it would be a simple matter to encode the block indexes as runs, for the common case where a portion of A matches a series of blocks of B in order.

如果在文件中的给定偏移量处未找到匹配项,则滚动校验和将更新为下一个偏移量并继续搜索。 如果找到匹配项,则在匹配块的末尾重新开始搜索。 对于两个文件几乎相同的常见情况,此策略可以节省大量计算。 此外,对于 A 的一部分按顺序匹配 B 的一系列块的常见情况,将块索引编码为运行是一件简单的事情。

5、流水线

The above sections describe the process for constructing a copy of one file on a remote system. If we have a several files to copy, we can gain a considerable latency advantage by pipelining the process.

以上部分描述了在远程系统上构建一个文件副本的过程。 如果我们有多个文件要复制,我们可以通过流水线化过程获得相当大的延迟优势。

This involves b initiating two independent processes. One of the processes generates and sends the checksums to a while the other receives the difference information from a and reconstructs the files.

这涉及 b 启动两个独立的进程。 其中一个进程生成校验和并将其发送给 a,而另一个进程从 a 接收差异信息并重建文件。

If the communications link is buffered then these two processes can proceed independently and the link should be kept fully utilised in both directions for most of the time.

如果通信链路被缓冲,那么这两个过程可以独立进行,并且在大多数时间里,链路应该在两个方向上都得到充分利用。

6、结果

To test the algorithm, tar files were created of the Linux kernel sources for two versions of the kernel. The two kernel versions were 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and are separated by 5 released patch levels.

为了测试该算法,为两个版本的内核创建了 Linux 内核源代码的 tar 文件。 两个内核版本分别是 1.99.10 和 2.0.0。 这些 tar 文件大小约为 24MB,由 5 个发布的补丁级别分隔。

Out of the 2441 files in the 1.99.10 release, 291 files had changed in the 2.0.0 release, 19 files had been removed and 25 files had been added.

在 1.99.10 版本的 2441 个文件中,2.0.0 版本更改了 291 个文件,删除了 19 个文件,添加了 25 个文件。

A “diff” of the two tar files using the standard GNU diff utility produced over 32 thousand lines of output totalling 2.1 MB.

使用标准 GNU diff 实用程序对两个 tar 文件进行“diff”产生了 32,000 多行输出,总计 2.1 MB。

The following table shows the results for rsync between the two files with a varying block size. [2]

下表显示了具有不同块大小的两个文件之间的 rsync 结果。[2]

In each case, the CPU time taken was less than the time it takes to run “diff” on the two files. [3]

在每种情况下,占用的 CPU 时间都少于在两个文件上运行 “diff” 所花费的时间。 [3]

The columns in the table are as follows:

表中各列如下:

  • block size : The size in bytes of the checksummed blocks.
  • matches : The number of times a block of B was found in A.
  • tag hits : The number of times the 16 bit hash of the rolling checksum matched a hash of one of the checksums from B.
  • false alarms : The number of times the 32 bit rolling checksum matched but the strong checksum didn’t.
  • data : The amount of file data transferred verbatim, in bytes.
  • written : The total number of bytes written by a including protocol overheads. This is almost all file data.
  • read : The total number of bytes read by a including protocol overheads. This is almost all checksum information.

  • 块大小 : 校验和块的大小(以字节为单位)。
  • 匹配数 : 在 A 中找到 B 块的次数。
  • 标签点击数 : 滚动校验和的 16 位散列与来自 B 的校验和之一的散列匹配的次数。
  • 误报 :32 位滚动校验和匹配但强校验和不匹配的次数。
  • 数据 : 逐字传输的文件数据量,以字节为单位。
  • 写 : 包括协议开销在内的写入的总字节数。 这几乎是所有文件数据。
  • 读 : 包括协议开销在内的读取的总字节数。 这几乎就是所有的校验和信息。

The results demonstrate that for block sizes above 300 bytes, only a small fraction (around 5%) of the file was transferred. The amount transferred was also considerably less than the size of the diff file that would have been transferred if the diff/patch method of updating a remote file was used.

结果表明,对于超过 300 字节的块大小,只有一小部分(大约 5%)的文件被传输。 如果使用更新远程文件的 diff/patch 方法,传输的数量也大大小于 diff 文件的大小。

The checksums themselves took up a considerable amount of space, although much less than the size of the data transferred in each case. Each pair of checksums consumes 20 bytes: 4 bytes for the rolling checksum plus 16 bytes for the 128-bit MD4 checksum.

校验和本身占用了大量空间,尽管远小于每种情况下传输的数据大小。 每对校验和占用 20 个字节:4 个字节用于滚动校验和,另外 16 个字节用于 128 位 MD4 校验和。

The number of false alarms was less than 1=1000 of the number of true matches, indicating that the 32 bit rolling checksum is quite good at screening out false matches.

误报数小于真实匹配数的1=1000,说明 32 位滚动校验和非常适合筛选错误匹配。

The number of tag hits indicates that the second level of the checksum search algorithm was invoked about once every 50 characters. This is quite high because the total number of blocks in the file is a large fraction of the size of the tag hash table. For smaller files we would expect the tag hit rate to be much closer to the number of matches. For extremely large files, we should probably increase the size of the hash table.

标记命中数表示第二级校验和搜索算法大约每 50 个字符调用一次。 这是相当高的,因为文件中的块总数是标签哈希表大小的很大一部分。 对于较小的文件,我们希望标签命中率更接近匹配数。 对于非常大的文件,我们可能应该增加哈希表的大小。

The next table shows similar results for a much smaller set of files. In this case the files were not packed into a tar file first. Rather, rsync was invoked with an option to recursively descend the directory tree. The files used were from two source releases of another software package called Samba. The total source code size is 1.7 MB and the diff between the two releases is 4155 lines long totalling 120 kB.

下表显示了一组更小的文件的类似结果。 在这种情况下,文件没有先打包到 tar 文件中。 相反,调用 rsync 时带有递归下降目录树的选项。 使用的文件来自另一个名为 Samba 的软件包的两个源版本。 源代码总大小为 1.7 MB,两个版本之间的差异为 4155 行,总计 120 kB。

7、可用性

An implementation of rsync which provides a convenient interface similar to the common UNIX command rcp has been written and is available for download from ftp://samba.anu.edu.au/pub/rsync.

rsync 的实现提供了一个类似于通用 UNIX 命令 rcp 的方便接口,已经编写完成,可以从 ftp://samba.anu.edu.au/pub/rsync 下载。

作者: bugwz
链接: https://bugwz.com/2019/10/20/the-rsync-algorithm/
声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 咕咕