- 浏览: 1998901 次
- 性别:
- 来自: 北京
文章分类
- 全部博客 (651)
- ACE (35)
- BAT (9)
- C/C++ (116)
- fast-cgi (14)
- COM (27)
- python (59)
- CGI (4)
- C# (2)
- VC (84)
- DataBase (29)
- Linux (96)
- P2P (6)
- PHP (15)
- Web (6)
- Memcached (7)
- IME输入法 (11)
- 设计模式 (2)
- 搜索引擎 (1)
- 个人情感 (4)
- 笔试/面试 (3)
- 一亩三分地 (33)
- 历史 (2)
- 地理 (1)
- 人物 (3)
- 经济 (0)
- 不仅仅是笑哦 (43)
- 小故事大道理 (2)
- http://www.bjdsmyysjk120.com/ (0)
- http://www.bjdsmyy120.com/ (0)
- 它山之石可以攻玉 (15)
- 大学生你关注些什么 (28)
- 数据恢复 (1)
最新评论
-
luokaichuang:
这个规范里还是没有让我明白当浏览器上传文件时,STDIN的消息 ...
FastCGI规范 -
effort_fan:
好文章!学习了,谢谢分享!
com技术简介 -
vcell:
有错误os.walk(strPath)返回的已经是全部的文件和 ...
通过python获取目录的大小 -
feifeigd:
feifeigd 写道注意:文章中的CPP示例第二行 #inc ...
ATL入门:利用ATL编写简单的COM组件 -
feifeigd:
注意:文章中的CPP示例第二行 #include " ...
ATL入门:利用ATL编写简单的COM组件
Rsync 实现原理
前言
关于rsync的原始文档 Rsync technical report 以及Andrew Tridgell的论文 Phd thesis (pdf) 都是关于rsync算法原理的极好的文档。但是,这些文档注重的是rsync算法本身,而对算 法的实现方法则描述较少。
本文试图对Linux/Unix下的rsync工具的实现进行分析,并将描述下列问题:
- rsync 算法纵览(非数学性的);
- rsync 工具中,算法是如何实现的;
- rsync 工具中用到的协议;
- rsync 工具中,各个进程的作用(The identifiable roles the rsync processes play).
本文主要目的是为读者提供打下一个基础,在此基础上,读者可以更好的理解下列问题:
- rsync工作原理
- rsync缺陷
- Why a requested feature is unsuited to the code-base.
进程与角色: 常用术语介绍
当谈到Rsync时候,我们将使用一些术语来指代rsync工具在完成其任务的不同阶段 下的各个角色或者进程。下面为一些后文将会用到的术语:
client/客户端 | role/角色 | 客户端对同步过程进行初始化。 |
server/服务器端 | role/角色 | 服务器是指远端的rsync进程或者客户端通过远端shell、socket所连接到的系 统。
服务器(server)是一个通用的术语,注意不要将其与Deamon混为一谈。 |
一旦从Client到Server的链接建立起来,Client(客户 端)/Server(服务 器)的这两个角色的差别,就被Sender(发送者)/Receiver(接收者)所 取代了。 | ||
daemon/守护进程 | 角色,同时也是进程 | Daemon是一个rsync进程,该进程用于等待接收从Client发起的连接。在一 些平台上,Daemon也被叫做服务(Service) |
remote shell/远端shell | 角色,同时也是一系列的进程 | 一个或多个进程,用于向client和远端的server之间提供连通性。 |
sender/发送者 | role and process | 可以存取待同步的文件资源的rsync进程。 |
receiver/接收者 | role and process | 作为角色:指同步过程中的目标系统; 作为进程:指目标系统中,用于接收数据并接数据写入磁盘的进程。 |
generator/生产者 | process/进程 | 生产者进程用于识别文件的变化,并维持文件级别的逻辑。 |
进程启动
当rsync客户端启动后,它首先通过管道(pipes)或者网络来与server 进程建立 第一个连接。
根据远端连接的建立方式不同,rsync客户端的处理也不同。
当远端为一个通过remote shell建立起来的非Daemon server时,client会fork远 端shell,并借此在远端系统上启动一个服务器(server)。此后,client和 server均通过remote shell上的管道(pipes)来通讯。此过程中,单就rsync进程 而言,不涉及到网络操作。在这种模式下,server进程的rsync选项是通过用于启 动remote shell的命令行来传递的。
当rsync可以通过deamon来通讯时,它实际上是在直接通过网络来通讯。此模式 下,rsync的参数必须通过socket来发送,该过程具体如下:
通讯刚刚开始启动的时候,Client和Server将各自的版本号发送给对方,并选择较 低的版本号作为文件传输的标准。如果Server端的rsync是一个Daemon-Mode,则 rsync的选项由Client发送至Server。之后由Client发送到Server的,是exclude list,即排除的文件列表。
Local Rsync jobs (when the source and destination are both on locally mounted filesystems) are done exactly like a push. The client, which becomes the sender, forks a server process to fulfill the receiver role. The client/sender and server/receiver communicate with each other over pipes.
The File List
The file list includes not only the pathnames but also ownership, mode, permissions, size and modtime. If the --checksum option has been specified it also includes the file checksums.The first thing that happens once the startup has completed is that the sender will create the file list. While it is being built, each entry is transmitted to the receiving side in a network-optimised way.
When this is done, each side sorts the file list lexicographically by path relative to the base directory of the transfer. (The exact sorting algorithm varies depending on what protocol version is in effect for the transfer.) Once that has happened all references to files will be done by their index in the file list.
If necessary the sender follows the file list with id→name tables for users and groups which the receiver will use to do a id→name→id translation for every file in the file list.
After the file list has been received by the receiver, it will fork to become the generator and receiver pair completing the pipeline.
The Pipeline
Rsync is heavily pipelined. This means that it is a set of processes that communicate in a (largely) unidirectional way. Once the file list has been shared the pipeline behaves like this:generator → sender → receiver
The output of the generator is input for the sender and the output of the sender is input for the receiver. Each process runs independently and is delayed only when the pipelines stall or when waiting for disk I/O or CPU resources.
The Generator
The generator process compares the file list with its local directory tree. Prior to beginning its primary function, if --delete has been specified, it will first identify local files not on the sender and delete them on the receiver.
The generator will then start walking the file list. Each file will be checked to see if it can be skipped. In the most common mode of operation files are not skipped if the modification time or size differs. If --checksum was specified a file-level checksum will be created and compared. Directories, device nodes and symlinks are not skipped. Missing directories will be created.
If a file is not to be skipped, any existing version on the receiving side becomes the "basis file" for the transfer, and is used as a data source that will help to eliminate matching data from having to be sent by the sender. To effect this remote matching of data, block checksums are created for the basis file and sent to the sender immediately following the file's index number. An empty block checksum set is sent for new files and if --whole-file was specified.
The block size and, in later versions, the size of the block checksum are calculated on a per file basis according to the size of that file.
The Sender
The sender process reads the file index numbers and associated block checksum sets one at a time from the generator.For each file id the generator sends it will store the block checksums and build a hash index of them for rapid lookup.
Then the local file is read and a checksum is generated for the block beginning with the first byte of the local file. This block checksum is looked for in the set that was sent by the generator, and if no match is found, the non-matching byte will be appended to the non-matching data and the block starting at the next byte will be compared. This is what is referred to as the “rolling checksum”
If a block checksum match is found it is considered a matching block and any accumulated non-matching data will be sent to the receiver followed by the offset and length in the receiver's file of the matching block and the block checksum generator will be advanced to the next byte after the matching block.
Matching blocks can be identified in this way even if the blocks are reordered or at different offsets. This process is the very heart of the rsync algorithm.
In this way, the sender will give the receiver instructions for how to reconstruct the source file into a new destination file. These instructions detail all the matching data that can be copied from the basis file (if one exists for the transfe), and includes any raw data that was not available locally. At the end of each file's processing a whole-file checksum is sent and the sender proceeds with the next file.
Generating the rolling checksums and searching for matches in the checksum set sent by the generator require a good deal of CPU power. Of all the rsync processes it is the sender that is the most CPU intensive.
The Receiver
The receiver will read from the sender data for each file identified by the file index number. It will open the local file (called the basis) and will create a temporary file.
The receiver will expect to read non-matched data and/or to match records all in sequence for the final file contents. When non-matched data is read it will be written to the temp-file. When a block match record is received the receiver will seek to the block offset in the basis file and copy the block to the temp-file. In this way the temp-file is built from beginning to end.
The file's checksum is generated as the temp-file is built. At the end of the file, this checksum is compared with the file checksum from the sender. If the file checksums do not match the temp-file is deleted. If the file fails once it will be reprocessed in a second phase, and if it fails twice an error is reported.
After the temp-file has been completed, its ownership and permissions and modification time are set. It is then renamed to replace the basis file.
Copying data from the basis file to the temp-file make the receiver the most disk intensive of all the rsync processes. Small files may still be in disk cache mitigating this but for large files the cache may thrash as the generator has moved on to other files and there is further latency caused by the sender. As data is read possibly at random from one file and written to another, if the working set is larger than the disk cache, then what is called a seek storm can occur, further hurting performance.
The Daemon
The daemon process, like many daemons, forks for every connection. On startup, it parses the rsyncd.conf file to determine what modules exist and to set the global options.When a connection is received for a defined module the daemon forks a new child process to handle the connection. That child process then reads the rsyncd.conf file to set the options for the requested module, which may chroot to the module path and may drop setuid and setgid for the process. After that it will behave just like any other rsync server process adopting either a sender or receiver role.
The Rsync Protocol
A well-designed communications protocol has a number of characteristics.
- Everything is sent in well defined packets with a header and an optional body or data payload.
- In each packet's header a type and or command specified.
- Each packet has a definite length.
In addition to these characteristics, protocols have varying degrees of statefulness, inter-packet independence, human readability, and the ability to reestablish a disconnected session.
Rsync's protocol has none of these good characteristics. The data is transferred as an unbroken stream of bytes. With the exception of the unmatched file-data, there are no length specifiers nor counts. Instead the meaning of each byte is dependent on its context as defined by the protocol level.
As an example, when the sender is sending the file list it simply sends each file list entry and terminates the list with a null byte. Within the file list entries, a bitfield indicates which fields of the structure to expect and those that are variable length strings are simply null terminated. The generator sending file numbers and block checksum sets works the same way.
This method of communication works quite well on reliable connections and it certainly has less data overhead than the formal protocols. It unfortunately makes the protocol extremely difficult to document, debug or extend. Each version of the protocol will have subtle differences on the wire that can only be anticipated by knowing the exact protocol version.
notes
This document is a work in progress. The author expects that it has some glaring oversights and some portions that may be more confusing than enlightening for some readers. It is hoped that this could evolve into a useful reference.Specific suggestions for improvement are welcome, as would be a complete rewrite.
Sync Algorithm: RSync vs. RDC
Note:
本文前半部分翻译,原文可从rsync官方网站上得到,但是因为
时间原因,没有翻译完成,已翻译的部分也存在词不达意的现象,等以后有时间再修改吧。后半部分是转载的网友的文章,原文地址为 这里
发表评论
-
Berkeley DB 使用经验总结
2012-08-27 14:41 3022作者:陈磊 NoSQL是现在互联网Web2.0时代备受 ... -
嵌入式数据库系统Berkeley DB
2012-08-27 14:37 1470前言 UNIX/LINUX平台下的数据库种类非常多 ... -
C语言中标准输入流、标准输出流、标准错误输出流
2011-06-13 14:32 9195C语言中标准输入流、标准输出流、标准错误输出流 在 ... -
c++简单的虚函数测试
2011-04-27 14:25 967#include <iostream> u ... -
C++文件行查找
2011-04-26 14:10 1349#include <iostream> # ... -
c++偏特化简单示例
2011-04-13 11:17 2106c++偏特化 // temp1.c ... -
GDB调试精粹及使用实例
2011-03-16 14:06 1078GDB调试精粹及使用实例 一:列文件清单 1. ... -
简单的ini文件解析
2011-02-12 16:36 1569int GetKeyVal(const string s ... -
scanf族函数高级用法
2011-01-25 16:00 2480如何解释 fscanf(fd,&quo ... -
使用scons替代makefile(1)
2011-01-25 11:58 3637早在多年前我刚开始接触linux下的C程序时,经常被makef ... -
使用scons替代makefile(2)
2011-01-25 11:57 3525本篇文章接着上一篇进一步介绍scons的使用方法,主要介绍静态 ... -
使用scons替代makefile(3)
2011-01-25 11:55 4773在上两篇文章中已经简单介绍了用scons编译库文件,可执行程序 ... -
C 支持动态添加测试数据的测试代码
2011-01-13 17:22 1076/下面的定义为了支持可扩增。 //当需要增加一个新的测试用列 ... -
Linux下Makefile的automake生成
2010-12-28 16:55 1044******************helloworld.c* ... -
SCons笔记(详细版)
2010-12-23 16:11 103711. 基本使用 SConstruct文件就功能而言相当于Ma ... -
scons 学习
2010-12-23 11:14 2106scons 学习 作者:Sam(甄峰) sam_code@h ... -
scons随笔
2010-12-22 20:20 4635scons随笔 Scons是新一代的软件构件工具,或者说ma ... -
Scons在linux下的安装和使用
2010-12-21 11:59 3195因为正在用的一个开源软件需要的Developm ... -
排列组合的实现
2010-12-20 12:41 1009简单算法: 从前往后(或者从后往前)每次交换一个位置。当存在 ... -
UDP编程的服务器 Linux
2010-10-22 18:44 1281UDP编程的服务器端一般步骤是: ...
相关推荐
用rsync实现网站镜像和备份
用Rsync实现Linux文件系统备份.pdf
基于java实现的,以rsync算法原理为基础的二进制文件差异比较处理
远程文件传输rsync-断点续传及增量传输,rsync实现断点续传 传送文件较大时,如果网络中断了,重传比较费时。可以考虑使用rsync命令替代scp来断点续传文件 win10下使用git配置rsync实现断点续传,解压缩后cp到git对应usr...
利用Rsync 3.0.9实现Linux系统间的远程同步和增量备份方案及搭建流程。 本人原创与2011年07月20日
Rsync实现文件备份同步,定时备份,同步数据,如果源地址文件删除,目标地址也会删除,我们公司就用rsync同步图片资源,很实用。
rsync中文手册,使用rsync实现网站镜像及备份,实现资料同步或备份
2、开通防火墙端口rsync缺省的端口是873,您可以修改配置文件中的端口 1、创建操作系统用户操作系统用户可以是普通的用户,也可以是简单的、无需登录的、没有H
linux下Rsync+sersync实现文件数据实时同步
rsync+inotify实现实时同步 随着应用系统规模的不断扩大,对数据的安全性和可靠性也提出的更好的要求,rsync在高端业务系统中也逐渐暴露出了很多不足,首先,rsync同 步数据时,需要扫描所有文件后进行比对,进行差...
inotify + rsync实现linux文件实时同步.doc
使用 inotify-tools 和rsync 实现文件多服务器自动实时同步脚本 shell,结合网络资料整理和加强,实际使用中。
Windows平台下使用rsync实现文件同步.docx
[rsync实现网站的备份,文件的同步,不同系统的文件的同步,如果是windows的话,需要windows版本cwrsync] 一、什么是rsync rsync,remote synchronize顾名思意就知道它是一款实现远程同步功能的软件,它在同步文件...
lsyncd与rsync实现实时自动同步的配置.docx
Ubuntu结合Rsync和Inotify-tools和脚本实现数据的实时同步
rsync,remote synchronize顾名思意就知道它是一款实现远程同步功能的软件,它在同步文件的同时,可以保持原来文件的权限、时间、软硬链接等附加信息。本文档提供rsync的配置,实现远程服务器定时备份的功能。
Rsync+sersync实现数据实时同步备份,这个很有用,大家实践实践吧,这个勒索病毒横行的年代,懂得保护自己数据
rsync 配置与使用实现 rsync 配置与使用实现 rsync 配置与使用实现
rsync成功实现多台Windows文件;同步rsync成功实现多台Windows文件同步