Over the years I've been (mildly) fascinated by how various version control tools and file backup utilities work. Especially the core algorithm that drives many of these file send/diff/backup/de-duplication programs.
Rsync being the most widely used tools and the basis for many extensions, I naturally tried to wrap my head around it's working. But I thought the details were somewhat hazy. Maybe it was just me but I was looking for a simpler, clearer implementation of the algorithm and not a fully functioning program.
Recently, I gave it another shot. I waded through some of the material available on the interwebs and bravely set out to implement it to see how much of I had understood.
So, here is the basic implementation in Java. It may not be a faithful implementation of the paper but the gist is:
- Create a summary out of fixed blocks of input text (original)
- Use these blocks as reference against another text (modified)
- This modified text is slightly different from the original text
- Hence the assumption that the original text can be transformed to the modified text without having to send the entire modified text back
- The modified text can now be transformed into a combination of:
- References to those original blocks where there were no changes
- And any differences as simple text
Some notes on the implementation:
- It only handles Java Strings
- It uses a combination of Rabin-Karp rolling hash for quick, incremental hashing of blocks and CRC32 for hash conflict resolution. In reality a much more robust hash should be used instead of CRC32
- It assumes that the list of generated "blocks" is available on the other side to generate the patch. In reality there has to be a more clearly defined mechanism/protocol to exchange these blocks
- The overall algorithm to identify common/repeating hashes should be smarted than this
- http://en.wikipedia.org/wiki/Rsync#Algorithm
- https://github.com/basak/ddar/blob/7bee594a62a2b8fe43b1570cbe62438fb732a9fc/scan.c
- https://github.com/MatthewSteeples/rsync.net/blob/master/Generator.cs
- https://github.com/JohannesBuchner/Jarsync/tree/master/jarsync
- http://tutorials.jenkov.com/rsync/checksums.html
- http://blog.incubaid.com/2012/02/14/rediscovering-the-rsync-algorithm/
- http://stackoverflow.com/questions/2557927/how-does-the-rsync-algorithm-correctly-identify-repeating-blocks
- http://en.wikipedia.org/wiki/Rzip
- http://moinakg.wordpress.com/2012/11/15/inside-content-defined-chunking-in-pcompress-part-2/
- http://www.infoarena.ro/blog/rolling-hash
- http://code.activestate.com/recipes/577518-rsync-algorithm/
- https://code.google.com/p/ngramhashing/
- http://distributeddatastore.blogspot.com/2013/07/cassandra-using-merkle-trees-to-detect.html
Until next time!
0 comments:
Post a Comment