Apr
26
23

Cross-Datacenter File Replication

Asides, Linux

Anyone have any favorite tricks for geographically diverse real-time file replication on Linux? It seems like most information is pretty dispersed, and suggestions range from every-30-seconds rsync to putting all files as BLOBs in MySQL and replicating that. There has to be a better way. (The scariest part is Microsoft seems to show up first for most Googles I can think of, but Windows is not an option.)


23 Comments

  • Daniel Westermann-Clark April 26, 2006 @ 11:17 am

    I haven’t had time to play with it myself, but MogileFS claims automatic file replication.

  • Matt April 26, 2006 @ 11:20 am

    I think Mogile solves a different problem, which is how to efficiently store static files in one datacenter. We basically want a synced copy of files in two datacenters, it looks like if we had a Mogile node in each we’d have to replicate the tracker and some file calls would go to the other DC which would have a lot of latency. We want all files to be local for the web nodes in each place.

  • Jeremy Zawodny April 26, 2006 @ 11:27 am

    I believe some folks have built a system on top of spread: http://www.spread.org/

  • michael April 26, 2006 @ 11:46 am

    Have you looked at csync2? It is asynchronous like rsync, but it may be better suited to what you are trying to accomplish.

  • westi April 26, 2006 @ 12:21 pm

    I’ve found unison useful as an rsync like two way replicator to ensure that changes at both ends are propergated correctly.

  • Ben Williams April 26, 2006 @ 12:24 pm

    I haven’t used it but there is tsync:
    http://tsyncd.sourceforge.net/

  • ryan king April 26, 2006 @ 1:44 pm

    You could try rdist, but you may have to do a bit of work to get it integrated.

  • Alt April 26, 2006 @ 2:14 pm

    You can try OpenAFS or CodaFS. (http://www.openafs.org/ or http://www.coda.cs.cmu.edu/) A while back, I was looking for a similar solution and tested out coda briefly. The setup for both is pretty complex. I have since reverted to using rsync :-/
    If you are able to spend $$$, Veritas has some cool replication solutions that I’ve setup/used at previous employers of mine.

  • Eric Pierce April 26, 2006 @ 2:21 pm

    If you have a fast link to the other datacenter, you can use DRBD. I haven’t used it yet, but I’ve been looking at it as part of a DR plan for our operations.

  • Matt April 26, 2006 @ 3:20 pm

    The link is just a normal net connection, ping times of about 35ms and 12 hops between them.

  • W. Andrew Loe III April 26, 2006 @ 3:21 pm

    I think the simplest solution would be to use rsync. Are you copying from one “live” server to a backup, or do they both get updated independently and want both sets of data available on each?

  • BillSaysThis April 26, 2006 @ 4:56 pm

    You should talk with Jason Hoffman, CTO of your and WordPress’s hosting company*. I saw his Scale With Rails presentation this weekend and he showed how Solaris ZFS with copy on write iSCSI storage is much better and, overall, cheaper than replication at the file or database level.

    * Of course you know who he is but for the benefit of others…

  • Elliott Back April 26, 2006 @ 6:50 pm

    You should take a look at GFS, the Google Filesystem they built, which seems to do exactly what you want.

  • Brendan April 26, 2006 @ 7:02 pm

    It’s true – rsync kicks butt, however it’s limitation would be that you would have to ‘mirror’ each resource seperately which result in a heck of a lot of code duplication to get it happening.

    Rails is also cool, however unless one is using rails for content delivery it’s applying a fairly large amount of resources to what essentially is a relatively simple concept – file replication.

    The following looks to be a simple approach to file and database replication.

  • Matt April 27, 2006 @ 12:11 am

    For small amounts of data we use a cron scheduled rsync. We’re replicating data from a Linux box in the midwest to one on the west coast. It’s been perfectly reliable thus far.

  • JP April 27, 2006 @ 10:37 am

    Lot’s of options, but you didn’t provide much detail on file sizes, pipe sizes, protocol/firewall constraints, and what you define as “real-time”. I did see a comment that they are “normal” net connections (and to me a normal net connection is an OC148 – but that’s not normal to most).
    iFolder3 is an open source project that provides “real-time” sync of data, encryption, and delta sync. http://www.ifolder.com/index.php/Home
    Rsync
    SyncML (more protocol/language than actual product)

    There are many others that claim to be file sharing, but lack the features that you may need. There are a lot of commercial/closed source packages that have the file replication built into them, but you would have to purchase the whole enchilada to get it (portlock.com)

  • Scott Johnson of Ookles April 28, 2006 @ 6:17 am

    I cannot, in any way, recommend anything related to Spread. Once upon a time I tried to use mod_log_spread and even though we retained the author / brother of the author to make it work we found that, well, we’d have accomplished more with our time by taking several thousand dollars in singles, lighting them on fire and dancing naked in the moonlight.

    I personally think that Spread and anything related to it should be covered with camel dung and then placed on a rocket and shot directly into the sun. Sorry Theo.

    Honestly this problem is really, really hard. The big question for me is how much data and how many files. Give us a grasp of scale and then people can think more on it. Or feel free to give me a call Matt; you’ve got the number or see /contact/. Always happy to help out WordPress.

  • jason hoffman April 28, 2006 @ 5:04 pm

    The problem is a piece of cake.

  • Matt April 28, 2006 @ 8:10 pm

    Haha, Jason do tell!

  • Maarten Sander April 29, 2006 @ 6:26 am

    I have to agree with Jason, but I wish I could say the same about the solution!

  • Andrew July 21, 2007 @ 8:36 pm

    Well… I stumbled on this thread recently because I was looking for a similar solution.
    As far as opensource and realtime file synchronisation across multiple hosts in which changes to files can be on any one of these, tsyncd looks like the best!!
    The first time I had come across this program was in this thread when Ben Williams mentioned it… Thought I’d have a look and it turns out it works a treat!
    A little playing around and tweeking here and there (it’s still in beta), but I’m pretty confident with it and currently use it for the following 2 solutions:

    1. Online backup of remote client server systems (6 clients now replicate their /home and /shares (samba shares) to my file servers.

    2. Realtime content replication of 30+ websites across 3 datacentres (was originally a twice daily batch process). This allows a user to update the closest file server to them and the changes are automatically replicated to the other datacentres instantaneously so the content is the same no matter which web server visitors look at (no more 12 hour delay)!!!

    Thanks Ben for your suggestion!! This is a gem!!

Share Your Thoughts