Cross-Datacenter File Replication

Anyone have any favorite tricks for geographically diverse real-time file replication on Linux? It seems like most information is pretty dispersed, and suggestions range from every-30-seconds rsync to putting all files as BLOBs in MySQL and replicating that. There has to be a better way. (The scariest part is Microsoft seems to show up first for most Googles I can think of, but Windows is not an option.)

23 thoughts on “Cross-Datacenter File Replication

  1. I think Mogile solves a different problem, which is how to efficiently store static files in one datacenter. We basically want a synced copy of files in two datacenters, it looks like if we had a Mogile node in each we’d have to replicate the tracker and some file calls would go to the other DC which would have a lot of latency. We want all files to be local for the web nodes in each place.

  2. You can try OpenAFS or CodaFS. (http://www.openafs.org/ or http://www.coda.cs.cmu.edu/) A while back, I was looking for a similar solution and tested out coda briefly. The setup for both is pretty complex. I have since reverted to using rsync :-/
    If you are able to spend $$$, Veritas has some cool replication solutions that I’ve setup/used at previous employers of mine.

  3. I think the simplest solution would be to use rsync. Are you copying from one “live” server to a backup, or do they both get updated independently and want both sets of data available on each?

  4. You should talk with Jason Hoffman, CTO of your and WordPress’s hosting company*. I saw his Scale With Rails presentation this weekend and he showed how Solaris ZFS with copy on write iSCSI storage is much better and, overall, cheaper than replication at the file or database level.

    * Of course you know who he is but for the benefit of others…

  5. It’s true – rsync kicks butt, however it’s limitation would be that you would have to ‘mirror’ each resource seperately which result in a heck of a lot of code duplication to get it happening.

    Rails is also cool, however unless one is using rails for content delivery it’s applying a fairly large amount of resources to what essentially is a relatively simple concept – file replication.

    The following looks to be a simple approach to file and database replication.

  6. For small amounts of data we use a cron scheduled rsync. We’re replicating data from a Linux box in the midwest to one on the west coast. It’s been perfectly reliable thus far.

  7. Lot’s of options, but you didn’t provide much detail on file sizes, pipe sizes, protocol/firewall constraints, and what you define as “real-time”. I did see a comment that they are “normal” net connections (and to me a normal net connection is an OC148 – but that’s not normal to most).
    iFolder3 is an open source project that provides “real-time” sync of data, encryption, and delta sync. http://www.ifolder.com/index.php/Home
    Rsync
    SyncML (more protocol/language than actual product)

    There are many others that claim to be file sharing, but lack the features that you may need. There are a lot of commercial/closed source packages that have the file replication built into them, but you would have to purchase the whole enchilada to get it (portlock.com)

  8. I cannot, in any way, recommend anything related to Spread. Once upon a time I tried to use mod_log_spread and even though we retained the author / brother of the author to make it work we found that, well, we’d have accomplished more with our time by taking several thousand dollars in singles, lighting them on fire and dancing naked in the moonlight.

    I personally think that Spread and anything related to it should be covered with camel dung and then placed on a rocket and shot directly into the sun. Sorry Theo.

    Honestly this problem is really, really hard. The big question for me is how much data and how many files. Give us a grasp of scale and then people can think more on it. Or feel free to give me a call Matt; you’ve got the number or see /contact/. Always happy to help out WordPress.

  9. Well… I stumbled on this thread recently because I was looking for a similar solution.
    As far as opensource and realtime file synchronisation across multiple hosts in which changes to files can be on any one of these, tsyncd looks like the best!!
    The first time I had come across this program was in this thread when Ben Williams mentioned it… Thought I’d have a look and it turns out it works a treat!
    A little playing around and tweeking here and there (it’s still in beta), but I’m pretty confident with it and currently use it for the following 2 solutions:

    1. Online backup of remote client server systems (6 clients now replicate their /home and /shares (samba shares) to my file servers.

    2. Realtime content replication of 30+ websites across 3 datacentres (was originally a twice daily batch process). This allows a user to update the closest file server to them and the changes are automatically replicated to the other datacentres instantaneously so the content is the same no matter which web server visitors look at (no more 12 hour delay)!!!

    Thanks Ben for your suggestion!! This is a gem!!

SHARE YOUR THOUGHTS