Anyone have any favorite tricks for geographically diverse real-time file replication on Linux? It seems like most information is pretty dispersed, and suggestions range from every-30-seconds rsync to putting all files as BLOBs in MySQL and replicating that. There has to be a better way. (The scariest part is Microsoft seems to show up first for most Googles I can think of, but Windows is not an option.)
I haven’t had time to play with it myself, but MogileFS claims automatic file replication.
I think Mogile solves a different problem, which is how to efficiently store static files in one datacenter. We basically want a synced copy of files in two datacenters, it looks like if we had a Mogile node in each we’d have to replicate the tracker and some file calls would go to the other DC which would have a lot of latency. We want all files to be local for the web nodes in each place.
I believe some folks have built a system on top of spread: http://www.spread.org/
Have you looked at csync2? It is asynchronous like rsync, but it may be better suited to what you are trying to accomplish.
I’ve found unison useful as an rsync like two way replicator to ensure that changes at both ends are propergated correctly.
I haven’t used it but there is tsync:
http://tsyncd.sourceforge.net/
You could try rdist, but you may have to do a bit of work to get it integrated.
You can try OpenAFS or CodaFS. (http://www.openafs.org/ or http://www.coda.cs.cmu.edu/) A while back, I was looking for a similar solution and tested out coda briefly. The setup for both is pretty complex. I have since reverted to using rsync :-/
If you are able to spend $$$, Veritas has some cool replication solutions that I’ve setup/used at previous employers of mine.
If you have a fast link to the other datacenter, you can use DRBD. I haven’t used it yet, but I’ve been looking at it as part of a DR plan for our operations.
The link is just a normal net connection, ping times of about 35ms and 12 hops between them.
I think the simplest solution would be to use rsync. Are you copying from one “live” server to a backup, or do they both get updated independently and want both sets of data available on each?
You should talk with Jason Hoffman, CTO of your and WordPress’s hosting company*. I saw his Scale With Rails presentation this weekend and he showed how Solaris ZFS with copy on write iSCSI storage is much better and, overall, cheaper than replication at the file or database level.
* Of course you know who he is but for the benefit of others…
You should take a look at GFS, the Google Filesystem they built, which seems to do exactly what you want.
It’s true – rsync kicks butt, however it’s limitation would be that you would have to ‘mirror’ each resource seperately which result in a heck of a lot of code duplication to get it happening.
Rails is also cool, however unless one is using rails for content delivery it’s applying a fairly large amount of resources to what essentially is a relatively simple concept – file replication.
The following looks to be a simple approach to file and database replication.
For small amounts of data we use a cron scheduled rsync. We’re replicating data from a Linux box in the midwest to one on the west coast. It’s been perfectly reliable thus far.
Lot’s of options, but you didn’t provide much detail on file sizes, pipe sizes, protocol/firewall constraints, and what you define as “real-time”. I did see a comment that they are “normal” net connections (and to me a normal net connection is an OC148 – but that’s not normal to most).
iFolder3 is an open source project that provides “real-time” sync of data, encryption, and delta sync. http://www.ifolder.com/index.php/Home
Rsync
SyncML (more protocol/language than actual product)
There are many others that claim to be file sharing, but lack the features that you may need. There are a lot of commercial/closed source packages that have the file replication built into them, but you would have to purchase the whole enchilada to get it (portlock.com)
I cannot, in any way, recommend anything related to Spread. Once upon a time I tried to use mod_log_spread and even though we retained the author / brother of the author to make it work we found that, well, we’d have accomplished more with our time by taking several thousand dollars in singles, lighting them on fire and dancing naked in the moonlight.
I personally think that Spread and anything related to it should be covered with camel dung and then placed on a rocket and shot directly into the sun. Sorry Theo.
Honestly this problem is really, really hard. The big question for me is how much data and how many files. Give us a grasp of scale and then people can think more on it. Or feel free to give me a call Matt; you’ve got the number or see /contact/. Always happy to help out WordPress.
The problem is a piece of cake.
Haha, Jason do tell!
I have to agree with Jason, but I wish I could say the same about the solution!
Well… I stumbled on this thread recently because I was looking for a similar solution.
As far as opensource and realtime file synchronisation across multiple hosts in which changes to files can be on any one of these, tsyncd looks like the best!!
The first time I had come across this program was in this thread when Ben Williams mentioned it… Thought I’d have a look and it turns out it works a treat!
A little playing around and tweeking here and there (it’s still in beta), but I’m pretty confident with it and currently use it for the following 2 solutions:
1. Online backup of remote client server systems (6 clients now replicate their /home and /shares (samba shares) to my file servers.
2. Realtime content replication of 30+ websites across 3 datacentres (was originally a twice daily batch process). This allows a user to update the closest file server to them and the changes are automatically replicated to the other datacentres instantaneously so the content is the same no matter which web server visitors look at (no more 12 hour delay)!!!
Thanks Ben for your suggestion!! This is a gem!!