Recent Articles

Why Sanoid’s ZFS replication matters

Published by Jim Salter // November 19th, 2014


If you’re an old hand in the storage game and are familiar with rsync (which is an amazing tool, btw) you might not be quite sure why block-level replication matters.

So, let’s do a thought experiment

What if you want daily offsite DR of an entire VM image? Let’s say the image is about 2TB in size. If you just copy the whole thing, like with FTP or SCP or any other simple copy tool, you’re looking at squeezing 2TB, byte by byte, over your offsite internet connection. Even with a completely uncontested 100mbps connection, with roughly 11.5MB/sec of throughput, that’s going to take a solid 50 hours. Ouch. OK, that’s a non-starter.

OK, so what about rsync? Rsync is a highly advanced userland tool that is specifically designed to only migrate CHANGED data across the network connection. But in order to do this, rsync first has to read the entire file, on both ends, then tokenize it in chunks, then compare those tokens, which will tell it what chunks have changed and therefore need to be sent across the pipe. This works great for moderately sized files. But if you stop and think about it, that means for this scenario, rsync needs to read an entire 2TB file – on both ends – before it can even start actually moving data. So, let’s say you’ve got really great storage on both ends and you can pull sustained 100MB/sec reads over the entire 2TB on each end… you’re still looking at six hours, minimum, during which your production system will be drowning in I/O requests and nearly unusable. Even if you only changed a single byte of data. Well, crap.

This brings us to syncoid, Sanoid’s replication tool. Syncoid makes using the underlying ZFS replication simple and easy – you give it a source, you give it a target, and it makes sure the target is up-to-date with all of the snapshots available on the source. Syncoid (and ZFS) don’t need to tokenize anything first – they already know what individual blocks have changed, because that information is implicit in the snapshots themselves (which ultimately are just a list of pointers to blocks in the filesystem). So Syncoid just has to see what the newest snapshot is on the target, then send an incremental stream from the source to the target updating it with any more recent snapshots – meaning it starts working immediately, and it doesn’t generate any unnecessary load on the source or target machines.

Ooh, a case study! Everybody loves case studies!

How’s this work in practice? Well, luckily, I just so happen to have a >2TB dataset in production. Almost the full 2TB of it is a single VM (which is the fileserver for an engineering office), with a couple hundred gigabytes thrown in for other, smaller application server VMs in that office. The big fileserver is Linux, two of the application server VMs are as well, and finally there’s a Windows VM running a server copy of Quickbooks. So let’s see how this plays out in real life.

The laughably small internet connection

root@remotebackup:~# iperf -c 10.10.10.1
------------------------------------------------------------
Client connecting to 10.10.10.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.10.10.2 port 39796 connected with 10.10.10.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-11.0 sec 2.00 MBytes 1.52 Mbits/sec

First question – just how crappy is the bandwidth we’re working with? Well, it’s an OpenVPN tunnel running across a low-dollar residential cable connection – the offsite DR is actually to the home of one of the firm partners. Our iperf test here shows us we’re getting a whopping 1.5mbps – enough for about 160KB/sec or so of throughput. Yuck.

2TBs of VM images? Well, 2.2TB really. 2.6TB including snapshots.

root@remotebackup:~# ssh 10.10.10.1 zfs list data/images
NAME        USED  AVAIL REFER MOUNTPOINT
data/images 2.64T 2.86T 2.19T /data/images

We can see that on the source server, we have 2.64TB of storage used, out of which a little under 2.2TB are the images themselves (the rest is data contained in snapshots, but not contained in the current state of data/images). We do have a local copy, of course – like I said, this is a production example – but it’s close to a day out of date, during which time an office full of engineers have been busily working on drawings and documents. So, how do we replicate this monster over a $40/mo internet connection?

Getting the job done: syncoid!

root@remotebackup:~# /usr/local/bin/syncoid root@10.10.10.1:data/images backup/images
Sending incremental syncoid_remotebackup_2014-11-17:22:00:04 ... syncoid_cseremotebackup_2014-11-18:11:08:51 (~ 3.1 GB):
3.04GB 0:53:17 [ 998kB/s] [=============================> ] 99%

Easy peasy – a single command takes care of it. Over this ludicrously tiny internet connection, it did take a little under an hour to run… but that’s not so bad for a daily backup routine, and the 1MB/sec that our little internet pipe limited us to obviously isn’t putting a big strain on the underlying storage in the meantime. (If you’re wondering how we got 1MB/sec over a 1mbps pipe that should only move 160KB/sec or so, the answer is built-in LZO compression. Snazzy, right?)

I love this stuff. I really, really do.

  • "I honestly feel as though my business would not be where it is today were it not for us happening into the hiring of Jim Salter."

    W. Chris Clark, CPA // President // Clark Eustace Wagner, PA

  • "Jim’s advice has always been spot on – neither suggesting too little or too much. He understands that companies have limited resources and he does not offer up solutions beyond those that we have truly needed."

    Paul Yoo // President // US Patriot Tactical

  • "Jim Salter is an indispensable part of our team, maintaining our network and being proactive on all of our IT needs. We have comfort knowing that there are redundant bootable backups of all files and databases, offsite and onsite."

    Regina R. Floyd, AIA, LEED AP BD+C // Principal // Watson Tate Savory

  • Recent Thoughts

  • Demonstrating ZFS pool write distribution
  • One of my pet peeves is people talking about zfs “striping” writes across a pool. It doesn’t help any that zfs core developers use this terminology too – but it’s sloppy and not really correct. ZFS distributes writes among all the vdevs in a pool.  If your vdevs all have the same amount of free space available, this will resemble a simple striping action closely enough.  But if you have different amounts of free space on different […]