Recent Articles

Verifying copies

Published by Jim Salter // June 29th, 2016


Most of the time, we explicitly trust that our tools are doing what we ask them to do. For example, if we rsync -a /source /target, we trust that the contents of /target will exactly match the contents of /source. That’s the whole point, right? And rsync is a tool with a long and impeccable lineage. So we trust it. And, really, we should.

But once in a while, we might get paranoid. And when we do, and we decide to empirically and thoroughly check our copied data… things might get weird. Today, we’re going to explore the weirdness, and look at how to interpret it.

So you want to make a copy of /usr/bin…

Well, let’s be honest, you probably don’t. But it’s a convenient choice, because it has plenty of files in it (but not so many that copies will take forever), it has folders beneath it, and it has a couple of other things in there that can throw you off balance.

On the machine I’m sitting in front of now, /usr/bin is on an ext4 filesystem – the root filesystem, in fact. The machine also has a couple of ZFS filesystems. Let’s start out by using rsync to copy it off to a ZFS dataset.

We’ll use the -a argument to rsync, which stands for archive, and is a shorthand for -rlptgoD. This means recursive, copy symlinks as symlinks, maintain permissions, maintain timestamps, maintain group ownership, maintain ownership, and preserve devices and other special files as devices and special files where possible. Seems fairly straightforward, right?

root@banshee:/# zfs create banshee/tmp
root@banshee:/# rsync -a /usr/bin /banshee/tmp/
root@banshee:/# du -hs /usr/bin ; du -hs /banshee/tmp/bin
353M	/usr/bin
215M	/banshee/tmp/bin

Hey! We’re missing data! What the hell? No, we’re not actually missing data – we just didn’t think about the fact that the du command reports space used on disk, which is related but not at all the same thing as amount of data stored. In this case, obvious culprit is obvious:

root@banshee:/# zfs get compression banshee/tmp
NAME         PROPERTY     VALUE     SOURCE
banshee/tmp  compression  lz4       inherited from banshee
root@banshee:/# zfs get compressratio banshee/tmp
NAME         PROPERTY       VALUE  SOURCE
banshee/tmp  compressratio  1.55x  -

So, compression tricked us.

I had LZ4 compression enabled on my ZFS dataset, which means that the contents of /usr/bin – much of which are highly compressible – do not take up anywhere near as much space on disk when stored there as they do on the original ext4 filesystem. For the purposes of our exercise, let’s just wipe it out, create a new dataset that doesn’t use compression, and try again.

root@banshee:/# zfs destroy -r banshee/tmp
root@banshee:/# zfs create -o compression=off banshee/tmp
root@banshee:/# rsync -ha /usr/bin /banshee/tmp/
root@banshee:/# du -hs /usr/bin ; du -hs /banshee/tmp/bin
353M	/usr/bin
365M	/tmp/bin

Wait, what the hell?! Compression’s off now, so why am I still seeing discrepancies – and in this case, more space used on the ZFS target than the ext4 source? Well, again – du measures on-disk space, and these are different filesystems. The blocksizes may or may not be different, and in this case, ZFS stores redundant copies of the metadata associated with each file where ext4 does not, so this adds up to measurable discrepancies in the output of du.

Semi-hidden, arcane du arguments

You won’t find any mention at all of it in the man page, but info du mentions an argument called –apparent-size that sounds like the answer to our prayers.

`--apparent-size'
     Print apparent sizes, rather than disk usage.  The apparent size
     of a file is the number of bytes reported by `wc -c' on regular
     files, or more generally, `ls -l --block-size=1' or `stat
     --format=%s'.  For example, a file containing the word `zoo' with
     no newline would, of course, have an apparent size of 3.  Such a
     small file may require anywhere from 0 to 16 KiB or more of disk
     space, depending on the type and configuration of the file system
     on which the file resides.

Perfect! So all we have to do is use that, right?

root@banshee:/# du -hs --apparent-size /usr/bin ; du -hs --apparent-size /banshee/tmp/bin
349M	/usr/bin
353M	/banshee/tmp/bin

We’re still consuming 4MB more space on the target than the source, dammit. WHY?

Counting files and comparing notes

So we’ve tried everything we could think of, including some voodoo arguments to du that probably cost the lives of many Bothans, and we still have a discrepancy in the amount of data present in source and target. Next step, let’s count the files and folders present on both sides, by using the find command to enumerate them all, and piping its output to wc -l, which counts them once enumerated:

root@banshee:/# find /usr/bin | wc -l ; find /banshee/tmp/bin | wc -l
2107
2107

Huh. No discrepancies there. OK, fine – we know about the diff command, which compares the contents of files and tells us if there are discrepancies – and handily, diff -r is recursive! Problem solved!

root@banshee:/# diff -r /usr/bin /banshee/tmp/bin 2>&1 | head -n 5
diff: /banshee/tmp/bin/arista-gtk: No such file or directory
diff: /banshee/tmp/bin/arista-transcode: No such file or directory
diff: /banshee/tmp/bin/dh_pypy: No such file or directory
diff: /banshee/tmp/bin/dh_python3: No such file or directory
diff: /banshee/tmp/bin/firefox: No such file or directory

What’s wrong with diff -r?

Actually, problem emphatically not solved. I cheated a little, there – I knew perfectly well we were going to get a ton of errors thrown, so I piped descriptor 2 (standard error) to descriptor 1 (standard output) and then to head -n 5, which would just give me the first five lines of errors – otherwise we’d be staring at screens full of crap. Why are we getting “no such file or directory” on all these? Let’s look at one:

root@banshee:/# ls -l /banshee/tmp/bin/arista-gtk
lrwxrwxrwx 1 root root 26 Mar 20  2014 /banshee/tmp/bin/arista-gtk -> ../share/arista/arista-gtk

Oh. Well, crap – /usr/bin is chock full of relative symlinks, which rsync -a copied exactly as they are. So now our target, /banshee/tmp/bin, is full of relative symlinks that don’t go anywhere, and diff -r is trying to follow them to compare contents. /usr/share/arista/arista-gtk is a thing, but /banshee/share/arista/arista-gtk isn’t. So now what?

find, and xargs, and md5sum, oh my!

Luckily, we have a tool called md5sum that will generate a pretty strong checksum of any arbitrary content we give it. It’ll still try to follow symlinks if we let it, though. We could just ignore the symlinks, since they don’t have any data in them directly anyway, and just sorta assume that they’re targeted to the same relative target that they were on the source… yeah, sure, let’s try that.

What we’ll do is use the find command, with the -type f argument, to only find actual files in our source and our target, thereby skipping folders and symlinks entirely. We’ll then pipe that to xargs, which turns the list of output from find -type f into a list of arguments for md5sum, which we can then redirect to text files, and then we can diff the text files. Problem solved! We’ll find that extra 4MB of data in no time!

root@banshee:/# find /usr/bin -type f | xargs md5sum > /tmp/sourcesums.txt
root@banshee:/# find /banshee/tmp/bin -type f | xargs md5sum > /tmp/targetsums.txt

… okay, actually I’m going to cheat here, because you guessed it: we failed again. So we’re going to pipe the output through grep sudo to let us just look at the output of a couple of lines of detected differences.

root@banshee:/# diff /tmp/sourcesums.txt /tmp/targetsums.txt | grep sudo
< 866713ce6e9700c3a567788334be7e25  /usr/bin/sudo
< e43c40cfbec06f191191120f4332f60f  /usr/bin/sudoreplay
> e43c40cfbec06f191191120f4332f60f  /banshee/tmp/bin/sudoreplay
> 866713ce6e9700c3a567788334be7e25  /banshee/tmp/bin/sudo

Ah, crap – absolute paths. Every single file failed this check, since we’re seeing it referenced by absolute path, and /usr/bin/sudo doesn’t match /banshee/tmp/bin/sudo, even though the checksums match! OK, we’ll fix that by using relative paths when we generate our sums… and I’ll save you a little more time; that might fail too, because there’s not actually any guarantee that find will return the files in the same sort order both times. So we’ll also pipe find‘s output through sort to fix that problem, too.

root@banshee:/banshee/tmp# cd /usr ; find ./bin -type f | sort | xargs md5sum > /tmp/sourcesums.txt
root@banshee:/usr# cd /banshee/tmp ; find ./bin -type f | sort | xargs md5sum > /tmp/targetsums.txt
root@banshee:/banshee/tmp# diff /tmp/sourcesums.txt /tmp/targetsums.txt
root@banshee:/banshee/tmp# 

Yay, we finally got verification – all of our files exist on both sides, are named the same on both sides, are in the same folders on both sides, and checksum out the same on both sides! So our data’s good! Hey, wait a minute… if we have the same number of files, in the same places, with the same contents, how come we’ve still got different totals?!

Macabre, unspeakable use of ls

We didn’t really get what we needed out of find. We do at least know that we have the right total number of files and folders, and that things are in the right places, and that we got the same checksums out of the files that we could check. So where the hell are those missing 4MB of data coming from?! We’re using our arcane invocation of du -s –apparent-size to make sure we’re looking at bytes of data and not bytes of allocated disk space, so why are we still seeing differences?

This time, let’s try using ls. It’s rarely used, but ls does have a recursion argument, -R. We’ll use that along with -l for long form and -a to pick up dotfiles, and we’ll pipe it though grep -v ‘\.\.’ to get rid of the entry for the parent directory (which obviously won’t be the same for /usr/bin and for /banshee/tmp/bin, no matter how otherwise perfect matches they are), and then, once more, we’ll have something we can diff.

root@banshee:/banshee/tmp# ls -laR /usr/bin > /tmp/sourcels.txt
root@banshee:/banshee/tmp# ls -laR /banshee/tmp/bin > /tmp/targetls.txt
root@banshee:/banshee/tmp# diff /tmp/sourcels.txt /tmp/targetls.txt | grep which
< lrwxrwxrwx  1 root   root         10 Jan  6 14:16 which -> /bin/which
> lrwxrwxrwx 1 root   root         10 Jan  6 14:16 which -> /bin/which

Once again, I cheated, because there was still going to be a problem here. ls -l isn’t necessarily going to use the same number of formatting spaces when listing the source or the target – so our diff, again, frustratingly fails on every single line. I cheated here by only looking at the output for the which file so we wouldn’t get overwhelmed. The fix? We’ll pipe through awk, forcing the spacing to be constant. Grumble grumble grumble…

root@banshee:/banshee/tmp# ls -laR /usr/bin | awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11}' > /tmp/sourcels.txt
root@banshee:/banshee/tmp# ls -laR /banshee/tmp/bin | awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11}' > /tmp/targetls.txt
root@banshee:/banshee/tmp# diff /tmp/sourcels.txt /tmp/targetls.txt | head -n 25
1,4c1,4
< /usr/bin:          
< total 364620         
< drwxr-xr-x 2 root root 69632 Jun 21 07:39 .  
< drwxr-xr-x 11 root root 4096 Jan 6 15:59 ..  
---
> /banshee/tmp/bin:          
> total 373242         
> drwxr-xr-x 2 root root 2108 Jun 21 07:39 .  
> drwxr-xr-x 3 root root 3 Jun 29 18:00 ..  
166c166
< -rwxr-xr-x 2 root root 36607 Mar 1 12:35 c2ph  
---
> -rwxr-xr-x 1 root root 36607 Mar 1 12:35 c2ph  
828,831c828,831
< -rwxr-xr-x 4 root root 14592 Dec 3 2012 lockfile-check  
< -rwxr-xr-x 4 root root 14592 Dec 3 2012 lockfile-create  
< -rwxr-xr-x 4 root root 14592 Dec 3 2012 lockfile-remove  
< -rwxr-xr-x 4 root root 14592 Dec 3 2012 lockfile-touch  
---
> -rwxr-xr-x 1 root root 14592 Dec 3 2012 lockfile-check  
> -rwxr-xr-x 1 root root 14592 Dec 3 2012 lockfile-create  
> -rwxr-xr-x 1 root root 14592 Dec 3 2012 lockfile-remove  
> -rwxr-xr-x 1 root root 14592 Dec 3 2012 lockfile-touch  
885,887c885,887

FINALLY! Useful output! It makes perfect sense that the special files . and .., referring to the current and parent directories, would be different. But we’re still getting more hits, like for those four lockfile- commands. We already know they’re identical from source to target, both because we can see the filesizes are the same and because we’ve already md5sum‘ed them and they matched. So what’s up here?

And you thought you’d never need to use the ‘info’ command in anger

We’re going to have to use info instead of man again – this time, to learn about the ls command’s lesser-known evils.

Let’s take a quick look at those mismatched lines for the lockfile- commands again:

< -rwxr-xr-x 4 root root 14592 Dec 3 2012 lockfile-check  
< -rwxr-xr-x 4 root root 14592 Dec 3 2012 lockfile-create  
< -rwxr-xr-x 4 root root 14592 Dec 3 2012 lockfile-remove  
< -rwxr-xr-x 4 root root 14592 Dec 3 2012 lockfile-touch  
---
> -rwxr-xr-x 1 root root 14592 Dec 3 2012 lockfile-check  
> -rwxr-xr-x 1 root root 14592 Dec 3 2012 lockfile-create  
> -rwxr-xr-x 1 root root 14592 Dec 3 2012 lockfile-remove  
> -rwxr-xr-x 1 root root 14592 Dec 3 2012 lockfile-touch

OK, ho hum, we’ve seen this a million times before. File type, permission modes, owner, group, size, modification date, filename, and in the case of the one symlink shown, the symlink target. Hey, wait a minute… how come nobody ever thinks about the second column? And, yep, that second column is the only difference in these lines – it’s 4 in the source, and 1 in the target.

But what does that mean? You won’t learn the first thing about it from man ls, which doesn’t tell you any more than that -l gives you a “long listing format”. And there’s no option to print column headers, which would certainly be helpful… info ls won’t give you any secret magic arguments to print column headers either. But however grudgingly, it will at least name them:

`-l'
`--format=long'
`--format=verbose'
     In addition to the name of each file, print the file type, file
     mode bits, number of hard links, owner name, group name, size, and
     timestamp

So it looks like that mysterious second column refers to the number of hardlinks to the file being listed, normally one (the file itself) – but on our source, there are four hardlinks to each of our lockfile- files.

Checking out hardlinks

We can explore this a little further, using find‘s –samefile test:

root@banshee:/banshee/tmp# find /usr/bin -samefile /usr/bin/lockfile-check
/usr/bin/lockfile-create
/usr/bin/lockfile-check
/usr/bin/lockfile-remove
/usr/bin/lockfile-touch
root@banshee:/banshee/tmp# find /banshee/tmp/bin -samefile /usr/bin/lockfile-check
root@banshee:/banshee/tmp# 

Well, that explains it – on our source, lockfile-create, lockfile-check, lockfile-remove, and lockfile-touch are all actually just hardlinks to the same inode – meaning that while they look like four different files, they’re actually just four separate pointers to the same data on disk. Those and several other sets of hardlinks add up to the reason why our target, /banshee/tmp/bin, is larger than our source, /usr/bin, even though they appear to contain the same data exactly.

Could we have used rsync to duplicate /usr/bin exactly, including the hardlink structures? Absolutely! That’s what the -H argument is there for. Why isn’t that rolled into the -a shortcut argument, you might ask? Well, probably because hardlinks aren’t supported on all filesystems – in particular, they won’t work on FAT32 thumbdrives, or on NFS links set up by sufficiently paranoid sysadmins – so rather than have rsync jobs fail, it’s left up to you to specify if you want to recreate them.

Simpler verification with tar

This was cumbersome as hell – could we have done it more simply? Well, of course! Actually… maybe.

We could have used the venerable tar command to generate a single md5sum for the entire collection of data in one swell foop. We do have to be careful, though – tar will encode the entire path you feed it into the file structure, so you can’t compare tars created directly from /usr/bin and /banshee/tmp/bin – instead, you need to cd into the parent directories, create tars of ./bin in each case, then compare those.

Let’s start out by comparing our current copies, which differ in how hardlinks are handled:

root@banshee:/tmp# cd /usr ; tar -c ./bin | md5sum ; cd /banshee/tmp ; tar -c ./bin | md5sum
4bb9ffbb20c11b83ca616af716e2d792  -
2cf60905cb7870513164e90ac82ffc3d  -

Yep, they differ – which we should be expecting ; after all, the source has hardlinks and the target does not. How about if we fix up the target so that it also uses hardlinks?

root@banshee:/tmp# rm -r /banshee/tmp/bin ; rsync -haH /usr/bin /banshee/tmp/
root@banshee:/tmp# cd /usr ; tar -c ./bin | md5sum ; cd /banshee/tmp ; tar -c ./bin | md5sum
4bb9ffbb20c11b83ca616af716e2d792  -
f654cf7a1c4df8482d61b59993bb92f7  -

Well, crap. That should’ve resulted in identical md5sums, shouldn’t it? It would have, if we’d gone from one ext4 filesystem to another:

root@banshee:/banshee/tmp# rsync -haH /usr/bin /tmp/
root@banshee:/banshee/tmp# cd /usr ; tar -c ./bin | md5sum ; cd /tmp ; tar -c ./bin | md5sum
4bb9ffbb20c11b83ca616af716e2d792  -
4bb9ffbb20c11b83ca616af716e2d792  -

So why, exactly, didn’t we get the same md5sum values for a tar on ZFS, when we did on ext4? The metadata gets included in those tarballs, and it’s just plain different between different filesystems. Different metadata means different contents of the tar means different checksums. Ugly but true. Interestingly, though, if we dump /banshee/tmp/bin back through tar and out to another location, the checksums match again:

root@banshee:/tmp/fromtar# tar -c /banshee/tmp/bin | tar -x
tar: Removing leading `/' from member names
tar: Removing leading `/' from hard link targets
root@banshee:/tmp/fromtar# cd /tmp/fromtar/banshee/tmp ; tar -c ./bin | md5sum ; cd /usr ; tar -c ./bin | md5sum
4bb9ffbb20c11b83ca616af716e2d792  -
4bb9ffbb20c11b83ca616af716e2d792  -

What this tells us is that ZFS is storing additional metadata that ext4 can’t, and that, in turn, is what was making our tarballs checksum differently… but when you move that data back onto a nice dumb ext4 filesystem again, the extra metadata is stripped again, and things match again.

Picking and choosing what matters

Properly verifying data and metadata on a block-for-block basis is hard, and you need to really, carefully think about what, exactly, you do – and don’t! – want to verify. If you want to verify absolutely everything, including folder metadata, tar might be your best bet – but it probably won’t match up across filesystems. If you just want to verify the contents of the files, some combination of find, xargs, md5sum, and diff will do the trick. If you’re concerned about things like hardlink structure, you have to check for those independently. If you aren’t just handwaving things away, this stuff is hard. Extra hard if you’re changing filesystems from source to target.

Shortcut for ZFS users: replication!

For those of you using ZFS, and using strictly ZFS, though, there is a shortcut: replication. If you use syncoid to orchestrate ZFS replication at the dataset level, you can be sure that absolutely everything, yes everything, is preserved:

root@banshee:/usr# syncoid banshee/tmp banshee/tmp2
INFO: Sending oldest full snapshot banshee/tmp@autosnap_2016-06-29_19:33:01_hourly (~ 370.8 MB) to new target filesystem:
 380MB 0:00:01 [ 229MB/s] [===============================================================================================================================] 102%            
INFO: Updating new target filesystem with incremental banshee/tmp@autosnap_2016-06-29_19:33:01_hourly ... syncoid_banshee_2016-06-29:19:39:01 (~ 4 KB):
2.74kB 0:00:00 [26.4kB/s] [======================================================================================>                                         ] 68%           
root@banshee:/banshee/tmp2# cd /banshee/tmp ; tar -c . | md5sum ; cd /banshee/tmp2 ; tar -c . | md5sum
e2b2ca03ffa7ba6a254cefc6c1cffb4f  -
e2b2ca03ffa7ba6a254cefc6c1cffb4f  -

Easy, peasy, chicken greasy: replication guarantees absolutely everything matches.

TL;DR

There isn’t one, really! Verification of data and metadata, especially across filesystems, is one of the fundamental challenges of information systems that’s generally abstracted away from you almost entirely, and it makes for one hell of a deep rabbit hole to fall into. Hopefully, if you’ve slogged through all this with me so far, you have a better idea of what tools are at your fingertips to test it when you need to.

  • "I honestly feel as though my business would not be where it is today were it not for us happening into the hiring of Jim Salter."

    W. Chris Clark, CPA // President // Clark Eustace Wagner, PA

  • "Jim’s advice has always been spot on – neither suggesting too little or too much. He understands that companies have limited resources and he does not offer up solutions beyond those that we have truly needed."

    Paul Yoo // President // US Patriot Tactical

  • "Jim Salter is an indispensable part of our team, maintaining our network and being proactive on all of our IT needs. We have comfort knowing that there are redundant bootable backups of all files and databases, offsite and onsite."

    Regina R. Floyd, AIA, LEED AP BD+C // Principal // Watson Tate Savory

  • Recent Thoughts

  • Demonstrating ZFS pool write distribution
  • One of my pet peeves is people talking about zfs “striping” writes across a pool. It doesn’t help any that zfs core developers use this terminology too – but it’s sloppy and not really correct. ZFS distributes writes among all the vdevs in a pool.  If your vdevs all have the same amount of free space available, this will resemble a simple striping action closely enough.  But if you have different amounts of free space on different […]