Updated Jan. 21, 2008: The python script may not work very well; please check this post for the proper shell script format based on Gregg’s comments here.
Updated May 23, 2007 to fix link to Python text file.
What happens when you have two directories or folders that are supposed to be identical (think backup situation)? They’re supposed to contain all the same files and identical directory structure all the way down, right? But, what if those directories contain thousands of files and subdirectories? How can you be sure they’re identical without masochistically reviewing every… single… painful… thing? 36 lines of code later…
The Foof was having some serious problems with her G4 TiBook at work. Things like: startup took 15-20 minutes, Samba shares were dropping randomly and required logout/in to allow reconnection, and the added bonus of randomly booting into single-user mode. For someone who’s not used to the bleak blackness of the standard *nix login screen, having it come up after churning through startup for 15 minutes is probably pretty disconcerting.
About a week ago, I leapfrogged the system from Jag 10.2.8 to Panther 10.3.3 as part of the whole-office upgrade initiative. I’d hoped that such a huge rev bump would inherently fix her PowerBook’s problems. It didn’t.
Last night, I brought it home with three other iBooks; they were getting upgrades, but The Foof’s machine was getting the full treatment: backup everything, erase hard drive, erase it again, and again for good measure, then reinstall everything. It was the “backup everything” step that took the most time. I didn’t want The Foof to lose any data whatsoever, and she has a lot of stuff on that machine.
The simplest thing to do is copy her entire user directory to the Firewire drive. Except some files have really long names, and OS X takes a shit when you try to copy them to another volume. You can choose to skip those files, but the system doesn’t tell you which files are the offenders; how are you to find out which files weren’t copied without scouring through both places and looking at every single file? The horror.
The next-to-simplest thing to do is tar up the whole fuckin’ thing and copy the tarball to the Firewire. Except that tar takes a shit on the same files (those with very long names). It’s only slightly more informative, though, as the tar process actually gives you a list of files that it has problems with… but they’re not going to exist in the backup anyway, so you’re no better off.
If only there was a way to do a diff on two separate branches of a file system. Ideally, it would tell you what files were present in one branch, but missing in another, or if two identically-named files in respectively-identical locations had different file sizes. That way, you could rename and copy things around until both branches were exactly the same. Kind of like a manual rsync process, keeping you in the loop and informed, but without the hassle of setting up and configuring rsync for a one-off backup. Besides, I don’t know if rsync would similarly barf on files with long names.
A Little Help
As far as I know, there is no command-line utility or available application to find differences between two file structures (correct me if I’m wrong, please!). So, I wrote a quick and dirty Python script to help. It’ll only work on OS X, Linux or other *nix operating systems that have Python installed.
Basically, you specify a directory location, and it will walk down the rest of the file system recording all files, their file sizes, and subdirectories… traverse all subdirectories and their subdirectories recording file info, etc. until it goes as far downstream as the file system exists. It simply spits out a text file with the information, like this:
features.html >> 11989
index.html >> 11980
specs.html >> 12777
diagram.gif >> 17953
photo.jpg >> 21825
But how does this help? Simple. Specify your live branch and run it. Then specify your backup branch and run it again. You get two text files containing data representing both branches in full. Now, open up BBEdit and “Find Differences…” on both files… or use command-line diff to determine what’s not where, and if files have different sizes.
It’s not perfect, it could be a hell of a lot better, but it helps for now. Here’s the code.