Diff’ing Directory Structures

Updated Jan. 21, 2008: The python script may not work very well; please check this post for the proper shell script format based on Gregg’s comments here.

Updated May 23, 2007 to fix link to Python text file.

What happens when you have two directories or folders that are supposed to be identical (think backup situation)? They’re supposed to contain all the same files and identical directory structure all the way down, right? But, what if those directories contain thousands of files and subdirectories? How can you be sure they’re identical without masochistically reviewing every… single… painful… thing? 36 lines of code later…

The Setup
The Foof was having some serious problems with her G4 TiBook at work. Things like: startup took 15-20 minutes, Samba shares were dropping randomly and required logout/in to allow reconnection, and the added bonus of randomly booting into single-user mode. For someone who’s not used to the bleak blackness of the standard *nix login screen, having it come up after churning through startup for 15 minutes is probably pretty disconcerting.

About a week ago, I leapfrogged the system from Jag 10.2.8 to Panther 10.3.3 as part of the whole-office upgrade initiative. I’d hoped that such a huge rev bump would inherently fix her PowerBook’s problems. It didn’t.

Last night, I brought it home with three other iBooks; they were getting upgrades, but The Foof’s machine was getting the full treatment: backup everything, erase hard drive, erase it again, and again for good measure, then reinstall everything. It was the “backup everything” step that took the most time. I didn’t want The Foof to lose any data whatsoever, and she has a lot of stuff on that machine.

The Conundrum
The simplest thing to do is copy her entire user directory to the Firewire drive. Except some files have really long names, and OS X takes a shit when you try to copy them to another volume. You can choose to skip those files, but the system doesn’t tell you which files are the offenders; how are you to find out which files weren’t copied without scouring through both places and looking at every single file? The horror.

The next-to-simplest thing to do is tar up the whole fuckin’ thing and copy the tarball to the Firewire. Except that tar takes a shit on the same files (those with very long names). It’s only slightly more informative, though, as the tar process actually gives you a list of files that it has problems with… but they’re not going to exist in the backup anyway, so you’re no better off.

If only there was a way to do a diff on two separate branches of a file system. Ideally, it would tell you what files were present in one branch, but missing in another, or if two identically-named files in respectively-identical locations had different file sizes. That way, you could rename and copy things around until both branches were exactly the same. Kind of like a manual rsync process, keeping you in the loop and informed, but without the hassle of setting up and configuring rsync for a one-off backup. Besides, I don’t know if rsync would similarly barf on files with long names.

A Little Help
As far as I know, there is no command-line utility or available application to find differences between two file structures (correct me if I’m wrong, please!). So, I wrote a quick and dirty Python script to help. It’ll only work on OS X, Linux or other *nix operating systems that have Python installed.

Basically, you specify a directory location, and it will walk down the rest of the file system recording all files, their file sizes, and subdirectories… traverse all subdirectories and their subdirectories recording file info, etc. until it goes as far downstream as the file system exists. It simply spits out a text file with the information, like this:

### /whatever

features.html >> 11989
index.html >> 11980
specs.html >> 12777

### /whatever/graphics

diagram.gif >> 17953
photo.jpg >> 21825

But how does this help? Simple. Specify your live branch and run it. Then specify your backup branch and run it again. You get two text files containing data representing both branches in full. Now, open up BBEdit and “Find Differences…” on both files… or use command-line diff to determine what’s not where, and if files have different sizes.

It’s not perfect, it could be a hell of a lot better, but it helps for now. Here’s the code.


3 thoughts on “Diff’ing Directory Structures

  1. Actually let me append to my snarky comment… Panther `diff` is GNU diff, which does everything you want it to do in 1 line, probably in less than 36 characters (if you have tab completion). Jag diff is probably an older BSD diff, I don’t remember, nor do I remember if older BSD diff does everything GNU diff does.

    You can do the same thing in 3 lines, with `find`, `md5` or your system equivilent, and old shitty `diff`, by doing

    find [path] -name “*” -exec md5sum > outfile1 \;
    # then again as above w/ 2nd path and outfile2
    diff outfile1 outfile2

    nice use of python, to be sure, but your hands will be saved by using less typing and 30+ years of unix tradition. (insert smiley here)

  2. You fucking rule. I wish you were at work the day I had the question as to HOW to do that. None of the people had any idea (regardless of their years of unix experience and tradition). That would have precluded the Python, not to mention a lot of extraneous typing incl. that huge post and, um, this comment.

Comments are closed.