Friday, July 31, 2009

Time Machine and Mail: a match made in hell

I've just spent the last two days debugging a problem I was having with Time Machine backups being freakishly slow (20 minutes to back up 20 MB). There's still some weird stuff going on, but along the way I discovered a problem that nearly everyone who uses a Mac will almost certainly encounter sooner or later, but I haven't found a single write-up on it. So here goes:

Time Machine is pretty smart about only backing up things that have changed since the last backup, and storing things on the backup drive in such a way that it *appears* that you have a ton of complete (not incremental) backups without actually duplicating files that haven't changed. It does this by using an OS X feature called FSEvents, which allows Time Machine to find changed files without having to scan the entire file system, and hard links, which allows Time Machine to store multiple backups efficiently.

This combination of features has some unintended consequences, most notably, if you use a mail system like Entourage that stores its messages in one big monolithic file. The contents of this file change every time you get a new mail message. So every time you get a new mail message, Time Machine has to back up *all* your mail, since it doesn't know how to look inside an Entourage mail file to find just the parts that have changed.

To avoid this situation, Apple has been recommending to developers that they store data in lots of little files instead of one big file, and they followed this recommendation in their own Mail application, which stores every message in a separate file. Unfortunately, it turns out this doesn't help matters at all, and in fact seems to make the situation worse.

The problem is this: imagine you have a folder with a lot of mail messages in it. A new message arrives, which gets stored in a new file. Time Machine knows that only this one file has changed, and so only this one file has to be copied. Unfortunately, to store this one file in a way that makes it appear that it's a complete backup, while it doesn't have to *copy* all those other files, it does have to create hard links for them all. And for small files, like most mail messages tend to be unless they have a lot of attachments, creating a hard link is no faster than actually copying the file.

So what happens over time is that your inbox grows and grows. Every time Time Machine runs you've almost certainly gotten at least one new mail message. So every time it has to create new hard links for every one of those zillion messages. So gradually Time Machine will get slower and sslloowweerr and sssssllllllloooooowwwweeeeerrrr, until one day you suddenly realize that it's pretty frickin' slow.

Fortunately, it's pretty easy to fix this problem: just create an archive folder, and periodically dump your old mail in there. As long as the archive folder doesn't change, it doesn't get copied at all. You should also do this with your sent mail folder, and your junk mail folder if you don't have it set up to automatically delete messages. If you're using SpamSieve, this can be a little tricky. This is actually what happened to me. I had set up SpamSieve according to the default instructions, which send spam to a folder that does not get automatically purged. Over time I picked up an obscene amount of spam, all of which was in one folder, none of which was being deleted, and which was constantly being added to. Time Machine was freaking out.

This is not a hard problem to avoid once you know what's going on, but it does seem to be one of the best kept secrets of OS X.

2 comments:

zeckalpha@gmail.com said...

Or use online mail.

If that bothers you (security issues? confidentiality?), then set up a mail server and use bongo. That can be backed up, too, though setup is not as easy as your Mac.

Curt Sampson said...

Oof. You'd think they'd just do the thing right, keep diffs of changes (using the rsync algorithm or similar), and have a special filesystem that gives a view of the backup.