Aymeric on Software

Because we needed another of these blogs...

HFS+ Bit Rot

HFS+ is a terribly old filesystem with serious flaws. I sincerely hope that Apple comes with an update of the filesystem for WWDC 2015. After all they have been working on Swift for the past 4 years and we have just learned about it last week.

HFS+ is seriously old

HFS+ was released in 1998, in the era of Mac OS Classic. It predates the current Unix based version of OS X by at least three years. It was created in a period where Apple’s business was in such a dire situation that Michael Dell’s uttered this now infamous quote “What would I do? I’d shut it down and give the money back to the shareholders”.

Technically HFS+ is small evolution of its predecessor HFS which dates back to 1985. The major change from HFS to HFS+ is the transition of block addresses from 16 bits to 32 bits. This change was really needed, as hard drives capacity exploded in the late 90s. On a 16 bit addressing scheme, a file containing a single byte would use 16KB on a 1GB hard drive, and around 16MB on a 1TB hard drive. The other changes included the transition to longer filenames (from 31 to 255 characters) and a switch to Unicode encoding.

By and large the rest of the design of HFS+ has remained unchanged since 1985.

HFS+ has serious limitations and flaws

When it was first released, HFS+ did not support hard links, journaling, extended attributes, hot files, and online defragmentation. These features were gradually added with subsequent releases of Mac OS X. But they are basically hacked to death, which leads to a complicated, slow and not so reliable implementation.

In the early days, the system had a hard limit to the number of files that could be written and deleted over the lifetime of the volume. It was 2,147,483,648 (i.e. 2\^31). After that, the volume would stop being able to add any more files or directories. On HFS+, every entry in the filesystem is associated to a CNID (Catalog Number ID). The early implementations used a simple a global counter nextCatalogID stored in a volume header, that could only be incremented by one until the maximum value was reached. More recent versions of Mac OS X can now recycle old unused CNIDs, but this gives you an idea of the types of considerations that went into the design of HFS+.

More recently Apple added support for full disk encryption with FileVault 2 and Fusion Drive. But these features are implemented in a layer underneath the file system, by Core Storage, a logical volume manager. Additional features like Snapshotting and Versioning would probably require much tighter integration to the file system… and they would also make TimeMachine extremely efficient and reliable. Currently TimeMachine is built on top of the file system and relies on capturing I/O events, which adds overhead and complexity. Another hack.

Finally there is Bit Rot. Over time data stored on spinning hard disks or SSDs degrade and become incorrect. Modern file systems like ZFS, which Apple considered but abandoned as a replacement, include checksums of all ~~meta data structures~~ content [1]. That means that when the file is accessed, the filesystem detects the corruption and throws an error. This prevents incorrect data from propagating to backups. With ZFS, you can also scrub your disk on a regular basis and verify if existing files have been corrupted preemptively.

A concrete example of Bit Rot

I have a large collection of photos, which starts around 2006. Most of these files have been kept on HFS+ volumes since their existence.

In addition to TimeMachine backups, I also use two other backup solutions that I described in a previous blog post. I keep a copy of the photos on a Linux microserver using ext3, which I checksum and verify regularly using snapraid. I also keep off-site backups using ARQ and Amazon Glaciers.

Before I acquired the Linux Microserver, I used to keep a copy of all my photos on a Dreamhost account. I recently compared these photos against their current versions on the iMac and was a bit shocked by the results.

The photos were taken between 2006 and 2011, most of them after 2008. There are 15264 files, which represent a total of 105 GiB. 70% of these photos are CR2 raw files from my old EOS 350D camera. The other photos are regular JPEGs which come from the cameras of friends and relatives.

HFS+ lost a total of 28 files over the course of 6 years.

Most of the corrupted files are completely unreadable. The JPEGs typically decode partially, up to the point of failure. So if you’re lucky, you may get most of the image except the bottom part. The raw .CR2 files usually turn out to be totally unreadable: either completely black or having a large color overlay on significant portions of the photo. Most of these shots are not so important, but a handful of them are. One of the CR2 files in particular, is a very good picture of my son when he was a baby. I printed and framed that photo, so I am glad that I did not lose the original.

If you’re keeping all your files and backups on HFS+ volumes, you’re doing it wrong.

How to check for file corruptions

I used the following technique to compare the photos from the Dreamhost backup against my main HFS+ volume.

I ran the shasum commmand line tool to compute SHA1 hashes of every single file in the backup folder, except .DS_Store files. Then, I ran shasum in verify mode to check the files on my main volume against the hashes. Differences either indicate voluntary modifications (which did not apply in my case), or corruptions courtesy of HFS+ (which was my case).

# Compute checksums
find . -type f -a ! -name ".DS_Store" -exec  shasum '{}' \; > shasums.txt
# Check against checksums
shasum -c < shasums.txt  > check.txt
# Filter out differences
cat check.txt | fgrep -v OK

You can use the same technique to check for corruptions on a single volume. You need to compute checksums and verify against them from time to time. If you use clone backups, it is probably a good idea to check for corruptions before doing the clone.

Addendum - June 11th, 2014

Thanks to everyone who spent the time to send feedback. There are a few things I would like to add:

  • [1] Erratum. ZFS uses checksums for everything, not just the meta-data.
  • I understand the corruptions were caused by hardware issues. My complain is that the lack of checksums in HFS+ makes it a silent error when a corrupted file is accessed.
  • This not an issue specific to HFS+. Most filesystems do not include checksums either. Sadly…
  • Other people have written articles on similar topics. Jim Salter and John Siracusa for Ars Technica in particular.

Why the Blockchain and the Bitcoin Wallet Balances Differ

If you look at a website like blockchain.info or blockexplorer.com, you may notice it is possible to find out the details about a particular bitcoin address, such as the last transactions and of course the balance.

If you try this on a Bitcoin address that belongs to you, and fire up the Bitcoin Qt client (aka Bitcoin Core), you may have noticed a discrepancy. It’s very likely for the balance displayed on the website to be less than the one displayed by the software wallet.

The discrepancy is caused by the nature of bitcoin. Instead of storing actual coins, the bitcoin protocol should be seen as a distributed public database of transactions which together form the blockchain. You “receive” bitcoins when another party decides to use their private key to sign a transaction and send some amount of bitcoins to your public address. Bitcoins only exist in the sense that you can trace the chain of valid transactions until you reach special coinbase transaction, i.e. some mined bitcoins. You can almost think of all the transactions forming a singly linked list, that stops at one end with mined bitcoins, and on the other end with unspent bitcoins… except for the fact that each transaction can have multiple inputs or outputs. (Please keep in mind this is voluntarily simplified, if you wish to know more check the protocol documentation)

One of the quirks of the protocol, is that the amount of the inputs and outputs in the transaction must match (in reality, the output can be less than the input, and the remainder then constitutes the optional transaction fee). That rule greatly simplifies the validation of transactions, since there is no need to extract the entire history of transactions to figure out how much funds are spent or unspent for a given transaction: it’s either all or nothing.

The drawback of this solution arises when you need to spend only a fraction of the amount received in a previous transaction. In that case, the wallet software automatically creates two outputs to the transaction: one output is used to send money to the intended recipient, one output is used to send the remainder to the sender.

At this stage, it’s probably simpler to reason with an example. Let’s imagine Alice wants to sent 1.2 BTC to Bob. Alice previously received 1 BTC from Chip and 0.5 BTC from Dale. The new transaction she makes has to reference both previous unspent transactions as inputs, since neither of these transactions taken individually have enough funds. One of the outputs of the transaction must be the 1.2 BTC that are sent to Bob. But Alice also need to add a 0.3 BTC output that are sent back to herself. In the future, if she could use these 0.3 BTC coins that remain in her wallet, by referencing this 0.3 BTC output as an input for a new transaction.

It would be possible to use the same public address to send the money back to the sender, but the Bitcoin Qt software sends it to a new address instead, for privacy reasons. A bitcoin wallet contains at least a hundred of such addresses which constitute the key pool. The key pool is pre-allocated (therefore many addresses will have a balance of zero) so that slightly out of date backups of wallet files result in no loss of bitcoins. Every time a transaction that requires return funds is made, these returned funds seem to “disappear” from the balance of the wallet’s public address. It’s possible to reach a balance of zero on your public address in that way.