Aymeric on Software

Because we needed another of these blogs...

Compute the Balance of a Bitcoin Wallet With node.js and blockchain.info

The Problem

As I have explained in great details in a previous post, the balance of a Bitcoin wallet differs from the balance that can be computed by looking at the blockchain. The reason is that practically every time you make a transaction, the Bitcoin software sends a fraction of the funds back to you, to a new address. Such addresses are invisible in the user interface, and for all practical purposes can be ignored by end users. These addresses are pre-allocated and constitute the key pool.

In this article I explain how you can compute the balance of a BitcoinQt wallet.dat file using node.js and the JSON API of blockchain.info. I will show how you can extract the public addresses contained in the wallet, whether it is encrypted or not, and demonstrate how to use the blockchain.info API efficiently.

Extracting the Keys

Bitcoin relies on public/private key cryptography. The public key is what is turned into a public address through hashing (detailed below) and the private key is the one used to sign the transactions. In order to compute the balance of a wallet, we need to extract all the public keys from the key pool.

The wallet.dat is a Berkley DB database. This is an old NoSQL embedded database from long before the term was coined. Unlike relational databases such as SQLite, it is not possible to make complicated queries in SQL. The wallet uses a single key/value dictionary to store all public and private keys, the last transactions, and all other account informations.

The keys from the key pool are stored directly into the database as individual records. The database keys are split into two parts. The first part of the DB key is some kind of identifier, that describes the type of record. Non encrypted Bitcoin keys are identified with key while encrypted keys are identified with ckey. The second part of the DB key is the Bitcoin public key itself, preceded by a single byte that indicates the length in bytes of the public key. The database record associated to the DB key is the private Bitcoin key, which may or may not be encrypted. This splitting of the DB key into two parts may sound odd, you have to remember that the DB keys must be unique to identify a record.

Here’s a drawing, in case you’re starting to be confused between the DB keys and the Bitcoin public and private keys.

You can see details about the key loarding algorithm in walletdb.cpp from the source code of Bitcoin. Look at the CWalletDB::ReadKeyValue() method.

I had no luck using node-bdb for reading the wallet with node.js, because cursors are not supported. Instead, I used a crude algorithm looking for the ckey and key strings within the wallet. This works because BerkeleyDB does not use any compression, but that’s not ideal.

Computing Bitcoin public addresses

Once you have obtained the Bitcoin public keys from the pool, you can turn them into a Bitcoin addresses through a complicated process of hashing, which is well documented.

However, I found a discrepancy with the documentation. The length of the keys I read from my wallet were all 33 bytes instead of the expected 65 bytes. Bitcoin uses ECDSA, a variant of elliptic curve cryptography. Both public and private keys are points on a specific elliptic curve (Secp256k1) whose equation is . The keys are the and coordinates stored as 32 bit integers preceded by one extra header byte that OpenSSL uses to store format information. Since and satisfy the known equation of the curve, you can store a single coordinate and a sign to compute the other coordinate with . This is what is called a compressed key, and it is 33 bytes long.

I initially thought that I would have to decompress the key, and convert it to the 65 bytes format in order to compute the address. But it turns out that Bitcoin hashes the OpenSSL ECDSA public key as it is. I suspect that means you could have multiple public addresses associated to the same public key: one public address for uncompressed 65 bytes key, and one public address for compressed 33 bytes keys…

You can simply hash the public keys with with the ripemd160 and sha256 algorithms as documented. The details of the hashing are strangely complicated, but easy to follow, and you end up with a 25 bytes binary address. This binary address is then converted into an ASCII string using the Base58 Check encoding. As explained in the source code and the docs, this encoding is used instead of the easier to calculate and more base64 encoding, to avoid confusion between similar characters like 0OIl.

Implementation with node.js is relatively straightforward using the crypto and bs58 modules

Computing the balance with Blockchain.info

You can use the JSON API from blockchain.info to compute the balance of each address in the key pool, and sum them up to obtain the final balance. These APIs are public and can be used without authentication, but they are rate limited. This can be a problem if you use naively the single address API. Instead you should use the multi address API or alternatively the unspent outputs API.

Source code

I wrote a quick and dirty sample using node.js and published it on Github.

HFS+ Bit Rot

HFS+ is a terribly old filesystem with serious flaws. I sincerely hope that Apple comes with an update of the filesystem for WWDC 2015. After all they have been working on Swift for the past 4 years and we have just learned about it last week.

HFS+ is seriously old

HFS+ was released in 1998, in the era of Mac OS Classic. It predates the current Unix based version of OS X by at least three years. It was created in a period where Apple’s business was in such a dire situation that Michael Dell’s uttered this now infamous quote “What would I do? I’d shut it down and give the money back to the shareholders”.

Technically HFS+ is small evolution of its predecessor HFS which dates back to 1985. The major change from HFS to HFS+ is the transition of block addresses from 16 bits to 32 bits. This change was really needed, as hard drives capacity exploded in the late 90s. On a 16 bit addressing scheme, a file containing a single byte would use 16KB on a 1GB hard drive, and around 16MB on a 1TB hard drive. The other changes included the transition to longer filenames (from 31 to 255 characters) and a switch to Unicode encoding.

By and large the rest of the design of HFS+ has remained unchanged since 1985.

HFS+ has serious limitations and flaws

When it was first released, HFS+ did not support hard links, journaling, extended attributes, hot files, and online defragmentation. These features were gradually added with subsequent releases of Mac OS X. But they are basically hacked to death, which leads to a complicated, slow and not so reliable implementation.

In the early days, the system had a hard limit to the number of files that could be written and deleted over the lifetime of the volume. It was 2,147,483,648 (i.e. 2\^31). After that, the volume would stop being able to add any more files or directories. On HFS+, every entry in the filesystem is associated to a CNID (Catalog Number ID). The early implementations used a simple a global counter nextCatalogID stored in a volume header, that could only be incremented by one until the maximum value was reached. More recent versions of Mac OS X can now recycle old unused CNIDs, but this gives you an idea of the types of considerations that went into the design of HFS+.

More recently Apple added support for full disk encryption with FileVault 2 and Fusion Drive. But these features are implemented in a layer underneath the file system, by Core Storage, a logical volume manager. Additional features like Snapshotting and Versioning would probably require much tighter integration to the file system… and they would also make TimeMachine extremely efficient and reliable. Currently TimeMachine is built on top of the file system and relies on capturing I/O events, which adds overhead and complexity. Another hack.

Finally there is Bit Rot. Over time data stored on spinning hard disks or SSDs degrade and become incorrect. Modern file systems like ZFS, which Apple considered but abandoned as a replacement, include checksums of all ~~meta data structures~~ content [1]. That means that when the file is accessed, the filesystem detects the corruption and throws an error. This prevents incorrect data from propagating to backups. With ZFS, you can also scrub your disk on a regular basis and verify if existing files have been corrupted preemptively.

A concrete example of Bit Rot

I have a large collection of photos, which starts around 2006. Most of these files have been kept on HFS+ volumes since their existence.

In addition to TimeMachine backups, I also use two other backup solutions that I described in a previous blog post. I keep a copy of the photos on a Linux microserver using ext3, which I checksum and verify regularly using snapraid. I also keep off-site backups using ARQ and Amazon Glaciers.

Before I acquired the Linux Microserver, I used to keep a copy of all my photos on a Dreamhost account. I recently compared these photos against their current versions on the iMac and was a bit shocked by the results.

The photos were taken between 2006 and 2011, most of them after 2008. There are 15264 files, which represent a total of 105 GiB. 70% of these photos are CR2 raw files from my old EOS 350D camera. The other photos are regular JPEGs which come from the cameras of friends and relatives.

HFS+ lost a total of 28 files over the course of 6 years.

Most of the corrupted files are completely unreadable. The JPEGs typically decode partially, up to the point of failure. So if you’re lucky, you may get most of the image except the bottom part. The raw .CR2 files usually turn out to be totally unreadable: either completely black or having a large color overlay on significant portions of the photo. Most of these shots are not so important, but a handful of them are. One of the CR2 files in particular, is a very good picture of my son when he was a baby. I printed and framed that photo, so I am glad that I did not lose the original.

If you’re keeping all your files and backups on HFS+ volumes, you’re doing it wrong.

How to check for file corruptions

I used the following technique to compare the photos from the Dreamhost backup against my main HFS+ volume.

I ran the shasum commmand line tool to compute SHA1 hashes of every single file in the backup folder, except .DS_Store files. Then, I ran shasum in verify mode to check the files on my main volume against the hashes. Differences either indicate voluntary modifications (which did not apply in my case), or corruptions courtesy of HFS+ (which was my case).

# Compute checksums
find . -type f -a ! -name ".DS_Store" -exec  shasum '{}' \; > shasums.txt
# Check against checksums
shasum -c < shasums.txt  > check.txt
# Filter out differences
cat check.txt | fgrep -v OK

You can use the same technique to check for corruptions on a single volume. You need to compute checksums and verify against them from time to time. If you use clone backups, it is probably a good idea to check for corruptions before doing the clone.

Addendum - June 11th, 2014

Thanks to everyone who spent the time to send feedback. There are a few things I would like to add:

  • [1] Erratum. ZFS uses checksums for everything, not just the meta-data.
  • I understand the corruptions were caused by hardware issues. My complain is that the lack of checksums in HFS+ makes it a silent error when a corrupted file is accessed.
  • This not an issue specific to HFS+. Most filesystems do not include checksums either. Sadly…
  • Other people have written articles on similar topics. Jim Salter and John Siracusa for Ars Technica in particular.