CHAPTER 5
Binary encodings are commonly used to encode byte sequences into displayable, printable sequences (usually some subset of ASCII). Binary encodings are very useful when you need to store binary data in systems or protocols that do not offer native binary storage support, but do support text or ASCII-subset storage (for example, browser cookies). It is important to understand the most common binary encoding schemes and the various trade-offs involved when choosing which one to use and when. The Base64, Base32, and Base16 encodings we discuss next are covered in RFC-4648.
Base64 encoding converts an 8-bit sequence into a 6-bit sequence, where each output character comes from a range of 26 = 64 different values. The least common multiple (LCM) of 8 and 6 is 24, so 24 input bits represented by 3 bytes can be Base64-encoded into four 6-bit values. Therefore, Base64 encoding has a 4/3 = 1.33x length bloat factor compared to the 1x length of the source. The 64 output ASCII values are comprised of 26 lowercase letters, 26 uppercase letters, 10 digits, and two additional characters, / and +. A distinct padding character, = (equal sign), is used at the end to indicate that the source sequence was one byte short of a multiple of 3 (a single = at the end), or two bytes short of a multiple of 3 (== at the end). Each encoded four-character block has a minimum of two non-pad characters, since a single 8-bit byte takes at least two 6-bit characters to encode.
Base64 encoding is extremely popular due to its small bloat factor and built-in availability in most libraries and frameworks. The .NET Framework has the Convert.ToBase64String() and Convert.FromBase64String() methods. Note that the /, +, and = characters in default Base64 often do not play nice when embedded into other text protocols in which these characters have special meaning (for example, in URLs and file paths). To address these scenarios, .NET also has a safe-alphabet Base64 implementation, UrlTokenEncode() and UrlTokenDecode() in System.Web.HttpServerUtility, which replaces / and + with _ and -, and always appends a single-digit pad count (either 0, 1, or 2) instead of the = pad.
Code Listing 10
var a = new byte[] { 0 }; var b = new byte[] { 0, 0 }; var c = new byte[] { 0, 0, 0 };
Convert.ToBase64String(a).Dump(); System.Web.HttpServerUtility.UrlTokenEncode(a).Dump();
Convert.ToBase64String(b).Dump(); System.Web.HttpServerUtility.UrlTokenEncode(b).Dump(); Convert.ToBase64String(c).Dump(); System.Web.HttpServerUtility.UrlTokenEncode(c).Dump(); // Results: AA== AA2 AAA= AAA1 AAAA AAAA0 |
As you can see, safe Base64 can be longer, equal, or shorter than the equivalent default Base64 encoding. The .NET implementation of safe Base64—the UrlTokenEncode/Decode methods—calls its own default Base64 (Convert methods), and then does two more loops: one to replace the two special characters with their safe equivalents, and one to replace the variable-length = pad with a fixed-length digit pad. This “unoptimized” implementation makes UrlTokenEncode/Decode needlessly slower than the default Base64. This is something to be aware of for extreme edge cases where micro-optimizations are your last resort. It is definitely not a reason to roll out your own “optimized” version without ever measuring to obtain concrete evidence that this is your performance bottleneck. UrlTokenEncode/Decode Base64 methods should thus be preferred to equivalent Convert methods because they integrate better into other text protocols. If you prefer to avoid taking a dependency on System.Web.dll, we suggest using ToB64Url()/FromB64Url() methods from the Inferno library (code sample).
Base32 encoding converts an 8-bit sequence into a 5-bit sequence, where each output character comes from a range of 25 = 32 different values. The LCM of 8 and 5 is 40, so 40 input bits represented by 5 bytes can be Base32-encoded into eight 5-bit values. Therefore, Base32 encoding has an 8/5 = 1.6x bloat factor. The 32 output ASCII values are comprised of 26 same-case letters (either lower or upper) and six digits between 2 and 7. Many other Base32 alphabets are possible, however (with different features). The same = pad is used as necessary (and repeated up to five times) to indicate source sequences that are not multiples of 5.
Base32 encoding seems very unattractive due to its 60 percent bloat factor (almost double the Base64 bloat), which raises the question of why anyone would use Base32 encoding instead of Base64. One Base32 advantage is that, unlike Base64, it’s not case sensitive, and thus is easier for human beings to type. Another consequence of case insensitivity is that Base32 encoding can survive when embedded in case-insensitive protocols or other scenarios where case conversion might occur outside of your control.
One example is ASP.NET SessionID, which is a 15-byte sequence Base32-encoded by Microsoft into 24 characters (e.g., fes21mw1yitkbkqgkxek0zhp). We speculate that Microsoft has chosen to use Base32 to encode SessionID because they had to support cookieless sessions, which required passing SessionID within the URL, which in turn required case insensitivity (various proxies and web servers can uppercase or lowercase a URL, which is outside of Microsoft’s control). Microsoft probably wanted to avoid Base32 padding, which adds quite a lot of noise and bloat, so their SessionID sequence length choices were limited to multiples of 5. 5-byte (40-bit) and 10-byte (80-bit) sequences were too short, but 15 bytes (120-bit) gave adequate security against brute-force guessing. A longer, 20-byte (160-bit) or 25-byte (200-bit) Base32-encoded SessionID might be a slightly better choice today, but Microsoft is unlikely to change their defaults because a longer SessionID might break existing applications due to URL length limitations in various systems and other related incompatibilities triggered by longer SessionIDs.
Another example of Base32 relevancy is encoding short binary sequences that are intended to be typed or re-typed by human beings. Suppose that instead of relying on your users to come up with a sufficiently high-entropy password, you want to generate a high-entropy, 128-bit (16-byte) password for each user, which you have to binary-encode somehow for display. You want to avoid Base64 so that your users do not have to struggle with the Shift and Caps Lock keys when typing. You could use Base32, but there are some issues you need to address first. Base32 works best on multiple-of-5 sequences, so you could either settle for a 15-byte key (120-bits), or go for a longer, 20-byte key (160-bits). Alternatively, you could go for a 16-byte key like you wanted, and remove the five = pad characters from the end. This would create a 26-character, Base32-encoded key that your users can re-type, but it would also force you to assume a fixed key length (you have to add the missing pad characters yourself if you want to decode Base32 back into a byte sequence). This approach is similar to what Mozilla Sync does.
Another important Base32 consideration is what alphabet to use for encoding. We prefer the following:
Code Listing 11
static readonly char[] base32table = { // 'i', 'l', and 'o' are omitted to avoid confusion. // Sort order is maintained. '1','2','3','4','5','6','7','8','9','a','b','c','d','e','f','g', 'h','j','k','m','n','p','q','r','s','t','u','v','w','x','y','z' }; |
This alphabet maintains the source sequence sort order within the output sort order, and it also removes the 0, O, i, and l characters, which can be easily misread. 1 is usually distinct and clear enough in most fonts not to be mistaken for l (lowercase L) or I (capital i).
The .NET Framework has no public APIs for Base32 encoding or decoding, even though Microsoft uses Base32 encoding internally to encode SessionID. You can find all encode and decode implementations we discuss in the Inferno crypto library (code sample).
Base16 encoding converts 8-bit sequences into 4-bit sequences, where each output character comes from a range of 24 = 16 different values. The LCM of 8 and 4 is 8, so 8 input bits represented by a single byte can be Base16-encoded into two 4-bit values. Therefore, Base16 encoding has a 2/1 = 2x bloat factor (the worst so far).
A hexadecimal (hex) encoding, which is likely very familiar to you, is a particular alphabet of Base16 encoding. Hex encoding uses 10 digits (0–9) and six uppercase or lowercase letters (A–F), which makes it case insensitive.
There are other Base16 encodings, such as modified hexadecimal (ModHex) encoding, which uses a different 16-character alphabet (CBDEFGHIJKLNRTUV instead of 0123456789ABCDEF) for keyboard layout independence.
Despite its 2x bloat factor, hex encoding is probably the most popular binary encoding used. You can find it anywhere, from databases (blob display) to CSS stylesheets (hex colors). Hex benefits include padless conversion, case insensitivity, and to/from conversion logic that is so simple it can easily be done mentally. Microsoft’s usage of hex encoding within the .NET Framework includes storing arbitrary-length binary sequences such as encryption and validation keys within .NET configuration files (a good encoding choice) and encoding ASP.NET FormsAuthentication cookies (a not-so-good choice, as we will discuss later).