CHAPTER 6
Text encodings are defined by an ISO Universal Character Set (UCS), which assigns every character in every language a unique name and integer number called its code point. The current Unicode 8.0 standard (an extension of UCS) has 1,114,112 code points, of which 260,319 are assigned, and 853,793 are unassigned. These code point integers need to be somehow encoded into byte sequences, and there is more than one way to do it. Standardized code-points-to-bytes encodings are called UCS Transformation Formats (UTF)—you likely know them as UTF‑32, UTF‑16, UTF‑8, and UTF‑7. The minimal bit combination that can represent a single character under a particular UTF encoding is called a code unit.
The total number of million-plus Unicode code points can be represented in about 21 bits, which means that every code point can be comfortably encoded in 32 bits—that is exactly what UTF-32 does. UTF‑32 is a fixed-width encoding that uses four bytes per character. For example, “hello” is encoded in 20 bytes under UTF-32; it uses 4-byte code units and a single code unit per character. The main advantage of UTF‑32 is that it is easily indexable (random access) and searchable due to its fixed-width nature, but it is very space inefficient. The .NET Framework provides Encoding.UTF32 in System.Text.
UTF-16 encoding uses 2-byte code units, which makes it a variable-width encoding (some characters require two code units) because 16 bits of a single code unit cannot represent every Unicode code point. .NET strings and JavaScript strings are UTF‑16-encoded for in-memory representation. The Windows OS APIs mostly use UTF‑16 for string representation as well, which allows a clean interop from .NET.
The original Unicode proposal (back in the 90s) was grounded in the assumption that every code point can be encoded in 16 bits (Unicode originally called for a 16-bit fixed-width encoding). UTF‑16 was created to deal with over-16-bit code points, which also made it variable width. There are also two flavors of UTF-16 due to endianness—UTF-16-BE and UTF-16-LE—and an optional byte order mark (BOM) to indicate which UTF-16 flavor is being used. The same two-flavor endianness/BOM concept applies to UTF-32 as well.
As long as the complexity of UTF-16 representation and processing is taken care of for you by the underlying framework (OS/.NET Framework/JavaScript engine, etc.), you have little to worry about. However, the moment you need to take UTF-16 strings out of the safety of your framework and into some other storage (for example, to write them into a file), UTF-16 complexity will impose challenges you could avoid with a different UTF encoding. The .NET Framework confusingly calls its UTF‑16 implementation “Unicode” and provides Encoding.Unicode (UTF‑16‑LE) and Encoding.BigEndianUnicode (UTF-16-BE) in System.Text.
UTF-8 encoding uses single-byte code units, which makes it a variable-width encoding. UTF‑8 code unit values are backward-compatible with ASCII codes, but it can take up to four 8-bit UTF‑8 code units to represent some Unicode code points. You might be wondering why three 8-bit code units will not suffice, since 3 × 8 = 24 bits can cover ~21-bit Unicode code point space. The reason is that not all 8 bits within a variable-width code unit carry code point data—some of the bits are used for meta information, such as whether a code unit is a continuation of a multi-unit code point. UTF-8 does not suffer from endian ambiguities, and is also more compact than other encodings when used for ASCII-dominant text, but less compact for Asian-character-dominant text. The .NET Framework provides Encoding.UTF8 in System.Text.
UTF-7 is not a standard Unicode encoding, and is not part of the Unicode standard. It was proposed as a way to encode Unicode with 128 ASCII characters more efficiently than UTF-8 with Base64 or with QP. UTF-7 allows multiple encodings of the same source string, which can lead to various security vulnerabilities and attacks—do not use it, even though the .NET Framework supports it.
Table 4: UTF comparison
UTF-8 | UTF-16 | UTF-32 | |
|---|---|---|---|
Endianness/BOM/cross-platform problems | no | yes | yes |
Fixed-width benefits | no | no | yes |
Fewest bytes per char | 1 | 2 | 4 |
Most bytes per char | 4 | 4 | 4 |
ASCII compatibility | yes | no | no |
Lexicographical order | yes | no | yes |
Bytes per ASCII char | 1 | 2 | 4 |
Bytes per Asian char | 3 | 2 | 4 |
Bytes per East-European & Middle-Eastern char | 2 | 2 | 4 |
Framework in-memory representation | not common | common | not common |
Web & Internet friendly (machine-to-machine) | yes | no | no |
Unless you have specific requirements for UTF-16 or UTF-32, we recommend UTF-8 for storage as your default choice. Reduction of complexity is more important for security than potential storage savings for Asian alphabets. Text processing is probably already taken care of by the framework you use, but if you need to do your own byte-level processing, you are likely to make fewer mistakes by converting to UTF-8 first.
The .NET Framework exposes public static Encoding.UTF-* properties in System.Text, which are thread-safe and are wildly used to convert a byte sequence into a particular UTF text representation. Here is a typical example:
Code Listing 12
var encoding = Encoding.ASCII; for (int i = 0; i < 1000; ++i) { byte[] bytes = Guid.NewGuid().ToByteArray(); encoding.GetString(bytes).Dump(); } |
Not every byte combination can represent a well-formed code unit, and not every well-formed code unit can represent a valid code point. When valid mapping is not possible, the .NET UTF implementations can use either a “fallback” strategy (map to a question box character, �), a “replacement” strategy (some custom mapping logic), or an “exception” strategy (throw). The public static Encoding.* UTF properties all use the fallback strategy, which means that they never throw and silently produce � on mapping failures. The code in Code Listing 12 never throws, which is precisely the problem, because silent � substitutions can lead to various security vulnerabilities and attack vectors. Microsoft’s own internal CryptoUtil helper class uses a custom UTF instance constructed as follows:
Code Listing 13
public static readonly UTF8Encoding SecureUTF8Encoding = new UTF8Encoding( encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true); |
You can create similar UTF-16 or UTF-32 instances configured to throw on invalid bytes. Do what Microsoft does: always use explicitly-throwing UTF instances instead of convenient-but-dangerous Encoding.* ones. Despite Microsoft’s obvious understanding of the dangers of fallback-based encodings, they make a strange choice of using a fallback-based UTF-8 encoding in their Rfc2898DeriveBytes implementation:
Code Listing 14
string s1 = "\ud8ab"; string s2 = "\ud8cd"; var salt = Guid.Empty.ToByteArray(); // fixed salt var skm1 = new Rfc2898DeriveBytes(password: s1, salt: salt).GetBytes(16); var skm2 = new Rfc2898DeriveBytes(password: s2, salt: salt).GetBytes(16); |
s1 and s2 in Code Listing 14 are two valid but distinct .NET strings (distinct hash codes and failing equality), which can be passed around just like any other .NET string. Both s1 and s2 also happen to be invalid UTF-16 encoding—they do not have a valid Unicode code point. .NET does not protect you from storing invalid UTF-16 strings inside string type, since the .NET string is just a sequence of char instances, and char is a 2-byte container that can be losslessly cast into short.
Since s1 and s2 represent different passwords, you would expect the skm1 and skm2 byte arrays to be different as well, because that is the whole purpose of password-based key derivation—yet skm1 and skm2 are BAC-equal. If you needed one more reason not to use Rfc2898DeriveBytes, this is it. Our PBKDF2 implementation does not have this problem due to proper string serialization.
.NET strings do not enforce Unicode validity of their contents, as we discussed previously. When we need to convert string to byte[] and then back to string, we usually try to leverage one of the Unicode encoding schemes to do the job. Unicode conversion forces us to make a choice: either we use a fallback strategy and replace what we cannot code-point-map with question marks, or we throw an exception. The first approach is bad for all the reasons we already discussed. The second approach is good when you are trying to convert from byte[] into a properly encoded string—i.e. you want to enforce a valid Unicode bytes-to-text conversion. However, the second approach is bad when you face scenarios such as entropy extraction, key derivation, string preservation, and round-tripping, where you do not wish to throw.
We need an alternative approach for string-to-byte[] conversion that does not involve Unicode. Since every string is just a sequence of chars, and every char is losslessly represented as two bytes, we can represent every .NET string of length n as a byte[] of length 2n. We should also prepend the byte-encoded length n to avoid length-extension vulnerabilities. A quick way of byte-encoding a 32-bit integer n is simply as four bytes. There is also a so-called “compressed” way of byte-encoding a 32-bit integer, implemented in .NET by the Write7bitEncodedInt method on BinaryWriter. Write7bitEncodedInt makes more sense to use for byte-encoding n, since most .NET string lengths are from a tiny subset of 32-bit integer space.
The downside of string serialization is that it always takes two bytes per char (UTF-8 could be more compact), but the upside is that serialization and deserialization never fail and never lose data (assuming you are deserializing what was previously serialized). This is the approach Microsoft uses internally with forms authentication for storing user-provided strings inside the authentication ticket, but there are no public APIs. We speculate that Microsoft was driven by a desire to match published test vectors for Rfc2898DeriveBytes implementation, which do not use multiple-of-2 byte arrays for passwords, which caused them to use UTF-8 encoding instead of string serialization. We provide working implementations of string serialization and deserialization to binary in the Inferno crypto library.