HBase Succinctly^®
by Elton Stoneman

CHAPTER 3

HBase Table Design

Row Key Structure

There are no relations between tables in HBase, and there are no indexes, data types, default values, computed columns, or any other fancy features you get in a modern SQL database, or even other NoSQL databases. Because of that, data modelling in HBase is deceptively simple—the only constructs you have to work with are tables, rows, and column families.

But access to data in HBase is always via the row key, so how you structure the key has a huge impact on the performance and usability of your database. In the access-logs table from the last chapter, I used the row key format {systemID}|{userId}|{period}. That format means the table is good for one read scenario, but bad for others.

If I typically read that table in the context of a specific system and user, then the row key format is great. To get a user's access logs, I can use a scan with boundaries that exclude all other users and systems, and I'll get a very quick response. Even if the table has hundreds of millions of rows, a scan that returns a few hundred rows will take seconds to run.

But if I want to find all the systems that anyone used on a particular day, this row key structure is not optimal. Because the period is at the end of the key, I can't set boundaries for a scan and I'll need to read every row. With hundreds of millions of rows, that will likely take hours. If that's how I want to use the table, then a row key structure of {period}|{systemID}|{userID} would be better.

That alternative structure has its own problems, as we'll see shortly. It also means queries by period are fast, but if I want to find one user's access logs for one system, then I'll be reading every row again.

Tip: The only “right” way to design row keys is to know how you'll be accessing the table, and model that access in the key structure. Often you'll need to compromise secondary read scenarios in favor of the primary scenario.

Read and Write Performance

Data access patterns are about what your queries are doing, and also how many queries you're running—and your row key structure has a big impact on that, too. HBase is a distributed database and it has the capacity to support high concurrency for reads and writes, but only if your table design allows it.

A single, logical table in HBase is actually split at the storage layer and stored in many parts, called regions. Access to the data in a region is provided by one instance of an HBase Region Server, and in a production environment, you will have many Region Servers running in a cluster. If you design your tables correctly, different regions can be hosted by different Region Servers, giving you high performance for concurrent reads and writes to that table.

But that also depends on your row key. Tables are split by row key, with each region having a start and end row key. If your design means your all your row keys start with similar values, then they won't be distributed among many regions and you won't get high concurrency.

Table 3 shows some more sample row keys in the access-logs table, using different structures (with spaces added around the pipes for readability):

Option 1: {period}\|{systemID}\|{userID}	Option 2: {systemID}\|{userID}\|{period}	Option 3: {userID}\|{systemID}\|{period}
201510 \| jericho \| dave	jericho \| dave \| 201510	dave \| jericho \| 201510
201510 \| jericho \| elton	jericho \| elton \| 201510	elton \| jericho \| 201510
201511 \| discus \| elton	discus \| elton \| 201511	elton \| discus \| 201511
201510 \| jericho \| fred	jericho \| fred \| 201510	fred \| jericho \| 201510

Table 3: Row Key Designs

With Option 1, every row starts with the same five characters; the distance between the rows is very small, so they will probably all be in the same region. That means they'll all be served by the same Region Server, and if we read and write a set of rows like that simultaneously, we won't balance the load across multiple Region Servers, and we won't get maximum performance.

Tip: Don't use sequential values (based on dates or timestamps) as row keys for tables that need high performance. Sometimes you need to use sequential values to support the way you want to read the data, but be aware that you are limiting concurrent performance when you do.

Option 2 is better—from just the first character, we have two different values. In this case, we might find three rows in one region and one in another. If we had many systems that we were logging access for, and the IDs were sufficiently different, we could have dozens of regions, and support a high level of concurrent read/write access, balanced across multiple Region Servers.

Option 3 is the best if we need maximum performance. We're likely to have more users than systems, so the superset of user IDs would be in the hundreds or thousands of values, and they're all likely to have good distance between them. We could have many hundreds of regions with this approach, and if we were to scale out a production cluster by adding more Region Servers, we would still have multiple regions in each server and support high concurrency.

Note: You need to balance performance considerations for concurrency with scan performance. A row key design that supports thousands of concurrent writes per second is good for event streams, but if it doesn't let you read data efficiently for your scenario, it's not the right choice.

Region Splits and Pre-Splitting

HBase runs its own administration jobs to maintain or improve performance as data volumes increase, and one part of that administration is splitting large tables into regions. Technically you can leave it to HBase to automatically create regions for your tables, but you can also do it yourself explicitly when you create tables.

With the simple create statements we used in Chapter 2, our tables are created with a single region, and they won't be split until they grow beyond the configured size for a region and HBase splits them in two. The maximum size varies, but it's typically in multiples of 128MB, which is usually too large—you've lost a lot of performance by the time the split happens, so it's better to pre-split your table.

Pre-splitting means you tell HBase, in the create statement, how many regions the table should start with, and the upper boundary for the row key in each region. That works most effectively if you can define boundaries, meaning each region should be approximately the same size.

That's not as hard as it may sound. Universally Unique IDs (UUIDs), or partial UUIDS, are a good choice for row keys (or the first part of row keys). With a hex representation, you can split on the first character and you'd have 16 regions in your table. Code Listing 18 shows the create command with the SPLITS property defining the boundaries for regions:

Code Listing 18: Create Table with Pre-split Regions

create 'access-logs', 't', {SPLITS => ['1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f']}

The HBase load balancer does a good job of spreading regions around the server estate, so in a small cluster with four region servers, we can expect four access-logs regions running on each server; if we scale up to 16 region servers, we'd have a single region on each server, and we can expect high concurrency for reads and writes.

If your data is not evenly distributed among the regions, then you will get hot-spots—cases where some regions have more than their fair share of data, and have to process more load. HBase still runs the admin jobs for pre-split tables, so if you do have an oversize region, then at some point it will be automatically split by HBase.

Tip: You can create many more regions, but every region has its own memory and disk allocation, so there's a cost to having more regions. A good rule of thumb is to aim for around 100 regions in total per region server—so you should allocate more regions to the most performance-critical tables.

Using UUIDs will give you good data distribution among your regions if you generate them randomly, but if you want good distribution and want to use transactional IDs for keys, you can hash them to produce the row key.

Table 4 shows how rows from the access-logs table would be distributed if we build the row key from an MD5 hash of the user ID. MD5 gives a simple way of generating deterministic hashes; the algorithm is not suitable for securing data, but it's extremely unlikely that you'll get collisions with small input values like user IDs:

Row key	Part-hashed row key	Region
dave \| jericho \| 201510	16108 \| jericho \| 201510	1 (keys 0 to 1)
elton \| jericho \| 201510	d5fe7 \| jericho \| 201510	13 (keys d to e)
fred \| jericho \| 201510	570a9 \| jericho \| 201510	5 (keys 5 to 6)

Table 4: Part-hashing Row Keys

Column Families

As we'll see in Chapter 8 Inside the Region Server,” column families within regions are the physical storage unit in HBase. All the columns in a single family for one region are stored in the same file, so columns that are typically accessed at the same time should be located in the same column family.

Column families can contain large numbers of columns, but like regions, there is an overhead with having many column families. Typically, one column family per table is all you need, and the official HBase documentation recommends no more than three families.

You add more than one column family to a table if the data has different access patterns, like in the social-usage table from Chapter 1, which has separate families for Facebook and Twitter usage.

In the original sample, I included extra column families to help illustrate how tables are structured in HBase, but in a real system, I would rationalize the design, removing the column families for identifiers and totals and storing that data as columns within the other families.

The revised design would have just two column families, fb and tw, consolidating all the data:

fb = Facebook, user’s Facebook details and activity

fb:id = user ID

fb:t = total usage

fb:{period} = usage within that period

tw = Twitter, the user’s Twitter details and activity

tw:id = user ID

tw:t = total usage

tw:{period} = usage within that period

This design would be suitable if we expected to capture similar amounts of data for each social network. But if the usage was heavily skewed, we could have a problem with the split of column families within regions.

If our table grew to 100 million rows and we were running a large cluster, we may decide to split it into 1,000 regions. That would give us 1,000 data files for the Facebook column family, and 1,000 for Twitter (that's not exactly correct, but we'll unpack that statement in Chapter 9).

Within those 100 million rows, if only one million contain any Twitter data, then our region splits will negatively impact performance when we query for Twitter data. If we do a large scan for Twitter rows, it could require reads across all 1,000 regions.

For one million rows, we might find optimal performance with just 100 regions, so by having multiple column families in the table, we've harmed performance for the less-populated family. In that case, the better design would be two tables, facebook-usage and twitter-usage, each with a single column family, so they can be independently tuned.

Tip: Design tables with a single column family unless you know the data will have separate access requirements but similar cardinality. Tables with two- or three-column families work best if the majority of rows have data in every column family.

Columns

All data in HBase is stored as byte arrays, so cell values can be representations of any type of data. Strings are the most portable data type; you can use a standard encoding (like UTF-8), and different clients will be able to work with the data in the same way.

If you work with a single HBase client, then using native data types rather than strings will give you a small optimization in the amount of storage you use, but it's probably a micro-optimization that's outweighed by the advantage of having a standard data type in all your columns.

Consistently using one data type makes your data access code simpler—you can centralize the encoding and decoding logic, and not have different approaches for different columns. Your string values could be complex objects too—storing JSON in HBase cells, which you then de-serialize in the client—is a perfectly valid pattern.

In HBase terms, whatever you store is a byte array, and although the client may interpret them differently, to the server the only difference with different data types is the amount of storage they use. HBase doesn't enforce a limit on the size of the byte array in a single cell, but you should keep cell sizes under 10MB for best performance.

Table 5 shows a sample HBase table we could use to store all the books in Syncfusion's Succinctly series, in a simple table with a single column family b:

Column Qualifier	Client Data Type	Contents
b:t	String	Book title
b:a	String	Author name
b:d	Long	Release date (UNIX timestamp)
b:c	Byte array	Cover image (PNG)
b:f	Byte array	Download file (PDF)

Table 5: Syncfusion's Succinctly Library

Counter Columns

There's one exception to HBase's bytes-in, bytes-out approach to cell values: counter columns. Counter columns live inside a column family in a table in the same way as other columns, but they are updated differently—HBase provides operations for atomically incrementing the value of a counter without having to read it first.

With the HBase Shell, the incr command increments a counter cell value, or creates a counter column if it doesn't exist. You can optionally specify an amount to increment; if you don't, then HBase will increment by one. Code Listing 19 shows this with two commands that add counter cells to the row rk1—the first adds a new cell, c:1, with the default increment of 1, and the second adds the cell c:2 with an increment of 100:

Code Listing 19: Incrementing Counter Columns

hbase(main):006:0> incr 'counters', 'rk1', 'c:1'

COUNTER VALUE = 1

0 row(s) in 0.0130 seconds

hbase(main):007:0> incr 'counters', 'rk1', 'c:2', 100

COUNTER VALUE = 100

0 row(s) in 0.0090 seconds

The response from the Shell displays the value of the counter after updating it. Counters are a very useful feature, especially for high volume, high concurrency systems. In the access-logs table, I could use counter columns to record how much time the user spent on the system.

A system auditing component would need to add the current usage to any already recorded for the period, and if the user has multiple sessions open we could have concurrent updates for the same cell. If different instances of the audit component manually read the existing value, add to it and put updates at the same time, then we will have a race condition and updates will be lost.

With a counter column, each instance of the auditing component issues an increment command without having to read the existing value, and HBase takes care of correctly updating the data, preventing any race conditions.

Counter columns can be read in the same way as other cell values, although in the HBase Shell they are shown as a hex representation of the raw byte array, as shown in Code Listing 20, which reads the values set in the previous command. Note that the HBase Shell displays the data using an unusual ASCII/binary encoding, so x01 is the value 1 in c:1, and x00d is the value 100 in c:2:

Code Listing 20: Counter Column Values

hbase(main):008:0> get 'counters', 'rk1'

COLUMN CELL

c:1 timestamp=1446726973017, value=\x00\x00\x00\x00\x00\x00\x00\x01

c:2 timestamp=1446726979178, value=\x00\x00\x00\x00\x00\x00\x00d

2 row(s) in 0.0140 seconds

Note: You should always create a counter column using the increment command from the client. If you create it as a normal column with a custom value and then try to increment it, you will get the following error: “Attempted to increment field that isn't 64 bits wide.” This is HBase saying you can't increment a value that isn't in a counter column.

Summary

In this chapter we looked at the key parts of table design in HBase: structuring the row key, pre-splitting regions, and using columns and column families.

There is no single design that fits all HBase problems; you need to be aware of the performance and usage considerations, particularly with your row key design. You need to design tables based on the access patterns you expect, and it's not unusual to redesign your tables during development as you learn more about your requirements.

Now we have a good working knowledge of how HBase stores data; in the next few chapters, we'll look at accessing that data remotely, with the APIs that HBase provides out of the box: Java, Thrift, and REST.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now

Option 1: {period}\|{systemID}\|{userID}	Option 2: {systemID}\|{userID}\|{period}	Option 3: {userID}\|{systemID}\|{period}
201510 \| jericho \| dave	jericho \| dave \| 201510	dave \| jericho \| 201510
201510 \| jericho \| elton	jericho \| elton \| 201510	elton \| jericho \| 201510
201511 \| discus \| elton	discus \| elton \| 201511	elton \| discus \| 201511
201510 \| jericho \| fred	jericho \| fred \| 201510	fred \| jericho \| 201510