Hive Succinctly^®
by Elton Stoneman

CHAPTER 5

External Tables Over HBase

Why use Hive with HBase?

HBase is a Big Data storage technology that provides real-time access to huge quantities of data. We won’t go into much HBase detail here, but Syncfusion has it covered with another free eBook in their Succinctly series—HBase Succinctly (also written by me).

The architecture of HBase allows you to quickly read cells from specific rows and columns, but its semistructured nature means the data in the cells can sometimes be difficult to work with. By mapping HBase tables as Hive tables, you get all the benefits of a fixed structure, which you can query with HiveQL, along with all the speed benefits of HBase.

In HBase, data is stored as rows inside tables. All rows have a row key as unique identifier, and all tables have one or more column families specified. Column families are dynamic structures that can contain different columns for different rows in the same table.

Hive allows you to create a table based on HBase that will expose specific columns within a column family or expose whole column families as a MAP column represented as key-value pairs. As with HDFS, Hive doesn’t import any data from HBase, so that when you query an HBase table with Hive, it will be executed as a map/reduce job, and it will execute tasks using the HBase Java API.

Combining HBase and Hive offers major advantages over using HBase alone. HBase doesn't provide indexes—you will need to query tables by their row key, which can be a slow process. Using Hive, however, you can create an index over any column in an HBase table, which means you can efficiently query HBase on fields other than the row key.

Defining external tables over HBase tables

In Hive, tables using HBase as storage are defined as external tables, and they use the same command syntax as HDFS-stored tables. There’s no need to specify a data format or SerDe with HBase, because Hive uses the HBase API to access data and the internal data format need not be known.

HBase tables must be declared using a specific storage engine in the stored by clause that includes properties to identify the HBase table name. Code Listing 35 shows a create table statement for accessing data in the HBase table called device-events.

Code Listing 35: Mapping an HBase Table in Hive

CREATE EXTERNAL TABLE device_events(rowkey STRING, data STRING)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,cf1:data')

TBLPROPERTIES ('hbase.table.name' = 'device-events');

We are only mapping the row key from the HBase table with this simple statement, along with one column from one column family. The :key column is provided for all HBase tables by the storage handler, and cf1:data indicates the data column in the cf1 column family (this is specific to my table).

For external HBase tables, here are the clauses you need to specify:

STORED BY—this will be the fixed value org.apache.hadoop.hive.hbase.HBaseStorageHandler for all HBase tables. The storage handler is part of the Hive release, but you will need to make additional HBase libraries available to Hive (as we’ll see later in this chapter).
WITH SERDEPROPERTIES—the source columns from HBase are specified in the hbase.columns.mapping property. They are positional, so that the first column in the table definition is mapped to the first column in the property list.
TBLPROPERTIES—as a minimum, provide the source table name in the hbase.table.name property. You can also provide a schema name.

HBase stores all data as byte arrays, and it is the client’s responsibility—in this case, the Hive storage handler’s responsibility—to decode the arrays into the relevant format. When you declare data types for columns in Hive, you must be sure the encoded byte array in HBase can be decoded to the type you specify. If Hive cannot decode the data, it will return NULLs.

Mapping columns and column families

In order to minimize storage and maximize access performance, tables in HBase typically include just one or two column families with very short names. With Hive we can map individual columns within column families as primitive data types or map the entire column family as a MAP.

I use two column families in my device-events table in HBase—e for the event data and m for the metadata properties. If I want to expose that table through Hive, with specific event columns and all the metadata columns, I can use the with serdeproperties clause, as shown in Code Listing 36.

Code Listing 36: Mapping Specific HBase Columns

CREATE EXTERNAL TABLE device_events(rowkey STRING, eventName STRING, receivedAt STRING, payload STRING, metadata MAP<string, string>)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,e:n,e:t,e:p,m:')

TBLPROPERTIES ('hbase.table.name' = 'device-events');

Column mappings are supplied as a comma-separated string in which specific columns are named using the {column family}:{column name} syntax, and whole families are named using the {column family}: syntax.

Table 1 shows the HBase source for each of the Hive columns.

Table 1: Table Structure in HBase

Hive Column	HBase Column Family	HBase Column	Notes
rowkey	-	-	Built-in :key property
eventName	e	n
timestamp	e	t
payload	e	p
metadata	m	-	Whole family

With this mapping, we can read HBase data in Hive in a more structured format, and we can use higher-level HiveQL functionality to derive useful information from the data.

Code Listing 37 shows how the raw data looks in HBase while using the HBase Shell to read all the cells with row key rk1 in the table device-events.

Code Listing 37: Reading Data in HBase

hbase(main):011:0> get 'device-events', 'rk1'

COLUMN CELL

e:n timestamp=1453966450613, value=power.on

e:p timestamp=1453966518206, value={"some":"json"}

e:t timestamp=1453966495748, value=1453562878

m:d timestamp=1453966537534, value=device-id

m:u timestamp=1453966556996, value=elton

m:v timestamp=1453966547593, value=1.0.0

Note that the timestamps shown are internal fields in which HBase records the last modification time of the cell value. The cell values are all stored as strings, which simplifies interop between HBase and other tools.

Code Listing 38 shows that same row fetched through Hive.

Code Listing 38: Reading HBase Data in Hive

> select * from device_events where rowkey='rk1';

+-----------------------+--------------------------+-----------------------

+-----------------------+--------------------------+-----------------------

| rk1 | power.on | 1453562878 | {"some":"json"} | {"d":"device-id","u":"elton","v":"1.0.0"} |

+-----------------------+--------------------------+-----------------------

Code Listing 39 depicts how we can use basic HiveQL to make more sense of the data—in this case showing the UNIX timestamp in the receivedAt field as a date and extracting the user name from the metadata map.

Code Listing 39: Formatting HBase Data in Hive

> select eventname, from_unixtime(cast(receivedat as int)), payload, metadata['u'] from device_events where rowkey='rk1';

+------------+----------------------+------------------+--------+--+

| eventname | _c1 | payload | _c3 |

+------------+----------------------+------------------+--------+--+

| power.on | 2016-01-23 15:27:58 | {"some":"json"} | elton |

+------------+----------------------+------------------+--------+--+

Converting data types

Because HBase stores all data as byte arrays, conversion issues between systems can arise if you use HBase in a multiplatform environment. In those scenarios it’s common to sacrifice some storage performance in order to support interoperable data and store all HBase data as standard UTF-8 encoded strings.

We’ll see more HiveQL syntax in Chapter 9 Querying with HiveQL, but here it is useful to know that the cast function converts between primitive data types and that we can include data conversion in a view in order to give more logical access to HBase data.

Direct access to rows always comes via the row key in HBase, and the keys are often cryptic combinations of multiple values. We can use string functions from HiveQL to split the row key into component parts, surfacing them as typed columns in a view and making the data much more readable.

Code Listing 40 shows a sample cell from a row in our HBase table where all values are being stored as encoded strings.

Code Listing 40: Fetching One Cell in HBase

hbase(main):002:0> get 'device-events', 'uuid|20160128', 'e:n'

COLUMN CELL

e:n timestamp=1454002528064, value=power.off

The row key is constructed from a device ID (which would be a real UUID in the actual database) and a date period separated by the pipe character. A common issue in HBase is that the row key design must support the primary data access vector (in this case the device ID), and if you want to query by a secondary vector (the date period), you must enact a full table scan.

Code Listing 41 shows how to create a view over the Hive table that will, by splitting the row key into two parts, expose the HBase data in a more useful way.

Code Listing 41: Splitting HBase Row Keys with Hive Views

CREATE VIEW device_events_period(rowkey, deviceId, period, eventname, receivedat) AS

SELECT rowkey, split(rowkey, '\\|')[0], split(ROWKEY, '\\|')[1], eventname, receivedat FROM device_events;

Tip: In this example, the view maintains the full row key as a separate column. That’s a good practice because your results contain the row key that you can use to query HBase directly if you want to read the source data.

Now we can search for rows in a given period with a clear and logical HiveQL statement, as in Code Listing 42.

Code Listing 42: Querying HBase by Partial Row Key

1: jdbc:hive2://127.0.0.1:10000> select * from device_events_period where substr(period, 1, 6) = '201601';

+------------------------------+--------------------------------+----------

+------------------------------+--------------------------------+----------

| uuid|20160128 | uuid | 20160128 | power.off | 1453564612 |

+------------------------------+--------------------------------+----------

Here we use the substr function to compare the first six characters of the string field with the literal '201601,' so we return rows in which the date comes in January 2016. That simple query will read a subset of the HBase data by matching a portion in the middle of the row key—something you cannot do with HBase alone.

Hive’s inability to create indexes over views, which would let us create a secondary index for HBase tables by using the parts of the row key, is a feature currently missing from the component. We also can’t include functions in the create index statement; if we want a secondary index over the parts of the row key, we'll need an ETL process (which we'll see in Chapter 6 ETL with Hive).

Bad and missing data

Because column families are dynamic in HBase, the mappings we define in Hive might not apply to every row. In fact, rows might have no cells for specific columns we expect to find, or the data in mapped cells might not be in the expected format.

Hive takes the same optimistic approach for HBase that it takes with other sources. When rows do not contain the expected data, it will map any columns that are correct, and return NULL for any columns that it cannot map. Even if the majority of columns can’t be mapped, Hive will return a row with a majority of NULLs rather than no row at all.

Table 2 shows two rows in the HBase table that were mapped in Code Listing 36 and that don’t conform to the expected Hive structure.

Table 2: Unexpected Data in HBase

Row	Row Key	Event Name (e:n)	Valid in HBase	Valid in Hive
1	uuid3\|20160128	-	Yes	Yes
2	uuid2	power.off	Yes	No

The first row includes the row key in the valid format but no data in the eventName (e:n) column. The second row has data in the eventName column but an unexpected row key format. Both are valid in HBase, which doesn’t require rows to have columns or to adhere to an explicit row key format.

Hive does what it can with that data. If we explicitly map a column that doesn’t exist, Hive won’t include any rows in the response that do not have that column. However, if we exclude that column from the query, the rows with missing values do get returned, as we see in Code Listing 43, in which the rowkey for uuid3 is seen in the first set of results, but not the second.

Code Listing 43: Querying Unexpected HBase Data in Hive

> select rowkey, payload from device_events;

+-----------------+-------------------+--+

| rowkey | payload |

+-----------------+-------------------+--+

| uuid2 | {"other":"json"} |

| uuid3|20160128 | {"more":"json"} |

| uuid|20160128 | {"other":"json"} |

+-----------------+-------------------+--+

3 rows selected (0.32 seconds)

> select rowkey, eventname from device_events;

+----------------+------------+--+

| rowkey | eventname |

+----------------+------------+--+

| uuid2 | power.off |

| uuid|20160128 | power.off |

+----------------+------------+--+

Swallowing exceptions means Hive might not return all the data in the source, but it also means that long-running queries won’t be broken by bad data.

Connecting Hive to HBase

Hive can run on an HBase cluster or on a separate Hadoop cluster configured to talk to HBase. Most people choose an option depending upon usage needs—if you share hardware, you'll be competing for resources.

If you'll be running Hive loads at the same time as heavy HBase usage, separate clusters is the way to go. But if your Hive queries are overnight jobs and HBase is used during the day, sharing a cluster works fine.

Hive comes packaged with the HBase storage handler, and, provided all the HBase libraries are available, you need only to configure Hive with the address of the HBase Zookeeper quorum.

When you query HBase data from Hive, the Hive compiler defers to the HBase storage handler in order to generate the data access jobs. In this case, Hive will generate Java code to query HBase using the native Java API, and it will use the configured Zookeeper quorum to find the address of the HMaster and HRegion nodes.

HBase is designed for real-time access, and queries are typically executed very quickly. For optimum performance, your tables should be structured so that the storage handler can query HBase by scanning a subset of rows rather than the entire table (that’s something we don’t have space for here, but if you’re interested, the area to research is “filter push-down”).

Configuring Hive to connect to HBase

HBase uses ZooKeeper for configuration and notifications, including sending and listening for heartbeats to determine which servers are active. ZooKeeper contains connection details for the servers in the HBase cluster, which means that, provided Hive can reach ZooKeeper, it can get all the other information it needs.

Code Listing 44 shows a snippet from hive-site.xml that contains the HBase ZooKeeper quorum addresses—typically multiple servers (here named zk1 to zk3), and they should all be accessible to Hive via DNS or using fully qualified domain names.

Code Listing 44: Configuring the HBase Zookeeper Address

<name>hbase.zookeeper.quorum</name>

</property>

Hive ships with the HBase storage handler and the Zookeeper library, but you will need to add the correct dependencies for your version of HBase (which you can discover from your running HBase server) and make them available in HDFS.

Hive uses the HBase libraries at two points in the runtime—on the server when a query is compiled, in order to build the map/reduce job, and also on each data node in the cluster when the job runs. Hive supports shipping dependencies to data nodes by making the libraries available in HDFS and listing all the dependencies in config. Code Listing 45 shows a sample of the HBase libraries configured in the hive-succinctly Docker image.

Code Listing 45: Specifying Auxiliary Libraries

<value>hdfs://localhost:9000/hive/auxlib/hbase-server-1.1.2.jar,hdfs://localhost:9000/hive/auxlib/protobuf-java-2.5.0.jar,

...</value>

</property>

Tip: You can get the correct list of HBase dependencies by logging onto the HBase Master node and running the command line hbase mapredcp | tr ':' '\n'. You can download the HBase tarball, extract the files in the list, put them into HDFS, and add them to the hive-site.xml config.

Hive, HBase and Docker Compose

In order to try out connecting Hive to a remote HBase server, we can use two Docker containers and configure them using Docker Compose, which is an extension to Docker used for specifying groups of containers that run together as a logical unit.

The easiest way to accomplish that is to use my hbase-succinctly and hive-succinctly images on the Docker Hub, which you can connect together using the Docker Compose YAML file on my GitHub repository with the code for this course.

Docker Compose uses a very simple syntax in the YAML file. The HBase container exposes all the ports Hive needs and is given the hostname hbase. The Hive container is linked to the HBase container so that it can access the exposed ports (which aren’t publically exposed and therefore aren’t available outside Docker containers). And the Hive container will have an entry for hbase in its HOSTS file, so that it can connect to the HBase container by host name.

In the hive-succinctly container image, the configuration contains the HBase Zookeeper quorum address (set to hbase), and the Hive library folder (/hive/auxlib) will already have all the additional libraries needed to talk to HBase, which means you won't need to configure anything yourself.

You will need to install Docker Compose as well as Docker, then download docker-compose.yml and navigate to the directory where you saved it. Code Listing 46 shows how to control the lifecycle of the containers through Docker Compose.

Code Listing 46: Starting and Stopping Containers

docker-compose up -d

docker-compose stop

docker-compose start

docker-compose kill

docker-compose rm

Here are the definitions of Code Listing 46’s key terms:

Up—get the latest images, then create, configure and start containers.
Stop—stop the containers running, but leave them in-place with their state saved.
Start—start (or restart) the saved containers.
Kill—stop the containers and destroy their state.
Rm—remove the containers.

Both containers are already set up with the sample data, tables, and views from this chapter, so that you can connect to the Hive container, run Beeline, and begin querying data that lives in the HBase container, which will be running in the background.

Tip: Both those containers use HDFS in pseudo-distributed mode, so that when you first run them they will take a few minutes to be ready. If you then use docker-compose stop when you’re done, they’ll be ready straight away when you next run docker-compose start.

When you run HBase queries in Hive, the compiler builds a job that uses the HBase API, then Hive fetches the data from the HBase server(s) and maps it to the table format. HBase owns reading (and writing) the data, and Hive owns query preparation and data translation.

Summary

Hive works well with HBase, taking advantage of real-time data access for fast querying and adding structured layers on top of the semistructured data in HBase.

The nature of HBase storage means that column and family names are typically very short (one character is common) in order to minimize disk usage. At the same time, row keys are often cryptic structures with multiple data items concatenated. Hive can expose that data in structured, typed columns, adding a layer of meaning and making queries easier to comprehend.

As with HDFS, Hive presents a consistent SQL interface for HBase, so that users need not have programming knowledge in order to query (HBase provides multiple server interfaces, but they all require programming).

We’ve now seen how to define Hive tables using internal storage, external HDFS files, and HBase tables. External tables are a powerful Hive concept when you want to use existing data in a more accessible way, but the full range of Hive functionality is only available for internal tables.

When you need the more advanced functionality that comes with internal tables, you can still use HDFS and HBase files as the source and load them into Hive tables using ETL functions, which we'll cover in the next chapter.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now

External Tables Over HBase

Why use Hive with HBase?

Defining external tables over HBase tables

Mapping columns and column families

Converting data types

Bad and missing data

Connecting Hive to HBase

Configuring Hive to connect to HBase

Hive, HBase and Docker Compose

Summary

DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.