HBase Succinctly^®
by Elton Stoneman

CHAPTER 5

Connecting with Python and Thrift

Overview

HBase is well suited for cross-platform solutions, and the Thrift API is an alternative to the Java API. Thrift is Apache's generic API interface, which supports client connections from different languages to Java servers. We’ll use Python in this chapter, but you can use any language with Thrift binding (including Go, C#, Haskell, and Node).

The Thrift API is an external interface, so it requires an additional JVM to run. You can start it with the HBase daemon script hbase-daemon.sh start thrift. It can be hosted separately to the rest of the HBase cluster, or it can be run on the Region Servers. Thrift doesn't have a native load-balancer, but the transport is TCP, so you can use an external load balancer (like HAProxy).

By default, the Thrift server listens on port 9090, and it's already running on the hbase-succinctly Docker image.

Thrift is more lightweight than REST, so it can offer better performance, but it's not so user-friendly. To consume a Thrift API, in most cases you need to build Thrift from source, generate a binding to the API from the public .thrift file that describes the interface, and then import the Thrift transport and the bindings for your client app.

Note: The Thrift API is documented in the .thrift file. That file doesn't ship with the HBase binaries, so you'll need to fetch the correct version from source. For release 1.1.2, the file is on GitHub here.

Working with Thrift Clients

The generated code from the .thrift definition file contains classes for working with the Thrift server at a relatively low level. Figure 4 shows the sequence for connecting to a server:

Connecting to HBase with Thrift

Figure 4: Connecting to HBase with Thrift

First, you need to create an instance of TTransport, and then an instance of TBinaryProtocol using the transport. Then you can create an HBase client using the protocol and open the transport to start the connection. Thrift provides a lot of functionality and benefits from wide platform reach, but in use the client can be cumbersome, and it takes a lot of code to do straightforward tasks.

It's worth looking for a community wrapper for Thrift in the language you want to use. In this chapter I'll use the HappyBase library for Python, which is a wrapper for Thrift that takes care of binding generation and imports, and also exposes a much friendlier client API than the raw Thrift version.

Connecting to Thrift with HappyBase

You need to install HappyBase in your environment. It's publically available on the Python Package Index, so assuming you already have Python and Pip (the Python package manager), you can install it with the command in Code Listing 35:

Code Listing 35: Installing the HappyBase Python package

$ pip install happybase

Now you can start Python and set up all the dependencies by importing the HappyBase package with import happybase. HappyBase is designed to expose HBase features in a Python-like way, so in Code Listing 36, we create a connection object that will automatically connect to the HBase Thrift server running locally:

Code Listing 36: Connecting to HBase with HappyBase

>>> connection = happybase.Connection('127.0.0.1')

The Connection object is the starting point for Thrift connections. From the Connection object, you can access Table objects, which are used for DDL and DML statements, and get Batch objects, which are used for batched data updates.

Reading Data with Python

HappyBase makes HBase interaction very simple. Use the table() method on a connection to get a Table object, and you can use the row method on the table to read individual cells, column families, or a whole row. Results are returned in dictionaries; Code Listing 37 shows the values returned for a row, column family, and cell:

Code Listing 37: Reading a Row with HappyBase

>>> table = connection.table('access-logs')

>>> print table.row('elton|jericho|201511')

{'t:1106': '120', 't:1107': '650'}

>>> print table.row('elton|jericho|201511', ['t'])

{'t:1106': '120', 't:1107': '650'}

>>> print table.row('elton|jericho|201511', ['t:1106'])

{'t:1106': '120'}

You can also read multiple rows by providing a list of keys to the rows() method, which returns a list containing a tuple for each row. The tuple contains the row key and a dictionary of column values, as in Code Listing 38, where two rows are returned:

Code Listing 38: Reading Multiple Rows with HappyBase

>>> print table.rows(['elton|jericho|201511', 'elton|jericho|201510'])

[('elton|jericho|201511', {'t:1106': '120', 't:1107': '650'}), ('elton|jericho|201510', {'t:2908': '80'})]

The key list is an explicit set of keys, not the start and end points for a range (for that you need a scanner, which we'll use in the next section). If a key you request doesn't exist, then it isn't returned in the response. If none of the keys you request exist, then you get an empty list back.

The rows() method also allows filtering by column family or column; if you request columns which don't exist for a row in the row key list, that row isn't returned in the response. In Code Listing 39, the request is for the t:1106 column from two rows, but only one row has that column, so the other row isn't returned:

Code Listing 39: Filtering Columns from Multiple Rows with HappyBase

>>> print table.rows(['elton|jericho|201511', 'elton|jericho|201510'], ['t:1106'])

[('elton|jericho|201511', {'t:1106': '120'})]

The row() and rows() methods can include an option to return the timestamp for each cell in the response, but if you have a table with multiple versions in a column family, these methods only return the most recent version.

To read more than one version from a column, HappyBase has the cells() method, which takes a row key and column name, together with the number of versions to return (and optionally the timestamp of the data), as shown in Code Listing 40:

Code Listing 40: Reading Multiple Cell Versions with HappyBase

>>> versionedTable = connection.table('with-custom-config')

>>> print versionedTable.cells('rk1', 'cf1:data', 3)

['v2', 'v1', 'v0']

>>> print versionedTable.cells('rk1', 'cf1:data', 3, include_timestamp=True)

[('v2', 1447399969699), ('v1', 1447399962115), ('v0', 1447399948404)]

The cells() method returns cell versions in descending order of timestamp.

For rows with counter columns, the data will be returned in the row() and cells() methods, but in an unfriendly hex format. HappyBase also includes a counter_get method to read the current value of a counter column as a long integer.

Code Listing 41 shows the different results for reading a counter column:

Code Listing 41: Reading Counter Columns with HappyBase

>>> counterTable = connection.table('counters')

>>> print counterTable.row('rk1')

{'c:1': '\x00\x00\x00\x00\x00\x00\x00\x01'}

>>> print counterTable.counter_get('rk1', 'c:1')

Working with Scanners in Python

The Table object has a scan() method to create a scanner on the region server, which you can iterate through on the client. You can use scan in the same way as the HBase Shell, passing start and stop rows to define the boundaries, as in Code Listing 42:

Code Listing 42: Scanning Rows with HappyBase

>>> access_logs = connection.table('access-logs')

>>> scanner = access_logs.scan('elton|jericho|201510', 'elton|jericho|x')

>>> for key, data in scanner:

... print key, data

...

elton|jericho|201510 {'t:2908': '80'}

elton|jericho|201511 {'t:1106': '120', 't:1107': '650'}

There are some friendly additions to the scan() method. You can pass a row key prefix instead of start and stop rows, and HappyBase sets the boundaries for you; you can also pass a list of column family names or column names to restrict the data in the response, as in Code Listing 43:

Code Listing 43: Scanning Rows by Prefix with HappyBase

>>> scanner = access_logs.scan(row_prefix='elton|jericho|', columns=['t:1106'])

>>> for key, data in scanner:

... print key, data

...

elton|jericho|201511 {'t:1106': '120'}

Scan returns an iterable object, which you can loop through as a single result set, although HappyBase will actually read the result in batches from Thrift. You can specify a batch_size argument to tune the reads from the scanner; this defaults to 1,000, which is a reasonable assumption to favor large batches over multiple reads.

If you are working with a wide table or with large cell sizes, then you may need to reduce the batch size to improve overall performance. If you are reading small cell values from many rows, a larger batch size may be better.

Scanners and Filters in Python

Thrift supports scanners with filters that run on the Region Server. The scanner reads rows efficiently from the supplied row key boundaries, and the filter extracts just the rows or columns you want returned.

HappyBase lets you create filtered scanners from the client in the scan() method. This is one area where HappyBase doesn't abstract the complexity, and you have to construct the filter as a string, in line with the Thrift API filter language.

The general format of the filter string is {filter name} ( {arguments} ). Code Listing 44 shows a filter that returns only the rows in the scan that have values in columns starting with the prefix "11":

Code Listing 44: Scanning and Filtering Rows with HappyBase

>>> access_logs = connection.table('access-logs')

>>> scanner = access_logs.scan('elton|jericho|201510', 'elton|jericho|x', filter="ColumnPrefixFilter('11')")

>>> for key, data in scanner:

... print key, data

...

elton|jericho|201511 {'t:1106': '120', 't:1107': '650'}

Tip: The Thrift API is well documented in the HBase online documentation, and the available filters and the arguments you need to supply are covered here.

Writing Data from Python

The put() method on the HappyBase table object works a lot like the put command in the HBase Shell, taking the row key, column name, and value. With HappyBase though, you can update and insert multiple cell values with a single statement, passing a dictionary of key:value pairs as in Code Listing 45:

Code Listing 45: Updating Data with HappyBase

>>> access_logs.put('elton|jericho|201511', {'t:1309':'400', 't:1310':'200'})

>>> print access_logs.row('elton|jericho|201511', ['t:1309', 't:1310'])

{'t:1310': '200', 't:1309': '400'}

The put() method is limited to a single row, but HappyBase provides a useful mechanism for batching updates. This is a common requirement in HBase clients, particularly in event-streaming applications where you may receive hundreds or even thousands of events per second that you want to buffer in your processor.

The Batch class in HappyBase allows you to do that without writing custom code to maintain the buffer of pending updates. You can create a batch object from a table and use it within a context manager block. When the block ends, the send() method is called on the batch, which sends all the updates to the Thrift server, as shown in Code Listing 46:

Code Listing 46: Batch Updating Data with HappyBase

>>> with access_logs.batch() as batch:

... batch.put('elton|jericho|201512', {'t:0110':'200'})

... batch.put('elton|jericho|201512', {'t:0210':'120', 't:0211':'360'})

...

>>> print access_logs.row('elton|jericho|201512')

{'t:0211': '360', 't:0210': '120', 't:0110': '200'}

The put method on the Batch class has the same signature as on the Table class, so you can make one or many updates to a row with each put.

Note: Batching is supported by the Thrift API, so when a batch of updates (called mutations in the native Thrift API) are sent, that's done in a single call to the Thrift connection.

Thrift also supports incrementing counter columns, which you can do in HappyBase with the counter_inc method, optionally providing an amount to increment by, as shown in Code Listing 47:

Code Listing 47: Incrementing Counters with HappyBase

>>> counterTable.counter_get('rk1', 'c:1')

>>> counterTable.counter_inc('rk1', 'c:1')

>>> counterTable.counter_inc('rk1', 'c:1', 100)

102

Note that the counter_inc method returns the cell value after the increment is applied, unlike the put() method, which has no return.

Summary

In this chapter, we looked at the principal external API for HBase, the Thrift server. Thrift can run on the Region Servers and provides a fully-featured alternative to the Java API. You generate a Thrift client in your preferred language, and then you work with the classes natively.

The Thrift client API operates at a low level, and in many languages there are community wrappers to make the generic Thrift classes easier to work with, and to give them a more native feel in the platform. We used the HappyBase library, which is a wrapper for the Python client.

Thrift provides all the key features you need to read, scan, and write data—all neatly wrapped in HappyBase to give an intuitive interface. You can fetch rows by key, selecting the columns to return, and you can scan for a set of rows with a more complex filter applied at the server.

You can write cell values individually or in bulk, and for many-row updates, Thrift supports a batch interface that allows multiple updates from a single server call. Thrift also supports counter columns, so you can read and increment counters with HappyBase.

The biggest downside with Thrift is getting started with the Thrift client and generating the language bindings for HBase. If there's no friendly wrapper in your preferred language, then the amount of setup work can be quite involved.

In the next chapter, we'll look at the other external API for HBase—the REST API, Stargate. That provides access to HBase data over HTTP, so it offers an even simpler cross-platform experience, although without all the features provided by Thrift.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now

Connecting with Python and Thrift

Overview

Working with Thrift Clients

Connecting to Thrift with HappyBase

Reading Data with Python

Working with Scanners in Python

Scanners and Filters in Python

Writing Data from Python

Summary

DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.