left-icon

HBase Succinctly®
by Elton Stoneman

Previous
Chapter

of
A
A
A

CHAPTER 4

Connecting with the Java API

Connecting with the Java API


Overview

The native API for client connections to HBase is the Java API. The functionality can be divided into two parts—metadata and admin functions—which connect to the Master Server, and data access functions, which connect to the Region Servers (we'll cover those servers in more depth in Chapter 7  The Architecture of HBase”).

You don't need to do anything to support the Java API from the server, other than ensure the ports are open (by default, 2181 for Zookeeper, 60000 for the Master, and 60020 for the Region Servers).

The HBase client package is in the Maven Repository, with JAR files for all the released versions of HBase. At the time of writing, the latest version is 1.1.2 (which is the version I use in the Docker container for the course), but 0.9x versions are common and still available in Maven.

Note: I won't cover getting up and running with Maven or a Java IDE here, but the source code for the book contains a NetBeans Java project with sample code that uses Maven.

Connecting to HBase with the Java Client

The HBase Java client is available in the Maven Central repository, and is versioned so the client version number matches the server version. For the current version of HBase (and the version running in the hbase-succinctly Docker container), we need a dependency to version 1.1.2 of the org.apache.hbase.hbase-client package, as in Code Listing 21:

Code Listing 21: The Maven HBase Client Dependency

<dependency>

   <groupId>org.apache.hbase</groupId>

   <artifactId>hbase-client</artifactId>

   <version>1.1.2</version>

 </dependency>

With the Java API, you start with a Configuration object, which contains the connection details for the server, and you use that when you create client objects for tables or administration. When you create a configuration object, by default it will look for an hbase-site.xml file in the resources of the running app that contains the configuration settings.

The hbase-site.xml configuration file also lives on the server, and you can use the same contents for the client connection—it specifies key details like the server ports and the Zookeeper quorum addresses. Code Listing 22 shows some sample properties from the site file:

Code Listing 22: The hbase-site.xml Config File

<configuration>

    <property>

        <name>hbase.cluster.distributed</name>

        <value>true</value>

    </property>

    <property>

        <name>hbase.master.port</name>

        <value>60000</value>

    </property>

    ...

</configuration>

The Java client only needs to know the Zookeeper quorum addresses; it gets the Master and Region Server addresses from Zookeeper.

Note: Zookeeper stores addresses as host names rather than IPs, so you need to be sure the host names of the Region Servers are accessible to the machine running the Java client. If you are using the Docker run command from Code Listing 6, then the host name will be hbase, and you should add a line to your hosts file, associating hbase to 127.0.0.1

You connect to HBase by using the ConnectionFactory class to create a Connection object, which uses the configuration from the local hbase-site.xml file, as in Code Listing 23:

Code Listing 23: Getting a Connection to HBase

Configuration config = HBaseConfiguration.create();

Connection connection = ConnectionFactory.createConnection(config);

You can set properties on the configuration object in code, but using the server's XML configuration file is more manageable.

Connection objects are expensive, and should be re-used. They are used to create Table and Admin objects for DML and DDL operations. The Connection object should be closed when your data access is finished, typically calling close() within a finally block.

Reading Data with Java

Using the Connection object, you can get a reference to a specific table, and you can use that to read and write data. The Java API works at the byte array level, so you need to decide on how to encode your data (natively, or converting all values to strings), and encode and decode all data.

There are helper classes in the HBase client package, which simplify that. Code Listing 24 shows how to get a Table object from the connection, and fetch a whole row using a Get object:

Code Listing 24: Reading a Row with Java

Table access_logs = connection.getTable(TableName.valueOf("access-logs"));

Get get = new Get(Bytes.toBytes("elton|jericho|201511"));

Result result = access_logs.get(get);

Note that the table name is created using the TableName class, and the row key in encoded to bytes using the Bytes utility class. When this code runs, the whole row will be in the result object, which contains the full byte array.

The Result class has a listCells() method, which returns a list of Cell objects; navigating the byte arrays in those objects is cumbersome, but another helper class, CellUtil, simplifies it. Code Listing 25 shows how to navigate the Cell array, printing out the column name and value for each cell:

Code Listing 25: Reading Cell Values with Java

for (Cell cell : result.listCells()){

  System.out.println(Bytes.toString(CellUtil.cloneFamily(cell)) + ":" +     

                     Bytes.toString(CellUtil.cloneQualifier(cell)) + " = " +

                     Bytes.toString(CellUtil.cloneValue(cell)));

}

//output -           

//t:1106 = 120

//t:1107 = 650     

The Get class can be used to return a restricted set of cells from the row. Code Listing 26 shows the use of the addFamily() method to return the cells in one column family for the row, and the addColumn() method to limit the response to a single cell.

Again, the identifiers need to by byte arrays, so the Bytes class is used to encode string values:

Code Listing 26: Reading Specific Cells with Java

 get = new Get(Bytes.toBytes("elton|jericho|201511"));

 get.addFamily(Bytes.toBytes("t"));

 result = access_logs.get(get);           

 printCells(result);

 //output - single column family:          

 //t:1106 = 120

 //t:1107 = 650 

           

 get = new Get(Bytes.toBytes("elton|jericho|201511"));

 get.addColumn(Bytes.toBytes("t"), Bytes.toBytes("1106"));

 result = access_logs.get(get);           

 printCells(result);

 //output - single column:          

 //t:1106 = 120    

Tip: If you see code samples using the HTable class and instantiating them directly with Configuration objects, that code is deprecated in the API. The newer way, which I'm using, is with the ConnectionFactory, Connection, and Table classes.

Working with Scanners in Java

To scan for a range of rows, you create a Scan object with the start and (optionally) stop row key boundaries, and pass it to the getScanner() method on the Table class. That creates the scanner on the server and returns a ResultScanner object you can use to iterate the rows.

Each iteration returns a Result object, as in Code Listing 27, where I use a helper method printCells() to write the output:

Code Listing 27: Scanning Rows with Java

Table access_logs = connection.getTable(TableName.valueOf("access-logs"));

 Scan scan = new Scan(Bytes.toBytes("elton|jericho|201510"),

                      Bytes.toBytes("elton|jericho|x"));

 ResultScanner scanner = access_logs.getScanner(scan);

 for (Result result : scanner) {

     printCells(result);

 }

       

 //output - three cells, two whole rows:

 //[elton|jericho|201510] t:2908 = 80

 //[elton|jericho|201511] t:1106 = 120

 //[elton|jericho|201511] t:1107 = 650

You can tune the scanner performance by specifying properties on the Scan object:

  • setCaching – Specify how many rows to cache on the server. Larger cache values mean the client can iterate the scanner more quickly, at the cost of server memory.
  • setMaxResultSize – Specify the maximum number of cells the whole scanner should return. Useful to verify logic over a subset of data in a large table.
  • setBatch – Specify the maximum number of cells to return for each iteration of the batch.

Note that the Scan instance is modified as the scanner is iterated, so you should create new instances for each scan you want to do. Other than the original Connection object, the client objects are cheap to create and don’t need to be reused.

You can also restrict the cells in the results from the scanner with the addFamily() and addColumn() methods, which work in the same way as for the Get class.

Scanners and Filters in Java

By adding a filter to a scanner, you can perform complex queries. All the selection happens server-side, but you need to remember that while the row scan is fast, the column filtering is slower, especially for wide tables with lots of columns.

Filters are strongly typed in the Java API, inheriting from the abstract FilterBase class. There are a number of filters of varying usefulness—the tree view in the Javadoc for the package org.apache.hadoop.hbase.filter is a good place to examine them.

The ValueFilter is a useful example; it filters cells by comparing their value to a provided comparison operator, and another filter. If you store cell values as strings, you can filter the response to column values that match a regular expression, as in Code Listing 28:

Code Listing 28: Scanning and Filtering Rows with Java

 scan = new Scan(Bytes.toBytes("elton|jericho|201510"),

                 Bytes.toBytes("elton|jericho|x"));

 scan.setFilter(new ValueFilter(CompareOp.EQUAL,

                 new RegexStringComparator("[5-9][0-9]0")));

 scanner = access_logs.getScanner(scan);

 for (Result result : scanner) {

     printCells(result);

 }

       

 //output - one cell:

 //[elton|jericho|201511] t:1107 = 650

The combination of ValueFilter and RegexStringComparator means cells will only be included in the response if they have a three-digit value between 500 and 990, ending in zero. That filter works across all columns in all families; a family name or qualifier isn't required.

With the Java API, you can also combine many filters using a FilterList object, and specify the inclusion criteria, whether rows must match all of the filters or just one.

You can combine any filters in a list. Code Sample 29 shows a list that filters on column qualifier name and cell value, both using regular expressions:

Code Listing 29: Scanning with Multiple Filters in Java

 FilterList filterList = new 

  FilterList(FilterList.Operator.MUST_PASS_ALL);

 filterList.addFilter(new QualifierFilter(CompareOp.EQUAL,

                       new RegexStringComparator("[0-9]{2}0[7-8]")));

 filterList.addFilter(new ValueFilter(CompareOp.EQUAL,

                       new RegexStringComparator("[0-9]0")));

 scan = new Scan(Bytes.toBytes("elton|jericho|201510"),

                 Bytes.toBytes("elton|jericho|x"));

 scan.setFilter(filterList);

 scanner = access_logs.getScanner(scan);

 for (Result result : scanner) {

     printCells(result);

 }

       

 //output - two cells:

 //[elton|jericho|201510] t:2908 = 80

 //[elton|jericho|201511] t:1107 = 650

Writing Data in Java

The Java API provides basic data updates, much like the put command in the HBase Shell, but also adds some more flexible functionality.

The Put class is the write equivalent of the Get class. You instantiate it for a specified row key, and then you can add one or more column values, before making the changes by calling the put() method on the relevant Table object, as in Code Listing 30:

Code Listing 30: Updating Cells with Put in Java

Table access_logs = connection.getTable(TableName.valueOf("access-logs"));

 Put log = new Put(Bytes.toBytes("elton|jericho|201511"));

 log.addColumn(Bytes.toBytes("t"),      //family

               Bytes.toBytes("1621"),   //qualifier

               Bytes.toBytes("340"));   //value

 access_logs.put(log);       

       

 //result - updated cell value:

 //t:1621 = 120

You can add multiple cell values to a Put object, which will automatically set multiple values on a single row, and an overload of the addColumn() method allows you to specify a timestamp for the cell.

The Put object is also used in the Table method checkAndPut(), which makes a conditional update to a cell. The method takes a column name and cell value to check before making the update. If the provided value matches, then the put is automatically made; if not, the row is not changed.

Code Listing 31 shows how checkAndPut() is used to add a new cell to a row, but only if an existing cell (in that row, or in another row of the table) has the expected value. In this case, I'm telling HBase to add a column t:1622, but only if the value of t:1621 is 34000, which it isn't, so the update should not be made:

Code Listing 31: Updating Conditionally with Java

Put newLog = new Put(Bytes.toBytes("elton|jericho|201511"));

 log.addColumn(Bytes.toBytes("t"),

               Bytes.toBytes("1622"),

               Bytes.toBytes("100"));

 access_logs.checkAndPut(Bytes.toBytes("elton|jericho|201511"),

                         Bytes.toBytes("t"), //family

                         Bytes.toBytes("1621"),

                         Bytes.toBytes("34000"),

                         newLog);

       

 //result - not updated, checked value doesn't match

Code Listing 32 shows the result of running the two put methods, from the HBase Shell. Cell t:1621 has the value 340, so the new cell t:1622 hasn't been added:

Code Listing 32: Fetching Cells Updated with Java

hbase(main):002:0> get 'access-logs', 'elton|jericho|201511'

COLUMN                CELL                                                     

 t:1106               timestamp=1447703111745, value=120                       

 t:1107               timestamp=1447703111735, value=650                       

 t:1621               timestamp=1447709413579, value=340                       

3 row(s) in 0.0730 seconds

The Java API also lets you make multiple updates to different rows in a single batch. The same Put class is used to define the changes, and multiple Put objects are added to a list. The list is used with the batch() method on the Table class, which writes the updates in a single server call, as shown in Code Listing 33:

Code Listing 33: Batch Updating Cells with Java

 List<Row> batch = new ArrayList<Row>();

       

 Put put1 = new Put(Bytes.toBytes("elton|jericho|201512"));

 put1.addColumn(Bytes.toBytes("t"),

                Bytes.toBytes("0109"),

                Bytes.toBytes("670"));       

 batch.add(put1);

       

 Put put2 = new Put(Bytes.toBytes("elton|jericho|201601"));

 put2.addColumn(Bytes.toBytes("t"),

                Bytes.toBytes("0110"),

                Bytes.toBytes("110"));       

 batch.add(put2);

       

 Put put3 = new Put(Bytes.toBytes("elton|jericho|201602"));

 put3.addColumn(Bytes.toBytes("t"),

                Bytes.toBytes("0206"),

                Bytes.toBytes("500"));       

 batch.add(put3);

       

 Table access_logs = connection.getTable(TableName.valueOf("access-logs"));

 Object[] results = new Object[batch.size()];   

 access_logs.batch(batch, results);

You can include other operations in a batch, so you could add Delete objects with Put objects. The batch can include Get objects to return a set of results, but the ordering of the batch is not guaranteed—so if a Get contains the same cells as a Put, you may get the data in the state it was before the Put.

In Code Listing 34, there's the result of executing that batch seen in a scan command from the HBase Shell:

Code Listing 34: Fetching Cells from a Java Batch Update

hbase(main):003:0> scan 'access-logs', {STARTROW => 'elton|jericho|201512'}

ROW                   COLUMN+CELL                                              

 elton|jericho|201512 column=t:0109, timestamp=1447710255527, value=670        

 elton|jericho|201601 column=t:0110, timestamp=1447710255527, value=110        

 elton|jericho|201602 column=t:0206, timestamp=1447710255527, value=500        

3 row(s) in 0.0680 seconds

Tip: This is the first mention of deleting data in HBase. I haven't covered it because I find you do it rarely, but you can delete individual cells and rows in tables. HBase uses a delete marker to flag deleted values, rather than immediately removing the data from disk, so the delete operation is fast.

Summary

The Java API is the richest client interface for HBase, and there are many more features than I've covered here, including incrementing counter columns, functions for accessing random data items (useful for integration testing), and a set of admin operations.

Java access is direct to the region servers (or master server for admin functions), and it's the most efficient API. It doesn't require any additional JVMs to run on the region servers, and the client is region-aware, so it requests data directly from the server that is hosting the region.

Java is also the native language for extending HBase on the server. Co-processors are an advanced topic that I won't cover in this book, but they are valuable in many scenarios. You write a co-processor in Java and make the package available to the Region Servers on HDFS. Then your code can be invoked on the server in response to data operations, like when rows are added or cell values are changed, similar to triggers in SQL databases.

Even if you work exclusively in Java, it's good to know what the external HBase APIs offer and how they work. In the next chapter, we'll cover the Thrift API, which has near-parity with the Java feature set, and can be consumed from many client libraries. We'll cover using Thrift with Python.

Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.