left-icon

HBase Succinctly®
by Elton Stoneman

Previous
Chapter

of
A
A
A

CHAPTER 2

HBase and the HBase Shell

HBase and the HBase Shell


HBase Run Modes

You can run HBase on a single machine for development and test environments. HBase supports three run modes: Standalone, Pseudo-Distributed, and Distributed. Distributed mode is for a full cluster, backed by HDFS, with multiple servers running different components of the HBase stack, which we will cover in Chapter 7  The Architecture of HBase.”

Standalone mode is for a single machine, where all components run in a single Java Virtual Machine, and the local file system is used for storage rather than HDFS. Pseudo-Distributed mode runs each HBase component in a different JVM on one server, and it can use HDFS or the local filesystem.

Tip: Pseudo-Distributed mode is a good option for running locally—you get a production-style separation between components, but without having to run multiple machines.

This HBase documentation covers installing and running HBase locally, so I won’t duplicate it here, but the easiest way to run HBase locally is with Docker. There are a few HBase images on the Docker Hub, including one of my own, which I’ve built to go along with this course.

Docker is an application container technology. A container is a fast, lightweight unit of compute that lets you run multiple loads on a single machine. Containers are conceptually similar to virtual machines, but very much lighter on disk, CPU, and memory usage. Docker runs on Linux, OS/X, and Windows machines. You can get installation instructions here. The Docker Hub is a public registry of pre-built images, and my image for this book is available here.

Running HBase in a Docker Container

The advantage of using a Docker container is that you can spin up and kill instances with very little overhead, and you don’t need to worry about any software or service conflicts with your development machine.  

The image hbase-succinctly is one I’ve put one together specially to go along with this book, which sets up the services we’ll be using. To use that image, install Docker and execute the command in Code Listing 5:

Code Listing 5: Running HBase in Docker

docker run -d -p 2181:2181 \

 -p 60010:60010 -p 60000:60000 \

 -p 60020:60020 -p 60030:60030 \

 -p 8080:8080 -p 8085:8085 \

 -p 9090:9090 -p 9095:9095 \

 --name hbase -h hbase \

 sixeyed/hbase-succinctly

Some of the settings in the docker run command are optional, but if you want to code along with the sample in this book, you'll need to run the full command. If you're not familiar with Docker, here’s what the command does:

  • Pulls the image called hbase-succinctly from the sixeyed repository in the public Docker Hub
  • Runs the image in a container locally, with all the key ports exposed for the servers and Web UI
  • Names the image hbase, so we can control it with other Docker commands without knowing the container ID Docker will assign
  • Gives the image the hostname hbase, so we can access it using that name

The first time that runs, it will take a while to pull the image from the registry to your local machine, but for future runs, the container will start in a few seconds, and you’ll have a functioning HBase instance with the Java, REST, and Thrift APIs running.

Note: This Docker command exposes all the ports for Zookeeper, the HBase external APIs, and the Web UIs. In case they conflict with other servers on your machine, you can change the local port by altering the number before the colon on the –p argument. For example, to have Thrift listening on port 91 on the Docker host, instead of 9090, use –p 91:9090.

This image uses the local filesystem for HBase data storage, which isn’t preserved when the container is killed, so you can reset your database back to the original state with docker kill hbase and then repeat the docker run command.

Using the HBase Shell

HBase comes with a command line interface, the HBase Shell. The Shell can’t be used on a remote machine, so you need to run it from the local machine (for HBase Standalone and Pseudo-Distributed modes), or by logging on to the Master Server (for Distributed mode).

From the HBase bin directory, run hbase shell to start the Shell. If you’re running HBase through my Docker image, connect by running the interactive command in Code Listing 6:

Code Listing 6: Running HBase Shell in Docker

docker exec -it hbase hbase shell

The HBase Shell is based on JRuby, and you can use it to execute script files, as well as interactive commands. A large number of commands are available in the Shell. It’s the ideal place to start with HBase, and you will also use it in production for administering the cluster.

We’ll cover the main commands for defining tables and reading and writing data in this chapter.

Working with Tables

Three shell commands will get you started with a new HBase database:

  • list
  • create
  • describe

To see all the tables in the database, run list and you’ll see the output in two forms: as a plain text list, and as an array representation. Code Listing 7 shows the sample output:

Code Listing 7: Listing Tables

hbase(main):001:0> list

TABLE                                                                                                                  

api-logs                                                                                                               

social-usage                                                                                                           

2 row(s) in 0.2550 seconds

=> ["api-logs", "social-usage"]

The list command only supplies the table names, and no other details. For a large database you can filter the output by supplying a regular expression for the command to match on table name; e.g. list 'so.*' will show table names starting 'so'.

To create a table, use the create command, specifying the table name and the column family names. Optionally, you can pass configuration settings for the column families, and this is how you can change the number of cell versions that HBase stores. Code Listing 8 shows two create commands:

Code Listing 8: Creating Tables

hbase(main):007:0> create 'with-default-config', 'cf1'

0 row(s) in 1.2380 seconds

=> Hbase::Table - with-default-config

hbase(main):008:0> create 'with-custom-config', {NAME =>'cf1', VERSIONS=>3}

0 row(s) in 1.2320 seconds

=> Hbase::Table - with-custom-config

The table with-default-config has a single column family, cf1, with no configuration specified, so it will use the HBase defaults (including having a single cell version). Table with-custom-config also has a column family named cf1, but with a custom configuration setting specifying three cell versions.

Note: The HBase Shell uses Ruby syntax, with curly braces to define objects and properties specified as name-value pairs, separated with ‘=>’ for the value.

To see the configuration for a table, use the describe command. The output tells you whether the table is enabled for client access, and includes all the column families with all their settings, as in Code Listing 9:

Code Listing 9: Describing Tables

hbase(main):009:0> describe 'with-custom-config'

Table with-custom-config is ENABLED                                                                                    

with-custom-config                                                                                                     

COLUMN FAMILIES DESCRIPTION                                                                                            

{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESS

ION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY =>

'false', BLOCKCACHE => 'true'}                                                                                         

1 row(s) in 0.1240 seconds

Now that we have some tables with different configurations, we can start adding and retrieving data with the HBase Shell.

Writing Cell Values

With the HBase Shell, you use the put command to write data for individual cells, specifying the table name, row key, column name, and cell value. The command in Code Listing 10 sets the value of the data column in the cf1 column family to v1 for the row with key rk1:

Code Listing 10: Writing data

hbase(main):019:0> put 'with-default-config', 'rk1', 'cf1:data', 'v1'

0 row(s) in 0.0070 seconds

The put command works like an insert/update. If the row doesn’t exist, the command creates it, creates the column, and sets the cell value. If the row exists but the column doesn’t, the command adds the column to the row and sets the cell value.

Using the default column family configuration with a single cell version, if the row and column already exist, then the command overwrites the existing value with the new one. But with a column family with more than one version, the put command adds a new cell version with the value and sets the timestamp.

You can optionally specify a custom timestamp value to use, as a long integer after the cell value. Code Listing 11 explicitly specifies a timestamp in the past—one millisecond after the UNIX epoch (January 1, 1970):

Code Listing 11: Writing Data with a Timestamp

hbase(main):020:0> put 'with-default-config', 'rk1', 'cf1:data', 'v0', 1

0 row(s) in 0.0060 seconds

Tip: Be careful about specifying timestamps in updates. If your timestamp is earlier than the other versions of the cell (even if there’s only one version configured in the column family), your data is assumed to be older than the current data, and the cell value won’t be overwritten—and your update gets lost.

Reading Data

You use the get command to read the data on a row, but unlike the put command, you can use get to read multiple cell values. To read a whole row, use get with the table name and row key. HBase will return the most recent version of every cell value in the row, as shown in Code Listing 12:

Code Listing 12: Reading a Whole Row

hbase(main):017:0> get 'social-usage', 'a'

COLUMN                          CELL                                                                                   

 i:tw                           timestamp=1446655378543, value=@EltonStoneman                                          

 t:tw                           timestamp=1446655459639, value=900                                                     

 tw:2015110216                  timestamp=1446655423853, value=310                                                     

 tw:2015110316                  timestamp=1446655409785, value=270                                                     

 tw:2015110417                  timestamp=1446655398909, value=320                                                     

5 row(s) in 0.0360 seconds

You can restrict the results to specific columns or families, with a comma-separated list of column family names or qualifiers after the row key. In Code Listing 13, we return the whole i family, and just one column from the tw family:

Code Listing 13: Reading Specific Columns

hbase(main):022:0> get 'social-usage', 'a', 'tw:2015110316', 'i'

COLUMN                          CELL                                                                                   

 i:tw                           timestamp=1446655378543, value=@EltonStoneman                                          

 tw:2015110316                  timestamp=1446655409785, value=270

2 row(s) in 0.0090 seconds

You can also pass objects with many properties, instead of just strings, to return more specific details from the rows. I can return multiple versions for a column by specifying the VERSIONS property, with the number to return, as shown in Code Listing 14:

Code Listing 14: Reading Multiple Cell Versions

hbase(main):027:0> get 'with-custom-config', 'rk1', {COLUMN =>'cf1:data', VERSIONS => 3}

COLUMN                          CELL                                                                                   

 cf1:data                       timestamp=1446655931606, value=v3                                                      

 cf1:data                       timestamp=1446655929977, value=v2                                                      

 cf1:data                       timestamp=1446655928221, value=v1                                                      

3 row(s) in 0.0120 seconds

Scanning Rows

Rows in HBase tables are physically stored in order, sorted by row key. We'll look at the structure of the data, and the indexing approach in Chapter 8  Inside the Region Server,” but for now, we just need to know that:

  • Sorted tables mean fast direct access by key.
  • Sorted tables mean slow searches by value.

In fact, you can't search for row keys matching an arbitrary pattern in HBase. If you have a table storing system access logs, with row keys that start {systemID}|{userID}, you can't search for the logs for one particular user because the user ID is in the middle of the row key.

With HBase, you find matching rows by scanning the table, providing start and end boundaries for the scan. Logically, HBase then works like a cursor, positioning the table to the start row (or a partial match of the start row) and reading until the end row.

The scan command is straightforward, but the results can be unexpected for new users of HBase. Table 2 shows some sample data from my access-logs table:

Row key

jericho|dave|201510

jericho|elton|201510

jericho|elton|201511

jericho|fred|201510

Table 2: Sample Row Keys

We have four row keys here, all for the system called Jericho, for users Dave, Elton and Fred, for the months of October and November, 2015. The rows are listed in Table 2 in the same lexicographical order as they are stored in the HBase table.

To find all the rows for Jericho access, I can scan the table with a STARTROW value, as in Code Listing 15:

Code Listing 15: Scanning with a STARTROW

hbase(main):006:0> scan 'access-logs', {STARTROW => 'jericho'}

ROW                             COLUMN+CELL                                                                            

 jericho|dave|201510            column=t:3015, timestamp=1446706437576, value=60                                       

 jericho|elton|201510           column=t:3015, timestamp=1446706444028, value=700                                      

 jericho|elton|201511           column=t:0416, timestamp=1446706449473, value=800                                      

 jericho|fred|201510            column=t:0101, timestamp=1446706454401, value=450                                      

4 row(s) in 0.0540 seconds

To find all Elton's access of Jericho, the STARTROW needs to contain the user ID, and I need to add an ENDROW value to exclude any rows after Elton. This is where the scan gets more interesting. I could use an ENDROW value of jericho|f, and that would get me just Elton's logs, as in Code Listing 16:

Code Listing 16: Scanning with an ENDROW

hbase(main):010:0> scan 'access-logs', {STARTROW => 'jericho|elton', ENDROW => 'jericho|f'}

ROW                             COLUMN+CELL                                                                            

 jericho|elton|201510           column=t:3015, timestamp=1446706444028, value=700                                      

 jericho|elton|201511           column=t:0416, timestamp=1446706449473, value=800                                      

2 row(s) in 0.0190 seconds

That query works for now, but if we later add rows for a user called Ernie, when I run the same query, it would return their logs too. So the STARTROW and ENDROW for a scan need to be as specific as you can make them, without losing the flexibility of your query.

A query that will return all of Elton's logs for any time period, could use STARTROW => 'jericho|elton|' and ENDROW => 'jericho|elton|x'. Knowing your ASCII character codes helps here.

The pipe character has a higher value than any alphanumeric characters, so including the pipe after the username ensures no other users' logs will creep into the scan. Character x is higher than any numbers, so adding that at the end of the scan means the query will return rows for any year.

One final point about scans: the upper boundary is an 'up to' value, not an 'up to and including' value. If I want to see all Elton's access in October and November of 2015, the query in Code Listing 17 isn't correct:

Code Listing 17: Scanning 'up to' the ENDROW

hbase(main):014:0> scan 'access-logs', {STARTROW => 'jericho|elton|201510', ENDROW => 'jericho|elton|201511'}

ROW                             COLUMN+CELL                                                                            

 jericho|elton|201510           column=t:3015, timestamp=1446706444028, value=700                                      

1 row(s) in 0.0140 seconds

Using an ENDROW value of jericho|elton|201511 means HBase reads up to that row and then stops. To include rows from 201511, I need an end row value that goes further than those rows, I could use jericho|elton|201512 in this case.

Summary

In this chapter, we got started with HBase, running it locally from a Docker container. Then we connected using the HBase Shell and used table management commands to list all tables in the database, create new ones, and describe their configuration.

We saw how to write data using the put command, and read it using the get command, including working with multiple versions of a cell. And we saw how to scan tables for a range of rows, using start and end row keys for the boundaries of the scan.

We'll return to the HBase Shell in Chapter 9  Monitoring and Administering HBase,” which covers more advanced topics. In the next chapter, we'll look more closely at HBase table design, row keys, and regions.


Scroll To Top
Disclaimer
DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.

Previous

Next



You are one step away from downloading ebooks from the Succinctly® series premier collection!
A confirmation has been sent to your email address. Please check and confirm your email subscription to complete the download.