Hive Succinctly^®
by Elton Stoneman

CHAPTER 3

Internal Hive Tables

Why use Internal tables?

Internal tables, also known as native or managed tables, are controlled by Hive—in fact, Hive owns the storage of these tables. They're still continued in HDFS, which means you get all the benefits of reliable, widely available data, and, if you choose a common format, you can still query Hive table files using other Hadoop tools.

You receive the major benefit of more Hive functionality when you use internal tables. Currently, updating or deleting data is only possible with managed tables (we'll cover this more in Chapter 7 DDL and DML in Hive), and there are many edge cases with HiveQL that work only with internal tables.

Using internal tables lets you focus on modeling the data in the way you want to use it while Hive worries about how the data is physically maintained. And for large datasets, you can configure sharding, so that tables are physically split across different files in order to improve performance.

Hive also allows for and manages temporary tables, and those are useful for storing intermediate result sets that Hive destroys when the session ends. The full set of Hive's Extract, Transform, and Load (ETL) tools are available for internal tables, which means they are a good choice for storing new data to be accessed primarily through Hive.

Hive stores internal tables as files in HDFS, which will allow you to access them using other Hadoop tools. The more optimized storage options are not supported by the entire Hadoop ecosystem. You can use internal Hive tables even if you are using a range of tools, but you will need to choose an interoperable format.

Defining internal tables

The create table statement will create an internal table unless you specify the external modifier (which we will cover in Chapter 4 External Tables Over HDFS and Chapter 5 External Tables Over HBase). The simplest statement, shown in Code Listing 11, will create a table with a single column, using all default values.

Code Listing 11: Creating an Internal Hive Table

create table dual(r string);

The default root location in HDFS for Hive tables is /user/hive/warehouse, and in Code Listing 12 we can see that when the create table statement runs, Hive creates a directory called dual, but the directory will be empty.

Code Listing 12: HDFS Folder Created by Hive

root@hive:/hive-setup# hdfs dfs -ls /user/hive/warehouse/

Found 1 item

drwxrwxr-x - root root 4096 2016-01-25 18:02 /user/hive/warehouse/dual

root@hive:/hive-setup# hdfs dfs -ls /user/hive/warehouse/dual

root@hive:/hive-setup#

When we insert data into the new table, Hive will create a file in HDFS and populate it. Code Listing 13 shows an insert statement in the new table, with all the output from Beeline, which allows us to see what Hive is doing.

Code Listing 13: Inserting a Row into a Hive Table

> insert into dual(r) values('1');

INFO : Number of reduce tasks is set to 0 since there's no reduce operator

INFO : number of splits:1

INFO : Submitting tokens for job: job_local1569518498_0001

INFO : The url to track the job: http://localhost:8080/

INFO : Job running in-process (local Hadoop)

INFO : 2016-01-25 18:07:39,487 Stage-1 map = 100%, reduce = 0%

INFO : Ended Job = job_local1569518498_0001

INFO : Stage-4 is selected by condition resolver.

INFO : Stage-3 is filtered out by condition resolver.

INFO : Stage-5 is filtered out by condition resolver.

INFO : Moving data to: file:/user/hive/warehouse/dual/.hive-staging_hive_2016-01-25_18-07-37_949_178012634824589876-2/-ext-10000 from file:/user/hive/warehouse/dual/.hive-staging_hive_2016-01-25_18-07-37_949_178012634824589876-2/-ext-10002

INFO : Loading data to table default.dual from file:/user/hive/warehouse/dual/.hive-staging_hive_2016-01-25_18-07-37_949_178012634824589876-2/-ext-10000

INFO : Table default.dual stats: [numFiles=1, numRows=1, totalSize=2, rawDataSize=1]

No rows affected (1.724 seconds)

There is a lot of detail in the INFO level output, and some of it offers useful information:

What is the URL for tracking the job in Hive's Web UI (good for long-running queries).
How the job is running (in-process rather than through YARN).
How the job is structured (into map and reduce stages).
What Hive is doing with the data (first loading it to a staging file).
How many rows were returned from the query. The ‘no rows affected’ response does not mean that no rows were inserted—it’s the counter in Beeline answering how many rows were in fact returned from the query.

Hive deals with appending data to HDFS via the staging file. Hive writes all the new data to it, and when the write is committed Hive moves the file to the correct location in HDFS. Inserting rows into internal tables is an ACID operation, and when it finishes we can view the file and its contents in HDFS using the list and cat commands in Code Listing 14.

Code Listing 14: Viewing File Contents for a Hive Table

root@hive:/hive-setup# hdfs dfs -ls /user/hive/warehouse/dual/

Found 1 items

-rwxrwxr-x 1 root root 2 2016-01-26 07:00 /user/hive/warehouse/dual/000000_0

root@hive:/hive-setup# hdfs dfs -cat /user/hive/warehouse/dual/000000_0

The hdfs dfs -ls command tells us there is no one file in the directory for the dual table, called 000000_0, and that file’s contents make up one row with a single character, the '1' we inserted into the table. We can read the contents of the file because the default format is text, but Hive also supports other, more efficient file formats.

File formats

The create table command supports the stored as clause that specifies the physical file format for the data files. The clause is optional, and if you omit it as we did in Code Listing 14, Hive assumes the default stored as textfile.

Hive has native support for other file types that are supported by other tools in the Hadoop ecosystem. Here are three of the most popular:

AVRO—schema-based binary format, interoperable across many platforms.
ORC—Optimized Row Columnar format, built for Hive as an efficient format.
PARQUET—a compressed columnar format, widely used in Hadoop.

If you are creating new data with Hive, the ORC format is typically the optimal choice for storage size and access performance, but it is not widely supported by other Hadoop tools. If you want to use other tools to access the same data (such as Pig and Spark), Parquet and Avro are more commonly supported.

Columnar file formats have a more intelligent structure than flat text files, and they typically store data in blocks along with a lightweight index that Hive uses to locate the exact block it needs to read. Read operations don't need to scan the entire file, they need only to read the index and the block containing the data.

Avro, ORC, and Parquet all provide compression by default. With large data sizes, the overhead in CPU time needed to compress and decompress data is usually negligible compared to the time saved in transferring smaller files across the network or blocks from disk into memory.

Note: You can change the file format of an existing table using alter table… set fileformat, but Hive does not automatically convert all the existing data. If you have data in a table and you change the file format in order to write more data, you will have multiple files in different formats—Hive won't be able to read from the table. If you want to change the format, you should use one of the ETL options described in Chapter 6 ETL with Hive.

Simple data types

One of Hive's most appealing features is its ability to apply structure to unstructured or semistructured Hadoop data. When you define a table in Hive, each column uses a specific data type. You need not worry about how the data is mapped with internal tables because Hive owns the storage and takes care of serializing and deserializing the data on disk.

Hive provides all the basic data types used in typical databases, and it also includes some higher-value data types that allow you to more accurately model your data. For complex data types, Hive uses columns that can contain multiple values in different types of collections. We'll look at those in Chapter 4 External Tables Over HDFS within the context of mapping existing data.

Number types

With built-in functions for mathematical operations such as log, square root, and modulus, Hive offers richer support for numerical data than many relational databases. Hive can also implicitly convert between multiple integer and floating-point types (so long as you are converting from a smaller capacity to a larger one).

In ascending order of capacity, here are the four integer types:

TINYINT—from -128 to +127, postfix with 'Y' in literals.
SMALLINT—from -32768 to +32767, postfix with 'S' in literals.
INT—from -2147483648 to +2147483647, no postfix needed.
BIGINT—from -9223372036854775808 to +9223372036854775807, postfix 'L'.

INT is the default integer type, but the +/-2Bn range will be limiting if you are storing sequences, counts, or UNIX timestamps (although Hive has a specific data type for that).

If you exceed the capacity of an integer field, Hive doesn't produce a runtime error. Instead, it wraps the value (from positive to negative and vice versa), so be careful that your calculations aren't skewed by silently breaching column capacity.

Code Listing 15 shows the result of breaching column capacity—in this case turning a positive TINYINT into a negative value and a negative SMALLINT into a positive value.

Code Listing 15: Wrapping Numbers from Positive to Negative

> select (127Y + 1Y), (-32759S - 10S);

+-------+--------+--+

| _c0 | _c1 |

+-------+--------+--+

| -128 | 32767 |

+-------+--------+--+

Floating point types allow for greater precision and a larger range of numbers:

FLOAT—single precision, 4-byte capacity.
DOUBLE—double precision, 8-byte capacity.
DECIMAL—variable precision, 38-digit capacity.

DECIMAL types default to a precision of 10 and a scale of zero, but they are typically defined with specific values—e.g., a five-digit field with two decimal places would be defined as DECIMAL (5,2) and could hold a maximum value of 999.99.

With the DECIMAL type using zero scale, you can store larger integers than is possible with BIGINT. The postfix for integer numbers represented as a decimal is BD (the DECIMAL type is based on Java's BigDecimal type), and as Code Listing 16 shows, this allows you to work beyond the BIGINT limits.

Code Listing 16: Representing Large Integers with BigDecimal

> select (9223372036854775807L * 10), (9223372036854775807BD * 10);

+------+-----------------------+--+

| _c0 | _c1 |

+------+-----------------------+--+

| -10 | 92233720368547758070 |

+------+-----------------------+--+

Hive supports both scientific notation and standard notation for floating-point numbers, and it allows a mixture of them in the same rows and tables.

Hive will approximate to zero or infinity when you reach the minimum and maximum limits of the decimal types it can store. But those limits are at powers of approximately +/-308, which means you are unlikely to hit them. Code Listing 17 shows what happens if you do.

Code Listing 17: Breaching Limits for Floating Point Numbers

> select 1E308, 1E330, -1E-308, -1E-330;

+----------+-----------+------------+-------+--+

| _c0 | _c1 | _c2 | _c3 |

+----------+-----------+------------+-------+--+

| 1.0E308 | Infinity | -1.0E-308 | -0.0 |

+----------+-----------+------------+-------+--+

Character types

The primary character type in Hive is STRING, which does not impose a maximum length and practically supports strings of any size. User Defined Functions over character types typically use strings, but there are choices of types available:

STRING—unlimited size, literals can be delimited with single or double quotes.
VARCHAR—specified maximum size, whitespace in input is preserved.
CHAR—fixed length, smaller input is padded with whitespace.

You can compare strings in columns with different character types, but you must know whether or not whitespace will affect the comparison. STRING and VARCHAR types preserve whitespace, which means values with the same text content but that end with different numbers of spaces are not equal. CHAR fields are always padded to the set length with spaces, so that only the text is compared.

Code Listing 18 creates a table with three character columns and inserts rows that include the same text in each field and with differing amounts of trailing whitespace.

Code Listing 18: Creating Tables with Character Fields

> create table strings(a_string string, a_varchar varchar(10), a_char char(10));

No rows affected (0.192 seconds)

> insert into strings(a_string, a_varchar, a_char) values('a', 'a', 'a');

No rows affected (1.812 seconds)

> insert into strings(a_string, a_varchar, a_char) values('b ', 'b ', 'b');

No rows affected (1.381 seconds)

Because the table is in the default file format, it is stored as text, and when we read the files we can see where the whitespace is being persisted, as in Code Listing 19 (where '#' represents the default separator \0001).

Code Listing 19: Storage of Whitespace in Character Fields

root@hive:/hive-setup# hdfs dfs -cat /user/hive/warehouse/strings/*

a#a#a

b #b #b

If we query that table to find rows with matching columns, we will receive only the first row because the fields in the second row have differences in the trailing whitespace. But if we use the trim() function to clear the whitespace for the comparison, both rows are returned—as in Code Listing 20.

Code Listing 20: Comparing Values in Character Fields

> select * from strings where a_string = a_varchar and a_string = a_char;

+-------------------+--------------------+-----------------+--+

| strings.a_string | strings.a_varchar | strings.a_char |

+-------------------+--------------------+-----------------+--+

| a | a | a |

+-------------------+--------------------+-----------------+--+

1 row selected (0.088 seconds)

> select * from strings where trim(a_string) = trim(a_varchar) and trim(a_string) = a_char;

+-------------------+--------------------+-----------------+--+

| strings.a_string | strings.a_varchar | strings.a_char |

+-------------------+--------------------+-----------------+--+

| a | a | a |

| b | b | b |

+-------------------+--------------------+-----------------+--+

2 rows selected (0.095 seconds)

Date and time types

Hive explicitly supports date and time data with types capable of storing high-precision timestamps or dates without a time component:

TIMESTAMP—UNIX-style timestamps, recording time elapsed since epoch. Precision can vary from seconds to nanoseconds.
DATE—a date with no time component.

Both types support literal expressions as strings in the formats 'yyyy-MM-dd' (for dates) and 'yyyy-MM-dd HH:mm:ss.fff' (for timestamps, with up to nine decimal places supporting nanosecond precision).

If you have values recorded as integer UNIX timestamps, you can insert them into TIMESTAMP columns using the from_unixtime() function. Note that only second precision is supported here, and you cannot use functions in an insert … values statement, which means the syntax differs for string and integer timestamp insertion, as shown in Code Listing 21.

Code Listing 21: Converting Dates and Timestamps

> create table datetimes(a_timestamp timestamp, a_date date);

No rows affected (0.068 seconds)

> insert into datetimes(a_timestamp, a_date) values('2016-01-27 07:19:01.001', '2016-01-27');

...

No rows affected (1.387 seconds)

> from dual insert into table datetimes select from_unixtime(1453562878), '2016-01-23';

...

No rows affected (1.296 seconds)

Here we create a table with a timestamp and a date column, and we insert two rows with different levels of precision in the timestamp. The from dual is a trick that lets us use a select statement with functions as the source clause for an insert (we'll cover that more in Chapter 6 ETL with Hive).

Hive supports conversion between timestamps and dates, and the built-in date functions apply to both types of column. With timestamps, the time portion will be lost if you convert to date type, and with dates any time-based functions will return NULL. Code Listing 22 shows those conversions and some sample functions.

Code Listing 22: Working with Date Types

> select a_timestamp, year(a_timestamp), a_date, year(a_date) from datetimes;

+--------------------------+-------+-------------+-------+--+

| a_timestamp | _c1 | a_date | _c3 |

+--------------------------+-------+-------------+-------+--+

| 2016-01-27 07:19:01.001 | 2016 | 2016-01-27 | 2016 |

| 2016-01-23 15:27:58.0 | 2016 | 2016-01-23 | 2016 |

+--------------------------+-------+-------------+-------+--+

2 rows selected (0.067 seconds)

> select a_timestamp, hour(a_timestamp), a_date, hour(a_date) from datetimes;

+--------------------------+-------+-------------+-------+--+

| a_timestamp | _c1 | a_date | _c3 |

+--------------------------+-------+-------------+-------+--+

| 2016-01-27 07:19:01.001 | 7 | 2016-01-27 | NULL |

| 2016-01-23 15:27:58.0 | 15 | 2016-01-23 | NULL |

+--------------------------+-------+-------------+-------+--+

2 rows selected (0.073 seconds)

> select cast(a_timestamp as date), cast(a_date as timestamp) from datetimes;

+--------------+------------------------+--+

| a_timestamp | a_date |

+--------------+------------------------+--+

| 2016-01-27 | 2016-01-27 00:00:00.0 |

| 2016-01-23 | 2016-01-23 00:00:00.0 |

+--------------+------------------------+--+

2 rows selected (0.067 seconds)

Other types

There are two other simple types in Hive:

BOOLEAN—for true/false values.
BINARY—for arbitrary byte arrays, which Hive does not interpret.

Boolean values are represented with the literals true and false; you can cast other types as boolean, but you might not get the expected result. Unlike with other languages, in Hive false is represented as the literal false or as the numeric value 0—any other value (such as number 1 or -1; or string 'true' or 'false') represents true.

Binary values can be used to store any array of bytes, but the data type does not behave like the blob data type in SQL databases. Binary column values in Hive are stored in-line in the data file with the rest of the row, rather than as a pointer to a blob. Because Hive doesn’t interact with binary data, binary values are not widely used.

Summary

In this chapter we've looked at using internal tables with Hive. The underlying data for internal tables is stored as files in HDFS, which means Hive gets all the reliability and scalability of Hadoop for free. By using internal tables, Hive controls reading and writing at the file level, so that the full feature set of Hive is available.

We also looked at the different file formats Hive provides, with text files as the default and more efficient columnar formats, such as ORC and Parquet, also natively supported. The format you choose when using internal tables will depend on whether or not any other systems need to access the raw data. If not, Hive's ORC format is a good choice; otherwise Parquet and Avro are well supported in Hadoop. Flat files can be supported by many options.

Lastly, we looked at all the simple data types available in Hive, noting that these are equally suitable to internal and external tables (provided they can be correctly mapped from the source data). In the next chapters we'll look at using external tables instead, and we’ll see how to use Hive with existing HDFS files and with HBase tables.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now

Internal Hive Tables

Why use Internal tables?

Defining internal tables

File formats

Simple data types

Number types

Character types

Date and time types

Other types

Summary

DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.