Hive Succinctly^®
by Elton Stoneman

CHAPTER 7

DDL and DML in Hive

HiveQL and ANSI-SQL

We’ve looked at some HiveQL queries in previous chapters, and we can see that they are predominantly SQL, including some Hive-specific statements and clauses. HiveQL isn’t fully ANSI-SQL compatible (although achieving SQL-92 compatibility is an aim for future releases), but the differences are found around the edges—anyone with SQL experience can easily pick up HiveQL.

As with SQL, HiveQL statements either define the structure of the database with Data Definition Language (DDL), change the content of the data queries with Data Modification Language (DML), or read data.

Hive provides only table, view, and index objects, which means there are a limited number of DDL statements, and because its primary function is as a data warehouse, the standard SQL DML statements aren't supported in all cases.

In this chapter we’ll cover the key parts of HiveQL for defining data structures and writing data. We'll also cover all the major statements, but the Language Manual on the Hive Wiki has excellent documentation for all statements, including the versions of Hive that support them.

Data definition

DDL statements are used to define or change Hive databases and database objects. The functionality of HiveQL has evolved with each release, which means not all statements, and not all clauses, are available in all versions. In this chapter we’ll cover the most commonly used DDL statements in the current version at the time of writing, 1.2.1.

Databases and schemas

All objects in Hive live inside a schema, but you need not specify a particular schema, in fact the default is often used. If you want to segregate your data, you can create different schemas and refer to objects by prefixing the schema name.

The terms database and schema are interchangeable in Hive, so the following statements can be used with schema or database and work in the same way:

CREATE SCHEMA [name]—create a new schema.
USE [name]—switch to the named schema.

Schemas can be created (or altered) using the with dbproperties clause in order to store a collection of key-value pairs as metadata about the schema. That can be useful for storing a database version, or any other informational values for the schema, which you can view with the extended describe statement, as in Code Listing 62.

Code Listing 62: Schema Properties

> create database iot with dbproperties('maintainer'='[email protected]', 'release'='2016R1');

No rows affected (0.249 seconds)

> describe schema extended iot;

+----------+----------+---------------------------------------------------

+----------+----------+---------------------------------------------------

| iot | | hdfs://localhost:9000/user/hive/warehouse/iot.db | root | USER | {[email protected], release=2016R1} |

+----------+----------+---------------------------------------------------

Creating database objects

The only database objects in Hive are tables, views, and indexes, which means the only create statements are the ones we already worked with in Chapter 1 Introducing Hive:

CREATE TABLE—create an internal Hive table.
CREATE EXTERNAL TABLE—create a table in which data is stored outside of Hive.
CREATE VIEW—create a view over one or more tables.
CREATE INDEX—create an index over an existing table (internal or external).

Because all of those statements support the if not exists clause, Hive will only create them if they don’t already exist in the database.

Tables are created specifying the column names and data types, the storage engine, and the data location. Internal tables can also be partitioned to improve scalability, which we'll cover more in Chapter 8 Partitioning Data.

Views can be created for any HiveQL query that returns a value, which means you can create views to provide simple access to complex combinations of data using join or union constructs as required.

Indexes are created for one or more columns over a single table.

Tip: Defining objects means only that Hive saves the definition in its metadata store. The store is used when objects are queried, but at create time there's only basic validation of the DDL—Hive makes sure the HiveQL is valid, and the source for the data exists, but any column mappings are not verified.

The create statements are broadly standard SQL, but Hive provides a time-saving option to create tables with the same structure as an existing table, create table like—as shown in Code Listing 63.

Code Listing 63: Creating Tables like Other Tables

> describe dual;

+-----------+------------+----------+--+

| col_name | data_type | comment |

+-----------+------------+----------+--+

| c | string | |

+-----------+------------+----------+--+

> create table dual2 like dual;

> describe dual2;

+-----------+------------+----------+--+

| col_name | data_type | comment |

+-----------+------------+----------+--+

| c | string | |

+-----------+------------+----------+--+

The create table like statement is frequently useful for working with data in temporary tables or moving between internal and external tables in which the structure is the same. It does not copy any data, nor does it link the tables, which means changing the original table structure won't affect the new one.

Modifying database objects

Existing objects can be changed with alter statements, but typically these affect only the structure of the object in Hive's metadata store and will not change any of the existing data.

For that reason, you must be careful when altering table definitions, as you can easily modify your table and make reading from it impossible. With alter table you can rename the table, change the file format, and add, remove, or change columns.

To change an existing column, the syntax is alter table … change, which allows you to change the column name, data type, and order in the table. Code Listing 64 shows an existing column in the syslogs table being changed from a timestamp to a string type and moved to a later position in the table.

Code Listing 64: Altering Tables to Move Columns

> describe syslogs;

+-----------+------------+----------+--+

| col_name | data_type | comment |

+-----------+------------+----------+--+

| loggedat | timestamp | |

| host | string | |

| process | string | |

| pid | int | |

| message | string | |

+-----------+------------+----------+--+

> alter table syslogs change loggedat string after host;

> describe syslogs;

+-----------+------------+----------+--+

| col_name | data_type | comment |

+-----------+------------+----------+--+

| host | string | |

| loggedat | string | |

| process | string | |

| pid | int | |

| message | string | |

+-----------+------------+----------+--+

The change is made successfully, as we can see from the second describe statement, but Hive hasn't changed the data inside the table. The data files maintain the original structure, so that the first column is a timestamp (the original loggedAt column), and the second column is a string (the original host column).

If we try to query this table, Hive reads the timestamp value in the first column, which contains the loggedAt timestamp, but tries to read it as a string. The formats are not compatible, so we get an error as in Code Listing 65.

Code Listing 65: Tables with Structural Mismatch

> select * from syslogs limit 5;

Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.TimestampWritable cannot be cast to org.apache.hadoop.io.Text (state=,code=0)

We can still access the other columns in the table, but now the data is in the wrong place—the metadata says the loggedAt column is in the second position, but in the data files that position contains the host field, as we see in Code Listing 66.

Code Listing 66: Fetching the Wrong Data

0: jdcbc:hive2://127.0.0.1:10000> select loggedat, process, pid, message from syslogs limit 1;

+------------+----------+------+------------------------------+--+

+------------+----------+------+------------------------------+--+

+------------+----------+------+------------------------------+--+

Because alter table works at the metadata level, it’s easy to repair the damage by altering the table back to its original definition. If you do need to change the existing structure of a table (other than adding or renaming columns), a better approach is to define a new table and load it from the existing one, using whichever transforms you need.

With alter view you can change the select statement that projects the view. You can change the column layout, data types, and order by changing the query, and, provided the HiveQL query is valid, the view will be valid. Views are not materialized in Hive, which means there is no data file sitting behind the view that can get out of sync with the definition.

If the view does not already exist, alter view will raise an error. Code Listing 67 alters the existing view over the HBase device-events table, thereby removing the original rowkey column and adding a clause, so that only rows with a value in the period column will be returned.

Code Listing 67: Altering Views

> alter view device_events_period as select split(rowkey, '\\|')[0] deviceid, split(ROWKEY, '\\|')[1] period, eventname, receivedat from device_events where split(ROWKEY, '\\|')[1] is not null;

The alter index statement only allows you to rebuild an existing index—you can't change the column or table the index uses. Hive doesn't automatically rebuild indexes, which means you will need to manually rebuild using alter index whenever you modify data in the underlying table.

Code Listing 68 rebuilds the index (which is materialized as an internal Hive table) over the external hbase_table. Note that the table for the index must be explicitly specified.

Code Listing 68: Rebuilding Indexes

> alter index ix_hbase_table_cf1_data on hbase_table rebuild;

...

No rows affected (51.917 seconds)

Removing database objects

You can remove objects from Hive with the drop table, drop view, and drop index statements while optionally using the if exists clause.

When you drop an index, Hive removes both the index and the internal table used to store it. Removing indexes has no functional impact unless you have explicitly referenced the internal index table in any queries. Otherwise, any queries which implicitly use the index will run more slowly, but they will still produce the same results.

When you drop a view, it gets removed from the database, and any queries using it will fail with a 'table not found' error from Hive, as in Code Listing 69.

Code Listing 69: Dropping Views

> drop view device_events_period;

> select * from device_events_period;

Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'device_events_period' (state=42S02,code=10001)

Because there are no materialized views in Hive, dropping a view will not remove any data.

When you drop a table, the effect on the data varies depending upon the type of table being used. With external tables, Hive doesn’t remove the source data, so that if you drop a table based on HDFS files or an HBase table, the underlying data will remain, although you will not be able to access it through your Hive table.

Code Listing 70 shows the result of dropping the external device_events table in Hive and scanning the underlying device-events table in HBase.

Code Listing 70: Dropping External Tables

> drop table device_events;

> select * from device_events;

Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'device_events' (state=42S02,code=10001)

…

hbase(main):004:0> scan 'device-events'

ROW COLUMN+CELL

uuid2 column=e:n, timestamp=1454520396475, value=power.off

With internal tables, Hive manages the storage so that when you drop them, all the data will be deleted. With some platform setups, the underlying files may be moved to a recoverable trash folder—but don't depend on it.

To ensure permanent deletion, specify the purge clause, as in Code Listing 71, which shows the HDFS file listing before and after the syslog_sumaries table is dropped.

Code Listing 71: Dropping Internal Tables

root@hive:/hive-setup# hdfs dfs -ls /user/hive/warehouse/syslog_summaries

Found 1 items

-rwxrwxr-x 1 root supergroup 423 2016-02-02 21:21 /user/hive/warehouse/syslog_summaries/000000_0

> drop table syslog_summaries purge;

…

root@hive:/hive-setup# hdfs dfs -ls /user/hive/warehouse/syslog_summaries

ls: `/user/hive/warehouse/syslog_summaries': No such file or directory

Hive will not warn about dependencies when you drop a table—it will let you drop tables that have views or indexes based on them. When you drop a table referenced by views, the views will remain but will error if you try to query them. When you drop an indexed table, the indexes and the underlying index tables will be silently deleted. If you want to remove the data from an internal Hive table but leave the table in place, the standard SQL truncate statement deletes all the rows, as in Code Listing 72.

Code Listing 72: Truncating Data

> select count(*) from syslogs;

+--------+--+

| _c0 |

+--------+--+

| 15695 |

+--------+--+

> truncate table syslogs;

No rows affected (0.09 seconds)

> select count(*) from syslogs;

+------+--+

| _c0 |

+------+--+

| 0 |

+------+--+

The truncate statement works by deleting the underlying files in HDFS while leaving the folder structure in place, so that the table can be populated again. Because it is only valid for internal tables, Hive will raise the error 'Cannot truncate non-managed table' if you try to truncate an external table.

Data manipulation

The full range of DML statements is a recent addition to Hive and isn't supported by all storage engines. From its origin as a data warehouse, Hive wasn't originally conceived to update or delete existing data; it only supported appending data with load and import statements.

Since Hive 0.14, update and delete statements have been provided for storage engines that support them. The insert statement has also been extended in order to allow direct insertion of values, whereas in previous versions we could only insert the results of a select query.

ACID storage and Hive transactions

The full set of DML statements is only available on tables that support the ACID properties of typical RDBMS designs. The ACID principles ensure data consistency, so that simultaneous reads and writes to the same table won't cause conflicts.

ACID principles are not inherent in HDFS, which doesn't allow data in files to be changed and doesn't lock files that are being appended. That means you can't update an existing data item in a file, and you can't stop readers from accessing new data as it's being written.

Hive works around the HDFS limitations by creating delta files—the Wiki page on Hive Transactions explains the complexity involved. Currently, only internal tables support Hive transactions, and only if they are stored in the correct file format and Hive is configured to support the transaction manager.

As of Hive 1.2.1, tables must have the following attributes in order to support ACID:

Internal table.
ORC format storage.
Bucketed but not sorted (more on that in Chapter 8 Partitioning Data).
Flagged with the transactional property.

Summary

Because Hive has a smaller number of objects than SQL databases, so that there are very few DDL statements, the variety of table definitions means there are a large number of Hive-specific clauses. In this chapter we’ve looked at creating, altering, and dropping objects.

Remembering the disconnect between the object structure, which is stored in the Hive metastore, and the actual structure of data in files, is the key takeaway concerning DDL. Because Hive typically doesn’t enforce the structure, if you alter tables the structure and content will be out of sync and the data will be unreachable.

From its origins as an append-or-overwrite data warehouse, Hive has grown to support the majority of SQL DML statements, albeit for a limited subset of table types. The support for ACID table types and transactions is useful, but keep in mind that Hive is not intended as a transactional database. If you find your implementation is limited by the DML support in Hive, you may not be using Hive appropriately.

In the next chapter, we’ll look at one of the major performance-boosting factors in Hive table design—partitioning data across multiple physical stores.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now

DDL and DML in Hive

HiveQL and ANSI-SQL

Data definition

Databases and schemas

Creating database objects

Modifying database objects

Removing database objects

Data manipulation

ACID storage and Hive transactions

Summary

DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.