Using Hive to Store the Output

As per the official Apache Hive project page:

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large data sets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time, this language also allows traditional Map/Reduce programmers to plug in their custom Mappers and Reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Hive is a language that allows you to put a SQL-like structure on top of data stored as raw files such as delimited text, then query that data using the HiveQL. It is architected in such a way as to allow the query jobs to be massively parallelized using the MapReduce paradigm, as Hive commands are transformed into MapReduce jobs in order to execute. As with Pig, this is not exposed to the Hive programmer directly.

There are some overlaps in functionality with Pig and there are cases where either Hive or Pig can perform the same function. Which tool you use is a decision based on the tasks to be performed, the comfort of the developer with the given language plus, of course, which approach will prove more efficient from a performance point of view. Hive is often a better entry point than Pig for those more familiar with traditional SQL.

Creating an External Table to Reference the Pig Output

With Pig having stored the output of the processing into a file on HDFS, we can now put a semantic layer on top of it using Hive. This is done by creating an entity in Hive called an External Table which describes a table-like structure for a file.

A code sample for referencing the output of the word counting process is below:

CREATE EXTERNAL TABLE words (

word STRING,

counts INT,

sentiment INT

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '124'

STORED AS TEXTFILE

LOCATION '/user/hadoop/pig_out/words';

This SQL-like structure creates an “External Table” (that is, a reference to a file outside of the Hive environment). It then defines columns using the primitives available as data types. It then specifies the row format, field delimiter (as an ASCII number), its storage type (in this case, a text file), and, of course, its location.

This text file can now be referenced in a Hive table as if it were a normal table in the Hive environment. A sample query and results are below:

SELECT * FROM words

ORDER BY counts DESC

LIMIT 10

This query selects the top 10 words by frequency count from the word count output.

Once data is present in a Hive table, it can be accessed via ODBC and imported into a tool of choice.

It is worth noting that Hive is not the only way to expose data. Output from Pig or even base Hadoop jobs are simply text files that can be processed by any tool that can interpret the structure in which the output is defined. However, Hive has the advantage of being able to interactively filter the content by using a WHERE clause or add fields or calculations using HiveQL.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now

Using Hive to Store the Output

Creating an External Table to Reference the Pig Output

DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.