Hadoop Succinctly^®
by Elton Stoneman

CHAPTER 8

The Hadoop Ecosystem

Introducing the ecosystem

We can't fully discuss Hadoop without mentioning the ecosystem. Because Hadoop is an open source platform with lots of potential and relatively few drawbacks, a huge community has developed around it, extending the reach and the capabilities of Hadoop. The ecosystem encompasses a wide range of technologies in different spheres, and in the final chapter of this e-book we'll look at a few of the major ones.

Most of the technologies in the ecosystem embrace the core Hadoop platform, plugging holes or providing alternative ways to work, but a few of them use the core features of Hadoop to support a completely different approach. All the tools we'll look at are open source, often with a large community following coupled with substantial investment from businesses.

Although Hadoop is a powerful and flexible platform, it is seldom used on its own. Augmenting Hadoop with tools from the ecosystem adds value to the basic proposition, making your cluster and its workload easier to manage or giving you faster and easier access to your data.

Managing Hadoop with Ambari

While installing Hadoop from scratch is not a daunting task, setting up a multinode cluster is far more complex, and it presents a much greater surface area for extension. If you decide to add Hive or Spark to your cluster, it's not trivial when you have 10 data nodes to deploy to and keep in sync. Ambari is the management console that addresses that problem and many more.

Ambari is a web application that aims to provide a single point of access for your Hadoop cluster. It has an extensible architecture with a growing library of components that provide functionality for administrating, monitoring, and using your cluster. Along with providing web views, it also exposes a REST API for automating access to your cluster.

In the administration space, Ambari has all the major tasks covered:

Hadoop configuration: You can view and edit configuration settings without touching the XML files on the server.
Cluster setup: You can add and remove nodes to the cluster, stop and start services, and enter maintenance mode.
Component deployment: You can add and remove packaged components like Hive.
Component configuration: You can view and edit settings for installed components in the same way as the Hadoop configuration.

Figure 22 shows the Add Service Wizard in Ambari, where you can add other components from the ecosystem to your cluster with ease.

Adding Services with Ambari

Figure 22: Adding Services with Ambari

For monitoring, Ambari collects stats from all the nodes in the cluster and all the jobs that have been run. You can view and drill down on graphs that show CPU, memory, disk, and network usage along with counts of all the known HDFS and YARN nodes and their status.

Figure 23 shows the Ambari home screen, which summarizes the health and status of the cluster.

The Ambari Home Screen

Figure 23: The Ambari Home Screen

On the usage front, Ambari has views that integrate with other parts of the ecosystem. Pig and Hive are alternative ways of querying data in Hadoop, and Ambari has views for each that let you submit queries and view the results, and views that save the queries you've run, so that you can load them again later.

Figure 24 shows the Hive view, in which you have a schema navigator, an input pane to write your query, and an output pane showing query results and logs.

The Hive View in Ambari

Figure 24: The Hive View in Ambari

MapReduce alternatives: Pig

As Hadoop has gained popularity, larger numbers of users are being exposed to MapReduce and have started to feel the pain of writing multiple Java classes to access their data, even for simple-use cases.

Although MapReduce is hugely powerful, it's a complex technology that can require cooperation from multiple disciplines—analysts to define the desired results, engineers to build the programs, testers to verify the code, and so on.

The earliest alternative to the Java MapReduce API was Pig, which is a simplified Domain Specific Language (DSL) for building MapReduce queries. Although the Pig DSL (called Pig Latin) is a new language, it's a minimal one, and new users from different disciplines can be productive with it in a few hours. A sample Pig query is shown in Code Listing 37.

Code Listing 37: A Matching Line Count in Pig Latin

lines = LOAD 'input ' USING PigStorage() AS (line);

matching = FILTER lines BY line MATCHES '.*dfs.*';

match_groups = GROUP matching BY line;

word_count = FOREACH match_groups GENERATE matching as line, COUNT(matching);

STORE word_count INTO 'output5';

This query is very similar to our custom MapReduce job from Chapter 2—reading a set of input files with the LOAD statement, using FILTER to find lines that contain the string 'dfs,' then counting the occurrences with the GROUP and GENERATE COUNT statements. The STORE statement specifies where the result should be written.

The Pig script contains much less code, is easier to read, and doesn't require a Java IDE to work with, build, or package the query. You can run Pig interactively, building up a query line by line, or you can submit entire scripts for batch execution.

You can execute Pig queries with the Pig command line (or with the Ambari Pig View), and the Pig engine will build a Java MapReduce program from your query and submit it to the cluster. As far as YARN and HDFS are concerned, this is a standard MapReduce job and it runs in the same way as a normal Java job.

Pig is usually the first tool that new Hadoop users pick up—it gives a good understanding of the structure of MapReduce without the complexity of the Java API. The scope of Pig is limited, which means you'll likely need to build custom MapReduce programs for complex jobs. But for simple jobs, Pig significantly lowers the entry barrier.

SQL on Hadoop: Hive

The next step up from Pig is an SQL-on-Hadoop framework that lets you query data in Hadoop using an SQL-like language. The open nature of Hadoop does lead to one downside—competing technologies appear in the same space, and it can be difficult to choose among them. The SQL-on-Hadoop space is a prime example, as there are several mature and well-featured alternatives, including Impala (from Cloudera), HAWQ (from Pivotal), and Apache Hive.

I'll focus on Hive here because it has a large user base, is under continuing development, and is the supported SQL-like language for Spark.

Data in Hadoop is widely variable, but groups of data often share similar traits. A folder containing web server logs might have files with different contents, but they might all be tab-separated text files with a number of key fields present in every file. With Hive, you impose a structure on your Hadoop data by defining tables in a similar way to a relational database.

Hive table definitions specify the source of the data in Hadoop, the column names and data types the files contain, and the mapping configuration, so that Hive knows how to convert the raw data into rows. In order to run our MapReduce query in Hive, we'd first need to create a table that represents the files in the input folder. Code Listing 38 shows an example.

Code Listing 38: HiveQL to Create a Table

CREATE EXTERNAL TABLE input(line STRING)

STORED AS TEXTFILE

LOCATION '/user/root/input'

Next, we can query that table with HiveQL, an SQL-like language that is very close to SQL-92 compliance. In order to count rows containing a string match, we can use a query like Code Listing 39.

Code Listing 39: HiveQL to Query a Table

SELECT COUNT(*) FROM input WHERE line LIKE '%dfs%'

Similar to Pig, Hive contains a compiler that builds a MapReduce program from the HiveQL, so that the cluster actually executes a normal MapReduce job. Hive is great for establishing a data warehouse over the main areas of your data landscape—there's a lot more to it and it's worth looking into.

Syncfusion has published Hive Succinctly (also by me), which covers the technology in more detail.

Real-Time Big Data: HBase

HBase is a departure from the other tools we've covered because it doesn't extend Hadoop—it's an alternative Big Data technology that uses Hadoop for its foundation. HBase is a real-time Big Data store that uses HDFS to store data so that it gains all the scalability and reliability at the storage layer without building that from scratch.

We use HBase like a NoSQL database. There are tables and rows, but the cells within rows don't belong to fixed columns. Instead, each table has one or more column families, and the column family is like a hash table—one row may have a column family with 1,000 columns while another row in the same table has only one column.

HBase is hugely scalable. It has the capability to store tables with billions of rows and millions of columns. It can provide real-time access because every row has a unique key, and HBase uses an efficient, indexed storage format to quickly find rows by key. HBase also employs a master/slave architecture, but YARN isn't used.

HDFS runs in the normal way with name and data nodes, but HBase has its own service processes. Because the requirements and access patterns are different with HBase, it typically has a dedicated cluster and the nodes aren't used for batch processing.

Other parts of the ecosystem integrate with HBase, so that we can map HBase tables in Hive and query them using HiveQL, which means we can use the Ambari Hive view to query HBase tables. HBase also uses another core Apache technology—Zookeeper—as a centralized configuration and alerting system between the nodes.

Syncfusion also has a dedicated title on HBase—again by me—HBase Succinctly.

The new Big Data toolkit: Spark

Spark is the new Big Thing in Big Data. It is a collection of technologies that lets you work with Big Data in new ways—there's an SQL engine, a machine-learning component, a streaming interface, and a dedicated-graph component. Spark works with explicit units of data called Resilient Distributed Datasets (RDDs). Typically, a chain of work is built up on an RDD, one that might include transforms or classic MapReduce patterns, then the RDD is executed when results are needed.

As opposed to the more complex job-splitting approach of MapReduce, in which the work is encapsulated as a Directed Acyclic Graph (DAG), RDD execution uses multiple parts that can execute concurrently. Spark moves data into memory when executing tasks, and commonly used data can be explicitly cached in memory—which means Spark is fast.

Speed of execution and breadth of functionality are driving Spark as the go-to toolkit for Big Data, but Spark is still rooted in Hadoop. Spark can be deployed on a shared Hadoop cluster, Spark queries the data in HDFS, and Spark jobs can execute on YARN, which allows us to run MapReduce and Spark jobs in the same cluster and prioritize work with the scheduler.

A key factor in Spark's adoption is that it supports interactive querying through notebooks such as Jupyter. That allows for fast, exploratory analysis of your dataset, so that you can quickly iterate queries over a small, cached dataset until you have the results you want, then you can submit the job to YARN for the entire dataset.

Spark is also covered in Syncfusion's library with Spark Succinctly, written by Marko Švaljek.

Summary

That quick tour of the Hadoop ecosystem ends this e-book. We looked at some key technologies that extend Hadoop or make use of it as a core platform, and most Big Data journeys will incorporate one or more of these tools.

If you're about to start your journey and you're wondering which tools you should look at, here's a suggested roadmap. Start with the hadoop-succinctly Docker container and some simple MapReduce programs in order to get a good feel for how the cluster operates in practice and to understand how to monitor what the cluster is doing.

Next, move on to a new installation with Ambari—which can be a single node—so that you can easily add components and see how the cluster services are surfaced in Ambari. Pig is a useful alternative to MapReduce if you have widely varying data, and Hive is a good technology to know if you want to provide a data warehouse that analysts can use (mastering HiveQL is much easier for nonengineers than is learning Java).

A good time to look at Spark will be when you're comfortable with running queries from different engines, monitoring job progress and the health of the cluster, and possibly extending your cluster to add new nodes. Although you might find yourself using Spark exclusively in the longterm, starting with a solid understanding of Hadoop will definitely prove helpful.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now