CHAPTER 6
So far we've examined master nodes and worker nodes at a high level. Now we're going to look more closely at the Hadoop cluster in order to get a better understanding of how to set up machines and see how they work together.
A Hadoop cluster is a dynamic part of an infrastructure. If you have analytical problems that require a Big Data solution, your data collection will be ongoing and your cluster will need to grow periodically in order to meet increased storage demands. Additionally, when the benefits of rich analysis start to be realized, many companies accelerate their Big Data program, storing more data for longer periods and requiring different types of analysis.
Hadoop is inherently scalable and also richly configurable, which means you can tune your cluster to meet changing requirements and continue to store and process all your data from a single platform.
As you initially design a Hadoop cluster, you will need some understanding of your data requirements in order to provision the correct amount of hardware.
As an absolute minimum, Hadoop clusters used for production workloads need one master that will act as the HDFS name node and the YARN resource manager, and they will need three slaves to act as the HDFS data nodes and the YARN node managers. Figure 16 shows a minimal deployment of Hadoop in which the nodes all reside on a single rack.

Figure 16: A Minimal Hadoop Cluster
The hardware requirements are different for the master and slave servers in these ways:
Server specifications continuously improve, but at the time of writing a typical commodity server suitable for use as a Hadoop slave would have 12-16 CPU cores, 48-128GB of RAM, and up to 12 disks, each with up to 4 TB of storage. A reasonable budget should allow you to settle on a 16-core machine with 48 GB of RAM and 36 TB of storage.
With three slaves of the same specification, you would have a storage capacity of 36 TB (because HDFS replicates three times, every block would be replicated on every server) and a compute capacity for perhaps 36 concurrent task containers (assuming a 25% CPU overhead for running the Hadoop services, leaving 12 cores on each node for YARN to allocate).
That specification may not sound minimal if you're used to dealing with ordinary app servers or even database servers, but in Big Data terms that’s quite a limited setup. There will be enough storage to collect 100 GB of data every day for just under a year, but the compute side is more limited.
Assuming an average of 1.6 GB of data per hour, that’s 14 file blocks for each hour’s worth of data. You would use the full concurrent task capacity of the cluster with a job processing just three hours of data. If you want to run a job over an entire month's worth of data (that's 3 TB spread across 24,000 file blocks of 128 MB each), assuming each task takes 60 seconds, your job would max out the cluster for 11 hours. Analyzing the entire year's worth of data would take more than five full days.
A minimal cluster is really only a starting point, a reasonably small investment if you're evaluating Hadoop. Typically, a functional small cluster contains about 10 worker nodes. That's a comfortable amount for a single master to manage—beyond that you'd need to scale out your master services, which we'll look at shortly.
The small Hadoop cluster is likely to have servers on multiple racks, typically in a two-layer network infrastructure in which the servers have a fast network connection within the rack and an uplink switch to the rest of the network. Figure 17 shows a small cluster setup with 10 worker nodes across three racks.

Figure 17: A Small Hadoop Cluster
Adding more slave servers gives you additional storage and compute capacity, but it's not quite as simple as a linear progression for both. If we keep our replication factor at the default of 3, adding three servers increases our storage capacity by the combined storage of the new nodes. If we added three nodes to our minimal cluster, we'd double the capacity to 72 TB, while scaling up to nine servers gives us 108 TB total storage.
Similarly, our nine-server cluster now has 144 cores, so it could execute 108 tasks simultaneously. Unfortunately, that doesn't mean our job to analyze a month's data will take one-third of the time and complete in 11/3=3.5 hours. With a replication factor of 3, and with nine servers, YARN has a smaller chance of allocating a task to a node with data locality, which means the average task time will increase as nodes are reading data over the network.
We can't predict the exact impact of losing locality, because that depends on which other jobs the cluster is running and how efficiently the scheduler works, but if we make a conservative guess that the average job time would increase from 60 seconds to 75 seconds, the monthlong job will take 4.6 hours. We've tripled our infrastructure, but the bottom-line job duration has fallen to 40% of the original time rather than 30%.
In the wild, there are Hadoop deployments with thousands of data nodes that store and process petabytes of data. One of the main drivers for the re-architecture of HDFS and YARN in Hadoop 2 addressed this issue of scaling, with the goal being that limitations on cluster scale should move to extraordinarily large sizes.
From the slave side, very large Hadoop clusters have large numbers of data nodes. You can separate the storage and compute functions, so that your HDFS data nodes run on one set of machines and your YARN node managers run on a separate set, but what you gain in distribution you will lose in data locality, so this is not a common pattern (except in cloud-hosted clusters, which we’ll see in Chapter 8).
From the master side, large clusters look very different because they do split out and federate the storage role. One name node cannot hold the entire file system namespace in memory if there are billions of file blocks. If you store data at the rate of 1 TB per hour, you are adding a million file blocks every week and you will quickly hit memory limitations. HDFS allows federation, which means multiple name nodes each store part of the namespace.
YARN currently doesn’t support federated resource managers. A high-performance resource manager can support about 4,000 node managers, which is the current practical limit of Hadoop (designs are being discussed to support federated resource managers, which you can follow on the Apache JIRA ticket YARN-2915).
Hadoop high availability and federation is beyond the scope of this e-book, but it is a well-trodden path. The key difference with large clusters is the requirement, usually filled with Apache Zookeeper, for a fast subsystem for shared state and notifications.
Larger clusters don’t only provide more storage and compute, they can also provide increased reliability because they provide more machines for HDFS and YARN to use in the case of failure.
Hadoop is built on the expectation that hardware will fail and that Hadoop’s work will continue in the face of failure. Failure is more common in HDFS data nodes because they are more numerous and have more failure points—a server with 12 spinning disks will surely have a failure at some point. In fact, with large enough clusters, daily failures become a certainty.
Data nodes might have failed disks, they might lose network connectivity, or the server might fail altogether, but Hadoop has a simple mechanism to address all these issues—data nodes periodically send messages to the name node, which monitors them and responds to failures.
The data nodes send two types of messages—the heartbeat message, which confirms that the node is up and able to communicate, and blockreports, in which the node sends a list of every block it has stored, across all its disks. These mechanisms, shown in Figure 18, are enough for HDFS to maintain reliability.

Figure 18: Heartbeat and Blockreport Messaging in HDFS
If heartbeat messages from a data node stop, the name node assumes the data node is offline and marks it as unavailable. Any entries for that node in the file system namespace are removed, which can mean files no longer have the required number of block replicas. In that case, the name node will instruct other data nodes to make replicas of all the blocks on the unavailable node, and, when the replication completes, the file system will be back to full replication.
With the default configuration, a single disk failure in a node will cause the data node service to shut down, triggering another replication by the name node. You can configure HDFS to tolerate one or more disk failures in a data node, which is a reasonable approach provided you have monitoring in place to alert you when disks fail.
HDFS endeavors to keep your data available, but it does not automatically make optimum use of space in the cluster. If a failed node is returned to the cluster or if new nodes are added, an imbalance of file blocks across the nodes will occur, and a returning node could indicate that blocks have more replicas than you need while new nodes start off empty, which means they'll have no data locality for tasks.
Hadoop has the functionality to rebalance data across the cluster without any downtime, so that client requests are still serviced. It’s designed not to be an invasive process—it isn’t compute-intensive, although rebalancing typically takes a long time.
You can run the command hadoop balancer to start rebalancing data across the cluster, but, in order to conserve cluster capacity, Hadoop uses the compute power of the machine running the balancer when deciding where blocks should move, which means you should invoke it on a fast machine with fast access to the cluster. Even so, rebalancing many terabytes of data will take days.
Just as HDFS monitors data availability and remedies any lost blocks, YARN monitors job progress and can address issues with lost tasks and failed jobs. While tasks run, they are the responsibility of the application master, and they send heartbeat messages in the same way that data nodes send heartbeats to the name node.
If a task fails unexpectedly, or if the heartbeat stops, the application master marks the task as failed and requests another container from the resource manager—which is ideally running on a different node than where the task failed—to repeat the failed task.
YARN lets you configure how many retries are permitted for map and reduce tasks. The defaults allow four task retries, but if any one task does fail four times, the application is flagged as a failure. That is also configurable, which means you can set up YARN to allow a certain percentages of task failures while keeping the job running.
Although the responsibility of the resource manager, running applications also send heartbeat messages. If an application master ends in a failed state or stops sending heartbeats, the resource manager ends the application and will retry it with a new application master. The number of retries for an application is also configurable, which means YARN allows for a lot of tuning to ensure that transient failures don't stop job completion and that a bad job (which could be down to defects in the code or the data) doesn't clog up the cluster.
We've seen several points where Hadoop's behavior is configurable, and there are a huge number of switches we can tweak to alter the characteristics of the cluster.
Hadoop is configured using XML files that are present on each node. Configuration files are in the /etc/hadoop directory within the Hadoop installation folder (which is typically referenced with the HADOOP_HOME environment variable, e.g., $HADOOP_HOME in the hadooop-succinctly Docker image). Typically, there are four files in which we configure settings, and we only need to specify settings where we wish to override the default values.
There are far too many options to cover here, but we'll look at the main configuration files and the specific setup used in the hadoop-succinctly container. The default settings are typically suitable for a basic development environment, which means there are some settings you always override.
These are core, Hadoop-wide settings used by HDFS, YARN, the web servers, and the APIs. The default configuration for the relevant version is in the Hadoop documentation—for 2.7.3 these are the core-default.xml settings.
This file configures key settings such as the security used by Hadoop and the default file system (Hadoop can use file systems other than HDFS, but only HDFS provides the performance boost of data locality). The default is the operating system's file system, which means we need to override that in order to use HDFS. Code Listing 33 shows the setup for the Docker container.
Code Listing 33: Specifying the File System in core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> |
This is the entire core-site.xml, which means we specify only one property—all the other settings come from the default file. We specify fs.defaultFS, which is the default file system for Hadoop, and it is specified in terms of the base URI for the provider—in this case, an HDFS path on port 9000 of the local host (which is the Docker container).
This file is for site-specific overrides for properties that configure HDFS. The available properties and default values are listed in the hdfs-default.xml documentation.
HDFS configuration properties let you specify the ports the services should listen on, whether or not data transfer should be encrypted, and the behavior of the file cache. For the Docker container, we only override one property, as seen in Code Listing 34.
Code Listing 34: Specifying the Replication Factor in hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> |
The dfs.replication property specifies the default replication factor for file blocks in HDFS. The default factor here is 3, but our container only runs a single data node service. When Hadoop starts up, it uses a special safe mode while it checks for file system consistency. If it expects three block replicas but only one data node ever reports back, Hadoop won't leave safe mode, and the cluster is effectively read-only.
Specifying a replication factor of 1 allows the pseudo-distributed cluster on the container to run correctly with a single data node process.
This file configures the behavior of MapReduce jobs. In v1 versions of Hadoop, this file configured the behavior of the MapReduce engine as well as the actual jobs, but with YARN, the runtime behavior lives in a separate file. The default property values are specified in the mapred-default.xml documentation.
In the mapred-site.xml, we can specify values such as the default number of reducers used for each job, whether or not the data output from mappers should be compressed, and the type of sort to perform on map keys. In the Docker setup, we have overridden one property, as in Code Listing 35.
Code Listing 35: Specifying the MapReduce Engine in mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> |
This specifies which engine to use for MapReduce jobs. Because YARN introduced a very different model of computation in Hadoop v2, the default is left to use the original MapReduce v1 engine, so that when old clusters upgrade, they won't unknowingly switch to a new engine that is potentially incompatible.
The last file we'll look at, yarn-sit.xml, configures the behavior of YARN. As usual, the available properties and default settings are listed in the yarn-default.xml documentation.
We can configure a lot of low-level settings here, such as the heartbeat interval for application masters to send to the resource manager, the type of scheduler to use, and the minimum and maximum CPU core and memory allocation requests YARN will allow. The Docker container specifies two properties here, as in Code Listing 36.
Code Listing 36: Specifying the Shuffle Service in yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration> |
Between MapReduce v1 and YARN, there was a change as to how the output from mappers got sorted before being sent to the reducer(s). The sorting of keys is the shuffle phase, and in MapReduce v1, the shuffle implementation was fixed. YARN allows for a pluggable shuffle framework, defined as an auxiliary service running on the node managers, and in this configuration we specify the standard ShuffleHandler class as the implementation of the mapred_shuffle service.
In this chapter, we looked more closely at the details of the Hadoop cluster. We examined both the hardware requirements for master and slave nodes and how Hadoop looks in clusters of different sizes. Because Hadoop clusters are capable of expanding gracefully, a good approach to sizing a new implementation is to estimate the rate of data capture and plan for a cluster that has enough capacity for a known period.
We saw how Hadoop nodes communicate among themselves, sending heartbeats and status reports, and we looked at how that facilitates reliability in the cluster. With HDFS, the loss of a data node triggers additional replication of data among the remaining nodes. With YARN, the loss of a node manager means the tasks and applications running on that node will be rescheduled on the cluster.
Lastly, we looked at the configuration framework in Hadoop and saw some of the properties we can override. All the key decision points in the operation of Hadoop have settings we can specify, which means we can tune our cluster to operate most efficiently for our requirements.
Deploying and configuring a Hadoop cluster is not a trivial task, and there are many commercial products that wrap Hadoop in customized and supported packages. Next, we'll look at some of the popular Hadoop distributions.