Hadoop Succinctly^®
by Elton Stoneman

CHAPTER 7

Hadoop Distributions

Introducing distributions

As Hadoop becomes increasingly prevalent across enterprises, the commercial ecosystem around Hadoop continues to grow. Commercial distributions of Hadoop typically bundle the core platform with additional Big Data technologies aimed at providing a single, easy-to-use platform for the entire data analysis landscape.

Commercially available distributions fall into two options—packaged solutions that we deploy on-premise and pay for either in terms of a support subscription or a product cost, and hosted solutions that run in the Cloud and that we pay for in terms of the hourly (or per-minute) compute cost of the cluster and the storage we use.

In this chapter, we'll look at the major Hadoop distributions to see what they offer over the core Apache offering and how they differentiate themselves in the market.

Cloudera

Cloudera was the first commercial provider of Hadoop, and its CEO, Doug Cutting, was the founder of Hadoop’s original project. Cloudera distribution, including Apache Hadoop (CDH), is a packaged Hadoop installation based on the core HDFS and YARN frameworks, and it also includes a broad selection of tools from the Hadoop ecosystem, including HBase and Hive.

The main differentiator for Cloudera over other distributions is the speed at which it adopts new developments in the Hadoop world. Cloudera engineers are actively involved in Apache Big Data projects, and the company is happy to promote early stage technology into its platform.

As the oldest commercial Hadoop provider, with hundreds of large-scale customer deployments, Cloudera has extensive experience in configuring Hadoop, and Cloudera Enterprise ships with default configurations that are better suited to large deployments and require less tuning than a standard Hadoop install.

CDH is free, but Cloudera Enterprise, a commercial offering, adds support and additional features, such as proactively monitoring your Hadoop cluster and comparing it to other customers' clusters. Cloudera uses that insight to power its predictive maintenance component, which can alert you to problems with your nodes before they have a serious impact.

MapR

MapR takes a very different approach for its Hadoop installation, using its own components to provide an optimized distribution rather than packaging the pure Apache components. At the core, their product uses a custom storage layer that is binary-compatible with HDFS but has a different implementation.

The file system MapR-FS exposes a Network File System (NFS) interface, so that clients can connect to the storage layer as a mapped drive, but it’s compatible with HDFS. The file system uses a distributed master for the metadata store, which means it can scale to larger sizes than HDFS and doesn't have the same single point of failure.

The MapR distribution bundles other key tools from the ecosystem but typically reworks and rebrands them, so that the Big Data-scale queue technology MapR Streams is compatible with Apache Kafka, and, for real-time access, MapR-DB is compatible with HBase.

All the MapR customized components are API-compatible with their Apache versions, but they are tuned for performance and scale at the enterprise level. MapR has its own management portal, the MapR Control System, which lets you configure the system as well as monitoring nodes and getting warning alerts.

Hortonworks

Hortonworks provides the most Hadoop-faithful of the commercial distributions. All of the included components are official Apache versions, which means the Hortonworks Data Platform (HDP) product is primarily about packaging, configuring, and supporting open source components.

The open source community has benefitted greatly from Hortonworks' approach. When there have been gaps in its product offering, rather than build a proprietary component, Hortonworks has ramped up support for Apache projects and even gone so far as to buy other companies and donate that technology to Apache. Two of the key open source tools around Hadoop are through Hortonworks—Ambari, for cluster management, and Ranger, for securing Hadoop.

The Hortonworks data platform uses Apache Ambari for configuration and monitoring of the cluster (which we’ll see in Chapter 8), and the packaged product includes HBase, Spark, Hive, etc. Typically, Hortonworks are slightly more conservative than Cloudera with their adoption of new releases, so the versions of components in HDP can be behind Cloudera (at the time of writing, Cloudera’s CDH 5.7 bundles HBase version 1.2.0, while HDP 2.4.0 has HBase 1.1.2).

HDP is distinct from the other leading distributions because it provides a variant that runs on Microsoft Windows. Although Hadoop is Java-based and is technically platform independent, the vast majority of deployments currently run on Linux platforms. With HDP, Hortonworks provides a version of Hadoop that will appeal to the many Microsoft clients now considering Hadoop.

Syncfusion and others

Syncfusion's Big Data platform is a relative newcomer to the market, but it has some key differentiators that make it an attractive option. First, it is Windows-native, which means it is a version of Hadoop specifically built to support Microsoft technologies.

The Big Data platform lets you run as a single node or connect to a cluster running on-premise or in the Cloud. The platform comes with a custom management portal that lets you submit different types of jobs to Hadoop, and the distribution includes sample jobs written in C#, so that Microsoft .NET becomes a first-class citizen.

Part of Syncfusion’s platform is the Big Data Studio, which gives you a nice front-end for navigating Hadoop and other parts of the ecosystem—we see the result from running the word count query in the Studio in Figure 19.

Syncfusion’s Big Data Studio

Figure 19: Syncfusion’s Big Data Studio

Also worthy of mention are Pivotal, whose HD distribution is expanded with HBD—aiming to make SQL a native part of the Hadoop experience—and IBM, which has deep integration, from their BigInsights Hadoop platform to the rest of their “Big” stack.

Amazon web services

Amazon provides a packaged Hadoop distribution that runs in the Cloud—Amazon Elastic MapReduce (EMR). EMR uses Amazon's S3 for the storage layer (rather than Hadoop) and provisions compute using Elastic Compute Cloud (EC2). This is the same approach that Microsoft uses in the Azure Cloud. The main consideration with cloud-based Hadoop deployments is that you lose the benefit of data locality, although you gain with scale because you are able to easily add or remove compute nodes and have virtually limitless storage capacity.

With EMR, you can provision a cluster from some preconfigured options, including Hive or Spark, and you can specify that custom components be installed as part of the deployment. Amazon also provides a hosted MapR distribution as well as its own Hadoop bundle. Figure 20 shows the creation of a new EMR cluster through the AWS Management Console.

Creating an Elastic MapReduce Cluster

Figure 20: Creating an Elastic MapReduce Cluster

You will be charged an hourly rate to run the VMs that comprise the compute part of the cluster and a separate per-gigabyte charge for all the data you have stored in S3.

This is one of the major benefits of cloud-based Big Data platforms. If you don't need to run jobs 24/7, you can provision a cluster when you need it, run your batch jobs, then remove the cluster. All of your data remains in cheap object storage, but the expensive compute power is only used when you need it. It's also easy to scale up when you need to—if you need to add more jobs without extending the overall run time, you can simply change your provisioning scripts to create a bigger cluster.

Cloud-based Hadoop distributions also integrate easily with the rest of the provider's stack—another plus point. Both AWS and Azure provide managed cloud-scale queue technologies suitable for receiving huge quantities of data from disparate sources and scalable compute frameworks you can use to pull data from the queues and store in object storage for later querying in Hadoop.

Microsoft Azure

The managed Big Data stack on Microsoft's Azure Cloud, called HDInsight, is actually powered by the Hortonworks Data Platform, and you can spin up clusters based on Windows or Linux VMs. HDInsight clusters use Azure Storage instead of HDFS, and the compute nodes run on Azure Virtual Machines where the cluster is part of a virtual network.

You can manage HDInsight clusters with PowerShell or the Cross-Platform CLI or with the Azure Portal. Figure 21 shows the portal screen for resizing a cluster.

Scaling a HDInsight Cluster

Figure 21: Scaling a HDInsight Cluster

HDInsight takes a similar approach to EMR, but instead of configuring the components to add to the cluster, it only allows you to select from one of the preconfigured cluster setups. In the current platform, you can deploy a plain Hadoop cluster from a cluster running Storm (for stream processing), from HBase (for real-time data access), or from Spark (for fast in-memory analysis). Because the storage and compute layers are disconnected, you can run multiple clusters that all have access to the same shared storage.

Although HDInsight is based on the open source HDP platform, Microsoft provides customizations at the client level. The Azure SDK integrates with Visual Studio to give you monitoring and deployment from within the IDE, and there are .NET and PowerShell packages to support development.

As with AWS, you can leverage other parts of Azure to feed data into object storage and query it with a cluster. Also, Azure has the Data Factory offering that lets you orchestrate entire runbooks, including spinning up a Hadoop cluster, submitting jobs, waiting for them to complete, and removing the cluster. With any cloud service, though, you need to be aware that the PaaS components are usually proprietary, so if you take a dependency on them, you will find it difficult to migrate to a different platform.

Choosing between distributions

Adoption of Hadoop can be business-driven, from stakeholders wanting to extract more value from their data, and it can be technology-driven, from engineers wanting to make use of the data they know is there. Typically, there's a prototype phase in which you'll work with Hadoop in a local installation, and it's a good idea to run that phase on pure Hadoop, so that you start to understand the capabilities and shortcomings of running your own cluster. The output of that phase should help the business users decide if the potential returns justify the outlay.

Next, you’ll need to choose between an on-premise installation or running in the Cloud. The Cloud offers a great way to start cheaply. You won't incur the upfront costs and build time for provisioning your own cluster—you can spin up a cloud cluster within 30 minutes. If your data is cloud-sourced anyway (from global clients outside your own network), it makes good sense to capture and store that data in the Cloud. Initially, you can run your cluster on a part-time basis, when you need to run batch jobs, and your operating costs will be minimal.

However, there is a crunch point in cloud computing, when the economies flip and it becomes cheaper to run your cluster on-premise. If you run a relatively large Hadoop cluster in the Cloud 24x7 and store a large quantity of data in object storage, your Cloud costs could easily be $5K per month. While $5K buys you a pretty well-specified compute node, over a year you might find that your Cloud costs are the equivalent of owning your own small cluster—although keep in mind that your on-premise cluster will have its own running and management costs.

If you're planning to run on-premise and looking at anything above the minimal Hadoop cluster, a commercial distribution is worth investing in. All the platforms have strong offerings and will likely provide you with a more reliable, more efficient, and more easily managed cluster than you can achieve by spending the same amount on setting up your own cluster from the pure Apache platform.

Ultimately, the platform choice is usually driven by cost, but you should drill down into all the potential costs when you make your decision. Buying your own kit may seem cheaper than paying the ongoing price for a cloud platform, but if you need to upgrade your cluster every six months with an additional 100 TB of storage and another five compute nodes, the cost savings may be illusory.

Summary

In this chapter, we looked at the commercial options for running Hadoop. Although Hadoop is a fully open source platform, it is a complex product with multiple moving parts, and many enterprises don't like to take a dependency on it without having commercial support available. That commercial opportunity is taken up by packaged Hadoop distributions and hosted Hadoop platforms.

In the packaged space, the major players are Cloudera, MapR, and Hortonworks. Each takes a different approach, but there's a lot of commonality in their offerings, so it's worthwhile getting to know the platforms well if you intend to run on-premise. The commercial market is also expanding, with newer products from entrants like Pivotal, Syncfusion, and IBM offering their own strengths.

If you don't have a definite longterm data roadmap based on Hadoop, a Cloud-based platform may be a better option, certainly when you're starting out. Both Amazon and Microsoft provide feature-rich Big Data platforms that have efficient pricing. Data and compute are billed separately, so if you can group all your batch jobs into one run, you can power up a cluster to do all the jobs, then power it down, which means you pay only expensive compute costs while the cluster is actually running.

Whichever option you choose, be mindful of the points at which the product departs from the standard Hadoop stack. If you focus your own investment on core Hadoop functionality (or functionality that is 100% compatible with Hadoop) and don't take a hard dependency on any proprietary technology, you'll be well placed to move to a different distribution if you later choose to do so.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now