CHAPTER 1
Many of the leading companies in the world today face the problem of big data. There are various definitions by various authors on what big data actually is. To be honest, it’s a vague definition, and most of the authors in the field have their own. The scope of this book is too limited to go into discussions and explanations, but if you are a seasoned developer in the field, you probably have your own view of what big data is.
If you are a newcomer to the field, perhaps the easiest way to explain the concept is that it’s the data that can’t be handled with traditional computing technologies, which mostly used single machines to process data. I won’t go into corner cases like, “what if you had a really powerful computer” and so on. The easiest way to think about big data is that it’s data that can’t be processed or stored by a single machine. When it comes to the whole big data use case, the flow usually looks like the following figure:

Figure 1: Big Data Information Flow
To create big data, first you have to gather the data. The gathered data can come from anything, i.e. weather stations, social media, or even another computer process. If you have ever worked with large data sets, then you might know that the data that we ingest is often not in the format that we expect it to be or might contain various invalid values. Data is usually preprocessed so that we can maintain the usability of data in our system. The next step in the information handling flow is storing the data. We’ll explain the store step in the chapters to come. For now, it’s sufficient to say that the store step is simply holding data that is going to be processed at a later point in time. The final step is usually post-processing. This book will be mostly about the post-processing step because that’s the way Spark is most often used. Now, don’t get me wrong: you can use Spark in any of the steps. But most of the time, Spark is used as the last step in the big data information flow. This is a bit of a simplified description of big data in the context of Spark usage, but it’s enough to get you going for the chapters to come. Let’s have a look at the high-level overview of Spark usage:

Figure 2: Spark in the Big Data Flow
Note that on the figure Spark takes input data and produces output data using the same storage. In practice this can vary, as data can come in from multiple data sources. Most of the examples provided in this book will be relatively simple when it comes to data output and the number of data processing results because the focus is on how to gather and process the data. Data output is a relatively easier step, but do remember that in practice Spark makes data output to other storage systems, especially in the context of big data production systems. This doesn’t mean that you can’t have any fun with Spark on your own or that you can only use it on multiple machines with multiple cores. Spark can also be used to make various data explorations on your own by using the shells bundled with Spark.
There are many programming languages, and Spark as a technology is not bound to just one of them. The languages supported out of the box in the Spark 1.4 release are:
If you want to use Spark with other programming technologies, there is the concept of pipes, with which Spark can read and write data from standard Unix streams. We’ll discuss this in later chapters of the book; this is a very high-level overview of what Spark is and how it can be used in the context of big data processing. Let’s dive into the basics of big data processing before getting to know more about Spark.
In the previous section, we mentioned that the simplest definition of big data would be that it is data that can’t be processed on a single computer, at least not within a reasonable timeframe. Theoretically, one could solve the problem by employing a faster computer with more memory, but that kind of specialized machine costs a great deal of money. The next logical step would be to split the computing among multiple commodity computers. But splitting the data and the processing among them is not as simple as it might seem:

Figure 3: The Problem of Splitting Data Processing
There’s a great amount of scheduling related to processing time and the I/O operations to do such a thing. Letting nodes negotiate themselves to split the tasks among them would imply a great deal of complex and hard-to-debug technologies with significant overhead and negotiation time among the nodes. To avoid all those problems, most modern systems use master-worker architecture. The master node determines what data is processed on what worker node and when, as show on the following figure:

Figure 4: Master-Worker Architecture
Some of you might think that there is some sort of a bottleneck when using master-worker architecture, and that is true in some solutions but most of the modern ones have the master node in charge only for delegating the work and monitoring the results. The worker nodes fetch the data themselves from the source according to the instructions from the master. Up until now the data was simply shown as a small cloud and we didn’t go into it. Earlier we mentioned that big data also refers to the data that can’t actually fit on a single computer. That’s why data is often saved on a distributed file system. In the world of big data, the most popular distributed file system at the moment is Hadoop Distributed File System (HDFS). Some of you might now think something like: wait, this book is supposed to be about Spark and not Hadoop. And some of you might not even know what Hadoop is. If you want to find out more about Hadoop, please use your favorite search engine; Hadoop will be mentioned throughout the book because this framework was the industry standard for big data processing for the past decade, and according to some people it still is. Many companies, especially in the Fortune 100, have a lot of infrastructure on that framework already. The Spark creators didn’t want to reinvent the wheel, and they also wanted a better adoption rate for Spark in the companies. This is much easier on top of an existing infrastructure, so, for instance, Spark supports HDFS out of the box.
Let’s have a look at the HDFS architecture. You might find it interesting because it’s also a master-worker architecture:

Figure 5: Overview of Master-Worker Hadoop Distributed File System Architecture
The names of the nodes shown on the previous figure are neither master nor worker, but from the architecture it’s pretty visible that the node “NameNode” is pretty important. Besides outlining the architecture, the previous picture is very important because it describes how distributed systems overcome failures. As systems get bigger, there is a pretty significant statistical chance that some of the nodes will experience outages from time to time. Data on data nodes have redundant copies on other data nodes in case a particular node fails, and so the master node knows where the backups are. If data nodes fail, data is simply read from copies. The master node is essentially a single point of failure in the HDFS; to cope with this, a backup node is added to the system. The backup node periodically checks if the master node is alive and syncs the data with it. If for some reason the name (master) node fails, the backup node will take over the master’s role in the cluster.
A basic hello world example in big data analysis is a word count example, and we will go over how to do that with Spark in later chapters. Before going into the history of Spark and the basic concepts surrounding it, I would like to go over a very important data processing concept that has been around for quite some time now. It’s a programming model called MapReduce. There are three basic phases in MapReduce. First is the Map, followed by a Shuffle phase, and the Reduce phase comes at the end. A basic overview is displayed in the following figure:

Figure 6: Overview of MapReduce Programming Model
The easiest way to describe MapReduce would be by using the word-count example. Imagine for a moment that we only have three words in a very large file. We are interested in the number of occurrences of each word. In the MapReduce programming model, every node would locally group the words together and make a count of every one of them in the map phase. The master node that is not shown on the previous figure would then assign every word to a specific node. The data is then shuffled in the next step so that it is associated with a specific node. The shuffle phase usually takes the longest because nodes have to transfer the data over the network and store it in their temporary storage. The node then combines info that came in from other nodes and produces a result—word count in our example—and stores it as a final step. At the end of every phase, data is stored to the disk. Writing the data to the disk is the part that usually takes the longest in data processing, and with MapReduce there are at least three phases where this takes place.
The basic concept behind the MapReduce model, used by Hadoop, is that it distributes the processing and disk load to nodes in the cluster; this was the industry standard when Spark was introduced in 2009. The main difference between Spark and Hadoop is that Spark can distribute the usage of not only the processor and the disk, but can also distribute in-memory operations. Because of that, it can achieve much greater speeds while processing data. For orientation, let’s have a look at the times it takes to complete various computer related operations, from one CPU cycle up to a physical system reboot. To make it easier to compare, we’ll add a human comparison column where the fastest operation will be mapped to a single second interval:
Table 1: Computer Task Processing Times Scaled to Human Time Perception
Event | Duration | Human comparison |
1 CPU Cycle | 0.3 ns | 1 s |
L1 cache access | 0.9 ns | 3 s |
L2 cache access | 2.8 ns | 9 s |
L3 cache access | 12.9 ns | 43 s |
RAM access from CPU | 120 ns | 6 min |
SSD I/O operation | 50 – 150 us | 2 – 6 days |
Rotational disk I/O operation | 1 – 10 ms | 1 – 12 months |
Internet: San Francisco to New York | 40 ms | 4 years |
Internet: San Francisco to United Kingdom | 81 ms | 8 years |
Internet: San Francisco to Australia | 183 ms | 19 years |
TCP packet retransmit | 1 – 3 s | 105 – 317 years |
Physical system reboot | 5 min | 32 millennia |
The most important thing from the previous table for us is the comparison between RAM access from the CPU and rotational disk I/O operation. Once again, if one CPU cycle is a second, RAM I/O takes six minutes and rotational disk I/O operation takes somewhere around one month or even a whole year. Modern computer systems have significant amounts of RAM available to them, so it was probably only a question of time when somebody was going to figure out that RAM can be used as a distributed resource to process large amounts of data. This is actually one of Spark’s first killer features, and we will describe the details in the chapters to come. Let’s have a look at how Spark developed in the next section.
Spark was created at UC Berkley AMPLab in 2009 by Matei Zaharia. It might sound unbelievable, but the first version of Spark was written in only 1600 lines of Scala code. One of the main goals behind Spark was to make a big data processing framework that is fast enough for machine learning. Hadoop was not usable for this approach because there was a lot of disk usage involved, and near-real time data processing took around twenty to thirty minutes at the time Spark was created. In its initial version, Spark could solve the same tasks in under a minute because of aggressive memory usage. The Spark project was open-sourced in 2010 under a BSD license. The BSD license allows unlimited redistribution for any purpose as long as its copyright notices and the license’s disclaimers of warranty are maintained. The license also contains a clause restricting use of the names of contributors for endorsement of a derived work without specific permission. In 2013, the project was donated to the Apache Software Foundation. It also switched its license to Apache 2.0. The Apache license is a bit more restrictive when it comes to copyright and redistributing code and also has special requirements, mostly pertaining to giving proper credit to those who have worked on the code and to maintaining the same license. In the same year, the creators of Spark started a company called Databricks. The company’s goal was to help clients with cloud-based big data processing by using Spark. In February 2014 Spark became a top-level Apache project.
The next very important thing happened in November 2014. Spark won the Daytona GraySort contest. Daytona GraySort is a competition where various companies come to show off their big data processing frameworks and solutions. The basic goal is to sort 100 terabytes of data (consisting of one trillion records) as fast as possible. The data that needs to be sorted is located on the HDFS, the inner workings of which we described in the previous section. Sorting the data usually takes around 500 TB of disk I/O and around 200 TB of network I/O. Organizations from around the world often build dedicated sort machines with specialized hardware and software. Winning the 2014 sort competition is a very important milestone for the Spark project, especially when the previous world record set by Yahoo with Hadoop MapReduce is taken into account:
Table 2: Daytona GraySort 2014 Competition Results
2013 Record: Hadoop | 2014 Record: Spark | |
Data Size | 102.5 TB | 100 TB |
Elapsed Time | 72 minutes | 23 minutes |
Number of Nodes | 2100 | 206 |
Sort rate | 1.42 TB/min | 4.27 TB/min |
Sort rate per node | 0.67 GB/min | 20.7 GB/min |
The amount by which the previous record is surpassed is simply incredible. Spark actually managed to process the data three times faster with ten times fewer machines.

Figure 7: Number of Nodes Used in the GraySort Challenge by Spark and Hadoop
The results of this competition attracted a lot of developers from around the world, and Spark had over 465 contributors in 2014. This makes it the most active project in the Apache Software Foundation and probably the most active big data open source project. In the next section we’ll make an overview of Spark as a platform for big data processing.
Spark consists of multiple components. The central Spark component is Spark Core. All of the other components have this component in common. This approach has many benefits. One of them is when optimization is added to the core, all of the other components start to use this optimization right away. The other advantage is that the code base of Spark remains compact and reinventing the wheel is reduced to a minimum. A Spark overview is shown in the following figure:

Figure 8: Spark Components
Spark Core is in charge of memory management, task scheduling, error recovery and storage systems interaction. It also defines the resilient distributed datasets. We’ll talk about them in chapters to come, but for now it is sufficient to describe them as distributed collections of items that can be manipulated in parallel.
Spark SQL is a package for working with structured data. The component started out as a support for Hive Query Language, but over time it grew to be a component that supports working with almost any kind of data, from JSON to large datasets residing on distributed storage. Spark SQL gives developers a unique ability to combine programming constructs together with SQL-like syntax when working with structured data.
Spark Streaming enables processing of data streams. Some of you might wonder what a data stream is. In essence, it’s a continuous influx of data from all sorts of sources, like log files or queuing solutions that take in messages from the clients. The traditional big data processing frameworks are oriented towards batch processing of data. This component from Spark is actually a step towards modern data processing frameworks because it’s designed from the beginning to work with live, incoming data and to generate the results on the fly.
MLlib provides functionalities for machine learning. In essence, it’s a loose collection of high-level algorithms. One of the main driving forces for Spark’s creation was machine learning. Solutions available at the time were relatively slow because they didn’t exploit the speed of RAM. In the past decade, machine learning at this scale was based around a library called Mahout, and it was primarily used in Hadoop ecosystems. In 2014, Mahout announced it would no longer accept Hadoop MapReduce code, and it switched new development to Spark’s library MLlib.
GraphX is a library for graph data manipulation. It allows users to create and manipulate graph data consisting of a vertex and edges, and with it we complete a short overview of Spark. In the next chapter we are going to go over Spark installation steps.