CHAPTER 2
To help get a grasp on the tools within HDInsight we will demonstrate their usage through a applying a simple Sentiment Analysis process to a large volume of unstructured text data. In this short non-technical section we will look at what Sentiment Analysis is. As part of this a simple approach will be set down which is the one that will be used as we progress through our exploration of HDInsight.
Sentiment Analysis is the process of deriving emotional context from communications through analyzing the words and terms used in those communications. This can be spelled out in the simple example below:
Step 1: Take some simple free-form text such as text from a hotel review:
Title | Hotel Feedback |
Content | I had a fantastic time on holiday at your resort. The service was excellent and friendly. My family all really enjoyed themselves. The pool was closed, which kind of sucked though. |
Step 2: Take a list of words deemed as “positive” or “negative” in Sentiment:
Positive | Negative |
Good | Bad |
Great | Worse |
Fantastic | Rubbish |
Excellent | Sucked |
Friendly | Awful |
Awesome | Terrible |
Enjoyed | Bogus |
Step 3: Match the text to the Sentiment word list:
Title | Hotel Feedback |
Content | I had a fantastic time on holiday at your resort. The service was excellent and friendly. My family all really enjoyed themselves. The pool was closed, which kind of sucked though. |
Step 4: Count the Sentiment words in each category:
Positive | Negative |
Fantastic | Sucked |
Excellent | |
Friendly | |
Enjoyed | |
4 | 1 |
Step 5: Subtract the negative from the positive:
Positive Sentiment | 4 |
Negative Sentiment | 1 |
Overall Sentiment | 3 |
In this example, the overall result is that the Sentiment of this particular block of text is positive and an automated system could interpret this as a positive review.
The view presented above is a very simplistic approach to Sentiment Analysis, as it examines individual words free of context and decides whether they are positive or negative. For example, consider this paragraph:
“I think you misunderstand me. I do not hate this and it doesn’t make me angry or upset in any way. I just had a terrible journey to work and am feeling a bit sick.”
By examining it using human ability to derive context, this is not a negative comment at all; it is quite apologetic. But it is littered with words that, assessed in isolation, would present a view that was very negative. Simple context can be added by considering the influence of modifying words such as “not”, though this has an impact on processing time. More complex context starts entering the domain of Natural Language Processing (NLP) which is a deep and complicated field that attempts to address these challenges.
A second issue is in the weight that is given to particular words. “Hate” is a stronger expression of dislike than “dislike” is—but where on that spectrum are “loathe” and “sucks”? A given person’s writing style would also impact the weight of such words. Someone prone to more dramatic expressions may declare that they “hate” something that is just a minor inconvenience, when a more diplomatic person may state that they are “concerned” about something that actually has caused them great difficulty.
This can be addressed in a couple of ways. The first way is to set aside the individual’s style and apply weighting to specific words according to a subjective judgment. This, of course, presents the challenge that the list of words will be long and, therefore, assigning weights will be a time-consuming effort. Also, it is quite probable that not all the words will be encountered in the wild. The second way—and one that reflects a technique used in the analytical world when addressing outcomes on a scale that is not absolute—is to simply use a simplistic approach that allocates a word as positive, negative or, in the absence of a categorization, neutral—and set the scale issue to one side.
A third issue is the distribution and use of words in a given scenario. In some cases, words that are common in the domain being analyzed may give false positives or negatives. For example, a pump manufacturer looking at reviews of its products should not be accounting for the use of the word “sucks” as it is a word that would feature in descriptions of those products’ capabilities. This is a simpler issue to address as, like part of any Sentiment Analysis, it is important to review the more frequent words that are impacting Sentiment in case words are being assessed as doing so when they are actually neutral in that specific domain.
For further reading on this field, it is recommended you look at the work of University of Illinois professor Bing Liu (an expert in this field) at http://www.cs.uic.edu/~liub/.
In this book, we will be discussing how to perform a simple, word-based Sentiment Analysis exercise using the HDInsight platform on Windows Azure. This process will consist of several steps:
Configuring an HDInsight cluster is designed to be an exercise that demonstrates the true capacity of the cloud to deliver infrastructure simply and quickly. The process of provisioning a nine-node cluster (one head node and eight worker nodes) can take as little as 15 minutes to complete.
HDInsight is delivered as part of the range of services available through the Windows Azure platform. HDInsight was formally launched as a publicly available service in October 2013.
Once access to the program is granted, HDInsight appears in the selection of available services:

HDInsight from the Azure portal
To create a cluster, select the HDInsight Service option and you will be directed to create one. To do so, you will be directed to the Quick Create option which will create a cluster using some basic presets. Cluster sizes are available from four nodes to 32 nodes. You will need an Azure storage account in the same region as your HDInsight cluster to hold your data. This will be discussed in a later section.

While you may be tempted to create the biggest cluster possible, a 32-node cluster could cost US$261.12 per day to run and may not necessarily give you a performance boost depending on how your job is configured.[5]
If you opt to custom create, you gain flexibility over selecting your HDInsight version, exact number of nodes, location, ability to select Azure SQL for a Hive and Oozie metastore, and finally, more options over storage accounts including selecting multiple accounts.