CHAPTER 6
The goal of sentiment analysis is to examine text data and predict if the mood is positive or negative. For example, you might want to programmatically process email messages from users of some product to predict if the sender is happy or not happy with the product.

Figure 6-1: Sentiment Analysis using Keras
The screenshot in Figure 6-1 shows a demonstration of sentiment analysis. The demo program begins by loading 620 training data items and 667 test data items into memory. Each item is a movie review with up to 50 words, where the review can be positive (1) or negative (0).
Behind the scenes, the demo program creates an LSTM (long, short-term memory) neural network. The LSTM network has an embedding layer that converts each word in a review into a numeric vector with 32 values. The LSTM network has a memory cell size of 100. The network has a total of 4,209,845 weights and biases values that must be determined.
The LSTM model is trained using five epochs (a small number to keep the size of the screenshot in Figure 6-1 small). After training, the demo program computes the model's accuracy on the test data (81.71% or about 54 out of 67 correct). The demo concludes by making a prediction for a new, previously unseen review of "the movie was a great waste of my time," and correctly predicts the sentiment is negative.
The IMDB (Internet Movie Database) movie review dataset consists of a total of 50,000 reviews. There are 25,000 training reviews and 25,000 test reviews. Each set has 12,500 positive review and 12,500 negative reviews. The raw data was collected as part of a research project and can be found here.
Getting the raw data into a usable format is a major challenge because the data is structured as one file per review. Here's an example of a positive review:
When I read other comment,i decided to watch this movie...<br /><br />First, cast specially Michael Madsen and Tamer Karadagli; good enough...<br /><br />Film,very intelligence and interesting because ,cast have a lot of international specially European actor and actress like from Turkey and Russsia...<br /><br />Second,Story is basic and you can guess but if you interesting action good play you'll like in my opinion...<br /><br />Third,Final chapter is not special or interesting,it's regular like other action movies...<br /><br />Finally,i recommend to watch this movie...And i hope You'll love it enjoy :D
Notice that there are misspelled words, incorrect grammar and punctuation, inconsistent capitalization, embedded HTML <br/> tags, and other factors to deal with. When working with natural language problems, the data preprocessing steps can often take 90 percent or more of the time and effort required to build a predictive model.
The Keras library has a built-in version of the IMDB dataset that can be loaded into memory like this:
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data()
However, using this approach is somewhat artificial, in my opinion, and hides many important details. For simplicity, I created a file of training data and a file of test data where each movie review is up to 50 words in length. The resulting data looks like this:
0 0 0 0 0 0 13 510 4 115 1331 363 . . . 1708 298 0
0 0 12 28 111 6 172 7 32188 9 4 88 31 . . . 1487 151 0
Each line is one review. The first few values on each line are zeroes for padding so that all reviews have exactly 50 values. The last value on each line is the sentiment: 0 for a negative review, and 1 for a positive review.
Each word is encoded using the same scheme as used by the built-in Keras IMDB dataset. Values 0 to 3 have special meaning. A value of 0 is used for padding. A value of 1 is used to indicate the start of a review in situations where the data is not delimited by newlines. A value of 2 is used for out-of-vocabulary (OOV)—words in the test data that were not seen in the training data. A value of 3 is reserved for custom usage.
Additionally, all words are converted to lower case, and all punctuation is removed, except for the single quote character, which is important for contractions like don't and wouldn't.
Each word ID is based upon the frequency of the word in the training data, where 4 is the most frequent ("the"), 5 is the second most frequent ("a"), and so on. The training data has a total of 129,888 distinct words, so the last word in the vocabulary has index 129,888 + 4 - 1 = 129,891.
As you'll see shortly, when using LSTM networks for natural language problems such as sentiment analysis, there's a tight coupling between data encoding and the LSTM network, and you need to know exactly how words are indexed.
The complete program that generated the output shown in Figure 6-1 is shown in Code Listing 6-1. The program begins with comments for the program file name and versions of Python, TensorFlow, and Keras used, and then imports the NumPy, Keras, TensorFlow, and OS packages:
# imdb_lstm.py
# Python 3.5.2, TensorFlow 2.1.5, Keras 1.7.0
import numpy as np
import keras as K
import tensorflow as tf
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
In a non-demo scenario, you'd want to include additional details in the comments. Because Keras and TensorFlow are under rapid development, you should always document which versions are being used. Version incompatibilities can be a significant problem when working with Keras and other open-source software.
Code Listing 6-1: IMDB Movie Review Sentiment Analysis Program
# imdb_lstm.py # Python 3.5.2, TensorFlow 2.1.5, Keras 1.7.0 # ================================================================================== import numpy as np import keras as K import tensorflow as tf import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' def main(): # 0. get started print("\nIMDB sentiment analysis using Keras/TensorFlow ") np.random.seed(1) tf.set_random_seed(1) # 1. load data max_review_len = 50 print("Loading train and test data, max len = %d words\n" % max_review_len) train_x = np.loadtxt(".\\Data\\imdb_train_50w.txt", delimiter=" ", usecols=range(0,max_review_len), dtype=np.float32) train_y = np.loadtxt(".\\Data\\imdb_train_50w.txt", delimiter=" ", usecols=[max_review_len], dtype=np.float32) test_x = np.loadtxt(".\\Data\\imdb_test_50w.txt", delimiter=" ", usecols=range(0,max_review_len), dtype=np.float32) test_y = np.loadtxt(".\\Data\\imdb_test_50w.txt", delimiter=" ", usecols=max_review_len, dtype=np.float32) # 2. define model e_init = K.initializers.RandomUniform(-0.01, 0.01, seed=1) init = K.initializers.glorot_uniform(seed=1) simple_adam = K.optimizers.Adam() nw = 129892 # must be > vocabulary size (don't forget +4) embed_vec_len = 32 # values per word -- 100-500 is typical model = K.models.Sequential() model.add(K.layers.embeddings.Embedding(input_dim=nw, output_dim=embed_vec_len, embeddings_initializer=e_init, mask_zero=True)) model.add(K.layers.LSTM(units=100, kernel_initializer=init, dropout=0.2)) model.add(K.layers.Dense(units=1, kernel_initializer=init, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer=simple_adam, metrics=['acc']) print(model.summary()) # 3. train model bat_size = 10 max_epochs = 5 print("\nStarting training ") model.fit(train_x, train_y, epochs=max_epochs, batch_size=bat_size, shuffle=True, verbose=1) print("Training complete \n") # 4. evaluate model loss_acc = model.evaluate(test_x, test_y, verbose=0) print("Test data: loss = %0.6f accuracy = %0.2f%% " % \ # 5. save model print("Saving model to disk \n") mp = ".\\Models\\imdb_model.h5" model.save(mp) # 6. use model print("Sentiment for \"the movie was a great waste of my time\"") rev = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 20, 16, 6, 86, 425, 7, 58, 64]], dtype=np.float32) prediction = model.predict(rev) print("Prediction (0 = negative, 1 = positive) = ", end="") print("%0.4f" % prediction[0][0]) # ================================================================================== if __name__ == "__main__": main() |
The program imports the entire Keras package and assigns an alias K. An alternative approach is to import only the modules you need, for example:
from keras.models import Sequential
from keras.layers import Dense, Activation
Even though Keras uses TensorFlow as its backend engine, you don't need to explicitly import TensorFlow, except in order to set its random seed. The OS package is imported only so that an annoying TensorFlow startup warning message will be suppressed.
The program structure consists of a single main function, with no helper functions. The program begins with:
def main():
# 0. get started
print("\nIMDB sentiment analysis using Keras/TensorFlow ")
np.random.seed(1)
tf.set_random_seed(1)
# 1. load data
max_review_len = 50
print("Loading train and test data, max len = %d words\n" % max_review_len)
train_x = np.loadtxt(".\\Data\\imdb_train_50w.txt", delimiter=" ",
usecols=range(0,max_review_len), dtype=np.float32)
train_y = np.loadtxt(".\\Data\\imdb_train_50w.txt", delimiter=" ",
usecols=[max_review_len], dtype=np.float32)
. . .
In most situations, you want to make your results reproducible. The Keras library makes extensive use of the NumPy global random-number generator, so it's good practice to set the seed value. The seed value used in the program, 1, is arbitrary. Similarly, because Keras uses TensorFlow, you'll usually want to set its seed, too. However, even if you set all random seeds, program results typically aren't completely reproducible due, in part, to order of numeric rounding of parallelized tasks.
I indent with two spaces rather than the normal four spaces because of page-width limitations. All normal error-checking has been removed to keep the main ideas as clear as possible.
The test data is read into memory using the same technique:
test_x = np.loadtxt(".\\Data\\imdb_test_50w.txt", delimiter=" ",
usecols=range(0,max_review_len), dtype=np.float32)
test_y = np.loadtxt(".\\Data\\imdb_test_50w.txt", delimiter=" ",
usecols=max_review_len, dtype=np.float32)
The program assumes that the training and test data files are located in a subdirectory named Data. The program doesn't have any information about the structure of the data files. I strongly recommend that you include program comments describing your data format. Data format information is easy to remember when you’re writing your program, but difficult to remember a couple of weeks later.
The training data is read into memory using the NumPy loadtxt() function. There are many ways to read data into memory, but the loadtxt() function is versatile enough to meet most problem scenarios. The NumPy genfromtxt() function is very similar but gives you a few additional options, such as dealing with missing data. The loadtxt() function has a large number of parameters, but in most cases you only need usecols, delimiter, and dtype.
Notice that usecols can accept a list such as [max_review_len] or a Python range such as range(0,max_review_len). If you use the range() function, be careful to remember that the first parameter is inclusive, but the second parameter is exclusive.
The default dtype parameter value is numpy.float, which is an alias for Python float, and is the exact same as numpy.float64. But the default data type for almost all Keras functions is numpy.float32, so the program specifies this type. The idea is that for the majority of machine learning problems, the advantage in precision gained by using 64-bit values is not worth the memory and performance penalty.
Instead of using a NumPy function such as loadtxt() to read data into memory, a different approach is to use the Pandas ("panel data" or "Python Data Analysis Library") library, which has many advanced data manipulation features. However, Pandas has a significant learning curve.
The program defines an LSTM neural network using this code:
# 2. define model
e_init = K.initializers.RandomUniform(-0.01, 0.01, seed=1)
init = K.initializers.glorot_uniform(seed=1)
simple_adam = K.optimizers.Adam()
nw = 129892
embed_vec_len = 32
model = K.models.Sequential()
model.add(K.layers.embeddings.Embedding(input_dim=nw, output_dim=embed_vec_len,
embeddings_initializer=e_init, mask_zero=True))
model.add(K.layers.LSTM(units=100, kernel_initializer=init, dropout=0.2))
model.add(K.layers.Dense(units=1, kernel_initializer=init, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=simple_adam, metrics=['acc'])
print(model.summary())
There's a lot going on here, so bear with me. The LSTM has two major components and several minor components. The first major component is the Embedding() layer. When working with natural language, you can feed word indexes such as 4 for "the" and 20 for "movie" directly to an LSTM network. However, this approach doesn't give very good results. A better approach is to convert each word index into a vector of real values such as (0.4508, 1.3233, . . 0.9305).
The vectors must be constructed in a way so that words that are close semantically, such as "excellent" and "wonderful," have vectors that are close numerically. There are three major ways to construct a set of word embeddings. First, you can create a custom set of embeddings based on your training data, using a separate tool, such as the Word2Vec library. Second, you can use a set of pre-built word embeddings based on a standard corpus, such as a large news feed of several hundred thousand stories from Google, or the text of all Wikipedia articles. The demo program uses a third approach, which is to compute the word embeddings on the fly, using the training data. This is a difficult problem, and is responsible for 4,156,444 of the 4,209,845 weights and biases of the model.
Notice that the Embedding() constructor requires the largest word index value. The demo uses 129,892 rather than 129,891 to indicate that you can have extra indexes if you wish. The demo program specifies an embedding vector length of 32. This value is a free parameter. For larger problems, a typical vector length is 100 to 500. Table 6-1 summarizes the seven parameters for an Embedding() layer.
Table 6-1: Embedding Layer Parameters
Name | Description |
|---|---|
input_dim | Size of the vocabulary, i.e. maximum integer index + 1 |
output_dim | Dimension of the dense embedding |
embeddings_initializer | Initializer for the embeddings matrix |
embeddings_regularizer | Regularizer function applied to the embeddings matrix |
embeddings_constraint | Constraint function applied to the embeddings matrix |
mask_zero | Whether or not the input value 0 is a padding value |
input_length | Length of input sequences, when it is constant |
The second major component of the LSTM network is the LSTM() layer. LSTMs are fantastically complex software modules, but the key idea is that they have a memory, or equivalently, they have state. Suppose you knew that one word of a sentence was "few" and you wanted to predict the next word. You'd certainly have to take a wild guess. But if you knew the previous words were "You can't make an omelet without breaking a few…", then you'd almost surely predict the next word to be "eggs." In short, LSTM networks have state and can work well for sequences of input words.
You can get a rough idea of what an LSTM cell is by examining the diagram in Figure 6-2.

Figure 6-2: A Simplified LSTM Cell
In Figure 6-2, x(t) is the input at time t and h(t) is the corresponding output. The vector c(t) is the cell state, or memory. The output, h(t), depends on the current input and the cell state. The internal plumbing of an LSTM cell is very complex, but fortunately, when using Keras you only need a few of the 23 LSTM() parameters.
The demo program specifies the memory via the units=100 argument. Memory size is a free parameter. Because of the complexity of an LSTM() layer, you can't apply dropout by using a standard Dropout() layer, so there's an internal dropout mechanism that the demo applies as a dropout=0.2 argument.
After the LSTM() layer, the model has a single Dense() layer with sigmoid activation. The idea here is to compress the output of the LSTM() layer down to a single value between 0.0 and 1.0, which can be interpreted as the probability that the predicted class = 1. Put another way, this means that if the output is less than 0.5, the model predicts 0 = negative sentiment; otherwise, the model predicts 1 = positive sentiment.
The LSTM model is compiled using binary cross entropy as the loss function because the class labels are 0 or 1. In sentiment analysis scenarios where you have three or more class labels, such as negative = (1, 0, 0), neutral = (0, 1, 0) and positive = (0, 0, 1), you would change the activation function on the last network layer from sigmoid to softmax, and use categorical cross entropy for the loss function.
You can loosely think of the compilation process as translating Keras code into TensorFlow code (or CNTK code or Theano code). You must pass values to the optimizer and loss parameters so that the fit() method will know how to train the model. The metrics parameter is optional. The program passes a Python list containing just 'acc' to indicate that classification accuracy (percentage correct predictions) should be computed during training.
The demo program displays a summary of the LSTM model using the summary() function. The primary purpose of using summary() is to check how many weights and biases your model has, which gives you an idea of how long training will take. It's possible to construct deep networks that just aren't trainable because they have too many weights and biases.
After training data has been read into memory and the LSTM network has been created, the demo program trains the model using these statements:
# 3. train model
bat_size = 10
max_epochs = 5
print("\nStarting training ")
model.fit(train_x, train_y, epochs=max_epochs, batch_size=bat_size,
shuffle=True, verbose=1)
print("Training complete \n")
The batch size is set to 10, which is called online training. The batch size is a free parameter that must be determined by trial and error. Some of my colleagues like to use powers of two for their batch size: 4, 8, 16, etc., but there is no research evidence that I'm aware of that indicates this practice is better or worse. As a general rule of thumb, LSTM neural networks are very sensitive to the batch size.
The max_epochs variable controls how many iterations will be used for training. The shuffle parameter in the fit() function indicates that the training items should be processed in random order. The default value is True, so the parameter could have been omitted. The verbose parameter controls how much information to display during training: 0 means display no information, 1 means display full information, and 2 means display a medium amount of information.
The fit() function returns a dictionary object that has the recorded training history. The demo program does not capture this information.
After training, the demo program evaluates the model on the test data:
# 4. evaluate model
loss_acc = model.evaluate(test_x, test_y, verbose=0)
print("Test data: loss = %0.6f accuracy = %0.2f%% " % \
(loss_acc[0], loss_acc[1]*100))
The evaluate() function returns a list of values. The first value at index [0] is the always value of the required loss function specified in the compile() function, binary cross entropy in this case. Other values in the list are any optional metrics from the compile() function. In this example, 'acc' was passed, so the value at index [1] holds the classification accuracy. The program multiples by 100 to convert accuracy from a proportion (like 0.8123) to a percentage (like 81.23 percent).
In most situations you'll want to save a trained model, especially if the training took hours or even longer. The demo program saves the trained model like so:
# 5. save model
print("Saving model to disk \n")
mp = ".\\Models\\imdb_model.h5"
model.save(mp)
The Keras save() function saves a trained model using the hierarchical data format (HDF) version 5. It is a binary format, so saved models can't be inspected with a text editor. In addition to saving an entire model, you can save just the model weights and biases, which is sometimes useful. You can also save the just model architecture without the weights.
You can load a saved Keras model from a different program like this:
print("Loading saved IMDB sentiment model")
mp = ".\\Models\\imdb_model.h5"
model = K.models.load_model(mp)
The whole point of creating and training a model is so that it can be used to make predictions for new, previously unseen data:
# 6. use model
print("Sentiment for \"the movie was a great waste of my time\"")
rev = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 4, 20, 16, 6, 86, 425, 7, 58, 64]], dtype=np.float32)
prediction = model.predict(rev)
print("Prediction (0 = negative, 1 = positive) = ", end="")
print("%0.4f" % prediction[0][0])
Because the LSTM model was trained using reviews that have length padded to 50 encoded words, when making a prediction you must pass a new review to the predict() method using the same format. The encoded values for "the movie was a great waste of my time" were hard-coded. However, in a non-demo scenario, when you create the training and test data files, you would save the encodings to a text file, typically named something like vocab.txt, along the lines of:
the 4
waste 425
time 64
. . .
Then you could write a script that opens the vocabulary file and reads the file into a dictionary object, where a word is the dictionary key ,and the encoded index is the dictionary value.
To create a classification prediction model where the input is a sequence of text such as sentences, you can use an LSTM network that consists of one or more LSTM cells plus some additional plumbing such as a dense layer.
When working with text input, words should be encoded as numeric vectors, a process called embedding. You can either create embeddings in a preprocessing phase, or you can create an embedding on the fly using an Embedding() layer.
Free parameters for LSTM models include weight-initialization algorithms, optimization algorithm and its parameters, dropout rate, batch size, and number of training iterations.
You can find the training and test data used by the demo program here.
The demo program uses just three of the 23 parameters for the LSTM() constructor. You can find additional information here.