CHAPTER 5
This chapter explains how to perform binary classification (the variable to predict can take one of just two possible values) using a neural network. Neural network binary classification is significantly more powerful than logistic regression binary classification, at the expense of a moderate increase in complexity.

Figure 5-1: Banknote Dataset Classification
If you’re new to machine learning, your first thought might be something like, “Binary classification using a neural network is no different than multi-class classification—there’s just two output nodes instead of three or more.” And you’d be mostly correct. However, the two side-by-side screenshots in Figure 5-1 indicate that there are two different techniques for neural network binary classification.
The two programs shown in Figure 5-1 work on the same raw data. The goal is to create a classification model that predicts whether a banknote (think a dollar bill or a euro) is a forgery or is authentic, based on four predictor variables: variance, skewness, kurtosis, and entropy. The program shown on the left uses essentially the same technique that we used in Chapter 4 to classify an iris flower as setosa, versicolor, or virginica. For neural network binary classification, this is called the two-node technique. The program shown on the right uses a significantly different approach, called the one-node technique.
Let me cut to the chase and state that in my opinion, the two-node technique is preferable to the one-node technique. But, for historical reasons, the neural network one-node technique is more common. You’ll almost certainly encounter the one-node technique, and should understand how it works.
Recall that the iris data was encoded as setosa = (1, 0, 0), versicolor = (0, 1, 0), virginica = (0, 0, 1). For the two-node technique, a banknote is encoded as forgery = (1, 0), authentic = (0, 1). After training, the two-node model is fed inputs (0.6, 1.9, -3.3, -0.3), and the prediction probabilities are (0.7678, 0.2322). This maps to (1, 0), and so the prediction is forgery/fake. In short, the neural network two-node binary classification technique is essentially the same as multi-class neural network classification.
The program shown on the right of Figure 5-1 uses a neural network with just a single output node. As you’ll see, this requires a change in the output layer activation function, a change in training error function, and the need for a program-defined classification accuracy function. For the one-node technique, authentic is encoded as 0, and forgery is encoded as 1. After training, the one-node model is fed the same inputs, (0.6, 1.9, -3.3, -0.3). The single output probability is 0.8468, which maps to 1, so the prediction is forgery/fake.
Both techniques give a similar quality prediction model—85% classification accuracy on the test data. Notice that during training, both techniques have roughly the same cross-entropy error, but that the one-node technique requires twice as many training iterations, 1000 versus 500, as the two-node technique. However, the one-node technique only updates half as many hidden-to-output node weights per iteration, so the increase in number of iterations is offset by faster training per iteration. The bottom line is that there’s no significant technical advantage to either of the techniques. I prefer the two-node technique because it’s slightly simpler, in my opinion.
The raw banknote data looks like:
3.6216,8.6661,-2.8073,-0.44699,0
4.5459,8.1674,-2.4586,-1.4621,0
. . .
-1.3971,3.3191,-1.3927,-1.9948,1
0.39012,-0.14279,-0.031994,0.35084,1
You can find this dataset here. The full dataset has 1,372 items. For simplicity, I selected just the first 50 authentic items (class forgery = 0) and the first 50 fake items (class forgery = 1). I wrote a short helper program to convert the raw data into two-node CNTK format that looks like:
|stats 3.62160000 8.66610000 -2.80730000 -0.44699000 |forgery 0 1 |# authentic
|stats 4.54590000 8.16740000 -2.45860000 -1.46210000 |forgery 0 1 |# authentic
. . .
|stats -1.39710000 3.31910000 -1.39270000 -1.99480000 |forgery 1 0 |# fake
|stats 0.39012000 -0.14279000 -0.03199400 0.35084000 |forgery 1 0 |# fake
The code for the helper program is shown in Code Listing 5-1. Because there are only 100 lines of data, the helper program uses print() to emit output to the shell in which it’s running. For larger datasets, you’d want to open a text file for writing and use the write() function.
I scraped the shell output and copied 80 items (40 authentic, 40 forgery) into a training file, and 20 items (10 authentic, 10 forgery) into a test file. The CNTK-formatted data can be found in the Appendix to this e-book.
Code Listing 5-1: Helper to Create CNTK-Format Data from Raw Data
# make_banknote_data.py # input: raw banknote_100.txt # output: banknote data in CNTK two-node format to screen # for scraping (manually divide into train/test) fin = open(".\\banknote_100_raw.txt", "r") for line in fin: line = line.strip() tokens = line.split(",") if tokens[4] == "0": print("|stats %12.8f %12.8f %12.8f %12.8f |forgery 0 1 |# authentic" % \ (float(tokens[0]), float(tokens[1]), float(tokens[2]), float(tokens[3])) ) else: print("|stats %12.8f %12.8f %12.8f %12.8f |forgery 1 0 |# fake" % \ (float(tokens[0]), float(tokens[1]), float(tokens[2]), float(tokens[3])) ) fin.close() |
Because there are four predictors/features, it’s not feasible to graph the data. However, you can get a rough idea of the data’s structure by examining the two-dimensional graph, based on just variance and skewness, of the 80-item training data as shown in Figure 5-2. The data is not linearly separable, so logistic regression would not work well.
The program that produced the output shown in the left side of Figure 5-1 is presented in Code Listing 5-2. If you scan through the code listing, you’ll see there’s very little difference between two-node binary classification and three-node (or more) multi-class classification. In the program-defined create_reader() function, the field properties are changed to correspond to the stats and forgery tags in the training and test data files:
x_strm = C.io.StreamDef(field='stats', shape=input_dim, is_sparse=False)
y_strm = C.io.StreamDef(field='forgery', shape=output_dim, is_sparse=False)
The neural network output dimension is changed to 2 for the two-node technique:
input_dim = 4
hidden_dim = 10
output_dim = 2
As always, determining the number of hidden nodes to use is a matter of trial and error. As in multi-class classification, you use the cross_entropy_with_softmax() error metric because CNTK doesn’t have a non-softmax version, so you do not use activation on the output layer:
with C.layers.default_options(init=C.initializer.uniform(scale=0.01, seed=1)):
hLayer = C.layers.Dense(hidden_dim, activation=C.ops.tanh,
name='hidLayer')(X)
oLayer = C.layers.Dense(output_dim, activation=None,
name='outLayer')(hLayer)
nnet = oLayer # train this
model = C.ops.softmax(nnet) # predict with this
Note that because Python assigns by reference rather than by value, when you train the nnet object, the model object will be updated, too.

Figure 5-2 Partial Banknote Data
As in multi-class classification, you set up objects to monitor cross-entropy error and classification error/accuracy:
tr_loss = C.cross_entropy_with_softmax(nnet, Y)
tr_clas = C.classification_error(nnet, Y)
This is significant, because as you’ll see shortly, you can’t use the built-in classification error function when using the one-node technique.
Code Listing 5-2: Two-Node Technique Binary Classification
# banknote_bnn.py # CNTK 2.3 with Anaconda 4.1.1 (Python 3.5, NumPy 1.11.1) # Use a one-hidden layer simple NN with 10 hidden nodes # banknote_train_cntk.txt - 80 items (40 authentic, 40 fake) # banknote_test_cntk.txt - 20 items (10 authentic, 10 fake) import numpy as np import cntk as C def create_reader(path, input_dim, output_dim, rnd_order, sweeps): # rnd_order -> usually True for training # sweeps -> usually C.io.INFINITELY_REPEAT for training OR 1 for eval x_strm = C.io.StreamDef(field='stats', shape=input_dim, is_sparse=False) y_strm = C.io.StreamDef(field='forgery', shape=output_dim, is_sparse=False) streams = C.io.StreamDefs(x_src=x_strm, y_src=y_strm) deserial = C.io.CTFDeserializer(path, streams) mb_src = C.io.MinibatchSource(deserial, randomize=rnd_order, max_sweeps=sweeps) return mb_src # ================================================================================== def main(): print("\nBegin banknote binary classification (two-node technique) \n") print("Using CNTK version = " + str(C.__version__) + "\n") input_dim = 4 hidden_dim = 10 output_dim = 2 train_file = ".\\Data\\banknote_train_cntk.txt" test_file = ".\\Data\\banknote_test_cntk.txt" # two-node data files: # |stats 4.17110 8.72200 -3.02240 -0.59699 |forgery 0 1 |# authentic # |stats -0.20620 9.22070 -3.70440 -6.81030 |forgery 0 1 |# authentic # . . . # |stats 0.60050 1.93270 -3.28880 -0.32415 |forgery 1 0 |# fake # |stats 0.91315 3.33770 -4.05570 -1.67410 |forgery 1 0 |# fake
# 1. create network X = C.ops.input_variable(input_dim, np.float32) Y = C.ops.input_variable(output_dim, np.float32) print("Creating a 4-10-2 tanh-softmax NN for partial banknote dataset ") with C.layers.default_options(init=C.initializer.uniform(scale=0.01, seed=1)): hLayer = C.layers.Dense(hidden_dim, activation=C.ops.tanh, name='hidLayer')(X) oLayer = C.layers.Dense(output_dim, activation=None, name='outLayer')(hLayer) nnet = oLayer model = C.ops.softmax(nnet) # 2. create learner and trainer print("Creating an ordinary cross entropy batch=10 SGD LR=0.01 Trainer ") tr_loss = C.cross_entropy_with_softmax(nnet, Y) # not model! tr_clas = C.classification_error(nnet, Y)
max_iter = 500 batch_size = 10 learn_rate = 0.01 learner = C.sgd(nnet.parameters, learn_rate) trainer = C.Trainer(nnet, (tr_loss, tr_clas), [learner]) # 3. create reader for train data rdr = create_reader(train_file, input_dim, output_dim, rnd_order=True, sweeps=C.io.INFINITELY_REPEAT) banknote_input_map = { X : rdr.streams.x_src, Y : rdr.streams.y_src } # 4. train print("\nStarting training \n") for i in range(0, max_iter): curr_batch = rdr.next_minibatch(batch_size, input_map=banknote_input_map) trainer.train_minibatch(curr_batch) if i % 50 == 0: mcee = trainer.previous_minibatch_loss_average macc = (1.0 - trainer.previous_minibatch_evaluation_average) * 100 print("batch %4d: mean loss = %0.4f, accuracy = %0.2f%% " \ % (i, mcee, macc)) print("\nTraining complete") # 5. evaluate model using test data print("\nEvaluating test data using built-in test_minibatch() \n") rdr = create_reader(test_file, input_dim, output_dim, rnd_order=False, sweeps=1) banknote_input_map = { X : rdr.streams.x_src, Y : rdr.streams.y_src } num_test = 20 all_test = rdr.next_minibatch(num_test, input_map=banknote_input_map) acc = (1.0 - trainer.test_minibatch(all_test)) * 100 print("Classification accuracy on the 20 test items = %0.2f%%" % acc) # (could save model here) # 6. use trained model to make prediction np.set_printoptions(precision = 1, suppress=True) unknown = np.array([[0.6, 1.9, -3.3, -0.3]], dtype=np.float32) # likely 1 0 = fake print("\nPredicting banknote authenticity for input features: ") print(unknown[0])
pred_prob = model.eval({X: unknown}) np.set_printoptions(precision = 4, suppress=True) print("Prediction probabilities are: ") print(pred_prob[0]) if pred_prob[0,0] < pred_prob[0,1]: # maps to (0,1) print("Prediction: authentic") else: # maps to (1,0) print("Prediction: fake") print("\nEnd banknote classification ") # ==================================================================================
if __name__ == "__main__": main() |
Training a two-node technique binary classifier neural network is exactly the same as training a multi-class network. When using a trained model to make a prediction, you have to map the two prediction probabilities to a predicted class:
unknown = np.array([[0.6, 1.9, -3.3, -0.3]], dtype=np.float32)
pred_prob = model.eval({X: unknown})
print("Prediction probabilities are: ")
print(pred_prob[0])
if pred_prob[0,0] < pred_prob[0,1]: # maps to (0,1)
print("Prediction: authentic")
else: # maps to (1,0)
print("Prediction: fake")
The return result from a call to the eval() function is an array-of-arrays, like [[ 0.7678, 0.2322 ]]. By selecting index [0] you get a single array with the two prediction probabilities, like [0.7678, 0.2322]. If the first of the two probabilities is less than the second, the prediction maps to the class that is encoded (0, 1); otherwise, the prediction maps to the class encoded as (1, 0).
Because the outputs are probabilities and there are only two values, you could also map a predicted class like this:
if pred_prob[0,0] < 0.5: # maps to (0,1)
print("Prediction: authentic")
else: # maps to (1,0)
print("Prediction: fake")
When using two-node neural network binary classification, you can encode either class as (0, 1), but it’s up to you to maintain the encoding meaning—it’s surprisingly easy to botch this.
I prepared the one-node version of the banknote classification technique by modifying the training and test data files, replacing (0, 1) with 0, and (1, 0) with 1:
|stats 3.62160000 8.66610000 -2.80730000 -0.44699000 |forgery 0 |# authentic
|stats 4.54590000 8.16740000 -2.45860000 -1.46210000 |forgery 0 |# authentic
. . .
|stats -1.39710000 3.31910000 -1.39270000 -1.99480000 |forgery 1 |# fake
|stats 0.39012000 -0.14279000 -0.03199400 0.35084000 |forgery 1 |# fake
Because there’s no change to the tag names stats and forgery, there’s no need to modify the program-defined create_reader() function. The complete program that generated the output shown on the right side of Figure 5-1 is presented in Code Listing 5-3.
Code Listing 5-3: One-Node Binary Classification Technique
# banknote_bnn_onenode.py # CNTK 2.3 with Anaconda 4.1.1 (Python 3.5, NumPy 1.11.1) # Use a one-hidden layer simple NN with 10 hidden nodes # banknote_train_cntk.txt - 80 items (40 authentic, 40 fake) # banknote_test_cntk.txt - 20 items(10 authentic, 10 fake) import numpy as np import cntk as C def create_reader(path, input_dim, output_dim, rnd_order, sweeps): # rnd_order -> usually True for training # sweeps -> usually C.io.INFINITELY_REPEAT for training OR 1 for eval x_strm = C.io.StreamDef(field='stats', shape=input_dim, is_sparse=False) y_strm = C.io.StreamDef(field='forgery', shape=output_dim, is_sparse=False) streams = C.io.StreamDefs(x_src=x_strm, y_src=y_strm) deserial = C.io.CTFDeserializer(path, streams) mb_src = C.io.MinibatchSource(deserial, randomize=rnd_order, max_sweeps=sweeps) return mb_src def class_acc(mb, x_var, y_var, model): num_correct = 0; num_wrong = 0 x_mat = mb[x_var].asarray() # batch_size x 1 x features_dim y_mat = mb[y_var].asarray() # batch_size x 1 x 1 for i in range(mb[x_var].shape[0]): # each item in the batch p = model.eval(x_mat[i]) # 1 x 1 y = y_mat[i] # 1 x 1 if p[0,0] < 0.5 and y[0,0] == 0.0 or p[0,0] >= 0.5 and y[0,0] == 1.0: num_correct += 1 else: num_wrong += 1 return (num_correct * 100.0) / (num_correct + num_wrong) # ================================================================================== def main(): print("\nBegin banknote binary classification (one-node technique) \n") print("Using CNTK version = " + str(C.__version__) + "\n") input_dim = 4 hidden_dim = 10 output_dim = 1 # NOTE train_file = ".\\Data\\banknote_train_cntk_onenode.txt" # NOTE: different file test_file = ".\\Data\\banknote_test_cntk_onenode.txt" # NOTE # one-node data files: # |stats 4.17110 8.72200 -3.02240 -0.59699 |forgery 0 |# authentic # |stats -0.20620 9.22070 -3.70440 -6.81030 |forgery 0 |# authentic # . . . # |stats 0.60050 1.93270 -3.28880 -0.32415 |forgery 1 |# fake # |stats 0.91315 3.33770 -4.05570 -1.67410 |forgery 1 |# fake
# 1. create network X = C.ops.input_variable(input_dim, np.float32) Y = C.ops.input_variable(output_dim, np.float32) print("Creating a 4-10-1 tanh-logsig NN for partial banknote dataset ") with C.layers.default_options(init=C.initializer.uniform(scale=0.01, seed=1)): hLayer = C.layers.Dense(hidden_dim, activation=C.ops.tanh, name='hidLayer')(X) oLayer = C.layers.Dense(output_dim, activation=C.ops.sigmoid, name='outLayer')(hLayer) # NOTE: sigmoid activation model = oLayer # alias # 2. create learner and trainer print("Creating a binary cross entropy batch=10 SGD LR=0.01 Trainer \n") tr_loss = C.binary_cross_entropy(model, Y) # NOTE: use model # tr_clas = C.classification_error(model, Y) # NOTE: not available for one-node
max_iter = 1000 batch_size = 10 learn_rate = 0.01 learner = C.sgd(model.parameters, learn_rate) # NOTE: use model trainer = C.Trainer(model, (tr_loss), [learner]) # NOTE: no classification error # 3. create reader for train data rdr = create_reader(train_file, input_dim, output_dim, rnd_order=True, sweeps=C.io.INFINITELY_REPEAT) banknote_input_map = { X : rdr.streams.x_src, Y : rdr.streams.y_src } # 4. train print("Starting training \n") for i in range(0, max_iter): curr_batch = rdr.next_minibatch(batch_size, input_map=banknote_input_map) trainer.train_minibatch(curr_batch) if i % 100 == 0: mcee = trainer.previous_minibatch_loss_average # built-in ca = class_acc(curr_batch, X, Y, model) # program-defined print("batch %4d: mean loss = %0.4f accuracy = %0.2f%%" % (i, mcee, ca)) print("\nTraining complete") # 5. evaluate test data (cannot use trainer.test_minibatch) print("\nEvaluating test data using program-defined class_acc() \n") rdr = create_reader(test_file, input_dim, output_dim, rnd_order=False, sweeps=1) banknote_input_map = { X : rdr.streams.x_src, Y : rdr.streams.y_src } num_test = 20 all_test = rdr.next_minibatch(num_test, input_map=banknote_input_map) acc = class_acc(all_test, X, Y, model) print("Classification accuracy on the 20 test items = %0.2f%%" % acc) # (could save model here) # 6. use trained model to make prediction np.set_printoptions(precision = 1, suppress=True) unknown = np.array([[0.6, 1.9, -3.3, -0.3]], dtype=np.float32) # likely fake print("\nPredicting banknote authenticity for input features: ") print(unknown[0])
pred_prob = model.eval({X: unknown}) print("Prediction probability is: ") print("%0.4f" % pred_prob[0,0]) if pred_prob[0,0] < 0.5: # prob(forgery) < 0.5 print("Prediction: authentic") else: print("Prediction: fake") print("\nEnd banknote classification ") # ==================================================================================
if __name__ == "__main__": main() |
The first change to the program is the addition of a program-defined class_acc() function to compute the classification accuracy of a mini-batch. When using the two-node classification technique, you can use the CNTK built-in classification_error() function, but the one-node technique doesn’t support classification_error(), so you must implement a program-defined function yourself.
In the main() function, you change the number of output nodes to 1 because you’re using the one-node technique:
input_dim = 4
hidden_dim = 10
output_dim = 1 # NOTE: instead of 2
When creating the neural network, you change the output activation from None to sigmoid(), and you need just one neural network object that is used for both training and prediction:
with C.layers.default_options(init=C.initializer.uniform(scale=0.01, seed=1)):
hLayer = C.layers.Dense(hidden_dim, activation=C.ops.tanh,
name='hidLayer')(X)
oLayer = C.layers.Dense(output_dim, activation=C.ops.sigmoid,
name='outLayer')(hLayer) # change from None
model = oLayer # NOTE: use for both training and prediction
The logistic sigmoid activation function scales the single output node value to the range [0.0, 1.0), which can be interpreted as the probability of getting class 1. This means if the output value is less than 0.5, your prediction is class 0; otherwise, your prediction is class 1.
For the one-node technique binary classification technique, when you set up the training error functions, you use binary_cross_entropy() instead of cross_entropy_with_softmax(), and you drop the classification_error() function:
print("Creating a binary cross-entropy batch=10 SGD LR=0.01 Trainer \n")
tr_loss = C.binary_cross_entropy(model, Y) # NOTE: use model
# tr_clas = C.classification_error(model, Y) # NOTE: not available for one-node
The binary_cross_entropy() function is used when you have a single output node with a value between 0.0 and 1.0. You could use squared_error(), but binary_cross_entropy() is more principled. If you include and then use classification_error() with the one-node technique, your program will run, but the function will give you meaningless results.
When you set up the trainer, you use the training error/loss function, but not the classification error:
trainer = C.Trainer(model, (tr_loss), [learner]) # NOTE: no classification error
The one-node training code is essentially the same as the two-node technique, except you don’t have the previous_minibatch_evaluation_average() function available because there’s no classification_error() function defined. So if you want to monitor classification accuracy (optional but recommended), you must call a program-defined function:
print("Starting training \n")
for i in range(0, max_iter):
curr_batch = rdr.next_minibatch(batch_size, input_map=banknote_input_map)
trainer.train_minibatch(curr_batch)
if i % 100 == 0:
mcee = trainer.previous_minibatch_loss_average # built-in
ca = class_acc(curr_batch, X, Y, model) # program-defined
print("batch %4d: mean loss = %0.4f accuracy = %0.2f%%" % (i, mcee, ca))
print("\nTraining complete")
The helper function class_acc() is at the top of Code Listing 5-3. The function accepts a mini-batch object for which you want to compute classification accuracy, a Variable object that holds the structure of the input values, a Variable object that holds the structure of the known correct output values, and a model object that is the neural network being trained:
def class_acc(mb, x_var, y_var, model):
num_correct = 0; num_wrong = 0:
. . .
The x and y values are pulled out of the mini-batch object like this:
x_mat = mb[x_var].asarray() # batch_size x 1 x features_dim
y_mat = mb[y_var].asarray() # batch_size x 1 x 1
This code is not at all obvious, but can be considered boilerplate. A CNTK mini-batch object is implemented as a Python dictionary. The X and Y variable objects act as keys for the dictionary, but the dictionary values must be explicitly cast as NumPy array types. The resulting arrays have three dimensions, where the first is the number of training items in the mini-batch.
Next, the function walks through each training item in the mini-batch:
for i in range(mb[x_var].shape[0]): # each item in the batch
p = model.eval(x_mat[i]) # 1 x 1
y = y_mat[i] # 1 x 1
if p[0,0] < 0.5 and y[0,0] == 0.0 or p[0,0] >= 0.5 and y[0,0] == 1.0:
num_correct += 1
else:
num_wrong += 1
return (num_correct * 100.0) / (num_correct + num_wrong)
The input values are fed to the model using the eval() function, and the computed y value is returned into a matrix p, which has dimensions 1´1, so the probability value is in p[0,0]. The known correct output value is also in a 1´1 matrix.
The one-node technique program concludes by making a prediction for an unknown banknote:
pred_prob = model.eval({X: unknown})
print("Prediction probability is: ")
print("%0.4f" % pred_prob[0,0])
if pred_prob[0,0] < 0.5: # prob(forgery) < 0.5
print("Prediction: authentic")
else:
print("Prediction: fake")
The return value from the call to eval() is a 1´1 matrix with a value between 0.0 and 1.0. A value less than 0.5 maps to the class encoded as 0 (authentic banknote in this case), and a value greater than or equal to 0.5 maps to the class encoded as 1 (forgery/fake banknote).
Using either the two-node technique (Code Listing 5-2) or the one-node technique (Code Listing 5-3), create, train, and evaluate a neural network binary classifier for the (processed) Cleveland Heart Disease dataset. You can find the raw data here.
There are 303 data items, six of which have a missing value. Each item has 13 features followed by a value from 0 to 4, where 0 means no heart disease, and values 1 through 4 mean presence of heart disease.
63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
. . .
I suggest removing the six items that have missing values. Because of the different scales of features, I recommend using some form of normalization—dividing by a power of 10 works reasonably well, but min-max works better. When dividing the normalized and encoded data into a training and a test set, I suggest using 80% of the data for training (about 238 items) and 20% for testing.