Introduction to CNTK Succinctly^®
by James McCaffrey

CHAPTER 6

Neural Network Regression

The goal of a regression problem is to predict a numeric value from one or more predictor variables. For example, suppose you want to predict the median value of a house in one of 100 towns near Boston. You have data that includes a crime statistic for each town, the age of the houses in each town, a measure of the distance from each town to Boston, the pupil-to-teacher ratio in each town, a racial demographic statistic for each town, and the median house value in each town. Using the first five predictor variables, you want to create a model to predict median house value.

Median Home Value Regression

Figure 6-1: Median Home Value Regression

You could create a linear regression model along the lines of Y = a0 + (a1)(crime) + (a2)(age) + (a3)(distance) + (a4)(ratio) + (a5)(racial) where Y is the predicted median value, a0 is a constant, and a1 through a5 are constants associated with the five predictor variables. An alternative approach, which can often create a more accurate prediction model, is to use a neural network.

Preparing the Boston area house values data

The Boston area house values dataset is a well-known benchmark collection dating from a 1978 research paper. The full dataset has 14 attributes/variables, and 506 instances. You can find the full dataset here. For simplicity, the demo program shown in Figure 6-1 uses just six of the 14 attributes (five as predictors, one as a value-to-predict), and 100 instances (80 training and 20 test).

The value to predict is the median house price in a town or census tract. The first predictor is crime per capita in the town or United States census tract, so you’d expect smaller values to be associated with higher house values. The second is the proportion of owner-occupied units built before 1940, so larger values mean older, but it’s not obvious if older houses would be associated with higher house values or lower house values. The third predictor is a weighted distance of the town to five Boston employment centers. The fourth predictor is the area school pupil-to-teacher ratio. The fifth predictor is an indirect metric of the proportion of black residents in the town (= 1000 * (proportion_Black - 0.63)^2), so you’d expect higher values to be associated with lower house values. The values to predict are median house values that have been divided by 1,000, for example 25.50 means $25,500.00—homes were much less expensive in 1978.

Using the first 80 items of the full dataset, I created an 80-item tab-delimited training data file that looks like this:

I also created a 20-item test set with the same format. You can find both datasets in the Appendix of this e-book. Because there are five predictor variables, it’s not possible to graph the dataset, but you can get an idea of the structure of the data by the graph in Figure 6-2. The graph plots the 80-item training set with the age attribute (proportion of houses built before 1940) on the x-axis, and median house value on the y-axis. Linear regression would fit a nearly horizontal line through the middle of the data, meaning the prediction equation would predict a median house value of about 22.50 regardless of age—and result in a poor prediction model.

For simplicity, I did not normalize the data, but this is an example of data that should definitely be normalized. Notice that the racial predictor variable, with values like 396.55, is much larger than the crime variable, with values like 0.0131. One simple approach would be to scale all predictor variables so that most are between 1.0 and 10.0—you could multiply crime by 10, divide age by 10, leave distance alone, divide ratio by 10, and divide racial by 100.

Another approach would be to use min-max normalization, or z-score normalization. In fact, some experiments showed that all three forms of normalization gave a significantly better predictive model.

Partial Boston Area Median House Values Data

Figure 6-2: Partial Boston Area Median House Values Data

In situations in which you have a dataset with a large number of variables, some of those variables may not be useful for a prediction model, and including some variables may actually create a worse model than you’d get by leaving those variables out. Determining which predictor variables to use, and which not to use, is called feature selection.

The neural network regression program

The program code that generated the output shown in the screenshot in Figure 6-1 is presented in Code Listing 6-1. After importing the required NumPy and CNTK packages, the demo program defines a helper function to read data from a CNTK format file into a mini-batch object:

def create_reader(path, input_dim, output_dim, rnd_order, sweeps):
# rnd_order -> usually True for training
# sweeps -> usually C.io.INFINITELY_REPEAT for training OR 1 for eval
x_strm = C.io.StreamDef(field='predictors', shape=input_dim, is_sparse=False)
y_strm = C.io.StreamDef(field='medval', shape=output_dim, is_sparse=False)
streams = C.io.StreamDefs(x_src=x_strm, y_src=y_strm)
# streams = C.variables.Record(x_src=x_strm, y_src=y_strm)
deserial = C.io.CTFDeserializer(path, streams)
mb_src = C.io.MinibatchSource(deserial, randomize=rnd_order, max_sweeps=sweeps)
return mb_src

You can consider the code in this helper function as boilerplate for neural network regression problems. Because CNTK is so new and evolving so quickly, sometimes the documentation gets out of sync with the code For example, at the time I’m writing this e-book, the documentation makes no mention of the StreamDefs() function. It’s not clear if StreamDefs() has been deprecated or if it’s just missing from the documentation.

If you apply the type() function to the return result streams, you’ll see it is type cntk.variables.Record, and so you could code as:

streams = C.variables.Record(x_src=x_strm, y_src=y_strm)

Using the Python type() function to examine CNTK objects is an indispensable debugging technique.

Code Listing 6-1: Neural Network Regression

# boston_reg.py

# CNTK 2.3 with Anaconda 4.1.1 (Python 3.5, NumPy 1.11.1)

# Predict median value of a house in an area near Boston based on area's

# crime rate, age of houses, distance to Boston, pupil-teacher

# ratio, percentage black residents

# boston_train_cntk.txt - 100 items

# boston_test_cntk.txt - 20 items

import numpy as np

import cntk as C

def create_reader(path, input_dim, output_dim, rnd_order, sweeps):

# rnd_order -> usually True for training

# sweeps -> usually C.io.INFINITELY_REPEAT for training OR 1 for eval

x_strm = C.io.StreamDef(field='predictors', shape=input_dim, is_sparse=False)

y_strm = C.io.StreamDef(field='medval', shape=output_dim, is_sparse=False)

streams = C.io.StreamDefs(x_src=x_strm, y_src=y_strm)

# streams = C.variables.Record(x_src=x_strm, y_src=y_strm)

deserial = C.io.CTFDeserializer(path, streams)

mb_src = C.io.MinibatchSource(deserial, randomize=rnd_order, max_sweeps=sweeps)

return mb_src

def mb_accuracy(mb, x_var, y_var, model, delta):

num_correct = 0

num_wrong = 0

x_mat = mb[x_var].asarray() # batch_size x 1 x features_dim

y_mat = mb[y_var].asarray() # batch_size x 1 x 1

# for i in range(mb[x_var].shape[0]): # each item in the batch

for i in range(len(mb[x_var])):

v = model.eval(x_mat[i]) # 1 x 1 predicted value

y = y_mat[i] # 1 x 1 actual value

if np.abs(v[0,0] - y[0,0]) < delta: # close enough?

num_correct += 1

else:

num_wrong += 1

return (num_correct * 100.0) / (num_correct + num_wrong)

# ==================================================================================

def main():

print("\nBegin median house value regression \n")

print("Using CNTK version = " + str(C.__version__) + "\n")

input_dim = 5 # crime, age, distance, pupil-teach, black

hidden_dim = 20

output_dim = 1 # median value (x$1000)

train_file = ".\\Data\\boston_train_cntk.txt"

test_file = ".\\Data\\boston_test_cntk.txt"

# data resembles:

# |predictors 0.041130 33.50 5.40 19.00 396.90 |medval 28.00

# |predictors 0.068600 62.50 3.50 18.00 393.53 |medval 33.20

# 1. create network

X = C.ops.input_variable(input_dim, np.float32)

Y = C.ops.input_variable(output_dim, np.float32)

print("Creating a 5-20-1 tanh-none regression NN for partial Boston dataset ")

with C.layers.default_options(init=C.initializer.uniform(scale=0.01, seed=1)):

hLayer = C.layers.Dense(hidden_dim, activation=C.ops.tanh,

name='hidLayer')(X)

oLayer = C.layers.Dense(output_dim, activation=None,

name='outLayer')(hLayer)

model = C.ops.alias(oLayer) # alias

# 2. create learner and trainer

print("Creating a squared error batch=5 variable SGD LR=0.02 Trainer \n")

tr_loss = C.squared_error(model, Y)

max_iter = 3000

batch_size = 5

base_learn_rate = 0.02

sch = C.learning_parameter_schedule([base_learn_rate, base_learn_rate/2],

minibatch_size=batch_size,

epoch_size=int((max_iter*batch_size)/2))

learner = C.sgd(model.parameters, sch)

trainer = C.Trainer(model, (tr_loss), [learner])

# 3. create reader for train data

rdr = create_reader(train_file, input_dim, output_dim,

rnd_order=True, sweeps=C.io.INFINITELY_REPEAT)

boston_input_map = {

X : rdr.streams.x_src,

Y : rdr.streams.y_src

}

# 4. train

print("Starting training \n")

for i in range(0, max_iter):

curr_batch = rdr.next_minibatch(batch_size, input_map=boston_input_map)

trainer.train_minibatch(curr_batch)

if i % int(max_iter/10) == 0:

mcee = trainer.previous_minibatch_loss_average

acc = mb_accuracy(curr_batch, X, Y, model, delta=3.00) # program-defined

print("batch %4d: mean squared error = %8.4f accuracy = %5.2f%%" \

% (i, mcee, acc))

print("\nTraining complete")

# 5. evaluate test data (cannot use trainer.test_minibatch)

print("\nEvaluating test data using program-defined class_acc() \n")

rdr = create_reader(test_file, input_dim, output_dim,

rnd_order=False, sweeps=1)

boston_input_map = {

X : rdr.streams.x_src,

Y : rdr.streams.y_src

}

num_test = 20

all_test = rdr.next_minibatch(num_test, input_map=boston_input_map)

acc = mb_accuracy(all_test, X, Y, model, delta=3.00)

print("Prediction accuracy on the 20 test items = %0.2f%%" % acc)

# (could save model here)

# 6. use trained model to make prediction

np.set_printoptions(precision = 2, suppress=True)

unknown = np.array([[0.09, 50.00, 4.5, 17.00, 350.00]], dtype=np.float32)

print("\nPredicting area median home value for feature/predictor values: ")

print(unknown[0])

pred_value = model.eval({X: unknown})

print("\nPredicted home value is: ")

print("$%0.2f (x1000)" % pred_value[0,0])

print("\nEnd median house value regression ")

# ==================================================================================

if __name__ == "__main__":

main()

When working with neural network regression, you’ll need to define a custom accuracy function. With classification problems, a prediction is either correct or wrong. But with regression problems, you must define what it means for an output value to be correct—how close is close enough?

The demo program defines a helper function that accepts a CNTK mini-batch object and computes a custom accuracy metric. The definition begins:

def mb_accuracy(mb, x_var, y_var, model, delta):
num_correct = 0
num_wrong = 0

The function accepts a CNTK mini-batch object of training data, a model to evaluate, and a delta value that defines how close a predicted output value must be to the known correct value in order to be considered correct. The x_var and y_var parameters aren’t necessary from a conceptual point of view, but they’re required to access data in the mini-batch object:

x_mat = mb[x_var].asarray() # batch_size x 1 x features_dim
y_mat = mb[y_var].asarray() # batch_size x 1 x 1

A CNTK mini-batch object is essentially a Python dictionary that has CNTK Variable object as keys. The values must be explicitly coerced into three-dimensional NumPy array objects using the asarray() function.

The first dimension of the array shape is the number of items in the mini-batch collection, so that value can be used to iterate through the collection. The input values from each item are fed to the regression model, and the output values are computed (using the current weights and bias values):

for i in range(mb[x_var].shape[0]): # each item in the batch
v = model.eval(x_mat[i]) # 1 x 1 predicted value
y = y_mat[i] # 1 x 1 actual value
. . .

Because the len() function applied to an n-dim NumPy array returns the first dimension, you can also iterate using for i in range(len(mb[x_var])) instead of using shape directly. The return value from the call to eval() is a 1´1 matrix. Information like this is not obvious and can best be determined during program debugging by inserting statements that display objects’ shape properties.

The mb_accuracy() function computes and returns the percentage of correct predictions:

. . .
    if np.abs(v[0,0] - y[0,0]) < delta: # close enough?
      num_correct += 1
    else:
      num_wrong += 1
return (num_correct * 100.0) / (num_correct + num_wrong)

An alternative is to return the accuracy as a proportion such as 0.6500, rather than a percentage like 65.00%. Instead of counting a predicted output value as correct if it is within a fixed value of the correct output value, you can check if the computed value is within a specified percentage/proportion of the correct output value. For example, you might want to count a predicted area median house value correct if it’s within 10% of the true value.

The demo regression program prepares to create a neural network like so:

input_dim = 5 # crime, age, distance, pupil-teach, racial
hidden_dim = 20
output_dim = 1 # median value (x$1000)

X = C.ops.input_variable(input_dim, np.float32)
Y = C.ops.input_variable(output_dim, np.float32)

There are no good rules of thumb for determining the number of hidden nodes to use. It is possible to create a regression model with two or more output nodes, but such problems are quite rare.

The model is created with these statements:

with C.layers.default_options(init=C.initializer.uniform(scale=0.01, seed=1)):
    hLayer = C.layers.Dense(hidden_dim, activation=C.ops.tanh,
      name='hidLayer')(X)
    oLayer = C.layers.Dense(output_dim, activation=None,
     name='outLayer')(hLayer)
model = C.ops.alias(oLayer) # alias

Note that there is no activation function applied to the single output node. This is sometimes called applying the Identify function, which is just f(x) = x.

The learner algorithm object and trainer object are created with a variable learning rate:

tr_loss = C.squared_error(model, Y)
max_iter = 3000
batch_size = 5
base_learn_rate = 0.02
sch = C.learning_parameter_schedule([base_learn_rate, base_learn_rate/2],
minibatch_size=batch_size,
epoch_size=int((max_iter*batch_size)/2))
learner = C.sgd(model.parameters, sch)
trainer = C.Trainer(model, (tr_loss), [learner])

The demo program uses squared_error() because cross-entropy is not applicable for regression problems. It is possible to extend CNTK to use a custom error function, but that’s a topic outside of the scope of this e-book.

Instead of using a fixed learning rate, the demo creates a learning rate schedule that uses a learning rate of 0.02 for the first half of the 3,000 training iterations, and then a 0.01 rate for the second half of iterations. In general, learning rate schedules aren’t necessary for simple neural network, but they’re often useful for deep neural networks with many hidden layers.

A reader for the training data is created in the usual way:

rdr = create_reader(train_file, input_dim, output_dim,
    rnd_order=True, sweeps=C.io.INFINITELY_REPEAT)
boston_input_map = {
    X : rdr.streams.x_src,
    Y : rdr.streams.y_src
}

Instead of reusing a single rdr object for training and testing, some of my colleagues prefer creating two objects, such as rdr_train and rdr_test.

Training is performed with these statements:

    for i in range(0, max_iter):
    curr_batch = rdr.next_minibatch(batch_size, input_map=boston_input_map)
    trainer.train_minibatch(curr_batch)
    if i % int(max_iter/10) == 0:
      mcee = trainer.previous_minibatch_loss_average
      acc = mb_accuracy(curr_batch, X, Y, model, delta=3.00) # program-defined
      print("batch %4d: mean squared error = %8.4f accuracy = %5.2f%%" \
        % (i, mcee, acc))

Just as with classification, it’s important to monitor error/loss during regression training, because training can often fail in spectacular fashion. Here average squared error is displayed every 1/10 of the specified iterations. Mean squared error is a bit easier to interpret than cross-entropy error. Suppose the mean squared error is 25.00; then the mean absolute error is 5.00, which means an actual median house price of 30.00 ($30,000) is being predicted as either 25.00 or 35.00.

The demo program concludes by making a prediction for new, previously unseen predictor values. The prediction is prepared:

unknown = np.array([[0.09, 50.00, 4.5, 17.00, 350.00]], dtype=np.float32) # 1x5
print("\nPredicting area median home value for feature/predictor values: ")
print(unknown[0])

The prediction is made like this:

pred_value = model.eval({X: unknown})
print("\nPredicted home value is: ")
print("$%0.2f (x1000)" % pred_value[0,0])
print("\nEnd median house value regression ")

Because of Python’s permissiveness, you pass the unknown matrix directly to eval():

pred_value = model.eval(unknown)

The Python language’s lax handling of argument types isn’t necessarily always a benefit—it allows many different ways to write code, which can be reflected in somewhat apparently inconsistent code examples you find online.

Exercise

Using the program in Code Listing 6-1 as a guide, create, train, and evaluate a neural network regression model for the Yacht Hydrodynamics dataset. You can find the raw data here. There are 308 data items. There are six predictor values that describe the shape of a yacht hull, followed by the value to predict which is residuary resistance. The data looks like this:

-2.3 0.568 4.78 3.99 3.17 0.125 0.11

-2.3 0.568 4.78 3.99 3.17 0.150 0.27

-2.3 0.568 4.78 3.99 3.17 0.175 0.47
. . .

I recommend normalizing the second and sixth predictors by multiplying by 10, or by normalizing all predictor values using min-max normalization. When you split the 308 items into a training set and a test set, I suggest using 80% for training (about 246 items), and the remaining 20% (62 items) for testing.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now

Neural Network Regression

Preparing the Boston area house values data

The neural network regression program

Exercise

DISCLAIMER: Web reader is currently in beta. Please report any issues through our support system. PDF and Kindle format files are also available for download.