CHAPTER 3
The goal of a regression problem is to make a prediction where the variable to predict is a single numeric value. For example, you might want to predict the annual income of a person based on their age, sex, political-party affiliation, and so on.

Figure 3-1: Regression using Keras
The screenshot in Figure 3-1 shows a demonstration of regression using a deep neural network. The demo program begins by loading 506 data items into memory. Each item represents the median house price in one of 506 towns near Boston. After loading, the demo programmatically splits the dataset into a training set and a test set.
Behind the scenes, the demo program creates a 13-(10-10)-1 deep neural network. Then, the network is trained using 500 epochs. After training, the regression model has 68.56 percent accuracy on the training data, and 70.59 percent accuracy on the held-out test data.
The demo program concludes by making a prediction for a hypothetical, previously unseen town near Boston. The predicted median house price is $8,049.89 (the data comes from the 1970s when house prices were much lower than they are today).
The Boston Housing dataset is a fairly well-known benchmark for regression problems. There are a total of 506 items. The raw data looks like this:
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
Each line has 14 fields. The first 13 are the predictor values. The last value is the median home price in the town, divided by 1,000, so the first town shown has a median house price of $24,000.
The first three predictors, [0] to [2], are per capita crime rate, proportion of land zoned for large residential lots, and proportion of non-retail acres. The predictor [3] is a Boolean if the town borders the Charles River (0 = no, 1 = yes).
Briefly, the remaining predictors are: [4] = air pollution metric, [5] = average number rooms per house, [6] = proportion of old houses, [7] = weighted distance to Boston, [8] = index of accessibility to highways, [9] = tax rate, [10] = pupil-teacher ratio, [11] = measure of proportion of Black residents, and [12] = percentage lower socio-economic status residents.
Because the data has 14 dimensions, it's not possible to easily visualize it. However, you can get a rough idea of the data by looking at the partial graph in Figure 3-2.

Figure 3-2: Partial Boston Housing Data
The graph plots just median price as a function of predictor [6], proportion of old houses, for the first 100 of the 506 data items. You can see from the graph that it's not possible to create an accurate prediction model based on just the old-houses proportion variable.
The raw data was preprocessed. The Boolean predictor [3] was converted from (0, 1) dummy encoding to (-1, +1) encoding. The 12 other (numeric) predictor variables were min-max normalized, resulting in all values being between 0.0 and 1.0. The target median house price variable, which was already divided by 1,000 was further divided by 10 so that all values are between 1.0 and 5.0. The resulting data looks like:
0.000000 0.180000 0.067815 -1 . . . 2.400000
0.000236 0.000000 0.242302 -1 . . . 2.160000
Somewhat fortunately, other than predictor [3], all the predictor variables are numeric, so there's no need for 1-of-(N-1) or possibly one-hot encoding of categorical variables.
The complete program that generated the output shown in Figure 3-1 is shown in Code Listing 3-1. The program begins with comments the program file name and versions of Python, TensorFlow, and Keras used, and then imports the NumPy, Keras, TensorFlow, and OS packages:
# boston_reg.py
# Python 3.5.2, TensorFlow 2.1.5, Keras 1.7.0
import numpy as np
import keras as K
import tensorflow as tf
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
In a non-demo scenario, you'd want to include additional details in the comments. Because Keras and TensorFlow are under rapid development, you should always document which versions are being used. Version incompatibilities can be a significant problem when working with Keras and other open-source software.
Code Listing 3-1: Boston Housing Regression Program
# boston_reg.py # Python 3.5.2, TensorFlow 2.1.5, Keras 1.7.0 # ================================================================================== import numpy as np import keras as K import tensorflow as tf import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' class MyLogger(K.callbacks.Callback): def __init__(self, n, data_x, data_y, pct_close): self.n = n self.data_x = data_x self.data_y = data_y self.pct_close = pct_close def on_epoch_end(self, epoch, logs={}): if epoch % self.n == 0: curr_loss = logs.get('loss') total_acc = my_accuracy(self.model, self.data_x, self.data_y, self.pct_close) print("epoch = %4d curr batch loss (mse) = %0.6f overall acc = %0.2f%%" % \ (epoch, curr_loss, total_acc * 100)) def my_accuracy(model, data_x, data_y, pct_close): num_correct = 0; num_wrong = 0 n = len(data_x) for i in range(n): predicted = model.predict(np.array([data_x[i]], dtype=np.float32)) # [[x]] actual = data_y[i] if np.abs(predicted[0][0] - actual) < np.abs(pct_close * actual): num_correct += 1 else: num_wrong += 1 return (num_correct * 1.0) / (num_correct + num_wrong) # ================================================================================== def main(): # 0. get started print("\nBoston Houses dataset regression example ") np.random.seed(2) tf.set_random_seed(3) kv = K.__version__ print("Using Keras: ", kv, "\n") # 1. load data print("Loading Boston data into memory ") data_file = ".\\Data\\boston_mm_tab.txt" all_data = np.loadtxt(data_file, delimiter="\t", skiprows=0, dtype=np.float32) N = len(all_data) indices = np.arange(N) np.random.shuffle(indices) n_train = int(0.80 * N) print("Splitting data into training and test sets \n") data_x = all_data[indices,:-1] data_y = all_data[indices,-1] train_x = data_x[0:n_train,:] train_y = data_y[0:n_train] test_x = data_x[n_train:N,:] test_y = data_y[n_train:N] # 2. define model init = K.initializers.RandomUniform(seed=1) simple_sgd = K.optimizers.SGD(lr=0.010) model = K.models.Sequential() model.add(K.layers.Dense(units=10, input_dim=13, kernel_initializer=init, activation='tanh')) # hidden layer model.add(K.layers.Dense(units=10, kernel_initializer=init, activation='tanh')) # hidden layer model.add(K.layers.Dense(units=1, kernel_initializer=init, model.compile(loss='mean_squared_error', optimizer=simple_sgd, metrics=['mse']) # 3. train model batch_size= 8 max_epochs = 500 my_logger = MyLogger(int(max_epochs/5), train_x, train_y, 0.15) print("Starting training ") h = model.fit(train_x, train_y, batch_size=batch_size, epochs=max_epochs, verbose=0, callbacks=[my_logger]) print("Training finished \n") # 4. evaluate model acc = my_accuracy(model, train_x, train_y, 0.15) print("Overall accuracy (wthin 15%%) on training data = %0.4f" % acc) acc = my_accuracy(model, test_x, test_y, 0.15) print("Overall accuracy on test data = %0.4f \n" % acc) eval = model.evaluate(train_x, train_y, verbose=0) print("Overall loss (mse) on training data = %0.6f" % eval[0]) eval = model.evaluate(test_x, test_y, verbose=0) print("Overall loss (mse) on test data = %0.6f" % eval[0]) # 5. save model print("\nSaving Boston model to disk \n") mp = ".\\Models\\boston_model.h5" model.save(mp) # 6. use model np.set_printoptions(precision=1) unknown = np.full(shape=(1,13), fill_value=0.6, dtype=np.float32) unknown[0][3] = -1.0 # binary feature predicted = model.predict(unknown) print("Using model to predict median house price for features: ") print(unknown) print("\nPredicted price is: ") print("$%0.2f" % (predicted * 10000)) # ================================================================================== if __name__=="__main__": main() |
The program imports the entire Keras package and assigns an alias K. An alternative approach is to import just the modules you need, for example:
from keras.models import Sequential
from keras.layers import Dense, Activation
Even though Keras uses TensorFlow as its backend engine, you don't need to explicitly import TensorFlow, except in order to set its random seed. The OS package is imported only so that an annoying TensorFlow startup warning message will be suppressed.
The program structure consists of a single main function, plus the helper class MyLogger, and the helper function my_accuracy(). These are needed because in a regression problem, there's no inherent definition of accuracy—you must define how close a predicted value must be to a correct training value in order to be considered a correct prediction.
The MyLogger class initializer is defined:
class MyLogger(K.callbacks.Callback):
def __init__(self, n, data_x, data_y, pct_close):
self.n = n
self.data_x = data_x
self.data_y = data_y
self.pct_close = pct_close
. . .
The class inherits from the Keras Callback base class. As you'll see shortly, a Callback object is something that can be invoked automatically during training via the fit() function. The MyLogger initializer function (similar to a constructor in other programming languages) accepts a pct_close parameter, which specifies how close a predicted value must be to a target value.
The MyLogger class functionality is defined like so:
def on_epoch_end(self, epoch, logs={}):
if epoch % self.n == 0:
curr_loss = logs.get('loss')
total_acc = my_accuracy(self.model, self.data_x,
self.data_y, self.pct_close)
print("epoch = %4d curr batch loss (mse) = %0.6f overall acc = %0.2f%%" % \
(epoch, curr_loss, total_acc * 100))
Note that Python uses the backslash character for line continuation. The on_epoch_end() method is inherited from the base Callback class. It will trigger automatically after each training epoch finishes. The program restricts output to every n epochs using the modulus (%) operator.
The method fetches the built-in loss metric value from the logs dictionary collection. Then, the method calls the program-defined my_accuracy() function to compute the prediction accuracy over the entire training data (not, just the current batch). The logger object displays the accuracy metric as a percentage (such as 85.12%) rather than a proportion (such as 0.8512), but this is a subjective matter of personal preference.
The program-defined accuracy function is:
def my_accuracy(model, data_x, data_y, pct_close):
num_correct = 0; num_wrong = 0
n = len(data_x)
for i in range(n):
predicted = model.predict(np.array([data_x[i]], dtype=np.float32)) # [[x]]
actual = data_y[i]
if np.abs(predicted[0][0] - actual) < np.abs(pct_close * actual):
num_correct += 1
else:
num_wrong += 1
return (num_correct * 1.0) / (num_correct + num_wrong)
There are a couple of tricky syntax issues. The predict() method expects a matrix, but data_x[i] is a vector, so the values are passed as [data_x[i]]. When using Keras, you shouldn't underestimate the frequency of running into problems with the shape of various objects.
The predicted house median price return value from predict() is a single value stored in a NumPy array-of-arrays matrix, so the scalar value itself is at [0][0].
The demo program begins execution like so:
def main():
# 0. get started
print("\nBoston Houses dataset regression example ")
np.random.seed(2)
tf.set_random_seed(3)
kv = K.__version
print("Using Keras: ", kv, "\n")
. . .
Setting the NumPy and TensorFlow global random seeds is an attempt to get reproducible results. The seed values, 2 and 3, are arbitrary. The demo prints the Keras version just to show how it's done. The entire 506-item normalized data is read into memory using these statements:
# 1. load data
print("Loading Boston data into memory ")
data_file = ".\\Data\\boston_mm_tab.txt"
all_data = np.loadtxt(data_file, delimiter="\t", skiprows=0, dtype=np.float32)
The NumPy loadtxt() function is simple and versatile. The strategy used by the demo program is to read all data into memory, and then split it into training and test matrices. The primary alternative is to split the dataset before loading into memory. Reading then splitting has the advantage of keeping your data file(s) simpler at the expense of slightly more program code.
The demo prepares the data split:
N = len(all_data)
indices = np.arange(N)
np.random.shuffle(indices)
n_train = int(0.80 * N)
The len() function, applied to an n-dimensional NumPy array, returns the number of items in the first dimension. In this case, that is the number of rows in the training data. An alternative is to use the size() function and explicitly specify the dimension:
N = np.size(all_data, 0) # 0 = rows, 1 = cols
The call to arange(N) ("array-range," not "arrange") returns an array of integers from 0 to N-1 inclusive. The shuffle() functions rearranges the values in its argument by reference, so you don't need to assign a return value. The number of training items is computed as 80 percent of the total number, in this case 0.80 * 506 = 404 items, leaving 102 items for test purposes.
Splitting the data is done by these statements:
data_x = all_data[indices,:-1]
data_y = all_data[indices,-1]
train_x = data_x[0:n_train,:]
train_y = data_y[0:n_train]
test_x = data_x[n_train:N,:]
test_y = data_y[n_train:N]
The indexing [indices,:-1] means all rows in scrambled order, and all columns except the last column. Indexing [indices,-1] means all rows in scrambled order, and just the last column. Indexing [0:n_train,:] means rows 0 to n_train-1 inclusive, and all columns. Indexing [0:n_train] is an alternative syntax that means rows 0 to n_train-1 inclusive, and all columns. Indexing [n_train:N,:] means rows n_train to N-1 inclusive. Indexing [n_train,N] is an alternative syntax that means rows n_train to N-1 inclusive. Whew!
Personally, I don't find NumPy-array indexing anywhere near intuitive, and when I work with indexing, I always have to look up the syntax rules in online documentation.
The deep neural regression model is defined by this code:
init = K.initializers.RandomUniform(seed=1)
simple_sgd = K.optimizers.SGD(lr=0.010)
model = K.models.Sequential()
model.add(K.layers.Dense(units=10, input_dim=13, kernel_initializer=init,
activation='tanh'))
model.add(K.layers.Dense(units=10, kernel_initializer=init,
activation='tanh'))
model.add(K.layers.Dense(units=1, kernel_initializer=init,
activation=None))
model.compile(loss='mean_squared_error', optimizer=simple_sgd, metrics=['mse'])
The network model has 13 input nodes, two hidden layers, each with 10 nodes, and one output node. Therefore, the network has (13 * 10) + 10 + (10 * 10) + 10 + (10 * 1) + 1 = 261 weights and biases. You can get this information programmatically by calling the model.summary() function without any parameters.
The demo program prepares the model by setting up a weight initializer using the RandomUniform() function with a seed value of 1, plus default parameter values of minval = -0.05 and maxval = +0.05. The demo uses a stochastic gradient descent optimizer. The default learning rate is 0.01, so the demo code could have omitted the explicit assignment. Other default parameter values are momentum = 0.0 (no momentum), decay = 0.0 (no decay), and nesterov = False (no Nesterov momentum used).
The model is defined using the Sequential() layers syntax. There is no explicit input layer, so the first Dense() layer added is the first hidden layer. The activation on both hidden layers is tanh, which is often used for regression networks, but relu sometimes works better. Both hidden layers have 10 nodes. It's common practice, but not required, to specify the same number of nodes for all hidden layers in a regression network.
The output layer has a single node. Because the output node represents the median house price, it can take any value, so no activation is applied. An activation value of None is the default, so the demo code could have omitted the activation parameter.
After the model definition statements, the model is compiled. A loss parameter value is required, and the demo passes mean_squared_error, which is the most common choice for a regression network. A rare alternative scenario is when the output node value has been coerced to the range (0.0, 1.0), typically via sigmoid activation on the output node, and the training target values are also in (0.0, 1.0), typically via min-max normalization on the target, dependent variable values.
The metrics parameter is optional; when used, it accepts a list of quantities to compute during training. The demo passes 'mse', which is a shortcut alias for mean_squared_error. Unlike a classification problem where you usually pass accuracy, in regression there's no inherent definition of accuracy.
Instead of using the Sequential() syntax, you can define layers individually and chain them together:
init = K.initializers.RandomUniform(seed=1)
simple_sgd = K.optimizers.SGD(lr=0.010)
X = K.layers.Input(shape=(13,))
net = K.layers.Dense(units=10, kernel_initializer=init,
activation='tanh')(X)
net = K.layers.Dense(units=10, kernel_initializer=init,
activation='tanh')(net)
net = K.layers.Dense(units=1, kernel_initializer=init,
activation=None)(net)
model = K.models.Model(X, net)
model.compile(loss='mean_squared_error', optimizer=simple_sgd, metrics=['mse'])
The two approaches create identical models and yield identical results, so the choice is purely one of personal coding style preference.
The demo program prepares training with these statements:
# 3. train model
batch_size= 8
max_epochs = 500
my_logger = MyLogger(int(max_epochs/5), train_x, train_y, 0.15)
The batch size and the maximum number of epochs to train are hyperparameters, and good value must be determined by trial and error. The my_logger object is instantiated to fire once every max_epochs / 5 = 500 / 5 = 100 epochs. Because of how the on_epoch_end() method is defined, this means that the current mean squared error loss and the prediction accuracy on all 404 training items, will be displayed every 100 epochs. Recall that when accuracy is computed, a prediction is correct if it is plus or minus 15 percent of the correct target value.
Training is performed by calling the fit() function:
print("Starting training ")
h = model.fit(train_x, train_y, batch_size=batch_size, epochs=max_epochs,
verbose=0, callbacks=[my_logger])
print("Training finished \n")
Notice that the demo uses the same argument value name, batch_size, as the parameter name. Some people use this style consistently, while others go out of their way to avoid using the same name.
Setting verbose = 0 suppresses all built-in progress messages during training, but because the callbacks list has the my_logger object, the demo will display the custom messages.
The fit() function returns an object that holds a History object containing metrics computed during training. The demo doesn't use the return value, but could have done so like this:
loss_list = h.history['loss'] # loss of last batch every epoch
print(loss_list)
After training, the demo program computes and prints the model's prediction accuracy:
# 4. evaluate model
acc = my_accuracy(model, train_x, train_y, 0.15)
print("Overall accuracy (within 15%%) on training data = %0.4f" % acc)
acc = my_accuracy(model, test_x, test_y, 0.15)
print("Overall accuracy on test data = %0.4f \n" % acc)
In general, the prediction accuracy on the test data should be roughly similar to the prediction accuracy on the training data. If accuracy on the test data is significantly less than accuracy on the training data, there's a good chance that you trained your model too aggressively and your model is overfitted.
In addition to displaying prediction accuracy, the demo program displays the loss values:
eval = model.evaluate(train_x, train_y, verbose=0)
print("Overall loss (mse) on training data = %0.6f" % eval[0])
eval = model.evaluate(test_x, test_y, verbose=0)
print("Overall loss (mse) on test data = %0.6f" % eval[0])
The evaluate() function returns a list of values. The first value at index [0] is the always the value of the (required) loss function specified in the compile() function, mean squared error in this case. Other values in the list are any optional metrics from the compile() function. For this example, and for regression in general, optional metrics usually are not specified.
Metrics include loss functions such as mean_squared_error and categorical_crossentropy, as well as five additional accuracy metrics for classification problems, shown in Table 3-1.
Table 3-1: Accuracy Metrics Functions
Function | Description |
|---|---|
binary_accuracy(y_true, y_pred) | For binary classification |
categorical_accuracy(y_true, y_pred) | For multiclass classification |
sparse_categorical_accuracy(y_true, y_pred) | Rarely used (see documentation) |
top_k_categorical_accuracy(y_true, y_pred, k=5) | Rarely used (see documentation) |
sparse_top_k_categorical_accuracy(y_true, y_pred, k=5) | Rarely used (see documentation)) |
It's also possible to write custom, program-defined metric functions. Note that the shortcut alias acc can be used for either binary classification or multiclass classification.
The demo program saves the trained model like so:
# 5. save model
print("\nSaving Boston model to disk \n")
mp = ".\\Models\\boston_model.h5"
model.save(mp)
Keras saves models using the hierarchical data format (HDF) version 5. It's a binary format, so saved models can't be inspected with a text editor. In addition to saving an entire model, you can save just the model weights and biases, which is sometimes useful. You can also save the model architecture, but not the weights. Keras does not support saving models with the ONNX (open neural network exchange) format.
A saved Keras model can be loaded like so:
print("Loading a saved model")
mp = ".\\Models\\boston_model.h5"
model = K.models.load_model(mp)
The demo program uses the trained model to predict the median house price for a hypothetical, previously unseen town near Boston:
# 6. use model
np.set_printoptions(precision=1)
unknown = np.full(shape=(1,13), fill_value=0.6, dtype=np.float32)
unknown[0][3] = -1.0 # binary feature
predicted = model.predict(unknown)
print("Using model to predict median house price for features: ")
print(unknown)
print("\nPredicted price is: ")
print("$%0.2f" % (predicted * 10000))
The code sets up 13 predictor variables. Recall that 12 of the 13 predictors were min-max normalized to values between 0.0 and 1.0, so when predicting, you must use min-max normalized values too. Predictor variable [3] is the Boolean next-to-river value, so it must be encoded as either -1 or +1.
The demo program uses 0.6 for all min-max normalized predictor values. In a non-demo scenario, you'd have to actually perform normalization on raw input data, which means you need the minimum an maximum values for each normalized variable in the training data. This creates a tightly coupled connection between training data and trained model. The point is that you must retain your training data.
When performing neural regression, you usually want to normalize your data. The number of hidden layers, and the number of nodes in each hidden layer, are hyperparameters that must be determined by trial and error. The two most common hidden layer activation functions are tanh and relu. The output layer should have a single node, and its activation function should be set to None, except in unusual scenarios.
Because there is no inherent definition of accuracy for a regression problem, you must define your own accuracy function. In order to monitor prediction accuracy during training, you can implement a custom callback class and pass it as an argument to the fit() function. Common training optimizer functions for regression problems are stochastic gradient descent and Adam, but other optimizers often perform better.
The 506-item normalized data used by the demo program can be found here.
The demo program uses a custom accuracy metric (for regression). You can find information about built-in accuracy metrics (for classification) here.