CHAPTER 5
A regression problem is one where the goal is to predict a single numeric value. For example, you might want to predict the college GPA of an incoming freshman based on their high school GPA, parent's annual income, and so on.
Designing and training a neural network for a regression problem is quite similar to designing and creating a neural network for a classification problem. The key differences are that a neural network for regression has a single output node and uses identity activation rather than softmax activation, and regression accuracy requires a different approach than classification accuracy.

Figure 5-1: Neural Network Regression on the Boston Housing Dataset
The screenshot in Figure 5-1 shows a demo of neural network regression. The goal is to predict the median house price of a town near Boston. The data comes from a 1978 research paper, so the median house prices are very small (between $5,000 and $50,000).
The Boston Dataset is a well-known collection containing 506 data items. Each item represents one of 506 towns near Boston. There are 13 predictor variables, including things such as the crime rate in the town and whether the town is next to a river (0 = no, 1 = yes).
The demo program sets up a 13-100-1 neural network. The network has just one output node because the goal is to predict a single numeric value. The network has 100 hidden nodes. The number of hidden nodes is a hyperparameter that must be determined by trial and error.
Before training, the 506-item data set was randomly split into a 406-item set for training and a 100-item set for testing. The training data set was normalized and encoded. The demo program uses the online back-propagation training technique for 5,000 epochs.
After training, the model accuracy on the training data was 86.45% (351 out of 406) and was 78.00% (78 out of 100) on the test data. Here, a correct median house price prediction is one that is within 15% of the true price. For example, if a town's median house price is $10,000, then any prediction from $8,500 to $11,500 would be considered correct.
The demo concludes by making a prediction for a new, previously unseen town. A set of 13 raw values was set up and then normalized using the normalization parameters from the training data. The predicted median house price of the new town is $21,279.11.
The Boston Dataset has a total of 14 variables:

Figure 5-2: Partial Boston House Data
Because there are 14 variables, it's not possible to visualize the data set, but you can get a rough idea of the data from the graph in Figure 5-2. The graph shows median house price in a town as a function of the percent area in the town zoned for industrial use.
The industry variable, by itself, cannot make a good prediction of the median house price. For example, if you knew a town had 18% of land area zoned for industrial use, the median house price could be anything between $5,000 and $25,000 (house prices were very low in the 1970s).
The raw data looks like the following.
0.19802 80 10.59 0 0.489 6.182 42.4 3.9454 4 277 18.6 393.63 9.47 25
There is no missing data. All of the 13 predictor variables, except for the binary adjacent-to-river variable, were min-max normalized. The adjacent-to-river variable was -1 +1 encoded (-1 = no, +1 = yes).
The variable to predict, median house price, was already normalized by dividing by 1,000. For example, in the sample raw data shown previously, the median house price is $25,000. I re-normalized median house price by dividing by 10 because smaller values are usually easier for neural regression to predict.
After normalization and encoding, the data looks like the following.
0.0021 0.0534 0.3713 -1 0.2139 0.5022 0.4067 0.2560 0.1304 0.1717 0.6382 0.9917 0.2135 2.50
Because there is only one categorical predictor (adjacent-to-river), and it was -1 +1 encoded, the normalized data has 13 predictor variables just like the raw data. After the training data was normalized and encoded, and the 100-item test data was normalized using the same min-max parameters from the training data, the adjacent-to-river values were -1 +1 encoded, and the median house prices were divided by 10.
The complete demo program shown in Figure 5-1 is presented in Code Listing 5-1. The demo assumes there is a top-level directory named JavaScript that contains subdirectories named Utilities and Boston. The Utilities directory contains the Utilities_lib.js library file. The Boston directory contains the demo program Boston_regression.js and subdirectories named Data and Models. The Data directory contains files Boston_train.txt and Boston_test.txt. The Models directory is used to store trained model weights and biases values.
Code Listing 5-1: The Boston Dataset Regression Demo Program Source Code
// boston_regression.js // ES6 let U = require("../Utilities/utilities_lib.js"); let FS = require("fs"); // ============================================================================= class NeuralNet constructor(numInput, numHidden, numOutput, seed) { this.rnd = new U.Erratic(seed); this.ni = numInput; this.nh = numHidden; this.no = numOutput; this.iNodes = U.vecMake(this.ni, 0.0); this.hNodes = U.vecMake(this.nh, 0.0); this.oNodes = U.vecMake(this.no, 0.0); this.ihWeights = U.matMake(this.ni, this.nh, 0.0); this.hoWeights = U.matMake(this.nh, this.no, 0.0); this.hBiases = U.vecMake(this.nh, 0.0); this.oBiases = U.vecMake(this.no, 0.0); this.initWeights(); } initWeights() let lo = -0.01; let hi = 0.01; for (let i = 0; i < this.ni; ++i) { for (let j = 0; j < this.nh; ++j) { this.ihWeights[i][j] = (hi - lo) * this.rnd.next() + lo; } } for (let j = 0; j < this.nh; ++j) { for (let k = 0; k < this.no; ++k) { this.hoWeights[j][k] = (hi - lo) * this.rnd.next() + lo; } } } eval(X) // regression: no output activation. let hSums = U.vecMake(this.nh, 0.0); let oSums = U.vecMake(this.no, 0.0); this.iNodes = X; for (let j = 0; j < this.nh; ++j) { for (let i = 0; i < this.ni; ++i) { hSums[j] += this.iNodes[i] * this.ihWeights[i][j]; } hSums[j] += this.hBiases[j]; this.hNodes[j] = U.hyperTan(hSums[j]); } for (let k = 0; k < this.no; ++k) { for (let j = 0; j < this.nh; ++j) { oSums[k] += this.hNodes[j] * this.hoWeights[j][k]; } oSums[k] += this.oBiases[k]; } // this.oNodes = U.softmax(oSums); for (let k = 0; k < this.no; ++k) { // aka "Identity" this.oNodes[k] = oSums[k]; } let result = []; for (let k = 0; k < this.no; ++k) { result[k] = this.oNodes[k]; } return result; } // eval() setWeights(wts) // order: ihWts, hBiases, hoWts, oBiases let p = 0; for (let i = 0; i < this.ni; ++i) { for (let j = 0; j < this.nh; ++j) { this.ihWeights[i][j] = wts[p++]; } } for (let j = 0; j < this.nh; ++j) { this.hBiases[j] = wts[p++]; } for (let j = 0; j < this.nh; ++j) { for (let k = 0; k < this.no; ++k) { this.hoWeights[j][k] = wts[p++]; } } for (let k = 0; k < this.no; ++k) { this.oBiases[k] = wts[p++]; } } // setWeights() getWeights() // order: ihWts, hBiases, hoWts, oBiases let numWts = (this.ni * this.nh) + this.nh + (this.nh * this.no) + this.no; let result = U.vecMake(numWts, 0.0); let p = 0; for (let i = 0; i < this.ni; ++i) { for (let j = 0; j < this.nh; ++j) { result[p++] = this.ihWeights[i][j]; } } for (let j = 0; j < this.nh; ++j) { result[p++] = this.hBiases[j]; } for (let j = 0; j < this.nh; ++j) { for (let k = 0; k < this.no; ++k) { result[p++] = this.hoWeights[j][k]; } } for (let k = 0; k < this.no; ++k) { result[p++] = this.oBiases[k]; } return result; } // getWeights() shuffle(v) // Fisher-Yates let n = v.length; for (let i = 0; i < n; ++i) { let r = this.rnd.nextInt(i, n); let tmp = v[r]; v[r] = v[i]; v[i] = tmp; } } train(trainX, trainY, lrnRate, maxEpochs) // regression: no output activation => f(x)=x => f'(x)=1 let hoGrads = U.matMake(this.nh, this.no, 0.0); let obGrads = U.vecMake(this.no, 0.0); let ihGrads = U.matMake(this.ni, this.nh, 0.0); let hbGrads = U.vecMake(this.nh, 0.0); let oSignals = U.vecMake(this.no, 0.0); let hSignals = U.vecMake(this.nh, 0.0); let n = trainX.length; // 406 let indices = U.arange(n); // [0,1,..,405] let freq = Math.trunc(maxEpochs / 10); for (let epoch = 0; epoch < maxEpochs; ++epoch) { this.shuffle(indices); // for (let ii = 0; ii < n; ++ii) { // each item let idx = indices[ii]; let X = trainX[idx]; let Y = trainY[idx]; this.eval(X); // output stored in this.oNodes. // compute output node signals. for (let k = 0; k < this.no; ++k) { // let derivative = (1 - this.oNodes[k]) * this.oNodes[k]; // softmax let derivative = 1; // identity activation oSignals[k] = derivative * (this.oNodes[k] - Y[k]); // E=(t-o)^2 } // compute hidden-to-output weight gradients using output signals. for (let j = 0; j < this.nh; ++j) { for (let k = 0; k < this.no; ++k) { hoGrads[j][k] = oSignals[k] * this.hNodes[j]; } } // compute output node bias gradients using output signals. for (let k = 0; k < this.no; ++k) { obGrads[k] = oSignals[k] * 1.0; // 1.0 dummy input can be dropped. } // compute hidden node signals. for (let j = 0; j < this.nh; ++j) { let sum = 0.0; for (let k = 0; k < this.no; ++k) { sum += oSignals[k] * this.hoWeights[j][k]; } let derivative = (1 - this.hNodes[j]) * (1 + this.hNodes[j]); // tanh hSignals[j] = derivative * sum; } // compute input-to-hidden weight gradients using hidden signals. for (let i = 0; i < this.ni; ++i) { for (let j = 0; j < this.nh; ++j) { ihGrads[i][j] = hSignals[j] * this.iNodes[i]; } } // compute hidden node bias gradients using hidden signals. for (let j = 0; j < this.nh; ++j) { hbGrads[j] = hSignals[j] * 1.0; // 1.0 dummy input can be dropped. } // update input-to-hidden weights. for (let i = 0; i < this.ni; ++i) { for (let j = 0; j < this.nh; ++j) { let delta = -1.0 * lrnRate * ihGrads[i][j]; this.ihWeights[i][j] += delta; } } // update hidden node biases. for (let j = 0; j < this.nh; ++j) { let delta = -1.0 * lrnRate * hbGrads[j]; this.hBiases[j] += delta; } // update hidden-to-output weights. for (let j = 0; j < this.nh; ++j) { for (let k = 0; k < this.no; ++k) { let delta = -1.0 * lrnRate * hoGrads[j][k]; this.hoWeights[j][k] += delta; } } // update output node biases. for (let k = 0; k < this.no; ++k) { let delta = -1.0 * lrnRate * obGrads[k]; this.oBiases[k] += delta; } } // ii if (epoch % freq == 0) { let mse = this.meanSqErr(trainX, trainY).toFixed(4); let acc = this.accuracy(trainX, trainY, 0.15).toFixed(4); let s1 = "epoch: " + epoch.toString(); let s2 = " MSE = " + mse.toString(); let s3 = " acc = " + acc.toString(); console.log(s1 + s2 + s3); } } // epoch } // train() // cross-entropy error not applicable to regression problems. meanSqErr(dataX, dataY) let sumSE = 0.0; for (let i = 0; i < dataX.length; ++i) { // each data item let X = dataX[i]; let y = dataY[i]; // target output like [2.3] as matrix let oupt = this.eval(X); // computed like [2.07] for (let k = 0; k < this.no; ++k) { let err = y[k] - oupt[k]; } let err = y[0] - oupt[0]; sumSE += err * err; } return sumSE / dataX.length; } accuracy(dataX, dataY, pctClose) // correct if predicted is within pctClose of target. let nc = 0; let nw = 0; for (let i = 0; i < dataX.length; ++i) { // each data item let X = dataX[i]; let y = dataY[i]; // target output let oupt = this.eval(X); // computed output if (Math.abs(oupt[0] - y[0]) < Math.abs(pctClose * y[0])) { ++nc; } else { ++nw; } } return nc / (nc + nw); } saveWeights(fn) let wts = this.getWeights(); let n = wts.length; let s = ""; for (let i = 0; i < n - 1; ++i) { s += wts[i].toString() + ","; } s += wts[n - 1]; FS.writeFileSync(fn, s); } loadWeights(fn) let n = (this.ni * this.nh) + this.nh + (this.nh * this.no) + this.no; let wts = U.vecMake(n, 0.0); let all = FS.readFileSync(fn, "utf8"); let strVals = all.split(","); let nn = strVals.length; if (n != nn) { throw ("Size error in NeuralNet.loadWeights()"); } for (let i = 0; i < n; ++i) { wts[i] = parseFloat(strVals[i]); } this.setWeights(wts); } } // NeuralNet // ============================================================================= function main() process.stdout.write("\033[0m"); // reset process.stdout.write("\x1b[1m" + "\x1b[37m"); // bright white console.log("\nBegin Boston Data regression "); // 1. load data let trainX = U.loadTxt(".\\Data\\boston_train.txt", "\t", [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]); let trainY = U.loadTxt(".\\Data\\boston_train.txt", "\t", [13]); let testX = U.loadTxt(".\\Data\\boston_test.txt", "\t", [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]); let testY = U.loadTxt(".\\Data\\boston_test.txt", "\t", [13]); // 2. create network console.log("\nCreating a 13-100-1 tanh, Identity NN for Boston dataset"); let seed = 0; let nn = new NeuralNet(13, 100, 1, seed); // 3. train network let lrnRate = 0.01; let maxEpochs = 5000; console.log("\nStarting training with learning rate = 0.01 "); nn.train(trainX, trainY, lrnRate, maxEpochs); console.log("Training complete"); // 4. evaluate model let trainAcc = nn.accuracy(trainX, trainY, 0.15); let testAcc = nn.accuracy(testX, testY, 0.15); console.log("\nAccuracy on training data = " + trainAcc.toFixed(4).toString()); console.log("Accuracy on test data = " + testAcc.toFixed(4).toString()); // 5. save model let fn = ".\\Models\\boston_wts.txt"; console.log("\nSaving model weights and biases to: "); console.log(fn); nn.saveWeights(fn); // 6. use trained model let unknownRaw = [0.04819, 80, 3.64, 0, 0.392, 6.108, 32, 9.2203, 1, 315, 16.4, 392.89, 6.57]; let unknownNorm = [0.000471, 0.800000, 0.116569, -1, 0.014403, 0.488025, 0.299691, 0.735726, 0.000000, 0.244275, 0.404255, 0.989889, 0.133554]; let predicted = nn.eval(unknownNorm); console.log("\nRaw features of town to predict: "); U.vecShow(unknownRaw, 4, 7); console.log("\nNormalized features of town to predict: "); U.vecShow(unknownNorm, 4, 7); console.log("\nPredicted median house price of town: "); U.vecShow(predicted, 6, 1); // predicted is a vector. let predPrice = predicted[0] * 10000; console.log("( $" + predPrice.toFixed(2).toString() + " )"); process.stdout.write("\033[0m"); // reset console.log("\nEnd demo"); } // main() main(); |
All of the control logic is contained in a top-level main() function. Program execution begins by loading the training and test data into memory.
// 1. load data
let trainX = U.loadTxt(".\\Data\\boston_train.txt", "\t",
[0,1,2,3,4,5,6,7,8,9,10,11,12]);
let trainY = U.loadTxt(".\\Data\\boston_train.txt", "\t", [13]);
let testX = U.loadTxt(".\\Data\\boston_test.txt", "\t",
[0,1,2,3,4,5,6,7,8,9,10,11,12]);
let testY = U.loadTxt(".\\Data\\boston_test.txt", "\t", [13]);
. . .
The loadTxt() function is contained in the File utilities_lib.js file, which is located in the Utilities directory. Notice that the data files are tab-delimited. Next, a neural network is created.
// 2. create network
console.log("\nCreating a 13-100-1 tanh, Identity NN for Boston dataset");
let seed = 0;
let nn = new NeuralNet(13, 100, 1, seed);
. . .
The normalized and encode data has 13 predictor variables. There is just one output node because the goal is to predict a single numeric value. The number of hidden is a hyperparameter that must be determined by trial and error. In general, more hidden nodes can produce a more accurate model at the expense of an increased risk of model overfitting. Next, the network is trained.
// 3. train network
let lrnRate = 0.01;
let maxEpochs = 5000;
console.log("\nStarting training with learning rate = 0.01 ");
nn.train(trainX, trainY, lrnRate, maxEpochs);
console.log("Training complete");
. . .
The learning rate and maximum number of training epochs are hyperparameters. Next, the trained mode is evaluated.
// 4. evaluate model
let trainAcc = nn.accuracy(trainX, trainY, 0.15);
let testAcc = nn.accuracy(testX, testY, 0.15);
console.log("\nAccuracy on training data = " +
trainAcc.toFixed(4).toString());
console.log("Accuracy on test data = " +
testAcc.toFixed(4).toString());
. . .
The 0.15 argument passed to the accuracy() method is how close to the predicted median house value the computed value must be in order to be considered correct. The closeness parameter will vary from problem to problem.
Next, the trained model's weights and biases are saved to a text file.
// 5. save model
let fn = ".\\Models\\boston_wts.txt";
console.log("\nSaving model weights and biases to: ");
console.log(fn);
nn.saveWeights(fn);
. . .
The demo program concludes by making a prediction for a new, previously unseen hypothetical town near Boston.
// 6. use trained model
let unknownRaw = [0.04819, 80, 3.64, 0, 0.392, 6.108, 32, 9.2203, 1, 315,
16.4, 392.89, 6.57];
let unknownNorm = [0.000471, 0.800000, 0.116569, -1, 0.014403, 0.488025,
0.299691, 0.735726, 0.000000, 0.244275, 0.404255, 0.989889, 0.133554];
let predicted = nn.eval(unknownNorm);
console.log("\nRaw features of town to predict: ");
U.vecShow(unknownRaw, 4, 7);
console.log("\nNormalized features of town to predict: ");
U.vecShow(unknownNorm, 4, 7);
console.log("\nPredicted median house price of town: ");
U.vecShow(predicted, 6, 1);
let predPrice = predicted[0] * 10000;
console.log("( $" + predPrice.toFixed(2).toString() + " )" );
The demo program normalizes and encodes the new town item offline. An alternative that is useful when making many predictions is to write a problem-specific function to normalize and encode. For example, using such a function might look like the following.
let unknownRaw = [0.04819, 80, 3.64, 0, 0.392, (etc.)];
let unkNorm = normAndEncode(unknownRaw);
let predicted = nn.eval(unkNorm);
Note that a program-defined function to normalize and encode raw data needs to know the normalization parameters, such as min and max, if min-max normalization is used.
Because the predicted output value is a house price divided by 10,000, the demo program wraps up by displaying the predicted price in a slightly more friendly format.
A neural network classifier has multiple output nodes and uses softmax activation on the output nodes so that their values sum to 1.0 and can be loosely interpreted as probabilities. In back-propagation training, the derivative of the function used for output-layer activation is used to compute the gradients, which in turn are used to update the network's weights and biases.
A neural network for regression uses just a single output node and uses the identity function for output layer activation. The identity function is just f(x) = x, or in other words, the identity function doesn't do anything. So it's also somewhat correct to say that a neural network for regression has no output layer activation function.
If y = softmax(x), then the calculus derivative is y' = y * (1 - y). If y = x (the identity function), then the calculus derivative is the constant value 1.
In method train(), the output node signals are computed using the following statements.
// compute output node signals.
for (let k = 0; k < this.no; ++k) {
// let derivative = (1 - this.oNodes[k]) * this.oNodes[k]; // softmax
let derivative = 1; // identity activation
oSignals[k] = derivative * (this.oNodes[k] - Y[k]); // E=(t-o)^2
}
I left the statement used for a neural network classifier in as a comment so you can see the difference between softmax activation and identity activation.
In the eval() method, the code for regression is the following.
. . .
// compute output node before activation.
for (let k = 0; k < this.no; ++k) {
for (let j = 0; j < this.nh; ++j) {
oSums[k] += this.hNodes[j] * this.hoWeights[j][k];
}
oSums[k] += this.oBiases[k];
}
// this.oNodes = U.softmax(oSums); // for classifier
for (let k = 0; k < this.no; ++k) { // aka "Identity"
this.oNodes[k] = oSums[k]; // copy as-is
}
No activation function is applied when performing regression. There are some very rare situations in which you might want to apply an activation function on the output node for a regression problem. For example, if you applied y = x2 as an output activation, then the calculus derivative is y' = 2x, and you'd apply that derivative in the training method. Additionally, note that, except for rare situations, neural networks for regression cannot use cross-entropy error because the output is not a probability distribution.
Momentum is a technique intended to speed up training. Momentum can be applied to a neural network classification problem or a regression problem. The screenshot in Figure 5-3 shows an example of training with momentum applied to the Boston Dataset problem.
If you compare the training progress accuracy without momentum (from Figure 5-1) and with momentum, this is the result.
epoch no momentum with momentum
-----------------------------------
0 0.4360 0.5099
500 0.7759 0.7931
1000 0.7833 0.8079
1500 0.7882 0.8128
. . .
You'll notice that the prediction accuracy for training with momentum improves slightly faster than training without momentum.
Recall that during training, each weight in the network is iteratively updated a little bit so that the computed output values get closer and closer to the known correct output values contained in the training data. The idea of momentum is that if an update to a weight is good in the sense that the network improves, then on the next update a bonus update is added.
On each training iteration, the weight delta that is added will get larger and larger until at some point the delta becomes too big, and the bonus is reset to zero. The demo program implements momentum by adding a momentum rate parameter to the train() method.
train(trainX, trainY, lrnRate, maxEpochs, momentRate)
{
let hoGrads = U.matMake(this.nh, this.no, 0.0);
. . .
The train() method maintains the value of the weight deltas from the previous iteration.
// for momentum
let ihPrevWtsDeltas = U.matMake(this.ni, this.nh, 0.0);
let hPrevBiasDeltas = U.vecMake(this.nh, 0.0);
let hoPrevWtsDeltas = U.matMake(this.nh, this.no, 0.0);
let oPrevBiasDeltas = U.vecMake(this.no, 0.0);

Figure 5-3: Momentum Training Demo Run
Then, on each update, a small portion of the previous delta is added as a bonus. For example, the input-to-hidden weights are updated like the following.
// update input-to-hidden weights.
for (let i = 0; i < this.ni; ++i) {
for (let j = 0; j < this.nh; ++j) {
let delta = -1.0 * lrnRate * ihGrads[i][j];
this.ihWeights[i][j] += delta;
this.ihWeights[i][j] += momentRate * ihPrevWtsDeltas[i][j]; // add a bonus.
ihPrevWtsDeltas[i][j] = delta; // save the delta for next iteration.
}
}
The call to the the train() method looks like the following.
// 3. train network
let lrnRate = 0.015;
let maxEpochs = 5000;
let momentRate = 0.20; // momentum typically 0.90.
console.log("\nStarting training with learn rate = 0.015 momentum = 0.20");
nn.train(trainX, trainY, lrnRate, maxEpochs, momentRate);
console.log("Training complete");
When momentum works, it can speed up training and improve accuracy. A disadvantage is that it introduces a new hyperparameter to deal with: the momentum rate. The demo uses a momentum rate of 0.20, but a value like 0.90 is more typical. Many deep neural libraries have an advanced form of momentum called Nesterov momentum.
Batch training is a technique that used to be the most common approach for neural network training, but is now used less often. The screenshot in Figure 5-4 shows an example of batch training applied to the Boston Dataset problem.
The most common technique for neural network training is called online training. Online training updates weights after processing each training item. Batch training processes all training items, and then updates weights. Batch training can be used for classification or regression problems.
In pseudo-code, the two training approaches are the following.
// online training
for-each training item
compute the gradients
use gradients from item to update weights
end-loop
// batch training
for-each training item
compute the gradients
accumulate the gradients
end-loop
use average of accumulated gradients from all items to update weights
In the early days of neural networks, batch training was very common because it is more principled. The online training approach uses a single training item to estimate the true gradient. The batch approach uses all training items to estimate the true gradient.

Figure 5-4: Batch Training on the Boston Dataset
But in practice, online training usually, but not always, proved to work better than batch training for simple, single-hidden-layer neural networks. However, a variation of batch training, called mini-batch training, is now the most common technique used for deep neural networks.
The demo program implements batch training using the same method signature as online training.
batchTrain(trainX, trainY, lrnRate, maxEpochs)
{
let oSignals = U.vecMake(this.no, 0.0);
let hSignals = U.vecMake(this.nh, 0.0);
let n = trainX.length; // 406
let indices = U.arange(n); // [0,1,..,405]
let freq = Math.trunc(maxEpochs / 10);
. . .
In online training, you declare a method-scope set of matrices and vectors to hold the gradients for the weights and biases. But in batch training, you instantiate matrices and vectors for accumulated gradients inside the loop that iterates through each training item.
for (let epoch = 0; epoch < maxEpochs; ++epoch) {
// this.shuffle(indices); // no shuffle needed for batch training.
let ihWtsAccGrads = U.matMake(this.ni, this.nh, 0.0); // accumulated grads
let hBiasAccGrads = U.vecMake(this.nh, 0.0);
let hoWtsAccGrads = U.matMake(this.nh, this.no, 0.0);
let oBiasAccGrads = U.vecMake(this.no, 0.0);
. . .
You could declare the storage for accumulated gradients outside the training items loop, but then you'd have to zero-out all values immediately inside the loop. Because batch training processes all training items before updating weights, it's not necessary to visit the training items in random order, because doing so leads to the identical accumulated gradients.
Batch training begins with the forward pass in the same way as online training.
for (let ii = 0; ii < n; ++ii) { // each item
let idx = indices[ii];
let X = trainX[idx];
let Y = trainY[idx];
this.eval(X); // output stored in this.oNodes.
. . .
After the eval() method is called, although gradients are computed in the usual way, they are then accumulated rather than used immediately. For example, the hidden-to-output gradients are accumulated like in the following.
// compute output node signals.
for (let k = 0; k < this.no; ++k) {
let derivative = 1; // identity activation.
oSignals[k] = derivative * (this.oNodes[k] - Y[k]); // E=(t-o)^2
}
// compute and accumulate hidden-to-output weight gradients.
for (let j = 0; j < this.nh; ++j) {
for (let k = 0; k < this.no; ++k) {
let grad = oSignals[k] * this.hNodes[j];
hoWtsAccGrads[j][k] += grad; // accumulate -- don't use yet.
}
}
After the hidden-to-output weight gradients, the output node bias gradients, the input-to-hidden weight gradients, and the hidden node bias gradients have been computed and accumulated, the loop iterating through all training items terminates, and then the accumulated gradients are used to update the weights and biases. For example, the input-to-hidden weights are updated like in the following.
. . .
} // ii end-each item
// update input-to-hidden weights.
for (let i = 0; i < this.ni; ++i) {
for (let j = 0; j < this.nh; ++j) {
let delta = -1.0 * lrnRate * (ihWtsAccGrads[i][j] / n); // average grad
this.ihWeights[i][j] += delta;
}
}
In the code, the variable n is an alias for the number of training items and is used to compute the average gradient over all training items. Some neural network library implementations use the raw accumulated gradients (not averaged), but using the average is better because it doesn't introduce a dependency on the size of the training data.
In addition to online and batch training, there is a third technique called mini-batch training. The mini-batch training technique is a variation of (full) batch training. A mini-batch is just a subset of the training data, typically something like 10 or 16 items. In pseudo-code, mini-batch training looks the following.
for-each epoch
for-each mini-batch
fetch next mini-batch
use batch training technique on the mini-batch subset
end each mini-batch
end each-epoch
Although mini-batch training is conceptually simple, the implementation details are a bit messy. The source code repository has an implementation of mini-batch training on the Boston Dataset problem that you can examine and run.
1. True or false?