Support Vector Machines Succinctly^®
by Alexandre Kowalczyk

CHAPTER 3

The Perceptron

Presentation

The Perceptron is an algorithm invented in 1957 by Frank Rosenblatt, a few years before the first SVM. It is widely known because it is the building block of a simple neural network: the multilayer perceptron. The goal of the Perceptron is to find a hyperplane that can separate a linearly separable data set. Once the hyperplane is found, it is used to perform binary classification.

Given augmented vectors $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}=(x_0,x_1,\dots,x_n) \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=(w_0,w_1,...,w_n) \] \end{document}$ , the Perceptron uses the same hypothesis function we saw in the previous chapter to classify a data point $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}_i \] \end{document}$ :

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h(\mathbf{x}_i)=\text{sign}(\mathbf{w}\cdot\mathbf{x}_i) \] \end{document}$

The Perceptron learning algorithm

Given a training set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathcal{D} \] \end{document}$ of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ m \] \end{document}$ $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ n \] \end{document}$ -dimensional training examples $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ (\mathbf{x}_i,y_i) \] \end{document}$ , the Perceptron Learning Algorithm (PLA) tries to find a hypothesis function $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h \] \end{document}$ that predicts the label $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y_i \] \end{document}$ of every $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}_i \] \end{document}$ correctly.

The hypothesis function of the Perceptron is $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h(\mathbf{x})=\text{sign}(\mathbf{w}\cdot\mathbf{x}) \] \end{document}$ , and we saw that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot\mathbf{x} \] \end{document}$ is just the equation of a hyperplane. We can then say that the set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathcal{H} \] \end{document}$ of hypothesis functions is the set of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ n-1 \] \end{document}$ dimensional hyperplanes ( $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ n-1 \] \end{document}$ because a hyperplane has one dimension less than its ambient space).

What is important to understand here is that the only unknown value is $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ . It means that the goal of the algorithm is to find a value for $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ . You find $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ ; you have a hyperplane. There is an infinite number of hyperplanes (you can give any value to $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ ), so there is an infinity of hypothesis functions.

This can be written more formally this way:

Given a training set: $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^n,\, y_i \in \{-1,1\}\right\}_{i=1}^m \] \end{document}$ and a set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathcal{H} \] \end{document}$ of hypothesis functions.

Find $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h \in \mathcal{H} \] \end{document}$ such that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h(\mathbf{x}_i) = y_i \] \end{document}$ for every $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}_i \] \end{document}$ .

This is equivalent to:

Find $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=(w_0,w_1,...,w_n) \] \end{document}$ such that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \text{sign}(\mathbf{w}\cdot\mathbf{x}_i) = y_i \] \end{document}$ for every $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}_i \] \end{document}$ .

The PLA is a very simple algorithm, and can be summarized this way:

Start with a random hyperplane (defined by a vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ ) and use it to classify the data.
Pick a misclassified example and select another hyperplane by updating the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ , hoping it will work better at classifying this example (this is called the update rule).
Classify the data with this new hyperplane.
Repeat steps 2 and 3 until there is no misclassified example.

Once the process is over, you have a hyperplane that separates the data.
The algorithm is shown in Code Listing 11.

Code Listing 11

import numpy as np

def perceptron_learning_algorithm(X, y):
    w = np.random.rand(3)   # can also be initialized at zero.
    misclassified_examples = predict(hypothesis, X, y, w)

    while misclassified_examples.any():
        x, expected_y = pick_one_from(misclassified_examples, X, y)
        w = w + x * expected_y # update rule
        misclassified_examples = predict(hypothesis, X, y, w)

    return w

Let us look at the code in detail.

The perceptron_learning_algorithm uses several functions (Code Listing 12). The hypothesis function is just $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h(\mathbf{x}) \] \end{document}$ written in Python code; as we saw before, it is the function that returns the label $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y_i \] \end{document}$ predicted for an example $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}_i \] \end{document}$ when classifying with the hyperplane defined by $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ . The predict function applies the hypothesis for every example and returns the ones that are misclassified.

Code Listing 12

def hypothesis(x, w):
    return np.sign(np.dot(w, x))

# Make predictions on all data points
# and return the ones that are misclassified.
def predict(hypothesis_function, X, y, w):
    predictions = np.apply_along_axis(hypothesis_function, 1, X, w)
    misclassified = X[y != predictions]
    return misclassified

Once we have made predictions with predict, we know which examples are misclassified, so we use the function pick_one_from to select one of them randomly (Code Listing 13).

Code Listing 13

# Pick one misclassified example randomly
# and return it with its expected label.
def pick_one_from(misclassified_examples, X, y):
    np.random.shuffle(misclassified_examples)
    x = misclassified_examples[0]
    index = np.where(np.all(X == x, axis=1))
    return x, y[index]

We then arrive at the heart of the algorithm: the update rule. For now, just remember that it changes the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ . Why it does this will be explained in detail later. We once again use the predict function, but this time, we give it the updated $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ . It allows us to see if we have classified all data points correctly, or if we need to repeat the process until we do.

Code Listing 14 demonstrates how we can use the perceptron_learning_algorithm function with a toy data set. Note that we need the $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ vectors to have the same dimension, so we convert every $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ vector into an augmented vector before giving it to the function.

Code Listing 14

# See Appendix A for more information about the dataset

from succinctly.datasets import get_dataset, linearly_separable as ls

np.random.seed(88)

X, y = get_dataset(ls.get_training_examples)

# transform X into an array of augmented vectors.
X_augmented = np.c_[np.ones(X.shape[0]), X]

w = perceptron_learning_algorithm(X_augmented, y)

print(w) # [-44.35244895 1.50714969 5.52834138]

Understanding the update rule

Why do we use this particular update rule? Recall that we picked a misclassified example at random. Now we would like to make the Perceptron correctly classify this example. To do so, we decide to update the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ . The idea here is simple. Since the sign of the dot product between $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ is incorrect, by changing the angle between them, we can make it correct:

If the predicted label is 1, the angle between $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ is smaller than $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ 90^{\circ} \] \end{document}$ , and we want to increase it.
If the predicted label is -1, the angle between $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ is bigger than $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ 90^{\circ} \] \end{document}$ , and we want to decrease it.

Two vectors

Figure 15: Two vectors

Let’s see what happens with two vectors, $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ , having an angle $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta \] \end{document}$ between (Figure 15).
On the one hand, adding them creates a new vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} + \mathbf{x} \] \end{document}$ and the angle $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta \] \end{document}$ between $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} + \mathbf{x} \] \end{document}$ is smaller than $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta \] \end{document}$ (Figure 16).

The addition creates a smaller angle

Figure 16: The addition creates a smaller angle

On the other hand, subtracting them creates a new vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} - \mathbf{x} \] \end{document}$ , and the angle $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta \] \end{document}$ between $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} - \mathbf{x} \] \end{document}$ is bigger than $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta \] \end{document}$ (Figure 17).

The subtraction creates a bigger angle

Figure 17: The subtraction creates a bigger angle

We can use these two observations to adjust the angle:

If the predicted label is 1, the angle is smaller than $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ 90^{\circ} \] \end{document}$ . We want to increase the angle, so we set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=\mathbf{w} - \mathbf{x} \] \end{document}$ .
If the predicted label is -1, the angle is bigger than $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ 90^{\circ} \] \end{document}$ . We want to decrease the angle, so we set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=\mathbf{w} + \mathbf{x} \] \end{document}$ .

As we are doing this only on misclassified examples, when the predicted label has a value, the expected label is the opposite. This means we can rewrite the previous statement:

If the expected label is -1: We want to increase the angle, so we set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=\mathbf{w} - \mathbf{x} \] \end{document}$ .
If the expected label is +1: We want to decrease the angle, so we set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=\mathbf{w} + \mathbf{x} \] \end{document}$ .

When translated into Python it gives us Code Listing 15, and we can see that it is strictly equivalent to Code Listing 16, which is the update rule.

Code Listing 15

def update_rule(expected_y, w, x):
    if expected_y == 1:
        w = w + x
    else:
        w = w - x
    return w

Code Listing 16

def update_rule(expected_y, w, x):
w = w + x * expected_y
return w

We can verify that the update rule works as we expect by checking the value of the hypothesis before and after applying it (Code Listing 17).

Code Listing 17

import numpy as np

def hypothesis(x, w):
    return np.sign(np.dot(w, x))

x = np.array([1, 2, 7])
expected_y = -1
w = np.array([4, 5, 3])

print(hypothesis(w, x))             # The predicted y is 1.

w = update_rule(expected_y, w, x)   # we apply the update rule.

print(hypothesis(w, x))             # The predicted y is -1.

Note that the update rule does not necessarily change the sign of the hypothesis for the example the first time. Sometimes it is necessary to apply the update rule several times before it happens as shown in Code Listing 18. This is not a problem, as we are looping across misclassified examples, so we will continue to use the update rule until the example is correctly classified. What matters here is that each time we use the update rule, we change the value of the angle in the right direction (increasing it or decreasing it).

Code Listing 18

import numpy as np

x = np.array([1,3])
expected_y = -1
w = np.array([5, 3])

print(hypothesis(w, x))            # The predicted y is 1.

w = update_rule(expected_y, w, x) # we apply the update rule.

print(hypothesis(w, x))            # The predicted y is 1.

w = update_rule(expected_y, w, x) # we apply the update rule once again.

print(hypothesis(w, x))            # The predicted y is -1.

Also note that sometimes updating the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ for a particular example $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ changes the hyperplane in such a way that another example $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}^* \] \end{document}$ previously correctly classified becomes misclassified. So, the hypothesis might become worse at classifying after being updated. This is illustrated in Figure 18, which shows us the number of classified examples at each iteration step. One way to avoid this problem is to keep a record of the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ before making the update and use the updated $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ only if it reduces the number of misclassified examples. This modification of the PLA is known as the Pocket algorithm (because we keep $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ in our pocket).

The PLA update rule oscillates

Figure 18: The PLA update rule oscillates

Convergence of the algorithm

We said that we keep updating the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ with the update rule until there is no misclassified point. But how can we be so sure that will ever happen? Luckily for us, mathematicians have studied this problem, and we can be very sure because the Perceptron convergence theorem guarantees that if the two sets P and N (of positive and negative examples respectively) are linearly separable, the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ is updated only a finite number of times, which was first proved by Novikoff in 1963 (Rojas, 1996).

Understanding the limitations of the PLA

One thing to understand about the PLA algorithm is that because weights are randomly initialized and misclassified examples are randomly chosen, it is possible the algorithm will return a different hyperplane each time we run it. Figure 19 shows the result of running the PLA on the same dataset four times. As you can see, the PLA finds four different hyperplanes.

The PLA finds a different hyperplane each time

Figure 19: The PLA finds a different hyperplane each time

At first, this might not seem like a problem. After all, the four hyperplanes perfectly classify the data, so they might be equally good, right? However, when using a machine learning algorithm such as the PLA, our goal is not to find a way to classify perfectly the data we have right now. Our goal is to find a way to correctly classify new data we will receive in the future.

Let us introduce some terminology to be clear about this. To train a model, we pick a sample of existing data and call it the training set. We train the model, and it comes up with a hypothesis (a hyperplane in our case). We can measure how well the hypothesis performs on the training set: we call this the in-sample error (also called training error). Once we are satisfied with the hypothesis, we decide to use it on unseen data (the test set) to see if it indeed learned something. We measure how well the hypothesis performs on the test set, and we call this the out-of-sample error (also called the generalization error).

Our goal is to have the smallest out-of-sample error.

In the case of the PLA, all hypotheses in Figure 19 perfectly classify the data: their in-sample error is zero. But we are really concerned about their out-of-sample error. We can use a test set such as the one in Figure 20 to check their out-of-sample errors.

A test dataset

Figure 20: A test dataset

As you can see in Figure 21, the two hypotheses on the right, despite perfectly classifying the training dataset, are making errors with the test dataset.

Now we better understand why it is problematic. When using the Perceptron with a linearly separable dataset, we have the guarantee of finding a hypothesis with zero in-sample error, but we have no guarantee about how well it will generalize to unseen data (if an algorithm generalizes well, its out-of-sample error will be close to its in-sample error). How can we choose a hyperplane that generalizes well? As we will see in the next chapter, this is one of the goals of SVMs.

Not all hypotheses have perfect out-of-sample error

Figure 21: Not all hypotheses have perfect out-of-sample error

Summary

In this chapter, we have learned what a Perceptron is. We then saw in detail how the Perceptron Learning Algorithm works and what the motivation behind the update rule is. After learning that the PLA is guaranteed to converge, we saw that not all hypotheses are equal, and that some of them will generalize better than others. Eventually, we saw that the Perceptron is unable to select which hypothesis will have the smallest out-of-sample error and instead just picks one hypothesis having the lowest in-sample error at random.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now