Support Vector Machines Succinctly^®
by Alexandre Kowalczyk

CHAPTER 4

The SVM Optimization Problem

SVMs search for the optimal hyperplane

The Perceptron has several advantages: it is a simple model, the algorithm is very easy to implement, and we have a theoretical proof that it will find a hyperplane that separates the data. However, its biggest weakness is that it will not find the same hyperplane every time. Why do we care? Because not all separating hyperplanes are equals. If the Perceptron gives you a hyperplane that is very close to all the data points from one class, you have a right to believe that it will generalize poorly when given new data.

SVMs do not have this problem. Indeed, instead of looking for a hyperplane, SVMs tries to find the hyperplane. We will call this the optimal hyperplane, and we will say that it is the one that best separates the data.

How can we compare two hyperplanes?

Because we cannot choose the optimal hyperplane based on our feelings, we need some sort of metric that will allow us to compare two hyperplanes and say which one is superior to all others.

In this section, we will try to discover how we can compare two hyperplanes. In other words, we will search for a way to compute a number that allows us to tell which hyperplane separates the data the best. We will look at methods that seem to work, but then we will see why they do not work and how we can correct their limitations. Let us try with a simple attempt to compare two hyperplanes using only the equation of the hyperplane.

Using the equation of the hyperplane

Given an example and a hyperplane, we wish to know how the example relates to the hyperplane.

One key element we already know is that if the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ satisfies the equation of a line, then it means it is on the line. It works in the same way for a hyperplane: for a data point $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ and a hyperplane defined by a vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf w \] \end{document}$ and bias $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ , we will get $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf w \cdot \mathbf x + b = 0 \] \end{document}$ if $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ is on the hyperplane.

But what if the point is not on the hyperplane?

Let us see what happens with an example. In Figure 22, the line is defined by $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=(-0.4,-1) \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b = 9 \] \end{document}$ . When we use the equation of the hyperplane:

for point $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ A(1,3) \] \end{document}$ , using vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf a=(1,3) \] \end{document}$ we get $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf w \cdot \mathbf a + b = 5.6 \] \end{document}$
for point $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B(3,5) \] \end{document}$ , using vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf b=(3,5) \] \end{document}$ we get $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf w \cdot \mathbf b + b = 2.8 \] \end{document}$
for point $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ C(5,7) \] \end{document}$ , using vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf c=(5,7) \] \end{document}$ we get $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf w \cdot \mathbf c + b = 0 \] \end{document}$

The equation returns a bigger number for A than for B

Figure 22: The equation returns a bigger number for A than for B

As you can see, when the point is not on the hyperplane we get a number different from zero. In fact, if we use a point far away from the hyperplane, we will get a bigger number than if we use a point closer to the hyperplane.

Another thing to notice is that the sign of the number returned by the equation tells us where the point stands with respect to the line. Using the equation of the line displayed in Figure 23, we get:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ 2.8 \] \end{document}$ for point $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ A(3,5) \] \end{document}$
$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ 0 \] \end{document}$ for point $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B(5,7) \] \end{document}$
$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ -2.8 \] \end{document}$ for point $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ C(7,9) \] \end{document}$

The equation returns a negative number for C

Figure 23: The equation returns a negative number for C

If the equation returns a positive number, the point is below the line, while if it is a negative number, it is above. Note that it is not necessarily visually above or below, because if you have a line like the one in Figure 24, it will be left or right, but the same logic applies. The sign of the number returned by the equation of the hyperplane allows us to tell if two points lie on the same side. In fact, this is exactly what the hypothesis function we defined in Chapter 2 does.

A line can separate the space in different ways

Figure 24: A line can separate the space in different ways

We now have the beginning of a solution for comparing two hyperplanes.

Given a training example $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ (\mathbf{x},y) \] \end{document}$ and a hyperplane defined by a vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and bias $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ , we compute the number $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta = \mathbf{w}\cdot\mathbf{x}+b \] \end{document}$ to know how far the point is from the hyperplane.

Given a data set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^n,\, y_i \in \{-1,1\}\right\}_{i=1}^m \] \end{document}$ , we compute $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta \] \end{document}$ for each training example, and say that the number $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B \] \end{document}$ is the smallest $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta \] \end{document}$ we encounter.

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B = \min_{i=1\dots m} \beta_i \] \end{document}$

If we need to choose between two hyperplanes, we will then select the one from which $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B \] \end{document}$ is the largest.

To be clear, this means that if we have $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ k \] \end{document}$ hyperplanes, we will compute $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \max_{i=1\dots k} B_i \] \end{document}$ and select the hyperplane having this $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B_i \] \end{document}$ .

Problem with examples on the negative side

Unfortunately, using the result of the hyperplane equation has its limitations. The problem is that taking the minimum value does not work for examples on the negative side (the ones for which the equation returns a negative value).

Remember that we always wish to take the $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta \] \end{document}$ of the point being the closest to the hyperplane. Computing $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B \] \end{document}$ with examples on the positive side actually does this. Between two points with $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta = +5 \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta = +1 \] \end{document}$ , we pick the one having the smallest number, so we choose $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ +1 \] \end{document}$ . However, between two examples having $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta = -5 \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta = -1 \] \end{document}$ , this rule will pick $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ -5 \] \end{document}$ because $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ -5 \] \end{document}$ is smaller than $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ -1 \] \end{document}$ , but the closest point is actually the one with $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta = -1 \] \end{document}$ .

One way to fix this problem is to consider the absolute value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta \] \end{document}$ .

Given a data set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathcal{D} \] \end{document}$ , we compute $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta \] \end{document}$ for each example and say that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B \] \end{document}$ is the $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta \] \end{document}$ having the smallest absolute value:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B = \min_{i=1\dots m} |\beta_i| \] \end{document}$

Does the hyperplane correctly classify the data?

Computing the number $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B \] \end{document}$ allows us to select a hyperplane. However, using only this value, we might pick the wrong one. Consider the case in Figure 25: the examples are correctly classified, and the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B \] \end{document}$ computed using the last formula is 2.

A hyperplane correctly classifying the data

Figure 25: A hyperplane correctly classifying the data

In Figure 26, the examples are incorrectly classified, and the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B \] \end{document}$ is also 2. This is problematic because using $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ B \] \end{document}$ , we do not know which hyperplane is better. In theory, they look equally good, but in reality, we want to pick the one from Figure 25.

A hyperplane that does not classify the data correctly

Figure 26: A hyperplane that does not classify the data correctly

How can we adjust our formula to meet this requirement?

Well, there is one component of our training example $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ (\mathbf{x},y) \] \end{document}$ that we did not use: the $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y \] \end{document}$ !

If we multiply $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta \] \end{document}$ by the value of $%FontSize=11 %TeXFontSize=11 ontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y \] \end{document}$ , we change its sign. Let us call this new number $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ f \] \end{document}$ :

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ f = y \times \beta \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ f = y (\mathbf{w}\cdot\mathbf{x} + b) \] \end{document}$

The sign of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ f \] \end{document}$ will always be:

Positive if the point is correctly classified
Negative if the point is incorrectly classified

Given a data set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathcal{D} \] \end{document}$ , we can compute:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ F = \min_{i=1\dots m} f_i \] \end{document}$

With this formula, when comparing two hyperplanes, we will still select the one for which $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ F \] \end{document}$ is the largest. The added bonus is that in special cases like the ones in Figure 25 and Figure 26, we will always pick the hyperplane that classifies correctly (because $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ F \] \end{document}$ will have a positive value, while its value will be negative for the other hyperplane).

In the literature, the number $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ f \] \end{document}$ has a name, it is called the functional margin of an example; its value can be computed in Python, as shown in Code Listing 19. Similarly, the number $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ F \] \end{document}$ is known as the functional margin of the data set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathcal{D} \] \end{document}$ .

Code Listing 19

# Compute the functional margin of an example (x,y)
# with respect to a hyperplane defined by w and b.
def example_functional_margin(w, b, x, y):
    result = y * (np.dot(w, x) + b)
    return result

# Compute the functional margin of a hyperplane
# for examples X with labels y.
def functional_margin(w, b, X, y):
    return np.min([example_functional_margin(w, b, x, y[i])
                  for i, x in enumerate(X)])

Using this formula, we find that the functional margin of the hyperplane in Figure 25 is +2, while in Figure 26 it is -2. Because it has a bigger margin, we will select the first one.

Tip: Remember, we wish to choose the hyperplane with the largest margin.

Scale invariance

It looks like we found a good way to compare the two hyperplanes this time. However, there is a major problem with the functional margin: is not scale invariant.

Given a vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}_1=(2,1) \] \end{document}$ and bias $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b_1=5 \] \end{document}$ , if we multiply them by 10, we get $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}_2=(20,10) \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b_2=50 \] \end{document}$ . We say we rescaled them.

The vectors $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}_1 \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}_2 \] \end{document}$ , represent the same hyperplane because they have the same unit vector. The hyperplane being a plane orthogonal to a vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ , it does not matter how long the vector is. The only thing that matters is its direction, which, as we saw in the first chapter, is given by its unit vector. Moreover, when tracing the hyperplane on a graph, the coordinate of the intersection between the vertical axis and the hyperplane will be $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ (0,b/w_1) \] \end{document}$ , so the hyperplane does not change because of the rescaling of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ , either.

The problem, as we can see in Code Listing 20, is that when we compute the functional margin with $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}_2 \] \end{document}$ , we get a number ten times bigger than with $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}_1 \] \end{document}$ . This means that given any hyperplane, we can always find one that will have a larger functional margin, just by rescaling $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ .

Code Listing 20

x = np.array([1, 1])
y = 1

b_1 = 5
w_1 = np.array([2, 1])

w_2 = w_1 * 10
b_2 = b_1 * 10

print(example_functional_margin(w_1, b_1, x, y)) # 8
print(example_functional_margin(w_2, b_2, x, y)) # 80

To solve this problem, we only need to make a small adjustment. Instead of using the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ , we will use its unit vector. To do so, we will divide $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ by its norm. In the same way, we will divide $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ by the norm of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ to make it scale invariant as well.

Recall the formula of the functional margin: $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ f = y (\mathbf{w}\cdot\mathbf{x})+b \] \end{document}$

We modify it and obtain a new number $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \gamma \] \end{document}$ :

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \gamma = y \bigg(\dfrac{\mathbf{w}}{\|\mathbf{w}\|}\cdot\mathbf{x} + \dfrac{b}{\|\mathbf{w}\|}\bigg) \] \end{document}$

As before, given a data set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathcal{D} \] \end{document}$ , we can compute:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ M = \min_{i=1\dots m} \gamma_i \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ M = \min_{i=1\dots m} y_i \bigg(\dfrac{\mathbf{w}}{\|\mathbf{w}\|}\cdot\mathbf{x} + \dfrac{b}{\|\mathbf{w}\|}\bigg) \] \end{document}$

The advantage of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \gamma \] \end{document}$ is that it gives us the same number no matter how large is the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ that we choose. The number $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \gamma \] \end{document}$ also has a name—it is called the geometric margin of a training example, while $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ M \] \end{document}$ is the geometric margin of the dataset. A Python implementation is shown in Code Listing 21.

Code Listing 21

# Compute the geometric margin of an example (x,y)
# with respect to a hyperplane defined by w and b.
def example_geometric_margin(w, b, x, y):
    norm = np.linalg.norm(w)
    result = y * (np.dot(w/norm, x) + b/norm)
    return result

# Compute the geometric margin of a hyperplane
# for examples X with labels y.
def geometric_margin(w, b, X, y):
    return np.min([example_geometric_margin(w, b, x, y[i])
                  for i, x in enumerate(X)])

We can verify that the geometric margin behaves as expected. In Code Listing 22, the function returns the same value for the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}_1 \] \end{document}$ or its rescaled version $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}_2 \] \end{document}$ .

Code Listing 22

x = np.array([1,1])
y = 1

b_1 = 5
w_1 = np.array([2,1])

w_2 = w_1*10
b_2 = b_1*10

print(example_geometric_margin(w_1, b_1, x, y)) # 3.577708764
print(example_geometric_margin(w_2, b_2, x, y)) # 3.577708764

It is called the geometric margin because we can retrieve this formula using simple geometry. It measures the distance between and the hyperplane.

In Figure 27, we see that the point $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ X' \] \end{document}$ is the orthogonal projection of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ X \] \end{document}$ into the hyperplane. We wish to find the distance $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ d \] \end{document}$ between $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ X \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ X' \] \end{document}$ .

The geometric margin is the distance d between the point X and the hyperplane

Figure 27: The geometric margin is the distance d between the point X and the hyperplane

The vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf k \] \end{document}$ has the same direction as the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf w \] \end{document}$ , so they share the same unit vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \frac{\mathbf w}{\|w\|} \] \end{document}$ . We know that the norm of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf k \] \end{document}$ is $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ d \] \end{document}$ , so the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf k \] \end{document}$ can be defined by $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf k = d \frac{\mathbf w}{\|w\|} \] \end{document}$ .

Moreover, we can see that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x'} = \mathbf x - \mathbf k \] \end{document}$ , so if we substitute for $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf k \] \end{document}$ and do a little bit of algebra, we get:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x'} = \mathbf x - d \frac{\mathbf w}{\|w\|} \] \end{document}$

Now, the point $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ X' \] \end{document}$ is on the hyperplane. It means that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x'} \] \end{document}$ satisfies the equation of the hyperplane, and we have:

$%FontSize=11 %TeXFontSize=11 ontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot\mathbf{x'}+b=0 \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot \bigg(\mathbf x - d \frac{\mathbf w}{\|w\|}\bigg) +b=0 \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot\mathbf x - d \frac{\mathbf{w}\cdot\mathbf w}{\|w\|} +b=0 \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot\mathbf x - d \frac{\|w\|^2}{\|w\|} +b=0 \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot\mathbf x - d \|w\| +b=0 \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ d = \frac{\mathbf{w}\cdot\mathbf x + b}{\|w\|} \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ d = \frac{\mathbf{w}}{\|w\|}\cdot\mathbf x + \frac{b}{\|w\|} \] \end{document}$

Eventually, as we did before, we multiply by $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y \] \end{document}$ to ensure that we select a hyperplane that correctly classifies the data, and it gives us the geometric margin formula we saw earlier:

A hyperplane defined by w=(-0.4,-1) and b=8

Figure 28: A hyperplane defined by w=(-0.4,-1) and b=8

A hyperplane defined by w=(-0.4,-1) and b=8.5

Figure 29: A hyperplane defined by w=(-0.4,-1) and b=8.5

Now that we have defined the geometric margin, let us see how it allows us to compare two hyperplanes. We can see that the hyperplane in Figure 28 is closer to the blue star examples than to the red triangle examples as compared to the one in Figure 29. As a result, we expect its geometric margin to be smaller. Code Listing 23 uses the function defined in Code Listing 21 to compute the geometric margin for each hyperplane. As expected from Figure 29, the geometric margin of the second hyperplane defined by $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=(-0.4,-1) \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b=8.5 \] \end{document}$ is larger (0.64 > 0.18). Between the two, we would select this hyperplane.

Code Listing 23

# Compare two hyperplanes using the geometrical margin.

positive_x = [[2,7],[8,3],[7,5],[4,4],[4,6],[1,3],[2,5]]
negative_x = [[8,7],[4,10],[9,7],[7,10],[9,6],[4,8],[10,10]]

X = np.vstack((positive_x, negative_x))
y = np.hstack((np.ones(len(positive_x)), -1*np.ones(len(negative_x))))

w = np.array([-0.4, -1])
b = 8

# change the value of b
print(geometric_margin(w, b, X, y)) # 0.185695338177
print(geometric_margin(w, 8.5, X, y)) # 0.64993368362

We see that to compute the geometric margin for another hyperplane, we just need to modify the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ or $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ . We could try to change it by a small increment to see if the margin gets larger, but it is kind of random, and it would take a lot of time. Our objective is to find the optimal hyperplane for a dataset among all possible hyperplanes, and there is an infinity of hyperplanes.

Tip: Finding the optimal hyperplane is just a matter of finding the values of w and b for which we get the largest geometric margin.

How can we find the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ that produces the largest geometric margin? Luckily for us, mathematicians have designed tools to solve such problems. To find $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ , we need to solve what is called an optimization problem. Before looking at what the optimization problem is for SVMs, let us do a quick review of what an optimization problem is.

What is an optimization problem?

Unconstrained optimization problem

The goal of an optimization problem is to minimize or maximize a function with respect to some variable x (that is, to find the value of x for with the function returns its minimum or maximum value). For instance, the problem in which we want to find the minimum of the function $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $f(x)=x^2$ \end{document}$ is written:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{x}{\text{minimize}} & & f(x) \end{aligned} \end{equation*} \end{document}$

Or, alternatively:

In this case, we are free to search amongst all possible values of . We say that the problem is unconstrained. As we can see in Figure 30, the minimum of the function is zero at $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x=0 \] \end{document}$ .

Without constraint,

Figure 30: Without constraint,

the minimum is zero

Because of the constraint x-2=0,

Figure 31: Because of the constraint x-2=0,

the minimum is 4

Constrained optimization problem

Single equality constraint

Sometimes we are not interested in the minimum of the function by itself, but rather its minimum when some constraints are met. In such cases, we write the problem and add the constraints preceded by $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \text{subject to} \] \end{document}$ , which is often abbreviated $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \text{s.t.} \] \end{document}$ For instance, if we wish to know the minimum of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $f$ \end{document}$ but restrict the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x \] \end{document}$ to a specific value, we can write:

This example is illustrated in Figure 31. In general, constraints are written by keeping zero on the right side of the equality so the problem can be rewritten:

Using this notation, we clearly see that the constraint is an affine function while the objective function $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $f$ \end{document}$ is a quadratic function. Thus we call this problem a quadratic optimization problem or a Quadratic Programming (QP) problem.

Feasible set

The set of variables that satisfies the problem constraints is called the feasible set (or feasible region). When solving the optimization problem, the solution will be picked from the feasible set. In Figure 31, the feasible set contains only one value, so the problem is trivial. However, when we manipulate functions with several variables, such as $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $f(x,y)=x^2+y^2$ \end{document}$ , it allows us to know from which values we are trying to pick a minimum (or maximum).

For example:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{x,y}{\text{minimize}} & & f(x,y)\\ & \text{subject to} & & x - 2 = 0 \\ \end{aligned} \end{equation*} \end{document}$

In this problem, the feasible set is the set of all pairs of points $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ (x,y) \] \end{document}$ , such as $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ (x,y) = (2,y) \] \end{document}$ .

Multiple equality constraints and vector notation

We can add as many constraints as we want. Here is an example of a problem with three constraints for the function $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ f(x,y,z) = x^2 + y - z^2 \] \end{document}$ :

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{x,y,z}{\text{minimize}} & & f(x,y,z) \\ & \text{subject to} & & x - 2 = 0 \\ & & & y + 8 = 0 \\ & & & z + 3 = 0 \\ \end{aligned} \end{equation*} \end{document}$

When we have several variables, we can switch to vector notation to improve readability. For the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}=(x,y,z)^T \] \end{document}$ the function becomes $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ f(\mathbf x) = \mathbf x_1^2 - \mathbf x_2 + \mathbf x_3^2 \] \end{document}$ , and the problem is written:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{\mathbf{x}}{\text{minimize}} & & f(\mathbf{x})\\ & \text{subject to} & & \mathbf{x}_1 - 2 = 0 \\ & & & \mathbf{x}_2 + 8 = 0 \\ \end{aligned} \end{equation*} \end{document}$

When adding constraints, keep in mind that doing so reduces the feasible set. For a solution to be accepted, all constraints must be satisfied.

For instance, let us look at the following the problem:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{x}{\text{minimize}} & & x^2 \\ & \text{subject to} & & x - 2 = 0 \\ & & & x - 8 = 0 \\ \end{aligned} \end{equation*} \end{document}$

We could think that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x=2 \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x=8 \] \end{document}$ are solutions, but this is not the case. When $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x=2 \] \end{document}$ , the constraint $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x-8=0 \] \end{document}$ is not met; and when $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x=8 \] \end{document}$ , the constraint $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x-2=0 \] \end{document}$ is not met. The problem is infeasible.

Tip: If you add too many constraints to a problem, it can become infeasible.

If you change an optimization problem by adding a constraint, you make the optimum worse, or, at best, you leave it unchanged (Gershwin, 2010).

Inequality constraints

We can also use inequalities as constraints:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{x,y}{\text{minimize}} & & x^2+y^2 \\ & \text{subject to} & & x - 2 \geq 0 \\ & & & y \geq 0 \\ \end{aligned} \end{equation*} \end{document}$

And we can combine equality constraints and inequality constraints:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{x,y}{\text{minimize}} & & x^2 + y^2 \\ & \text{subject to} & & x - 2 = 0 \\ & & & y \geq 0 \\ \end{aligned} \end{equation*} \end{document}$

How do we solve an optimization problem?

Several methods exist that can solve each type of optimization problem. However, presenting them is outside the scope of this book. The interested reader can see Optimization Models and Application (El Ghaoui, 2015) and Convex Optimization (Boyd & Vandenberghe, 2004), two good books for starting on the subject and that are available online for free (see Bibliography for details). We will instead focus on the SVMs again and derive an optimization problem allowing us to find the optimal hyperplane. How to solve the SVMs optimization problem will be explained in detail in the next chapter.

The SVMs optimization problem

Given a linearly separable training set $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^p,\, y_i \in \{-1,1\}\right\}_{i=1}^m \] \end{document}$ and a hyperplane with a normal vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf w \] \end{document}$ and bias $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ , recall that the geometric margin $%FontSize=11 %TeXFontSize=11 ontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ M \] \end{document}$ of the hyperplane is defined by:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ M = \min_{i=1\dots m} \gamma_i \] \end{document}$

where $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \gamma_i = \bigg(y_i \dfrac{\mathbf{w}}{\|\mathbf{w}\|}\cdot\mathbf{x}_i + \dfrac{b}{\|\mathbf{w}\|} \bigg) \] \end{document}$ is the geometric margin of a training example $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ (\mathbf{x}_i, y_i) \] \end{document}$ .

The optimal separating hyperplane is the hyperplane defined by the normal vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf w \] \end{document}$ and bias $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ for which the geometric margin $%FontSize=11 %TeXFontSize=11 ontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ M \] \end{document}$ is the largest.

To find $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf w \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ , we need to solve the following optimization problem, with the constraint that the margin of each example should be greater or equal to $%FontSize=11 %TeXFontSize=11 ontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ M \] \end{document}$ :

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{\mathbf{w},b}{\text{maximize}} & & M \\ & \text{subject to} & & \gamma_i \geq M , \; i = 1, \ldots, m \\ \end{aligned} \end{equation*} \end{document}$

There is a relationship between the geometric margin and the functional margin:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ M = \dfrac{F}{\|\mathbf{w}\|} \] \end{document}$

So we can rewrite the problem:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{\mathbf{w},b}{\text{maximize}} & & M \\ & \text{subject to} & & \dfrac{f_i}{\|\mathbf{w}\|} \geq \dfrac{F}{\|\mathbf{w}\|} , \; \ i = 1, \ldots, m \\ \end{aligned} \end{equation*} \end{document}$

We can then simplify the constraint by removing the norm on both sides of the inequality:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{\mathbf{w},b}{\text{maximize}} & & M \\ & \text{subject to} & & f_i \geq F , \; \ i = 1, \ldots, m \\ \end{aligned} \end{equation*} \end{document}$

Recall that we are trying to maximize the geometric margin and that the scale of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ does not matter. We can choose to rescale $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ as we want, and the geometric margin will not change. As a result, we decide to scale $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ so that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ F = 1 \] \end{document}$ . It will not affect the result of the optimization problem.

The problem becomes:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{\mathbf{w},b}{\text{maximize}} & & M \\ & \text{subject to} & & f_i \geq 1 , \; \ i = 1, \ldots, m \\ \end{aligned} \end{equation*} \end{document}$

Because $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ M = \dfrac{F}{\|\mathbf{w}\|} \] \end{document}$ it is the same as:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{\mathbf{w},b}{\text{maximize}} & & \dfrac{F}{\|\mathbf{w}\|} \\ & \text{subject to} & & f_i \geq 1 , \; \ i = 1, \ldots, m \\ \end{aligned} \end{equation*} \end{document}$

And because we decided to set $%FontSize=11 %TeXFontSize=11 ontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ F = 1 \] \end{document}$ , this is equivalent to:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{\mathbf{w},b}{\text{maximize}} & & \dfrac{1}{\|\mathbf{w}\|} \\ & \text{subject to} & & f_i \geq 1 , \; \ i = 1, \ldots, m \\ \end{aligned} \end{equation*} \end{document}$

This maximization problem is equivalent to the following minimization problem:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{\mathbf{w},b}{\text{minimize}} & & \|\mathbf{w}\| \\ & \text{subject to} & & y_i (\mathbf{w}\cdot\mathbf{x}_i)+b \geq 1 , \; \ i = 1, \ldots, m \\ \end{aligned} \end{equation*} \end{document}$

Tip: You can also read an alternate derivation of this optimization problem on this page, where I use geometry instead of the functional and geometric margins.

This minimization problem gives the same result as the following:

$%FontSize=11 %TeXFontSize=11 ontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{\mathbf{w},b}{\text{minimize}} & & \frac{1}{2}\|\mathbf{w}\|^2 \\ & \text{subject to} & & y_i (\mathbf{w}\cdot\mathbf{x}_i)+b \geq 1 , \; \ i = 1, \ldots, m \\ \end{aligned} \end{equation*} \end{document}$

The factor $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \frac{1}{2} \] \end{document}$ has been added for later convenience, when we will use QP solver to solve the problem and squaring the norm has the advantage of removing the square root.

Eventually, here is the optimization problem as you will see it written in most of the literature:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \begin{equation*} \begin{aligned} & \underset{\mathbf{w},b}{\text{minimize}} & & \frac{1}{2}\|\mathbf{w}\|^2 \\ & \text{subject to} & & y_i (\mathbf{w}\cdot\mathbf{x}_i)+b - 1 \geq 0 , \; \ i = 1, \ldots, m \\ \end{aligned} \end{equation*} \end{document}$

Why did we take the pain of rewriting the problem like this? Because the original optimization problem was difficult to solve. Instead, we now have convex quadratic optimization problem, which, although not obvious, is much simpler to solve.

Summary

First, we assumed that some hyperplanes are better than others: they will perform better with unseen data. Among all possible hyperplanes, we decided to call the “best” hyperplane the optimal hyperplane. To find the optimal hyperplane, we searched for a way to compare two hyperplanes, and we ended up with a number allowing us to do so. We realized that this number also has a geometrical meaning and is called the geometric margin.

We then stated that the optimal hyperplane is the one with the largest geometric margin and that we can find it by maximizing the margin. To make things easier, we noted that we could minimize the norm of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ , the vector normal to the hyperplane, and we will be sure that it will be the $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ of the optimal hyperplane (because if you recall, $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ is used in the formula for computing the geometric margin).