Support Vector Machines Succinctly^®
by Alexandre Kowalczyk

CHAPTER 2

Prerequisites

This chapter introduces some basics you need to know in order to understand SVMs better. We will first see what vectors are and look at some of their key properties. Then we will learn what it means for data to be linearly separable before introducing a key component: the hyperplane.

Vectors

In Support Vector Machine, there is the word vector. It is important to know some basics about vectors in order to understand SVMs and how to use them.

What is a vector?

A vector is a mathematical object that can be represented by an arrow (Figure 1).

Representation of a vector

Figure 1: Representation of a vector

When we do calculations, we denote a vector with the coordinates of its endpoint (the point where the tip of the arrow is). In Figure 1, the point A has the coordinates (4,3). We can write:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \vec{OA} = (4,3) \] \end{document}$

If we want to, we can give another name to the vector, for instance, $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{a} \] \end{document}$ .

From this point, one might be tempted to think that a vector is defined by its coordinates. However, if I give you a sheet of paper with only a horizontal line and ask you to trace the same vector as the one in Figure 1, you can still do it.

You need only two pieces of information:

What is the length of the vector?
What is the angle between the vector and the horizontal line?

This leads us to the following definition of a vector:

A vector is an object that has both a magnitude and a direction.

Let us take a closer look at each of these components.

The magnitude of a vector

The magnitude, or length, of a vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf x \] \end{document}$ is written $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \|\mathbf x\| \] \end{document}$ , and is called its norm.

The magnitude of this vector is the length of the segment OA

Figure 2: The magnitude of this vector is the length of the segment OA

In Figure 2, we can calculate the norm $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \|OA\| \] \end{document}$ of vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \vec{OA} \] \end{document}$ by using the Pythagorean theorem:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ OA^2 = OB^2 + AB^2 \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ OA^2 = 4^2 + 3^2 \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ OA^2 = 25 \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ OA = \sqrt{25} \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \|OA\| =OA=5 \] \end{document}$

In general, we compute the norm of a vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}=(x_1, \cdots, x_n) \] \end{document}$ by using the Euclidean norm formula:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \left\| \mathbf{x} \right\| := \sqrt{x_1^2 + \cdots + x_n^2} \] \end{document}$

In Python, computing the norm can easily be done by calling the norm function provided by the numpy module, as shown in Code Listing 1.

Code Listing 1

import numpy as np

x = [3,4]
np.linalg.norm(x) # 5.0

The direction of a vector

The direction is the second component of a vector. By definition, it is a new vector for which the coordinates are the initial coordinates of our vector divided by its norm.

The direction of a vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{u} = (u_1,u_2) \] \end{document}$ is the vector:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=\Bigg(\frac{u_1}{\|u\|}, \frac{u_2}{\|u\|}\Bigg) \] \end{document}$

It can be computed in Python using the code in Code Listing 2.

Code Listing 2

import numpy as np

# Compute the direction of a vector x.
def direction(x):
return x/np.linalg.norm(x)

Where does it come from? Geometry. Figure 3 shows us a vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{u} \] \end{document}$ and its angles with respect to the horizontal and vertical axis. There is an angle $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta \] \end{document}$ (theta) between $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{u} \] \end{document}$ and the horizontal axis, and there is an angle $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \alpha \] \end{document}$ (alpha) between $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{u} \] \end{document}$ and the vertical axis.

A vector u and its angles with respect to the axis

Figure 3: A vector u and its angles with respect to the axis

Using elementary geometry, we see that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ cos(\theta) = \frac{u_1}{\|u\|} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ cos(\alpha) = \frac{u_2}{\|u\|} \] \end{document}$ , which means that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ can also be defined by:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=\Bigg(\frac{u_1}{\|u\|}, \frac{u_2}{\|u\|}\Bigg) = (cos(\theta),cos(\alpha)) \] \end{document}$

The coordinates of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ are defined by cosines. As a result, if the angle between $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{u} \] \end{document}$ and an axis changes, which means the direction of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{u} \] \end{document}$ changes, $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ will also change. That is why we call this vector the direction of vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{u} \] \end{document}$ . We can compute the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ (Code Listing 3), and we find that its coordinates are $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ (0.6,0.8) \] \end{document}$ .

Code Listing 3

u = np.array([3,4])

w = direction(u)

print(w) # [0.6 , 0.8]

It is interesting to note is that if two vectors have the same direction, they will have the same direction vector (Code Listing 4).

Code Listing 4

u_1 = np.array([3,4])
u_2 = np.array([30,40])

print(direction(u_1)) # [0.6 , 0.8]
print(direction(u_2)) # [0.6 , 0.8]

Moreover, the norm of a direction vector is always 1. We can verify that with the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=(0.6,0.8) \] \end{document}$ (Code Listing 5).

Code Listing 5

np.linalg.norm(np.array([0.6, 0.8])) # 1.0

It makes sense, as the sole objective of this vector is to describe the direction of other vectors—by having a norm of 1, it stays as simple as possible. As a result, a direction vector such as $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ is often referred to as a unit vector.

Dimensions of a vector

Note that the order in which the numbers are written is important. As a result, we say that a $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ n \] \end{document}$ -dimensional vector is a tuple of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ n \] \end{document}$ real-valued numbers.

For instance, $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=(0.6,0.8) \] \end{document}$ is a two-dimensional vector; we often write $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \in \mathbb{R}^2 \] \end{document}$ ( $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ belongs to $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbb{R}^2 \] \end{document}$ ).
Similarly, the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{u}=(5,3,2) \] \end{document}$ is a three-dimensional vector, and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{u} \in \mathbb{R}^3 \] \end{document}$ .

The dot product

The dot product is an operation performed on two vectors that returns a number. A number is sometimes called a scalar; that is why the dot product is also called a scalar product.

People often have trouble with the dot product because it seems to come out of nowhere. What is important is that it is an operation performed on two vectors and that its result gives us some insights into how the two vectors relate to each other. There are two ways to think about the dot product: geometrically and algebraically.

Geometric definition of the dot product

Geometrically, the dot product is the product of the Euclidean magnitudes of the two vectors and the cosine of the angle between them.

Two vectors x and y

Figure 4: Two vectors x and y

This means that if we have two vectors, $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{y} \] \end{document}$ , with an angle $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta \] \end{document}$ between them (Figure 4), their dot product is:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \cdot \mathbf{y} =\|x\|\ \|y\|\ cos(\theta) \] \end{document}$

By looking at this formula, we can see that the dot product is strongly influenced by the angle $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta \] \end{document}$ :

When $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta = 0^{\circ} \] \end{document}$ , we have $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ cos(\theta)=1 \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \cdot \mathbf{y} = \|x\|\ \|y\|\ \] \end{document}$
When $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta = 90^{\circ} \] \end{document}$ , we have $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ cos(\theta)=0 \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \cdot \mathbf{y} = 0 \] \end{document}$
When $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta = 180^{\circ} \] \end{document}$ , we have $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ cos(\theta)=-1 \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \cdot \mathbf{y} = -\|x\|\ \|y\|\ \] \end{document}$

Keep this in mind—it will be useful later when we study the Perceptron learning algorithm.

We can write a simple Python function to compute the dot product using this definition (Code Listing 6) and use it to get the value of the dot product in Figure 4 (Code Listing 7).

Code Listing 6

import math
import numpy as np

def geometric_dot_product(x,y, theta):
    x_norm = np.linalg.norm(x)
    y_norm = np.linalg.norm(y)
    return x_norm * y_norm * math.cos(math.radians(theta))

However, we need to know the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta \] \end{document}$ to be able to compute the dot product.

Code Listing 7

theta = 45

x = [3,5]
y = [8,2]

print(geometric_dot_product(x,y,theta)) # 34.0

Algebraic definition of the dot product

Using these three angles will allow us to simplify the dot product

Figure 5: Using these three angles will allow us to simplify the dot product

In Figure 5, we can see the relationship between the three angles $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta \] \end{document}$ , $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \beta \] \end{document}$ (beta), and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \alpha \] \end{document}$ (alpha):

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta = \beta - \alpha \] \end{document}$

This means computing $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ cos(\theta) \] \end{document}$ is the same as computing $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ cos(\beta - \alpha) \] \end{document}$ .

Using the difference identity for cosine we get:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ cos(\theta) = cos(\beta - \alpha) = cos(\beta)cos(\alpha) + sin(\beta)sin(\alpha) \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ cos(\theta) = \frac{x_1}{\|x\|}\frac{y_1}{\|y\|}+ \frac{x_2}{\|x\|}\frac{y_2}{\|y\|} \] \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ cos(\theta) = \frac{x_1y_1 + x_2y_2}{\|x\|\|y\|}\ \] \end{document}$

If we multiply both sides by $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \|x\|\|y\| \] \end{document}$ we get:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \|x\|\|y\|cos(\theta) = x_1y_1 + x_2y_2 \] \end{document}$

We already know that:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \|x\|\|y\|cos(\theta) = \mathbf{x} \cdot \mathbf{y} \] \end{document}$

This means the dot product can also be written:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \cdot \mathbf{y} = x_1y_1 + x_2y_2 \] \end{document}$

Or:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{2}(x_iy_i) \] \end{document}$

In a more general way, for $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ n \] \end{document}$ -dimensional vectors, we can write:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{n}(x_iy_i) \] \end{document}$

This formula is the algebraic definition of the dot product.

Code Listing 8

def dot_product(x,y):
    result = 0
    for i in range(len(x)):
        result = result + x[i]*y[i]
    return result

This definition is advantageous because we do not have to know the angle $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \theta \] \end{document}$ to compute the dot product. We can write a function to compute its value (Code Listing 8) and get the same result as with the geometric definition (Code Listing 9).

Code Listing 9

x = [3,5]
y = [8,2]
print(dot_product(x,y)) # 34

Of course, we can also use the dot function provided by numpy (Code Listing 10).

Code Listing 10

import numpy as np

x = np.array([3,5])
y = np.array([8,2])

print(np.dot(x,y)) # 34

We spent quite some time understanding what the dot product is and how it is computed. This is because the dot product is a fundamental notion that you should be comfortable with in order to figure out what is going on in SVMs. We will now see another crucial aspect, linear separability.

Understanding linear separability

In this section, we will use a simple example to introduce linear separability.

Linearly separable data

Imagine you are a wine producer. You sell wine coming from two different production batches:

One high-end wine costing $145 a bottle.
One common wine costing $8 a bottle.

Recently, you started to receive complaints from clients who bought an expensive bottle. They claim that their bottle contains the cheap wine. This results in a major reputation loss for your company, and customers stop ordering your wine.

Using alcohol-by-volume to classify wine

You decide to find a way to distinguish the two wines. You know that one of them contains more alcohol than the other, so you open a few bottles, measure the alcohol concentration, and plot it.

An example of linearly separable data

Figure 6: An example of linearly separable data

In Figure 6, you can clearly see that the expensive wine contains less alcohol than the cheap one. In fact, you can find a point that separates the data into two groups. This data is said to be linearly separable. For now, you decide to measure the alcohol concentration of your wine automatically before filling an expensive bottle. If it is greater than 13 percent, the production chain stops and one of your employee must make an inspection. This improvement dramatically reduces complaints, and your business is flourishing again.

This example is too easy—in reality, data seldom works like that. In fact, some scientists really measured alcohol concentration of wine, and the plot they obtained is shown in Figure 7. This is an example of non-linearly separable data. Even if most of the time data will not be linearly separable, it is fundamental that you understand linear separability well. In most cases, we will start from the linearly separable case (because it is the simpler) and then derive the non-separable case.

Similarly, in most problems, we will not work with only one dimension, as in Figure 6. Real-life problems are more challenging than toy examples, and some of them can have thousands of dimensions, which makes working with them more abstract. However, its abstractness does not make it more complex. Most examples in this book will be two-dimensional examples. They are simple enough to be easily visualized, and we can do some basic geometry on them, which will allow you to understand the fundamentals of SVMs.

Plotting alcohol by volume from a real dataset

Figure 7: Plotting alcohol by volume from a real dataset

In our example of Figure 6, there is only one dimension: that is, each data point is represented by a single number. When there are more dimensions, we will use vectors to represent each data point. Every time we add a dimension, the object we use to separate the data changes. Indeed, while we can separate the data with a single point in Figure 6, as soon as we go into two dimensions we need a line (a set of points), and in three dimensions we need a plane (which is also a set of points).

To summarize, data is linearly separable when:

In one dimension, you can find a point separating the data (Figure 6).
In two dimensions, you can find a line separating the data (Figure 8).
In three dimensions, you can find a plane separating the data (Figure 9).


Figure 8: Data separated by a line	Figure 9: Data separated by a plane

Similarly, when data is non-linearly separable, we cannot find a separating point, line, or plane. Figure 10 and Figure 11 show examples of non-linearly separable data in two and three dimensions.


Figure 10: Non-linearly separable data in 2D	Figure 11: Non-linearly separable data in 3D

Hyperplanes

What do we use to separate the data when there are more than three dimensions? We use what is called a hyperplane.

What is a hyperplane?

In geometry, a hyperplane is a subspace of one dimension less than its ambient space.

This definition, albeit true, is not very intuitive. Instead of using it, we will try to understand what a hyperplane is by first studying what a line is.

If you recall mathematics from school, you probably learned that a line has an equation of the form $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y=ax+b \] \end{document}$ , that the constant $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ a \] \end{document}$ is known as the slope, and that $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ intercepts the y-axis. There are several values of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x \] \end{document}$ for which this formula is true, and we say that the set of the solutions is a line.

What is often confusing is that if you study the function $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ f(x)=ax+b \] \end{document}$ in a calculus course, you will be studying a function with one variable.

However, it is important to note that the linear equation $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y=ax+b \] \end{document}$ has two variables, respectively $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x \] \end{document}$ , and we can name them as we want.

For instance, we can rename $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y \] \end{document}$ as $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x_2 \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x \] \end{document}$ as $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x_1 \] \end{document}$ , and the equation becomes: $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x_2=ax_1+b \] \end{document}$ .

This is equivalent to $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ ax_1-x_2+b=0 \] \end{document}$ .

If we define the two-dimensional vectors $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}=(x_1,x_2) \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=(a,-1) \] \end{document}$ , we obtain another notation for the equation of a line (where $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot\mathbf{x} \] \end{document}$ is the dot product of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ ):

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $\mathbf{w}\cdot\mathbf{x}+b=0$\end{document}$

What is nice with this last equation is that it uses vectors. Even if we derived it by using two-dimensional vectors, it works for vectors of any dimensions. It is, in fact, the equation of a hyperplane.

From this equation, we can have another insight into what a hyperplane is: it is the set of points satisfying $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot\mathbf{x}+b=0 \] \end{document}$ . And, if we keep just the essence of this definition: a hyperplane is a set of points.

If we have been able to deduce the hyperplane equation from the equation of a line, it is because a line is a hyperplane. You can convince yourself by reading the definition of a hyperplane again. You will notice that, indeed, a line is a two-dimensional space surrounded by a plane that has three dimensions. Similarly, points and planes are hyperplanes, too.

Understanding the hyperplane equation

We derived the equation of a hyperplane from the equation of a line. Doing the opposite is interesting, as it shows us more clearly the relationship between the two.

Given vectors $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $$\mathbf{w}=(w_0,w_1)$$ \end{document}$ , $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $$\mathbf{x}=(x,y)$$ \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ , we can define a hyperplane having the equation:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $\mathbf{w}\cdot\mathbf{x}+b=0$\end{document}$

This is equivalent to:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $w_0x+w_1 y +b = 0$\end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $w_1 y = -w_0x-b$\end{document}$

We isolate $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y \] \end{document}$ to get:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $y = -\frac{w_0}{w_1}x-\frac{b}{w_1}$\end{document}$

If we define $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ a \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ c \] \end{document}$ :

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $a = -\frac{w_0}{w_1}$ \text{and} $c = -\frac{b}{w_1}$ \end{document}$

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} $$y=ax+c$$ \end{document}$

We see that the bias $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ c \] \end{document}$ of the line equation is only equal to the bias $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ of the hyperplane equation when $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ w_1=-1 \] \end{document}$ . So you should not be surprised if $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ is not the intersection with the vertical axis when you see a plot for a hyperplane (this will be the case in our next example). Moreover, if $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ w_0 \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ w_1 \] \end{document}$ have the same sign, the slope $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ a \] \end{document}$ will be negative.

Classifying data with a hyperplane

A linearly separable dataset

Figure 12: A linearly separable dataset

Given the linearly separable data of Figure 12, we can use a hyperplane to perform binary classification.

For instance, with the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=(0.4,1.0) \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b=-9 \] \end{document}$ we get the hyperplane in Figure 13.

A hyperplane separates the data

Figure 13: A hyperplane separates the data

We associate each vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}_i \] \end{document}$ with a label $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y_i \] \end{document}$ , which can have the value $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ +1 \] \end{document}$ or $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ -1 \] \end{document}$ (respectively the triangles and the stars in Figure 13).

We define a hypothesis function $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h \] \end{document}$ :

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h(\mathbf{x}_i) = \begin{cases} +1 & \text{if}\quad\mathbf{w}\cdot\mathbf{x}_i+b \geq 0 \\ -1 & \text{if}\quad\mathbf{w}\cdot\mathbf{x}_i+b < 0 \end{cases} \] \end{document}$

which is equivalent to:

$%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h(\mathbf{x}_i)=\text{sign}(\mathbf{w}\cdot\mathbf{x}_i+b) \] \end{document}$

It uses the position of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ with respect to the hyperplane to predict a value for the label $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ y \] \end{document}$ . Every data point on one side of the hyperplane will be assigned a label, and every data point on the other side will be assigned the other label.

For instance, for $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}=(8,7) \] \end{document}$ , $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ is above the hyperplane. When we do the calculation, we get $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot\mathbf{x}+b=0.4\times8+1\times7-9=1.2 \] \end{document}$ , which is positive, so $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h(\mathbf{x}) = +1 \] \end{document}$ .

Similarly, for $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}=(1,3) \] \end{document}$ , $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x} \] \end{document}$ is below the hyperplane, and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h \] \end{document}$ will return $%FontSize=11 %TeXFontSize=11 ontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ -1 \] \end{document}$ because $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot\mathbf{x}+b=0.4\times1+1\times3-9=-5.6 \] \end{document}$ .

Because it uses the equation of the hyperplane, which produces a linear combination of the values, the function $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h \] \end{document}$ , is called a linear classifier.

With one more trick, we can make the formula of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h \] \end{document}$ even simpler by removing the b constant. First, we add a component $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ x_0=1 \] \end{document}$ to the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}_i=(x_1,x_2,...,x_n) \] \end{document}$ . We get the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{\hat{x}}_i=(x_0,x_1,\dots,x_n) \] \end{document}$ (it reads “ $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}_i \] \end{document}$ hat” because we put a hat on $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}_i \] \end{document}$ ). Similarly, we add a component $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ w_0=b \] \end{document}$ to the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=(w_1,w_2,...,w_n) \] \end{document}$ , which becomes $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{\hat{w}}=(w_0,w_1,...,w_n) \] \end{document}$ .

Note: In the rest of the book, we will call a vector to which we add an artificial coordinate an augmented vector.

When we use augmented vectors, the hypothesis function becomes:

If we have a hyperplane that separates the data set like the one in Figure 13, by using the hypothesis function $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ h \] \end{document}$ , we are able to predict the label of every point perfectly.
The main question is: how do we find such a hyperplane?

How can we find a hyperplane (separating the data or not)?

Recall that the equation of the hyperplane is $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}\cdot\mathbf{x}=0 \] \end{document}$ in augmented form. It is important to understand that the only value that impacts the shape of the hyperplane is $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ . To convince you, we can come back to the two-dimensional case when a hyperplane is just a line. When we create the augmented three-dimensional vectors, we obtain $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{x}=(x_0,x_1,x_2) \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w}=(b,a,-1) \] \end{document}$ . You can see that the vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ contains both $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ a \] \end{document}$ and $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ b \] \end{document}$ , which are the two main components defining the look of the line. Changing the value of $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ gives us different hyperplanes (lines), as shown in Figure 14.

Different values of w will give you different hyperplanes

Figure 14: Different values of w will give you different hyperplanes

Summary

After introducing vectors and linear separability, we learned what a hyperplane is and how we can use it to classify data. We then saw that the goal of a learning algorithm trying to learn a linear classifier is to find a hyperplane separating the data. Eventually, we discovered that finding a hyperplane is equivalent to finding a vector $%FontSize=11 %TeXFontSize=11 \documentclass{article} \pagestyle{empty} \begin{document} \[ \mathbf{w} \] \end{document}$ .

We will now examine which approaches learning algorithms use to find a hyperplane that separates the data. Before looking at how SVMs do this, we will first look at one of the simplest learning models: the Perceptron.

Build apps 2X faster

using Syncfusion Essential Studio^® suite

1800+ high-performance UI components.
Includes popular controls such as Grid, Chart, Scheduler, and more.
24x5 unlimited support by developers.

Get Your Free Trial Now