SVM From Scratch
Introduction
In this article I will walk you through every detail of the linear SVM classifier, from theory to implementation. The Jupyter Notebook can be found HERE.
Support Vector Machine (SVM) is a supervised machine learning usually employed in binary classification problems. Given a dataset of labeled examples (Xi, yi), where Xi is a feature vector and yi its label (-1 or 1), SVM will find the hyperplane that best separates the data points with label -1 from data points with label +1 (Figure 1).
Figure 1. SVM summarized.
The hyperplane is identified by the vector w and it is chosen so that its distance to the nearest data point on each side is maximized. The hyperplane w satisfies the equation:
The two data points closest to hyperplane are called support vectors: the left support vector and the right support vector. The left and right support vectors satisfies equations (2) and (3), respectively.
Data points laying on the left support vector and on the left side of it belong to the class -1; data points laying on the right support vector and on its right side belong to the class +1. This is the linear SVM classifier in a nutshell.
Comparison with logistic regression algorithm
If you are familiar with logistic regression which also finds an hyperplane separating the two classes, you may wonder how SVM is different. The main reason is that SVM won't find any hyperplane separating the two classes, but it will find the hyperplane with the maximum margin between the classes. This is concept is depicted below.
Figure 2. Examples of separating hyperplanes
In Figure 2, three hyperplanes are shown. H1 does not separate the two classes: it could represent an untrained (or under-trained) algorithm. H2 separates the two classes with 100% accuracy. However, H2 is dangerously close to one green and one red data point: if a new data point close to the hyperplane is collected, the algorithm will likely misclassify it. A fully trained logistic regression could generate an hyperplane such as H2. H3 is the ideal hyperplane: it best separates the two clusters of data points such that the margin between the two cluster is the biggest possible. The advantage of finding the hyperplane with maximum margin such as H3 is that the model tends to do better on the validation and test test, and has lower variance.
The SVM algorithm - hyperplane constrains
The SVM hyperplane w satisfies equation (1). Furthermore, for every data point, the hyperplane needs to satisfy:
(4) can be rearranged into into one single equation:
Equation (5) tells us that if, a data point is correctly classified, the product between its label and its prediction will be equal or higher than 1.
The SVM algorithm - the cost function
The cost function of SVM is composed by two terms: the hinge loss and a regularization term.
1 - The hinge loss is given by the equation:
which is equal to:
Figure 3 depicts how the shape of the hinge loss function.
Figure 3. Hinge Loss
When the data point is correctly classified (label*prediction >=1) the penalty is zero. Meanwhile, with misclassified data points the loss grows linearly with the negative value of the label*prediction product.
2 - The regularization term of the cost function is given by:
The goal of the regularization term is to maximize the margin of the SVM classifier. Since the ||w|| appears on the denominator of the margin formula (Figure 1), it makes sense we want to minimize it to widen the margin as much as possible.
Finally, combining (6) and (8) gives the complete cost function for SVM classifier:
which is equivalent to:
𝜆 is a regularization parameter and dictates the trade off between the two terms in the cost function. Larger 𝜆 will lead to a larger margin, smaller 𝜆 gives a narrower margin.
The SVM algorithm - the gradients
The next thing we need to build a linear SVM classifier is computing the gradients of the cost function w.r.t. the model's parameters.
If the data point is correctly classified (thus equation (5) holds) the gradients of the cost function w.r.t. w and and b are:
Otherwise:
The SVM algorithm - model's parameters update
The last rule we need is the updating method for the model's parameters. We will use gradient descent, thus:
Where 𝛼 is called learning rate and it ranges between 0.01 and 0.000001.
(15) and (16) implies that the value of the model's parameters in the next iteration is equal to their value at the current iteration minus the gradient multiplied by the learning rate.
Building the model
We have everything we need to train a linear SVM classifier. The figure below shows the data we will work with.
Figure 4. Observed data
Note that the data is clearly (easily) linearly separable. In the training we will utilize functions to make the prediction, calculate the hinge loss and calculate the gradients:
Figure 5. Helper functions
Next we randomly initialize w and b, and define the hyperparameters' value:
Figure 6. w and b initialization and hyperparameters
We train the model for n_iter times, row by row:
Figure 7. Training loop
After 500 iterations the model identified the hyperplane and margins that best separate the two clusters. The very small 𝜆 led to the narrowest possible margin:
Figure 8. Result with 𝜆=0.00001
Meanwhile, when 𝜆 is high (0.1) the margin is much wider:
Figure 9. Result with 𝜆=0.1
Concluding remarks
This is the end of this article. We went through each step of the linear SVM classifier algorithm and we learned how the hyperparameter 𝜆 affects the width of the margin. In this tutorial we used the cost function (9). For the sake of completeness, I want you to know that another version of cost function (9) exists and is the following:
(9) and (17) are equivalent and you can use either one: just keep in mind what the regularization parameter does and tune it appropriately. 𝜆 in (9) is equivalent to 1/C in (17).
One last thing, if you are wondering where the formula for the margin width comes from (Figure 1, 2/||w||) click HERE.
Comments