🆖 🚶🏻 👩‍❤️‍👨 Support Vector Machine, 30 lines 👂🏿 📊 🏽

In this article, I will show you how to write your very simple support vector machine without scikit-learn or other libraries with a ready-made implementation in just 30 lines in Python. If you wanted to understand the SMO algorithm, but it seemed too complicated, then this article may be useful to you.

It is impossible to explain what the ~~Matrix~~ support vector machine is ... You have to see it yourself.

Having learned that Habré has the ability to insert media elements, I created a small demo (if it suddenly doesn't work, you can still try your luck with the version on github [1] ). Place on a plane (space of two features

and

) several red and blue points (this is our dataset) and the machine will perform the classification (each point of the background is painted over depending on where the corresponding request would be classified). Move the points, change the core (I advise you to try Radial Basis Functions) and the hardness of the border (constant

). My apologies for the terrible code in JS - I wrote in it only a few times in my life, to understand the algorithm, use the Python code later in the article.

Content

, , , . , — . , — [2] [3], , , .
« SMO» SMO , , . , , SMO, .
, «» Python .
« sklearn.svc.svm» — 2D confusion matrix MNIST.
«» - .

Support Vector Machine is a machine learning method (supervised learning) for solving problems of classification, regression, anomaly detection, etc. We will consider it using the example of a binary classification problem. Our training sample is a set of feature vectors

classified into one of two classes

... Classification Request - Vector

to which we must assign the class

...

In the simplest case, the classes of the training sample can be divided by drawing only one straight line as in the figure (for a larger number of features, this would be a hyperplane). Now, when a request comes to classify some point

, it is reasonable to place it in the class on which side it will be.

How to choose the best straight line? Intuitively, I want the straight line to pass in the middle between the classes. To do this, write the equation of the straight line as

and scale the parameters so that the dataset points closest to the straight line satisfy

(plus or minus, depending on the class) - these points are called support vectors .

In this case, the distance between the boundary points of the classes is

... Obviously, we want to maximize this value in order to separate the classes as well as possible. The latter is equivalent to minimizing

, the entire optimization problem is written

If you solve it, then the classification on request

produced like this

This is the simplest support vector machine.

And what to do in the case when points of different classes mutually penetrate as in the figure?

We can no longer solve the previous optimization problem - there are no parameters that satisfy those conditions. Then it is possible to allow the points to violate the boundary by the amount

, but it is also desirable that such violators be as few as possible. This can be achieved by modifying the objective function with an additional term (regularization

and the classification procedure will continue as before. Here the hyperparameter

is responsible for the strength of regularization, that is, it determines how strictly we require points to respect the boundary: the more

- the more

will vanish and the fewer points will violate the boundary. Support vectors in this case are the points for which

...

But what if the training set resembles the logo of The Who and the dots can never be separated by a straight line?

Here we will be helped by a clever technique - the kernel trick [4] . However, in order to apply it, one must pass to the so-called dual (or dual ) Lagrange problem. A detailed description of it can be found in Wikipedia [5] or in the sixth lecture of the course [3] . Variables in which a new problem is solved are called dual or Lagrange multipliers... The dual problem is often simpler than the original one and has good additional properties, for example, it is concave even if the original problem is not convex. Although its solution does not always coincide with the solution of the original problem (duality break), there are a number of theorems that, under certain conditions, guarantee such a coincidence (strong duality). And this is just our case, so you can safely go to the dual problem

Where

- dual variables. After solving the maximization problem, it is also necessary to calculate the parameter

, which is not included in the dual problem, but is needed for the classifier

The classifier can (and should) be rewritten in terms of dual variables

What is the advantage of this recording? Please note that all vectors from the training set are included here exclusively in the form of dot products

... You can first map points to a surface in a higher-dimensional space, and only then calculate the dot product of the images in the new space. Why do this can be seen from the figure.

If the mapping is successful, the images of the points are separated by a hyperplane! In fact, everything is even better: there is no need to display, because we are only interested in the dot product, and not in the specific coordinates of the points. So the whole procedure can be emulated by replacing the dot product with the function

, which is called the core . Of course, not every function can be a kernel - there must at least hypothetically exist a mapping

such that

... The necessary conditions are determined by Mercer's theorem [6] . The Python implementation will represent linear (

), polynomial (

) the kernel and the kernel of the radial basis functions (

). As you can see from the examples, kernels can introduce their specific hyperparameters into the algorithm, which will also affect its operation.

You may have seen a video where the action of gravity is explained using the example of a stretched rubber film in the form of a funnel [7] . This works, since the movement of a point along a curved surface in a high-dimensional space is equivalent to the movement of its image in a space of a lower dimension, if provided with a nontrivial metric. In fact, the core bends space.

SMO Algorithm

So, we are at the goal, it remains to solve the dual problem posed in the previous section

then find the parameter

and the classifier will take the following form

The SMO (Sequential minimal optimization, [8] ) algorithm for solving the dual problem is as follows. In the loop, using a complex heuristic ( [9] ), a pair of dual variables is selected

and

, and then the objective function is minimized over them, with the condition of the constancy of the sum

$ inline $ y_ \ M \ lambda_ \ M + y_ \ L \ lambda_ \ L $ inline $

and restrictions

(setting the hardness of the border). The sum condition stores the sum of all

unchanged (after all, we did not touch the rest of the lambdas). The algorithm stops when it detects a sufficiently good observance of the so-called KKT conditions (Karush-Kuna-Tucker [10] ).

I'm going to make some simplifications.

I will abandon the complex heuristics of index selection (this is done in the Stanford University course [11] ) and will iterate over one index, and select the second at random.
I will refuse to check the CCP and will perform a predetermined number of iterations in advance.
In the optimization procedure itself, in contrast to the classical work [9] or the Stanford approach [11] , I will use the vector straight line equation. This will greatly simplify calculations (compare the volume of [12] and [13] ).

Now for the details. Information from the training sample can be written in the form of a matrix

In what follows, I will use the notation with two indices (

) to refer to an element of the matrix and with one index (

) to denote the column vector of the matrix. Collect the dual variables into a column vector

... We are interested in

Suppose, at the current iteration, we want to maximize the objective function by indices

and

... We will take derivatives, so it is convenient to select the terms containing the indices

and

... It's easy to do in the sum part

, but the quadratic form will require several transformations.

When calculating

summation is performed over two indices, let

and

... Highlight pairs of indices containing

...

Let's rewrite the problem by combining everything that does not contain

... To make it easier to keep track of the indices, pay attention to

in the image.

$$$ display $$ \ begin {aligned} \ mathscr {L} & = \ lambda_ \ M + \ lambda_ \ L - \ sum_ {j} \ lambda_ \ M \ lambda_j K _ {\ M, j} - \ sum_ {i } \ lambda_ \ L \ lambda_i K _ {\ L, i} + \ text {const} + \\ {\ text {compensation} \ atop \ text {double count}} \ rightarrow \ qquad & + \ frac {1} { 2} \ lambda_ \ M ^ 2 K _ {\ M, \ M} + \ lambda_ \ M \ lambda_ \ L K _ {\ M, \ L} + \ frac {1} {2} \ lambda_ \ L ^ 2 K_ { \ L, \ L} = \\ & = \ lambda_ \ M \ left (1- \ sum_ {j} \ lambda_j K _ {\ M, j} \ right) + \ lambda_ \ L \ left (1- \ sum_ { i} \ lambda_i K _ {\ L, i} \ right) + \\ & + \ frac {1} {2} \ left (\ lambda_ \ M ^ 2 K _ {\ M, \ M} + 2 \ lambda_ \ M \ lambda_ \ L K _ {\ M, \ L} + \ lambda_ \ L ^ 2 K _ {\ L, \ L} \ right) + \ text {const} = \\ & = \ boldsymbol {k} ^ T_0 \ boldsymbol {v} _0 + \ frac {1} {2} \ boldsymbol {v} ^ {\, T} _0 \, \ boldsymbol {Q} \, \ boldsymbol {v} _0 + \ text {const}, \ end { aligned} $$ display $$$

Where

denotes terms independent of

... In the last line, I used the notation

$$$ display $$ \ begin {align} \ boldsymbol {v} _0 & = (\ lambda_ \ M, \ lambda_ \ L) ^ T, \ tag {4a} \\ \ boldsymbol {k} _0 & = \ left ( 1 - \ boldsymbol {\ lambda} ^ T \ boldsymbol {K} _ {\ M}, 1 - \ boldsymbol {\ lambda} ^ T \ boldsymbol {K} _ {\ L} \ right) ^ T, \ tag { 4b} \\ \ boldsymbol {Q} & = \ begin {pmatrix} K _ {\ M, \ M} & K _ {\ M, \ L} \\ K _ {\ L, \ M} & K _ {\ L, \ L} \\ \ end {pmatrix}, \ tag {4c} \\ \ boldsymbol {u} & = (-y_ \ L, y_ \ M) ^ T. \ tag {4d} \ end {align} $$ display $$$

note that

does not depend on

nor from

$$$ display $$ \ boldsymbol {k} _0 = \ begin {pmatrix} 1 - \ lambda_ \ M K _ {\ M, \ M} - \ lambda_ \ L K _ {\ M, \ L} - \ sum_ {i \ neq \ M, \ L} \ lambda_i K _ {\ M, i} \\ 1 - \ lambda_ \ M K _ {\ L, \ M} - \ lambda_ \ L K _ {\ L, \ L} - \ sum_ {i \ neq \ M, \ L} \ lambda_i K _ {\ L, i} \\ \ end {pmatrix} = \ begin {pmatrix} 1 - \ sum_ {i \ neq \ M, \ L} \ lambda_i K _ {\ M , i} \\ 1 - \ sum_ {i \ neq \ M, \ L} \ lambda_i K _ {\ L, i} \\ \ end {pmatrix} - \ boldsymbol {Q} \ boldsymbol {v} _0. $$ display $$$

The kernel is symmetrical, therefore

and you can write

We want to perform the maximization so that

$ inline $ y_ \ L \ lambda_ \ L + y_ \ M \ lambda_ \ M $ inline $

remained constant. To do this, the new values must lie on the straight line

$$$ display $$ (\ lambda_ \ M ^ \ text {new}, \ lambda_ \ L ^ \ text {new}) ^ T = \ boldsymbol {v} (t) = \ boldsymbol {v} _0 + t \ boldsymbol {u}. $$ display $$$

It is easy to make sure that for any

$$$ display $$ y_ \ M \ lambda_ \ M ^ \ text {new} + y_ \ L \ lambda_ \ L ^ \ text {new} = y_ \ M \ lambda_ \ M + y_ \ L \ lambda_ \ L + t (-y_ \ M y_ \ L + y_ \ L y_ \ M) = y_ \ M \ lambda_ \ M + y_ \ L \ lambda_ \ L. $$ display $$$

In this case, we must maximize

which is easy to do by taking the derivative

Equating the derivative to zero, we get

And one more thing: perhaps we will climb further than necessary and find ourselves outside the square as in the picture. Then you need to take a step back and return to its border

$$$ display $$ (\ lambda_ \ M ^ \ text {new}, \ lambda_ \ L ^ \ text {new}) = \ boldsymbol {v} _0 + t _ * ^ {\ text {restr}} \ boldsymbol {u }. $$ display $$$

This completes the iteration and selects new indices.

Implementation

The schematic diagram of training a simplified support vector machine can be written as

Let's take a look at the code in a real programming language. If you don't like the code in the articles, you can study it on the github [14] .

Simplified support vector machine source code

import numpy as np

class SVM:
  def __init__(self, kernel='linear', C=10000.0, max_iter=100000, degree=3, gamma=1):
    self.kernel = {'poly'  : lambda x,y: np.dot(x, y.T)**degree,
         'rbf': lambda x,y: np.exp(-gamma*np.sum((y-x[:,np.newaxis])**2,axis=-1)),
         'linear': lambda x,y: np.dot(x, y.T)}[kernel]
    self.C = C
    self.max_iter = max_iter

  #   t,       
  def restrict_to_square(self, t, v0, u): 
    t = (np.clip(v0 + t*u, 0, self.C) - v0)[1]/u[1]
    return (np.clip(v0 + t*u, 0, self.C) - v0)[0]/u[0]

  def fit(self, X, y):
    self.X = X.copy()
    #   0,1  -1,+1;     sklearn
    self.y = y * 2 - 1 
    self.lambdas = np.zeros_like(self.y, dtype=float)
    #  (3)
    self.K = self.kernel(self.X, self.X) * self.y[:,np.newaxis] * self.y
    
    #  self.max_iter 
    for _ in range(self.max_iter):
      #     
      for idxM in range(len(self.lambdas)):                                    
        # idxL  
        idxL = np.random.randint(0, len(self.lambdas))                         
        #  (4)
        Q = self.K[[[idxM, idxM], [idxL, idxL]], [[idxM, idxL], [idxM, idxL]]] 
        #  (4a)
        v0 = self.lambdas[[idxM, idxL]]                                        
        #  (4b)
        k0 = 1 - np.sum(self.lambdas * self.K[[idxM, idxL]], axis=1)           
        #  (4d)
        u = np.array([-self.y[idxL], self.y[idxM]])                            
        #   (5),    idxM = idxL
        t_max = np.dot(k0, u) / (np.dot(np.dot(Q, u), u) + 1E-15) 
        self.lambdas[[idxM, idxL]] = v0 + u * self.restrict_to_square(t_max, v0, u)
    
    #    
    idx, = np.nonzero(self.lambdas > 1E-15) 
    #  (1)
    self.b = np.mean((1.0-np.sum(self.K[idx]*self.lambdas, axis=1))*self.y[idx]) 
  
  def decision_function(self, X):
    return np.sum(self.kernel(X, self.X) * self.y * self.lambdas, axis=1) + self.b

  def predict(self, X): 
    #   -1,+1  0,1;     sklearn
    return (np.sign(self.decision_function(X)) + 1) // 2

When creating an object of the SVM class, you can specify hyperparameters. Training is done by calling the fit function, the classes must be specified as

and

(internally converted to

and

, made for greater compatibility with sklearn), the dimension of the feature vector is allowed arbitrary. The predict function is used for classification.

Comparison with sklearn.svm.SVC

Not that this comparison makes much sense, because we are talking about an extremely simplified algorithm, developed solely for the purpose of teaching students, but still. For testing (and to see how to use it all), you can do the following (this code is also available on the github [14] ).

Comparison with sklearn.svm.SVC on a simple 2D dataset

from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.datasets import make_blobs, make_circles
from matplotlib.colors import ListedColormap

def test_plot(X, y, svm_model, axes, title):
  plt.axes(axes)
  xlim = [np.min(X[:, 0]), np.max(X[:, 0])]
  ylim = [np.min(X[:, 1]), np.max(X[:, 1])]
  xx, yy = np.meshgrid(np.linspace(*xlim, num=700), np.linspace(*ylim, num=700))
  rgb=np.array([[210, 0, 0], [0, 0, 150]])/255.0
  
  svm_model.fit(X, y)
  z_model = svm_model.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
  
  plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
  plt.contour(xx, yy, z_model, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
  plt.contourf(xx, yy, np.sign(z_model.reshape(xx.shape)), alpha=0.3, levels=2, cmap=ListedColormap(rgb), zorder=1)
  plt.title(title)

X, y = make_circles(100, factor=.1, noise=.1)
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(12,4))
test_plot(X, y, SVM(kernel='rbf', C=10, max_iter=60, gamma=1), axs[0], 'OUR ALGORITHM')
test_plot(X, y, SVC(kernel='rbf', C=10, gamma=1), axs[1], 'sklearn.svm.SVC')

X, y = make_blobs(n_samples=50, centers=2, random_state=0, cluster_std=1.4)
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(12,4))
test_plot(X, y, SVM(kernel='linear', C=10, max_iter=60), axs[0], 'OUR ALGORITHM')
test_plot(X, y, SVC(kernel='linear', C=10), axs[1], 'sklearn.svm.SVC')

fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(12,4))
test_plot(X, y, SVM(kernel='poly', C=5, max_iter=60, degree=3), axs[0], 'OUR ALGORITHM')
test_plot(X, y, SVC(kernel='poly', C=5, degree=3), axs[1], 'sklearn.svm.SVC')

After launch, pictures will be generated, but since the algorithm is randomized, they will be slightly different for each launch. Here is an example of how the simplified algorithm works (from left to right: linear, polynomial and rbf kernels)

And this is the result of the industrial version of the support vector machine.

If the dimension

seems too small, you can still test on MNIST

Comparison with sklearn.svm.SVC on 2 classes from MNIST

from sklearn import datasets, svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

class_A = 3
class_B = 8

digits = datasets.load_digits()
mask = (digits.target == class_A) | (digits.target == class_B)
data = digits.images.reshape((len(digits.images), -1))[mask]
target = digits.target[mask] // max([class_A, class_B]) # rescale to 0,1
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.5, shuffle=True)

def plot_confusion(clf):
  clf.fit(X_train, y_train)
  y_fit = clf.predict(X_test)

  mat = confusion_matrix(y_test, y_fit)
  sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, xticklabels=[class_A,class_B], yticklabels=[class_A,class_B])
  plt.xlabel('true label')
  plt.ylabel('predicted label');
  plt.show()

print('sklearn:')
plot_confusion(svm.SVC(C=1.0, kernel='rbf', gamma=0.001))
print('custom svm:')
plot_confusion(SVM(kernel='rbf', C=1.0, max_iter=60, gamma=0.001))

For MNIST, I tried several random pairs of classes (the simplified algorithm only supports binary classification), but I found no difference in the work of the simplified algorithm and sklearn. The following confusion matrix gives an idea of the quality.

Conclusion

Thanks to everyone who read to the end. In this article, we looked at a simplified tutorial implementation of a support vector machine. Of course, it cannot compete with an industrial design, but due to the extreme simplicity and compact code in Python, and the fact that all the basic ideas of SMO have been preserved, this version of SVM may well take its place in the classroom. It should be noted that the algorithm is simpler not only than a very tricky algorithm [9] , but even its simplified version from Stanford University [11] . After all, a support vector machine in 30 lines is beautiful.

List of references

https://fbeilstein.github.io/simplest_smo_ever/
page at http://www.machinelearning.ru
"Beginnings of Machine Learning", KAU
https://en.wikipedia.org/wiki/Kernel_method
https://en.wikipedia.org/wiki/Duality_(optimization)
http://www.machinelearning.ru
https://www.youtube.com/watch?v=MTY1Kje0yLg
https://en.wikipedia.org/wiki/Sequential_minimal_optimization
Platt, John. Fast Training of Support Vector Machines using Sequential Minimal Optimization, in Advances in Kernel Methods – Support Vector Learning, B. Scholkopf, C. Burges, A. Smola, eds., MIT Press (1998).
https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions
http://cs229.stanford.edu/materials/smo.pdf
https://www.researchgate.net/publication/344460740_Yet_more_simple_SMO_algorithm
http://fourier.eng.hmc.edu/e176/lectures/ch9/node9.html
https://github.com/fbeilstein/simplest_smo_ever

Support Vector Machine, 30 lines