Support Vector Machine, 30 lines

In this article, I will show you how to write your very simple support vector machine without scikit-learn or other libraries with a ready-made implementation in just 30 lines in Python. If you wanted to understand the SMO algorithm, but it seemed too complicated, then this article may be useful to you.



It is impossible to explain what the Matrix support vector machine is ... You have to see it yourself.





Having learned that Habré has the ability to insert media elements, I created a small demo (if it suddenly doesn't work, you can still try your luck with the version on github [1] ). Place on a plane (space of two features X and Y ) several red and blue points (this is our dataset) and the machine will perform the classification (each point of the background is painted over depending on where the corresponding request would be classified). Move the points, change the core (I advise you to try Radial Basis Functions) and the hardness of the border (constant C ). My apologies for the terrible code in JS - I wrote in it only a few times in my life, to understand the algorithm, use the Python code later in the article.



Content









Support Vector Machine is a machine learning method (supervised learning) for solving problems of classification, regression, anomaly detection, etc. We will consider it using the example of a binary classification problem. Our training sample is a set of feature vectors xi classified into one of two classes yi=±1 ... Classification Request - Vector x to which we must assign the class +1 or 1 ...



In the simplest case, the classes of the training sample can be divided by drawing only one straight line as in the figure (for a larger number of features, this would be a hyperplane). Now, when a request comes to classify some point x , it is reasonable to place it in the class on which side it will be.



How to choose the best straight line? Intuitively, I want the straight line to pass in the middle between the classes. To do this, write the equation of the straight line as xw+b=0 and scale the parameters so that the dataset points closest to the straight line satisfy xw+b=±1 (plus or minus, depending on the class) - these points are called support vectors .



In this case, the distance between the boundary points of the classes is 2/|w| ... Obviously, we want to maximize this value in order to separate the classes as well as possible. The latter is equivalent to minimizing 12|w|2 , the entire optimization problem is written





min12|w|2subject to: yi(xiw+b)10.





If you solve it, then the classification on request x produced like this





class(x)=sign(xw+b).





This is the simplest support vector machine.



And what to do in the case when points of different classes mutually penetrate as in the figure?



We can no longer solve the previous optimization problem - there are no parameters that satisfy those conditions. Then it is possible to allow the points to violate the boundary by the amount ξi0 , but it is also desirable that such violators be as few as possible. This can be achieved by modifying the objective function with an additional term (regularization L1 ):





min(12|w|2+Ciξi)subject to: ξi+yi(xiw+b)10,ξi0,





and the classification procedure will continue as before. Here the hyperparameter C is responsible for the strength of regularization, that is, it determines how strictly we require points to respect the boundary: the more C - the more ξi will vanish and the fewer points will violate the boundary. Support vectors in this case are the points for which ξi>0 ...



But what if the training set resembles the logo of The Who and the dots can never be separated by a straight line?



Here we will be helped by a clever technique - the kernel trick [4] . However, in order to apply it, one must pass to the so-called dual (or dual ) Lagrange problem. A detailed description of it can be found in Wikipedia [5] or in the sixth lecture of the course [3] . Variables in which a new problem is solved are called dual or Lagrange multipliers... The dual problem is often simpler than the original one and has good additional properties, for example, it is concave even if the original problem is not convex. Although its solution does not always coincide with the solution of the original problem (duality break), there are a number of theorems that, under certain conditions, guarantee such a coincidence (strong duality). And this is just our case, so you can safely go to the dual problem





maxλni=1λi12ni=1nj=1yiyj(xixj)λiλj,subject to: 0λiC, for i=1,2,,n,ni=1yiλi=0,





Where λi - dual variables. After solving the maximization problem, it is also necessary to calculate the parameter b , which is not included in the dual problem, but is needed for the classifier





b=Ek,ξk0[ykiλiyi(xixk)].





The classifier can (and should) be rewritten in terms of dual variables





class(x)=sign(f(x)),f(x)=iλiyi(xix)+b.





What is the advantage of this recording? Please note that all vectors from the training set are included here exclusively in the form of dot products (xixj) ... You can first map points to a surface in a higher-dimensional space, and only then calculate the dot product of the images in the new space. Why do this can be seen from the figure.



If the mapping is successful, the images of the points are separated by a hyperplane! In fact, everything is even better: there is no need to display, because we are only interested in the dot product, and not in the specific coordinates of the points. So the whole procedure can be emulated by replacing the dot product with the function K(xi;xj) , which is called the core . Of course, not every function can be a kernel - there must at least hypothetically exist a mapping φ such that K(xi;xj)=(φ(xi)φ(xj)) ... The necessary conditions are determined by Mercer's theorem [6] . The Python implementation will represent linear ( K(xi;xj)=xTixj ), polynomial ( K(xi;xj)=(xTixj)d ) the kernel and the kernel of the radial basis functions ( K(xi;xj)=eγ|xixj|2 ). As you can see from the examples, kernels can introduce their specific hyperparameters into the algorithm, which will also affect its operation.



You may have seen a video where the action of gravity is explained using the example of a stretched rubber film in the form of a funnel [7] . This works, since the movement of a point along a curved surface in a high-dimensional space is equivalent to the movement of its image in a space of a lower dimension, if provided with a nontrivial metric. In fact, the core bends space.





SMO Algorithm



So, we are at the goal, it remains to solve the dual problem posed in the previous section





maxλni=1λi12ni=1nj=1yiyjK(xi;xj)λiλj,subject to: 0λiC, for i=1,2,,n,ni=1yiλi=0,





then find the parameter





b=Ek,ξk0[ykiλiyiK(xi;xk)],





and the classifier will take the following form





class(x)=sign(f(x)),f(x)=iλiyiK(xi;x)+b.





The SMO (Sequential minimal optimization, [8] ) algorithm for solving the dual problem is as follows. In the loop, using a complex heuristic ( [9] ), a pair of dual variables is selected λM and λL , and then the objective function is minimized over them, with the condition of the constancy of the sum $ inline $ y_ \ M \ lambda_ \ M + y_ \ L \ lambda_ \ L $ inline $ yMλM+yLλL and restrictions 0λMC , 0λLC (setting the hardness of the border). The sum condition stores the sum of all yiλi unchanged (after all, we did not touch the rest of the lambdas). The algorithm stops when it detects a sufficiently good observance of the so-called KKT conditions (Karush-Kuna-Tucker [10] ).



I'm going to make some simplifications.



  • I will abandon the complex heuristics of index selection (this is done in the Stanford University course [11] ) and will iterate over one index, and select the second at random.
  • I will refuse to check the CCP and will perform a predetermined number of iterations in advance.
  • In the optimization procedure itself, in contrast to the classical work [9] or the Stanford approach [11] , I will use the vector straight line equation. This will greatly simplify calculations (compare the volume of [12] and [13] ).


Now for the details. Information from the training sample can be written in the form of a matrix





K=(y1y1K(x1;x1)y1y2K(x1;x2)y1yNK(x1;xN)y2y1K(x2;x1)y2y2K(x2;x2)y2yNK(x2;xN)yNy1K(xN;x1)yNy2K(xN;x2)yNyNK(xN;xN)).





In what follows, I will use the notation with two indices ( Ki,j ) to refer to an element of the matrix and with one index ( Kk ) to denote the column vector of the matrix. Collect the dual variables into a column vector λ ... We are interested in





maxλni=1λi12λTKλL.





Suppose, at the current iteration, we want to maximize the objective function by indices L and M ... We will take derivatives, so it is convenient to select the terms containing the indices L and M ... It's easy to do in the sum part λi , but the quadratic form will require several transformations.



When calculating λTKλ summation is performed over two indices, let i and j ... Highlight pairs of indices containing L or M ...





Let's rewrite the problem by combining everything that does not contain λL or λM ... To make it easier to keep track of the indices, pay attention to K in the image.

$$ display $$ \ begin {aligned} \ mathscr {L} & = \ lambda_ \ M + \ lambda_ \ L - \ sum_ {j} \ lambda_ \ M \ lambda_j K _ {\ M, j} - \ sum_ {i } \ lambda_ \ L \ lambda_i K _ {\ L, i} + \ text {const} + \\ {\ text {compensation} \ atop \ text {double count}} \ rightarrow \ qquad & + \ frac {1} { 2} \ lambda_ \ M ^ 2 K _ {\ M, \ M} + \ lambda_ \ M \ lambda_ \ L K _ {\ M, \ L} + \ frac {1} {2} \ lambda_ \ L ^ 2 K_ { \ L, \ L} = \\ & = \ lambda_ \ M \ left (1- \ sum_ {j} \ lambda_j K _ {\ M, j} \ right) + \ lambda_ \ L \ left (1- \ sum_ { i} \ lambda_i K _ {\ L, i} \ right) + \\ & + \ frac {1} {2} \ left (\ lambda_ \ M ^ 2 K _ {\ M, \ M} + 2 \ lambda_ \ M \ lambda_ \ L K _ {\ M, \ L} + \ lambda_ \ L ^ 2 K _ {\ L, \ L} \ right) + \ text {const} = \\ & = \ boldsymbol {k} ^ T_0 \ boldsymbol {v} _0 + \ frac {1} {2} \ boldsymbol {v} ^ {\, T} _0 \, \ boldsymbol {Q} \, \ boldsymbol {v} _0 + \ text {const}, \ end { aligned} $$ display $$ L=λM+λLjλMλjKM,jiλLλiKL,i+const+ +12λ2MKM,M+λMλLKM,L+12λ2LKL,L==λM(1jλjKM,j)+λL(1iλiKL,i)++12(λ2MKM,M+2λMλLKM,L+λ2LKL,L)+const==kT0v0+12vT0Qv0+const,





Where const denotes terms independent of λL or λM ... In the last line, I used the notation

$$ display $$ \ begin {align} \ boldsymbol {v} _0 & = (\ lambda_ \ M, \ lambda_ \ L) ^ T, \ tag {4a} \\ \ boldsymbol {k} _0 & = \ left ( 1 - \ boldsymbol {\ lambda} ^ T \ boldsymbol {K} _ {\ M}, 1 - \ boldsymbol {\ lambda} ^ T \ boldsymbol {K} _ {\ L} \ right) ^ T, \ tag { 4b} \\ \ boldsymbol {Q} & = \ begin {pmatrix} K _ {\ M, \ M} & K _ {\ M, \ L} \\ K _ {\ L, \ M} & K _ {\ L, \ L} \\ \ end {pmatrix}, \ tag {4c} \\ \ boldsymbol {u} & = (-y_ \ L, y_ \ M) ^ T. \ tag {4d} \ end {align} $$ display $$ v0=(λM,λL)T,k0=(1λTKM,1λTKL)T,Q=(KM,MKM,LKL,MKL,L),u=(yL,yM)T.





note that k0+Qv0 does not depend on λL nor from λM

$$ display $$ \ boldsymbol {k} _0 = \ begin {pmatrix} 1 - \ lambda_ \ M K _ {\ M, \ M} - \ lambda_ \ L K _ {\ M, \ L} - \ sum_ {i \ neq \ M, \ L} \ lambda_i K _ {\ M, i} \\ 1 - \ lambda_ \ M K _ {\ L, \ M} - \ lambda_ \ L K _ {\ L, \ L} - \ sum_ {i \ neq \ M, \ L} \ lambda_i K _ {\ L, i} \\ \ end {pmatrix} = \ begin {pmatrix} 1 - \ sum_ {i \ neq \ M, \ L} \ lambda_i K _ {\ M , i} \\ 1 - \ sum_ {i \ neq \ M, \ L} \ lambda_i K _ {\ L, i} \\ \ end {pmatrix} - \ boldsymbol {Q} \ boldsymbol {v} _0. $$ display $$ k0=(1λMKM,MλLKM,LiM,LλiKM,i1λMKL,MλLKL,LiM,LλiKL,i)=(1iM,LλiKM,i1iM,LλiKL,i)Qv0.





The kernel is symmetrical, therefore QT=Q and you can write





L=(k0+Qv0Qv0)Tv0+12vT0Qv0+const=(k0+Qv0)Tv012vT0Qv0+const





We want to perform the maximization so that $ inline $ y_ \ L \ lambda_ \ L + y_ \ M \ lambda_ \ M $ inline $ yLλL+yMλM remained constant. To do this, the new values ​​must lie on the straight line

$$ display $$ (\ lambda_ \ M ^ \ text {new}, \ lambda_ \ L ^ \ text {new}) ^ T = \ boldsymbol {v} (t) = \ boldsymbol {v} _0 + t \ boldsymbol {u}. $$ display $$ (λnewM,λnewL)T=v(t)=v0+tu.





It is easy to make sure that for any t

$$ display $$ y_ \ M \ lambda_ \ M ^ \ text {new} + y_ \ L \ lambda_ \ L ^ \ text {new} = y_ \ M \ lambda_ \ M + y_ \ L \ lambda_ \ L + t (-y_ \ M y_ \ L + y_ \ L y_ \ M) = y_ \ M \ lambda_ \ M + y_ \ L \ lambda_ \ L. $$ display $$ yMλnewM+yLλnewL=yMλM+yLλL+t(yMyL+yLyM)=yMλM+yLλL.





In this case, we must maximize





L(t)=(k0+Qv0)Tv(t)12vT(t)Qv(t)+const,





which is easy to do by taking the derivative





dL(t)dt=(k0+Qv0)Tdvdt12(d(vTQv)dv)Tdvdt==kT0u+vT0QTuvTQTu(vT0vT)Qu=kT0utuTQu.





Equating the derivative to zero, we get





t=kT0uuTQu.





And one more thing: perhaps we will climb further than necessary and find ourselves outside the square as in the picture. Then you need to take a step back and return to its border

$$ display $$ (\ lambda_ \ M ^ \ text {new}, \ lambda_ \ L ^ \ text {new}) = \ boldsymbol {v} _0 + t _ * ^ {\ text {restr}} \ boldsymbol {u }. $$ display $$ (λnewM,λnewL)=v0+trestru.





This completes the iteration and selects new indices.





Implementation





The schematic diagram of training a simplified support vector machine can be written as







Let's take a look at the code in a real programming language. If you don't like the code in the articles, you can study it on the github [14] .



Simplified support vector machine source code
import numpy as np

class SVM:
  def __init__(self, kernel='linear', C=10000.0, max_iter=100000, degree=3, gamma=1):
    self.kernel = {'poly'  : lambda x,y: np.dot(x, y.T)**degree,
         'rbf': lambda x,y: np.exp(-gamma*np.sum((y-x[:,np.newaxis])**2,axis=-1)),
         'linear': lambda x,y: np.dot(x, y.T)}[kernel]
    self.C = C
    self.max_iter = max_iter

  #   t,       
  def restrict_to_square(self, t, v0, u): 
    t = (np.clip(v0 + t*u, 0, self.C) - v0)[1]/u[1]
    return (np.clip(v0 + t*u, 0, self.C) - v0)[0]/u[0]

  def fit(self, X, y):
    self.X = X.copy()
    #   0,1  -1,+1;     sklearn
    self.y = y * 2 - 1 
    self.lambdas = np.zeros_like(self.y, dtype=float)
    #  (3)
    self.K = self.kernel(self.X, self.X) * self.y[:,np.newaxis] * self.y
    
    #  self.max_iter 
    for _ in range(self.max_iter):
      #     
      for idxM in range(len(self.lambdas)):                                    
        # idxL  
        idxL = np.random.randint(0, len(self.lambdas))                         
        #  (4)
        Q = self.K[[[idxM, idxM], [idxL, idxL]], [[idxM, idxL], [idxM, idxL]]] 
        #  (4a)
        v0 = self.lambdas[[idxM, idxL]]                                        
        #  (4b)
        k0 = 1 - np.sum(self.lambdas * self.K[[idxM, idxL]], axis=1)           
        #  (4d)
        u = np.array([-self.y[idxL], self.y[idxM]])                            
        #   (5),    idxM = idxL
        t_max = np.dot(k0, u) / (np.dot(np.dot(Q, u), u) + 1E-15) 
        self.lambdas[[idxM, idxL]] = v0 + u * self.restrict_to_square(t_max, v0, u)
    
    #    
    idx, = np.nonzero(self.lambdas > 1E-15) 
    #  (1)
    self.b = np.mean((1.0-np.sum(self.K[idx]*self.lambdas, axis=1))*self.y[idx]) 
  
  def decision_function(self, X):
    return np.sum(self.kernel(X, self.X) * self.y * self.lambdas, axis=1) + self.b

  def predict(self, X): 
    #   -1,+1  0,1;     sklearn
    return (np.sign(self.decision_function(X)) + 1) // 2

      
      









When creating an object of the SVM class, you can specify hyperparameters. Training is done by calling the fit function, the classes must be specified as 0 and 1 (internally converted to 1 and +1 , made for greater compatibility with sklearn), the dimension of the feature vector is allowed arbitrary. The predict function is used for classification.





Comparison with sklearn.svm.SVC



Not that this comparison makes much sense, because we are talking about an extremely simplified algorithm, developed solely for the purpose of teaching students, but still. For testing (and to see how to use it all), you can do the following (this code is also available on the github [14] ).



Comparison with sklearn.svm.SVC on a simple 2D dataset
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.datasets import make_blobs, make_circles
from matplotlib.colors import ListedColormap

def test_plot(X, y, svm_model, axes, title):
  plt.axes(axes)
  xlim = [np.min(X[:, 0]), np.max(X[:, 0])]
  ylim = [np.min(X[:, 1]), np.max(X[:, 1])]
  xx, yy = np.meshgrid(np.linspace(*xlim, num=700), np.linspace(*ylim, num=700))
  rgb=np.array([[210, 0, 0], [0, 0, 150]])/255.0
  
  svm_model.fit(X, y)
  z_model = svm_model.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
  
  plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
  plt.contour(xx, yy, z_model, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
  plt.contourf(xx, yy, np.sign(z_model.reshape(xx.shape)), alpha=0.3, levels=2, cmap=ListedColormap(rgb), zorder=1)
  plt.title(title)

X, y = make_circles(100, factor=.1, noise=.1)
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(12,4))
test_plot(X, y, SVM(kernel='rbf', C=10, max_iter=60, gamma=1), axs[0], 'OUR ALGORITHM')
test_plot(X, y, SVC(kernel='rbf', C=10, gamma=1), axs[1], 'sklearn.svm.SVC')

X, y = make_blobs(n_samples=50, centers=2, random_state=0, cluster_std=1.4)
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(12,4))
test_plot(X, y, SVM(kernel='linear', C=10, max_iter=60), axs[0], 'OUR ALGORITHM')
test_plot(X, y, SVC(kernel='linear', C=10), axs[1], 'sklearn.svm.SVC')

fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(12,4))
test_plot(X, y, SVM(kernel='poly', C=5, max_iter=60, degree=3), axs[0], 'OUR ALGORITHM')
test_plot(X, y, SVC(kernel='poly', C=5, degree=3), axs[1], 'sklearn.svm.SVC')

      
      







After launch, pictures will be generated, but since the algorithm is randomized, they will be slightly different for each launch. Here is an example of how the simplified algorithm works (from left to right: linear, polynomial and rbf kernels)



And this is the result of the industrial version of the support vector machine.



If the dimension 2 seems too small, you can still test on MNIST



Comparison with sklearn.svm.SVC on 2 classes from MNIST
from sklearn import datasets, svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

class_A = 3
class_B = 8

digits = datasets.load_digits()
mask = (digits.target == class_A) | (digits.target == class_B)
data = digits.images.reshape((len(digits.images), -1))[mask]
target = digits.target[mask] // max([class_A, class_B]) # rescale to 0,1
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.5, shuffle=True)

def plot_confusion(clf):
  clf.fit(X_train, y_train)
  y_fit = clf.predict(X_test)

  mat = confusion_matrix(y_test, y_fit)
  sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, xticklabels=[class_A,class_B], yticklabels=[class_A,class_B])
  plt.xlabel('true label')
  plt.ylabel('predicted label');
  plt.show()

print('sklearn:')
plot_confusion(svm.SVC(C=1.0, kernel='rbf', gamma=0.001))
print('custom svm:')
plot_confusion(SVM(kernel='rbf', C=1.0, max_iter=60, gamma=0.001))

      
      







For MNIST, I tried several random pairs of classes (the simplified algorithm only supports binary classification), but I found no difference in the work of the simplified algorithm and sklearn. The following confusion matrix gives an idea of ​​the quality.







Conclusion



Thanks to everyone who read to the end. In this article, we looked at a simplified tutorial implementation of a support vector machine. Of course, it cannot compete with an industrial design, but due to the extreme simplicity and compact code in Python, and the fact that all the basic ideas of SMO have been preserved, this version of SVM may well take its place in the classroom. It should be noted that the algorithm is simpler not only than a very tricky algorithm [9] , but even its simplified version from Stanford University [11] . After all, a support vector machine in 30 lines is beautiful.



List of references



  1. https://fbeilstein.github.io/simplest_smo_ever/
  2. page at http://www.machinelearning.ru
  3. "Beginnings of Machine Learning", KAU
  4. https://en.wikipedia.org/wiki/Kernel_method
  5. https://en.wikipedia.org/wiki/Duality_(optimization)
  6. http://www.machinelearning.ru
  7. https://www.youtube.com/watch?v=MTY1Kje0yLg
  8. https://en.wikipedia.org/wiki/Sequential_minimal_optimization
  9. Platt, John. Fast Training of Support Vector Machines using Sequential Minimal Optimization, in Advances in Kernel Methods – Support Vector Learning, B. Scholkopf, C. Burges, A. Smola, eds., MIT Press (1998).
  10. https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions
  11. http://cs229.stanford.edu/materials/smo.pdf
  12. https://www.researchgate.net/publication/344460740_Yet_more_simple_SMO_algorithm
  13. http://fourier.eng.hmc.edu/e176/lectures/ch9/node9.html
  14. https://github.com/fbeilstein/simplest_smo_ever



All Articles