It is impossible to explain what the
Having learned that Habré has the ability to insert media elements, I created a small demo (if it suddenly doesn't work, you can still try your luck with the version on github [1] ). Place on a plane (space of two features X and Y ) several red and blue points (this is our dataset) and the machine will perform the classification (each point of the background is painted over depending on where the corresponding request would be classified). Move the points, change the core (I advise you to try Radial Basis Functions) and the hardness of the border (constant C ). My apologies for the terrible code in JS - I wrote in it only a few times in my life, to understand the algorithm, use the Python code later in the article.
Content
- , , , . , — . , — [2] [3], , , .
- « SMO» SMO , , . , , SMO, .
- , «» Python .
- « sklearn.svc.svm» — 2D confusion matrix MNIST.
- «» - .
Support Vector Machine is a machine learning method (supervised learning) for solving problems of classification, regression, anomaly detection, etc. We will consider it using the example of a binary classification problem. Our training sample is a set of feature vectors xi classified into one of two classes yi=±1 ... Classification Request - Vector x to which we must assign the class +1 or −1 ...

How to choose the best straight line? Intuitively, I want the straight line to pass in the middle between the classes. To do this, write the equation of the straight line as x⋅w+b=0 and scale the parameters so that the dataset points closest to the straight line satisfy x⋅w+b=±1 (plus or minus, depending on the class) - these points are called support vectors .

min12|w|2subject to: yi(xi⋅w+b)−1≥0.
If you solve it, then the classification on request x produced like this
class(x)=sign(x⋅w+b).
This is the simplest support vector machine.
And what to do in the case when points of different classes mutually penetrate as in the figure?

min(12|w|2+C∑iξi)subject to: ξi+yi(xi⋅w+b)−1≥0,ξi≥0,
and the classification procedure will continue as before. Here the hyperparameter C is responsible for the strength of regularization, that is, it determines how strictly we require points to respect the boundary: the more C - the more ξi will vanish and the fewer points will violate the boundary. Support vectors in this case are the points for which ξi>0 ...
But what if the training set resembles the logo of The Who and the dots can never be separated by a straight line?

maxλn∑i=1λi−12n∑i=1n∑j=1yiyj(xi⋅xj)λiλj,subject to: 0≤λi≤C, for i=1,2,…,n,n∑i=1yiλi=0,
Where λi - dual variables. After solving the maximization problem, it is also necessary to calculate the parameter b , which is not included in the dual problem, but is needed for the classifier
b=Ek,ξk≠0[yk−∑iλiyi(xi⋅xk)].
The classifier can (and should) be rewritten in terms of dual variables
class(x)=sign(f(x)),f(x)=∑iλiyi(xi⋅x)+b.
What is the advantage of this recording? Please note that all vectors from the training set are included here exclusively in the form of dot products (xi⋅xj) ... You can first map points to a surface in a higher-dimensional space, and only then calculate the dot product of the images in the new space. Why do this can be seen from the figure.

You may have seen a video where the action of gravity is explained using the example of a stretched rubber film in the form of a funnel [7] . This works, since the movement of a point along a curved surface in a high-dimensional space is equivalent to the movement of its image in a space of a lower dimension, if provided with a nontrivial metric. In fact, the core bends space.
SMO Algorithm
So, we are at the goal, it remains to solve the dual problem posed in the previous section
maxλn∑i=1λi−12n∑i=1n∑j=1yiyjK(xi;xj)λiλj,subject to: 0≤λi≤C, for i=1,2,…,n,n∑i=1yiλi=0,
then find the parameter
b=Ek,ξk≠0[yk−∑iλiyiK(xi;xk)],
and the classifier will take the following form
class(x)=sign(f(x)),f(x)=∑iλiyiK(xi;x)+b.
The SMO (Sequential minimal optimization, [8] ) algorithm for solving the dual problem is as follows. In the loop, using a complex heuristic ( [9] ), a pair of dual variables is selected λM and λL , and then the objective function is minimized over them, with the condition of the constancy of the sum yMλM+yLλL and restrictions 0≤λM≤C , 0≤λL≤C (setting the hardness of the border). The sum condition stores the sum of all yiλi unchanged (after all, we did not touch the rest of the lambdas). The algorithm stops when it detects a sufficiently good observance of the so-called KKT conditions (Karush-Kuna-Tucker [10] ).
I'm going to make some simplifications.
- I will abandon the complex heuristics of index selection (this is done in the Stanford University course [11] ) and will iterate over one index, and select the second at random.
- I will refuse to check the CCP and will perform a predetermined number of iterations in advance.
- In the optimization procedure itself, in contrast to the classical work [9] or the Stanford approach [11] , I will use the vector straight line equation. This will greatly simplify calculations (compare the volume of [12] and [13] ).
Now for the details. Information from the training sample can be written in the form of a matrix
K=(y1y1K(x1;x1)y1y2K(x1;x2)…y1yNK(x1;xN)y2y1K(x2;x1)y2y2K(x2;x2)…y2yNK(x2;xN)⋯⋯⋯⋯yNy1K(xN;x1)yNy2K(xN;x2)…yNyNK(xN;xN)).
In what follows, I will use the notation with two indices ( Ki,j ) to refer to an element of the matrix and with one index ( Kk ) to denote the column vector of the matrix. Collect the dual variables into a column vector λ ... We are interested in
maxλn∑i=1λi−12λTKλ⏟L.
Suppose, at the current iteration, we want to maximize the objective function by indices L and M ... We will take derivatives, so it is convenient to select the terms containing the indices L and M ... It's easy to do in the sum part λi , but the quadratic form will require several transformations.
When calculating λTKλ summation is performed over two indices, let i and j ... Highlight pairs of indices containing L or M ...

Let's rewrite the problem by combining everything that does not contain λL or λM ... To make it easier to keep track of the indices, pay attention to K in the image.
L=λM+λL−∑jλMλjKM,j−∑iλLλiKL,i+const+ →+12λ2MKM,M+λMλLKM,L+12λ2LKL,L==λM(1−∑jλjKM,j)+λL(1−∑iλiKL,i)++12(λ2MKM,M+2λMλLKM,L+λ2LKL,L)+const==kT0v0+12vT0Qv0+const,
Where const denotes terms independent of λL or λM ... In the last line, I used the notation
v0=(λM,λL)T,k0=(1−λTKM,1−λTKL)T,Q=(KM,MKM,LKL,MKL,L),u=(−yL,yM)T.
note that k0+Qv0 does not depend on λL nor from λM
k0=(1−λMKM,M−λLKM,L−∑i≠M,LλiKM,i1−λMKL,M−λLKL,L−∑i≠M,LλiKL,i)=(1−∑i≠M,LλiKM,i1−∑i≠M,LλiKL,i)−Qv0.
The kernel is symmetrical, therefore QT=Q and you can write
L=(k0+Qv0−Qv0)Tv0+12vT0Qv0+const=(k0+Qv0)Tv0−12vT0Qv0+const
We want to perform the maximization so that yLλL+yMλM remained constant. To do this, the new values must lie on the straight line
(λnewM,λnewL)T=v(t)=v0+tu.
It is easy to make sure that for any t
yMλnewM+yLλnewL=yMλM+yLλL+t(−yMyL+yLyM)=yMλM+yLλL.
In this case, we must maximize
L(t)=(k0+Qv0)Tv(t)−12vT(t)Qv(t)+const,
which is easy to do by taking the derivative
dL(t)dt=(k0+Qv0)Tdvdt−12(d(vTQv)dv)Tdvdt==kT0u+vT0QTu−vTQTu⏟(vT0−vT)Qu=kT0u−tuTQu.

t∗=kT0uuTQu.
And one more thing: perhaps we will climb further than necessary and find ourselves outside the square as in the picture. Then you need to take a step back and return to its border
(λnewM,λnewL)=v0+trestr∗u.
This completes the iteration and selects new indices.
Implementation
The schematic diagram of training a simplified support vector machine can be written as

Let's take a look at the code in a real programming language. If you don't like the code in the articles, you can study it on the github [14] .
import numpy as np
class SVM:
def __init__(self, kernel='linear', C=10000.0, max_iter=100000, degree=3, gamma=1):
self.kernel = {'poly' : lambda x,y: np.dot(x, y.T)**degree,
'rbf': lambda x,y: np.exp(-gamma*np.sum((y-x[:,np.newaxis])**2,axis=-1)),
'linear': lambda x,y: np.dot(x, y.T)}[kernel]
self.C = C
self.max_iter = max_iter
# t,
def restrict_to_square(self, t, v0, u):
t = (np.clip(v0 + t*u, 0, self.C) - v0)[1]/u[1]
return (np.clip(v0 + t*u, 0, self.C) - v0)[0]/u[0]
def fit(self, X, y):
self.X = X.copy()
# 0,1 -1,+1; sklearn
self.y = y * 2 - 1
self.lambdas = np.zeros_like(self.y, dtype=float)
# (3)
self.K = self.kernel(self.X, self.X) * self.y[:,np.newaxis] * self.y
# self.max_iter
for _ in range(self.max_iter):
#
for idxM in range(len(self.lambdas)):
# idxL
idxL = np.random.randint(0, len(self.lambdas))
# (4)
Q = self.K[[[idxM, idxM], [idxL, idxL]], [[idxM, idxL], [idxM, idxL]]]
# (4a)
v0 = self.lambdas[[idxM, idxL]]
# (4b)
k0 = 1 - np.sum(self.lambdas * self.K[[idxM, idxL]], axis=1)
# (4d)
u = np.array([-self.y[idxL], self.y[idxM]])
# (5), idxM = idxL
t_max = np.dot(k0, u) / (np.dot(np.dot(Q, u), u) + 1E-15)
self.lambdas[[idxM, idxL]] = v0 + u * self.restrict_to_square(t_max, v0, u)
#
idx, = np.nonzero(self.lambdas > 1E-15)
# (1)
self.b = np.mean((1.0-np.sum(self.K[idx]*self.lambdas, axis=1))*self.y[idx])
def decision_function(self, X):
return np.sum(self.kernel(X, self.X) * self.y * self.lambdas, axis=1) + self.b
def predict(self, X):
# -1,+1 0,1; sklearn
return (np.sign(self.decision_function(X)) + 1) // 2
When creating an object of the SVM class, you can specify hyperparameters. Training is done by calling the fit function, the classes must be specified as 0 and 1 (internally converted to −1 and +1 , made for greater compatibility with sklearn), the dimension of the feature vector is allowed arbitrary. The predict function is used for classification.
Comparison with sklearn.svm.SVC
Not that this comparison makes much sense, because we are talking about an extremely simplified algorithm, developed solely for the purpose of teaching students, but still. For testing (and to see how to use it all), you can do the following (this code is also available on the github [14] ).
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.datasets import make_blobs, make_circles
from matplotlib.colors import ListedColormap
def test_plot(X, y, svm_model, axes, title):
plt.axes(axes)
xlim = [np.min(X[:, 0]), np.max(X[:, 0])]
ylim = [np.min(X[:, 1]), np.max(X[:, 1])]
xx, yy = np.meshgrid(np.linspace(*xlim, num=700), np.linspace(*ylim, num=700))
rgb=np.array([[210, 0, 0], [0, 0, 150]])/255.0
svm_model.fit(X, y)
z_model = svm_model.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plt.contour(xx, yy, z_model, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])
plt.contourf(xx, yy, np.sign(z_model.reshape(xx.shape)), alpha=0.3, levels=2, cmap=ListedColormap(rgb), zorder=1)
plt.title(title)
X, y = make_circles(100, factor=.1, noise=.1)
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(12,4))
test_plot(X, y, SVM(kernel='rbf', C=10, max_iter=60, gamma=1), axs[0], 'OUR ALGORITHM')
test_plot(X, y, SVC(kernel='rbf', C=10, gamma=1), axs[1], 'sklearn.svm.SVC')
X, y = make_blobs(n_samples=50, centers=2, random_state=0, cluster_std=1.4)
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(12,4))
test_plot(X, y, SVM(kernel='linear', C=10, max_iter=60), axs[0], 'OUR ALGORITHM')
test_plot(X, y, SVC(kernel='linear', C=10), axs[1], 'sklearn.svm.SVC')
fig, axs = plt.subplots(nrows=1,ncols=2,figsize=(12,4))
test_plot(X, y, SVM(kernel='poly', C=5, max_iter=60, degree=3), axs[0], 'OUR ALGORITHM')
test_plot(X, y, SVC(kernel='poly', C=5, degree=3), axs[1], 'sklearn.svm.SVC')
After launch, pictures will be generated, but since the algorithm is randomized, they will be slightly different for each launch. Here is an example of how the simplified algorithm works (from left to right: linear, polynomial and rbf kernels)
![]() |
![]() |
![]() |
And this is the result of the industrial version of the support vector machine.
![]() |
![]() |
![]() |
If the dimension 2 seems too small, you can still test on MNIST
from sklearn import datasets, svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
class_A = 3
class_B = 8
digits = datasets.load_digits()
mask = (digits.target == class_A) | (digits.target == class_B)
data = digits.images.reshape((len(digits.images), -1))[mask]
target = digits.target[mask] // max([class_A, class_B]) # rescale to 0,1
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.5, shuffle=True)
def plot_confusion(clf):
clf.fit(X_train, y_train)
y_fit = clf.predict(X_test)
mat = confusion_matrix(y_test, y_fit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, xticklabels=[class_A,class_B], yticklabels=[class_A,class_B])
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
print('sklearn:')
plot_confusion(svm.SVC(C=1.0, kernel='rbf', gamma=0.001))
print('custom svm:')
plot_confusion(SVM(kernel='rbf', C=1.0, max_iter=60, gamma=0.001))
For MNIST, I tried several random pairs of classes (the simplified algorithm only supports binary classification), but I found no difference in the work of the simplified algorithm and sklearn. The following confusion matrix gives an idea of the quality.

Conclusion
Thanks to everyone who read to the end. In this article, we looked at a simplified tutorial implementation of a support vector machine. Of course, it cannot compete with an industrial design, but due to the extreme simplicity and compact code in Python, and the fact that all the basic ideas of SMO have been preserved, this version of SVM may well take its place in the classroom. It should be noted that the algorithm is simpler not only than a very tricky algorithm [9] , but even its simplified version from Stanford University [11] . After all, a support vector machine in 30 lines is beautiful.
List of references
- https://fbeilstein.github.io/simplest_smo_ever/
- page at http://www.machinelearning.ru
- "Beginnings of Machine Learning", KAU
- https://en.wikipedia.org/wiki/Kernel_method
- https://en.wikipedia.org/wiki/Duality_(optimization)
- http://www.machinelearning.ru
- https://www.youtube.com/watch?v=MTY1Kje0yLg
- https://en.wikipedia.org/wiki/Sequential_minimal_optimization
- Platt, John. Fast Training of Support Vector Machines using Sequential Minimal Optimization, in Advances in Kernel Methods – Support Vector Learning, B. Scholkopf, C. Burges, A. Smola, eds., MIT Press (1998).
- https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions
- http://cs229.stanford.edu/materials/smo.pdf
- https://www.researchgate.net/publication/344460740_Yet_more_simple_SMO_algorithm
- http://fourier.eng.hmc.edu/e176/lectures/ch9/node9.html
- https://github.com/fbeilstein/simplest_smo_ever