Linear Regression Basics

Hello, Habr!



The purpose of this article is to talk about linear regression, namely, to collect and show the formulations and interpretations of the regression problem in terms of mathematical analysis, statistics, linear algebra and probability theory. Although the textbooks set out this topic strictly and exhaustively, another popular science article will not hurt.



! Watch out for traffic! The article contains a noticeable number of images for illustrations, some in gif format.



Content







Introduction



There are three similar concepts, three sisters: interpolation, approximation and regression.

They have a common goal: from a family of functions, choose the one that has a certain property.





Interpolation- a way to select from a family of functions the one that passes through the given points. The function is then usually used to calculate at intermediate points. For example, we manually set the color to several points and want the colors of the remaining points to form smooth transitions between the given ones. Or we set keyframes for the animation and want smooth transitions between them. Classic examples: Lagrange polynomial interpolation, spline interpolation, multidimensional interpolation (bilinear, trilinear, nearest neighbor, etc.). There is also a related concept of extrapolation - predicting the behavior of a function outside of an interval. For example, predicting the dollar rate based on previous fluctuations is extrapolation.



Approximation- a way to choose from a family of "simple" functions an approximation for a "complex" function on a segment, while the error should not exceed a certain limit. Approximation is used when you need to get a function similar to a given one, but more convenient for calculations and manipulations (differentiation, integration, etc.). When optimizing critical sections of the code, an approximation is often used: if the value of a function is calculated many times per second and absolute accuracy is not needed, then a simpler approximant with a lower "cost" of calculation can be dispensed with. Classic examples include Taylor series on a segment, orthogonal polynomial approximation, Padé approximation, Bhaskar sine approximation, etc.



Regression- a way to choose from a family of functions the one that minimizes the loss function. The latter characterizes how much the trial function deviates from the values ​​at the given points. If points are obtained in an experiment, they inevitably contain measurement error, noise, so it is more reasonable to require that the function convey the general trend, and not exactly pass through all the points. In a sense, regression is an “interpolating fit”: we want to draw the curve as close to the points as possible and still keep it as simple as possible to capture the overall trend. The loss function (in the English literature “loss function” or “cost function”) is responsible for the balance between these conflicting desires.



In this article, we'll look at linear regression. This means that the family of functions from which we choose is a linear combination of predetermined basis functionsf i

f = i w i f i .

The goal of regression is to find the coefficients of this linear combination, and thereby determine the regression function f (also called amodel). I note that linear regression is called linear precisely because of the linear combination of basis functions - this is not related to the most basic functions (they can be linear or not).



Regression has been with us for a long time: the method was first published by Legendre in 1805, although Gauss came to it earlier and used it successfully to predict the orbit of the "comet" (actually a dwarf planet) Ceres. There are many variations and generalizations of linear regression: LAD, Least Squares, Ridge Regression, Lasso Regression, ElasticNet, and many others.



GIF
. — .



GoogleColab.

GitHub





Least square method



Let's start with the simplest two-dimensional case. Let us be given points on the plane{ ( x 1 , y 1 ) , , ( x N , y N ) } and we are looking for such an affine function

f ( x ) = a + b x ,



so that its graph is closest to the points. Thus, our basis consists of a constant function and a linear( 1 , x ) .



As you can see from the illustration, the distance from a point to a straight line can be understood in different ways, for example, geometrically - it is the length of a perpendicular. However, in the context of our task, we need a functional distance, not a geometric one. We are interested in the difference between experimental value and model prediction for eachx i , so you need to measure along the axisy .



The first thing that comes to mind is to try an expression that depends on the absolute values ​​of the differences as a loss function| f ( x i ) - y i | ... The simplest option is the sum of the deviation modulesi | f ( x i ) - y i | results in Least Absolute Distance (LAD) regression.



However, the more popular loss function is the sum of the squares of the deviations of the regressant from the model. In English literature, it is called Sum of Squared Errors (SSE)

SSE(a,b)=SSres[iduals]=Ni=1i2=Ni=1(yif(xi))2=Ni=1(yiabxi)2,

Least Squares (OLS) - Linear Regression with SSE ( a , b ) as a loss function.



This choice is primarily convenient: the derivative of a quadratic function is a linear function, and linear equations are easily solved. However, further I will indicate other considerations in favor ofSSE ( a , b ) .

GIF
() (). .



GoogleColab.

GitHub





Mathematical analysis



The simplest way to find argmin a , bSSE ( a , b ) - calculate partial derivatives with respect toa andb , equate them to zero and solve the system of linear equations

a SSE(a,b)= - 2 N i = 1 ( y i - a - b x i ) , b SSE(a,b)= - 2 N i = 1 ( y i - a - b x i ) x i .

The values ​​of the parameters that minimize the loss function satisfy the equations

0= - 2 N Σ i = 1 ( y i - a - b x i ) , 0= - 2 N Σ i = 1 ( y i - a - b x i ) x i ,

which are easy to solve

a= i y iN - b ΣixiN , b= i x i y iN -ixiiyiN 2i x 2 iN -(i x 2 iN )2.

We got unwieldy and unstructured expressions. Now we will ennoble them and breathe meaning into them.



Statistics



The resulting formulas can be compactly written using statistical estimators: mean , variationsσ (standard deviation), covarianceσ ( , ) and correlationsρ ( , )

a= Y- bx, b=xy-xyx2-x2...

Let's rewrite ˆb as

ˆb=σ(x,y)σ2x,

Where σx is the uncorrected (biased) standard sample deviation, and σ(x,y)- covariance. Now, remember that the correlation coefficient (Pearson's correlation coefficient)

ρ(x,y)=σ(x,y)σxσy

and write

ˆb=ρ(x,y)σyσx...





Now we can appreciate all the elegance of descriptive statistics by writing the regression line equation like this



y-y=ρ(x,y)σyσx(x-x)...

First, this equation immediately indicates two properties of the regression line:

  • the straight line passes through the center of mass (x,y);
  • if along the axis x per unit of length choose σx, and along the axis y - σy, then the slope of the straight line will be from -45 before 45... This is due to the fact that-1ρ(x,y)1...


Secondly, now it becomes clear why the regression method is called that way. In units of standard deviationy deviates from its mean by less than x, because |ρ(x,y)|1... This is called regression (from the Latin regressus - "return") in relation to the mean. This phenomenon was described by Sir Francis Galton in the late 19th century in his article "Regression to Mediocrity in the Inheritance of Growth." The article shows that traits (such as height) that deviate greatly from the mean are rarely inherited. The characteristics of the offspring seem to tend to the average - nature rests on the children of geniuses.



By squaring the correlation coefficient, we obtain the coefficient of determinationR=ρ2... The square of this statistical measure shows how well the regression model fits the data.R2equal to 1, means that the function fits perfectly to all points - the data is perfectly correlated. It can be proved thatR2shows how much of the variance in the data is due to the best linear model. To understand what this means, we introduce the definitions

Vardata=1Ni(yi-y)2,Varres=1Ni(yi-model(xi))2,Varreg=1Ni(model(xi)-y)2...

Vardata - variation of initial data (variation of points yi).



Varres - variation of residuals, that is, variation of deviations from the regression model - from yi you need to subtract the model's prediction and find the variation.



Varreg - variation of regression, i.e. variation of the predictions of the regression model at points xi (note that the mean of the model predictions is the same as y).



The fact is that the variation in the original data is decomposed into the sum of two other variations: the variation that is explained by the model, and variation of the random noise (residuals)

Vardata=Varres+Varreg...

or

σ2data=σ2res+σ2reg...

As you can see, the standard deviations form a right-angled triangle.





We strive to get rid of the variability associated with noise and leave only the variability that is explained by the model - we want to separate the wheat from the chaff. The extent to which the best of the linear models succeeded is evidenced byR2equal to one minus the fraction of error variation in the total variation

R2=Vardata-VarresVardata=1-VarresVardata

or the proportion of variation explained (proportion of regression variation in total variation)

R2=VarregVardata...

R equal to the cosine of an angle in a right triangle (σdata,σreg,σres)... By the way, sometimes a fraction of the unexplained variation is introduced FUV=1-R2and it is equal to the square of the sine in this triangle. If the coefficient of determination is small, perhaps we chose unsuccessful basis functions, linear regression is not applicable at all, etc.



Probability theory



Earlier we came to the loss function SSE(a,b)for reasons of convenience, but it can also be reached using the theory of probability and the method of maximum likelihood (MLM). Let me briefly recall its essence. Suppose we haveNindependent identically distributed random variables (in our case, measurement results). We know the form of the distribution function (for example, the normal distribution), but we want to determine the parameters that are included in it (for exampleμ and σ). To do this, you need to calculate the probability of gettingNdatapoints under the assumption of constant, yet unknown parameters. Due to the independence of the measurements, we get the product of the probabilities of each measurement. If we think of the resulting value as a function of parameters (likelihood function) and find its maximum, we get an estimate of the parameters. Often, instead of the likelihood function, they use its logarithm - it is easier to differentiate it, but the result is the same.



Let's go back to the simple regression problem. Let's say that the valuesx we know exactly, but in measurement ythere is random noise ( weak exogenous property ). Moreover, we assume that all deviations from the straight line ( linearity property ) are caused by noise with a constant distribution ( constant distribution ). Then

y=a+bx+ϵ,

Where ϵ - normally distributed random variable

ϵN(0,σ2),p(ϵ)=12πσ2e-ϵ22σ2...







Based on the assumptions above, we write down the likelihood function

L(a,b|y)=P(y|a,b)=iP(yi|a,b)=ip(yi-a-bx|a,b)==i12πσ2e-(yi-a-bx)22σ2=12πσ2e-i(yi-a-bx)22σ2==12πσ2e-SSE(a,b)2σ2

and its logarithm

l(a,b|y)=logL(a,b|y)=-SSE(a,b)+const...

Thus, the maximum likelihood is achieved at a minimum SSE

(ˆa,ˆb)=argmaxa,bl(a,b|y)=argmina,bSSE(a,b),

which gives reason to accept it as a loss function. By the way, if

ϵLaplace(0,α),pL(ϵ;μ,α)=α2e-α|ϵ-μ|

we get the LAD regression loss function

ELAD(a,b)=i|yi-a-bxi|,

which we mentioned earlier.



The approach we have used in this section is one possible one. It is possible to achieve the same result using more general properties. In particular, the property of constancy of distribution can be weakened by replacing it with the properties of independence, constancy of variation (homoscedasticity) and absence of multicollinearity. Also, instead of MMP estimation, you can use other methods, for example, linear MMSE estimation.



Multilinear regression



So far, we have considered the regression problem for one scalar feature x, however usually the regressor is n-dimensional vector x... In other words, for each dimension, we registernfeatures, combining them into a vector. In this case, it is logical to accept the model withn+1 independent basis functions of the vector argument - n degrees of freedom correspond n features and one more - regressant y... The simplest choice is linear basis functions(1,x1,,xn)... Whenn=1 we get the already familiar basis (1,x)...



So, we want to find such a vector (set of coefficients)w, what

nj=0wjx(i)j=wx(i)yi,i=1...N...

Sign ""means that we are looking for a solution that minimizes the sum of the squared errors

ˆw=argminwNi=1(yi-wx(i))2

The last equation can be rewritten in a more convenient way. For this we placex(i) in matrix rows (information matrix)

X=(-x(1)--x(N)-)=(|||x0x1xn|||)=(1x(1)1x(1)n1x(N)1x(N)n)...

Then the columns of the matrix xi meet measurements i-th feature. It is important not to get confused here:N - the number of measurements, n- the number of signs (features) that we register. The system can be written as

Xwy...

The square of the norm of the difference between the vectors on the right and left sides of the equation forms the loss function

SSE(w)=y-Xw2,wRn+1;yRN,

which we intend to minimize

ˆw=argminwSSE(w)=argminw(y-Xw)(y-Xw)==argminw(yy-2wXy+wXXw)...

Let's differentiate the final expression by w(if you forgot how to do it, take a look at the Matrix cookbook )

SSE(w)w=-2Xy+2XXw,

we equate the derivative to 0and we get the so-called. normal equations

XXˆw=Xy...

If the columns of the information matrix X are linearly independent (there are no perfectly correlated features), then the matrix XXhas the opposite (the proof can be seen, for example, in the video of Khan's academy ). Then we can write

ˆw=(XX)-1Xy=X+y,

Where

X+=(XX)-1X



pseudo-inverse to X... The concept of a pseudoinverse matrix was introduced in 1903 by Fredholm, and it played an important role in the works of Moore and Penrose.



Let me remind you what to turnXX and find X+ only possible if the columns Xare linearly independent. However, if the columnsX close to linear dependence, calculation (XX)-1is already becoming numerically unstable. The degree of linear dependence of features inXor, as they say, the multicollinearity of the matrixXX, can be measured by the conditionality number - the ratio of the maximum eigenvalue to the minimum. The bigger it is, the closerXX to degenerate and unstable computation of the pseudoinverse.





Linear algebra



The solution to the problem of multilinear regression can be reached quite naturally with the help of linear algebra and geometry, because even the fact that the norm of the error vector appears in the loss function already hints that the problem has a geometric side. We have seen that an attempt to find a linear model describing the experimental points leads to the equation

Xwy...

If the number of variables is equal to the number of unknowns and the equations are linearly independent, then the system has a unique solution. However, if the number of dimensions exceeds the number of features, that is, there are more equations than unknowns, the system becomes inconsistent, overdetermined. In this case, the best we can do is choose the vectorwwhose image Xw the closest to y... Let me remind you that many images or column spaceC(X) Is a linear combination of the column vectors of the matrix X

(|||x0x1xn|||)w=w0x0+w1x1+wnxn...

C(X) - n+1-dimensional linear subspace (we consider features linearly independent), linear span of column vectors X... So ify belongs C(X), then we can find a solution, if not, we will seek, so to speak, the best of the unsolved.



If in addition to vectorsC(X) we consider all vectors perpendicular to them, then we get one more subspace and we can any vector from RNdecompose into two components, each of which lives in its own subspace. The second, perpendicular space can be characterized as follows (we will need this later). Let it govRNthen

Xv=(-x0--xn-)v=(x0vxnv)

is zero if and only if v perpendicular to all xi, and hence the whole C(X)... Thus, we have found two perpendicular linear subspaces, linear combinations of vectors of which completely, without holes, "cover" allRN... This is sometimes denoted by the orthogonal direct sum symbol





Where ker(X)={v|Xv=0}... Each of the subspaces can be reached using the corresponding projection operator, but more on that below.



Now let's imagine y decomposition

y=yproj+y,yprojC(X),yker(X)...





If we are looking for a solution ˆw, then it is natural to require that ||y-Xw||was minimal, because this is the length of the remainder vector. Taking into account the perpendicularity of subspaces and the Pythagorean theorem

argminw||y-Xw||=argminw||y+yproj-Xw||=argminw||y||2+||yproj-Xw||2,

but since choosing a suitable w, I can get any column space vector, then the problem is reduced to

Xˆw=yproj,

and ywill remain as a fatal error. Any other choiceˆw will only make the mistake more.



If we now remember that Xy=0then it's easy to see

XXw=Xyproj=Xyproj+Xy=Xy,

which is very convenient, since yproj we do not have, but y- there is. Recall from the previous section thatXX has the inverse under the condition of linear independence of features and write the solution

w=(XX)-1Xy=X+y,

Where X+the already familiar pseudoinverse matrix. If we are interested in the projectionyproj, then we can write

yproj=Xw=XX+y=ProjXy,

Where ProjX- operator of projection onto column space.



Let's find out the geometric meaning of the coefficient of determination.



Note that the purple vector ˉy1=ˉy(1,1,...,1) proportional to the first column of the information matrix X, which consists of one unit according to our choice of basis functions. In RGB triangle

y-ˆy1=y-ˉy+ˆy-ˉy1...

Since this triangle is rectangular, then by the Pythagorean theorem

y-ˆy12=y-ˉy2+ˆy-ˉy12...

This is a geometric interpretation of the already known fact that

Vardata=Varres+Varreg...

We know that

R2=VarregVardata,

which means

R=cosθ...

Nice, isn't it?





Arbitrary basis



As we know, regression is performed on basis functions fi and its result is the model

f=iwifi,

but so far we have used the simplest fithat simply relayed the original features without changes, well, except that they supplemented them with constant features f0(x)=1... As you can see, in fact, neither kindfi, nor their number is limited by anything - the main thing is that the functions in the basis are linearly independent. Usually, the choice is made based on assumptions about the nature of the process that we are modeling. If we have reason to believe that the points{(x1,y1),,(xN,yN)} fall on a parabola, and not on a straight line, then it is worth choosing a basis (1,x,x2)... The number of basic functions can be either less or more than the number of original features.

GIF
. scikit-learn , — .



GoogleColab.

GitHub



If we have decided on the basis, then we proceed as follows. We form a matrix of information

Φ=(-f(1)--f(N)-)=(f0(x(1))f1(x(1))fn(x(1))f0(x(N))f1(x(N))fn(x(N))),

write the loss function

E(w)=ϵ(w)2=y-Φw2

and find its minimum, for example, using the pseudoinverse matrix

ˆw=argminwE(w)=(ΦΦ)-1Φy=Φ+y

or another method.





Concluding remarks



Dimension selection problem



In practice, it is often necessary to independently build a model of the phenomenon, that is, to determine how many and which basic functions should be taken. The first impulse to "get more" can play a cruel joke: the model will be too sensitive to noise in the data (overfitting). On the other hand, if you overly restrict the model, it will be too rough (underfitting).



There are two ways to get out of the situation. The first is to consistently increase the number of basic functions, check the quality of the regression and stop in time. Or the second: choose a loss function that will determine the number of degrees of freedom automatically. As a criterion for the success of the regression, you can use the coefficient of determination, which was already mentioned above, however, the problem is thatR2grows monotonically with increasing dimension of the basis. Therefore, the adjusted coefficient is introduced

ˉR2=1-(1-R2)[N-1N-(n+1)],

Where N - sample size, n- the number of independent variables. Following theˉR2, we can stop in time and stop adding additional degrees of freedom.



The second group of approaches is regularization, the most famous of which is Ridge (L2/ ridge / Tikhonov regularization), Lasso (L1regularization) and Elastic Net (Ridge + Lasso). The main idea of ​​these methods is to modify the loss function with additional terms that will not allow the vector of coefficientsw grow indefinitely and thus prevent retraining

ERidge(w)=SSE(w)+αi|wi|2=SSE(w)+αw2L2,ELasso(w)=SSE(w)+βi|wi|=SSE(w)+βwL1,ERU(w)=SSE(w)+αw2L2+βwL1,

Where α and β- parameters that control the "strength" of the regularization. This is a vast topic with beautiful geometry that deserves a separate discussion. By the way, I will mention that for the case of two variables, using probabilistic interpretation, one can obtain the Ridge and Lasso regressions by successfully choosing the prior distribution for the coefficientb

y=a+bx+ϵ,ϵN(0,σ2),{bN(0,τ2)Ridge,bLaplace(0,α)Lasso...









Numerical methods



Let me say a few words about how to minimize the loss function in practice. SSE is an ordinary quadratic function that is parameterized by the input data, so in principle it can be minimized by the steepest descent method or other optimization methods. Of course, the best results are shown by algorithms that take into account the form of the SSE function, for example, the stochastic gradient descent method. Lasso's implementation of regression in scikit-learn uses the coordinate descent method.



You can also solve normal equations using numerical linear algebra methods. An efficient method that scikit-learn uses for OLS is to find the pseudoinverse using singular value decomposition. The fields of this article are too narrow to touch on this topic, for details I advise you to refer to the course of lectures by K.V. Vorontsov.



Advertising and conclusion



This article is an abridged retelling of one of the chapters of a course on classical machine learning at the Kiev Academic University (successor to the Kiev branch of the Moscow Institute of Physics and Technology, KO MIPT). The author of the article helped create this course. The course is technically done on the Google Colab platform, which allows you to combine LaTeX formatted formulas, Python executable code, and interactive demonstrations in Python + JavaScript, so that students can work with the course materials and run the code from any computer that has a browser. The home page contains links to abstracts, practice workbooks, and additional resources. The course is based on the following principles:



  • all materials must be available to students from the first pair;
  • the lecture is needed for understanding, not for taking notes (notes are already ready, there is no point in writing them if you don't want to);
  • a lecture is more than a lecture (there is more material in the notes than was announced at the lecture, in fact, the notes are a full-fledged textbook);
  • visibility and interactivity (illustrations, photos, demos, gifs, code, videos from youtube).


If you want to see the result, take a look at the course page on GitHub .



I hope you were interested, thank you for your attention.



All Articles