🐭 ☁️ 👩🏼‍⚕️ Machine Learning. Neural Networks (Part 1): The Perceptron Learning Process 📼 🧛🏻 🙈

I want to note in advance that those who know how the perceptron learns will hardly find something new in this article. You can safely skip it. Why I decided to write this - I would like to write a series of articles related to neural networks and the use of TensorFlow.js, therefore I could not omit general theoretical excerpts. Therefore, I ask you to treat the final idea with great patience and understanding.

In classical programming, the developer describes in a specific programming language a certain rigidly specified set of rules, which was determined on the basis of his knowledge in a specific subject area and which, as a first approximation, describes the processes occurring in the human brain when solving a similar problem.

For example, a strategy for playing tic-tac-toe, chess, and more can be programmed (Figure 1).

Figure 1 - The classical approach to solving problems

Whereas machine learning algorithms can define a set of rules for solving problems without the participation of the developer, but only based on the availability of a training dataset.

A training set is some kind of set of inputs associated with a set of expected outcomes (responses, outputs). At each step of training, the model, by changing the internal state, will optimize and reduce the error between the actual output of the model and the expected result (Figure 2).

Figure 2 - Machine learning

Neural networks

For a long time, scientists, inspired by the processes occurring in our brain, tried to reverse-engineer the central nervous system and try to imitate the work of the human brain. Thanks to this, a whole direction in machine learning was born - neural networks.

In Figure 3, you can see the similarities between the design of a biological neuron and the mathematical representation of a neuron used in machine learning.

Figure 3 - Mathematical representation of a neuron

In a biological neuron, a neuron receives electrical signals from dendrites, modulating electrical signals with different strengths, which can excite the neuron when a certain threshold value is reached, which in turn will lead to the transmission of an electrical signal to other neurons through synapses.

Perceptron

Mathematical model of a neural network, consisting of one neuron, which performs two sequential operations (Figure 4):

calculates the sum of the input signals taking into account their weights (conductance or resistance) of the connection
${s u m = \vec{X}}^{T} \vec{W} + \vec{B} = \sum_{i = 1}^{n} x_{i} w_{i} + b$
applies the activation function to the total sum of the input signals.
$o u t = φ (s u m)$

Figure 4 - Mathematical model of the perceptron

Any differentiable function can be used as an activation function, the most commonly used ones are shown in Table 1. The choice of the activation function falls on the shoulders of the engineer, and usually this choice is based either on the already existing experience in solving similar problems, or simply by the method selection.

The note

– , ReLU , .

1 -

Name	Formula	Schedule
Linear function	$φ (x) = x$	...
Sigmoid function	$φ (x) = \frac{1}{1 + e^{- x}}$
Softmax function	$φ (x_{j}) = \frac{e^{x_{j}}}{\sum_{i} e^{x_{i}}}$	$φ ([\begin{matrix} 1.2 \\ 0.9 \\ 0.4 \end{matrix}]) = [\begin{matrix} 0.46 \\ 0.34 \\ 0.20 \end{matrix}]$ ( 2)
Hyperbolic Tangent function	$φ (x) = \frac{e^{x} - e^{- x}}{e^{x} - e^{- x}}$	[-1, 1]. , ,
Rectified Linear Unit (ReLU)	$φ (x) = max (0, x)$	, , sigmoid tanh
Leaky ReLU	$φ (x) = max (0.01 x, x)$	ReLU , 0

The learning process consists of several steps. For greater clarity, we will consider a certain fictional problem that we will solve with a neural network consisting of one neuron with a linear activation function (this is essentially a perceptron without an activation function at all), and to simplify the task, we will exclude the displacement node b in the neuron (Figure 5) ...

Figure 5 - The training dataset and the state of the neural network at the previous training step

At this stage, we have a neural network in a certain state with certain connection weights that were calculated at the previous stage of training the model, or if this is the first iteration of training, then the values of the connection weights are selected in random order.

So, let's imagine that we have some set of training data, the values of each element from the set are represented by a vector of input data (input data), containing 2 parameters (feature)

x_{1}, x_{2}

... Under

x_{1}, x_{2}

in the model, depending on the domain in question, anything can be implied: the number of rooms in the house, the distance of the house from the sea, well, or we are just trying to train the neural network of the logical operation AND, or OR.

Each input vector in the training set is mapped to an expected output vector. In this case, the output data vector contains only one parameter, which, again, depending on the selected subject area, can mean anything - the price of a house, the result of performing a logical AND or OR operation.

STEP 1 - Feedforward process

At this step, we calculate the sum of the input signals taking into account the weight of each bond and apply the activation function (in our case, there is no activation function). Let's do the calculations for the first element in the training set:

y_{p r e d i c t e d} = \sum_{i = 1}^{n} x_{i} w_{i} = 1 \cdot 0.1 + 0.5 \cdot 0.2 = 0.2

Figure 6 - Forward propagation of the error

Note that the above formula is a simplified mathematical equation for the special case of tensor operations.

A tensor is essentially a data container that can have N axes and an arbitrary number of elements along each of the axes. Most with tensors are familiar with mathematics - vectors (tensor with one axis), matrices (tensor with two axes - rows, columns).

The formula can be written in the following form, where you will see the familiar matrices (tensors) and their multiplication, and also understand what kind of simplification was discussed above:

{\vec{Y}}_{p r e d i c t e d} = {\vec{X}}^{T} \vec{W} = {[\begin{matrix} x_{1} \\ x_{2} \end{matrix}]}^{T} \cdot [\begin{matrix} w_{1} \\ w_{2} \end{matrix}] = [\begin{matrix} x_{1} & x_{2} \end{matrix}] \cdot [\begin{matrix} w_{1} \\ w_{2} \end{matrix}] = [x_{1} w_{1} + x_{2} w_{2}]

STEP 2 - Calculate the error

function The error function is a metric that reflects the discrepancy between the expected and received output. The following error functions are commonly used:

- Mean Squared Error (MSE) - this error function is especially sensitive to outliers in the training set, since it uses the square of the difference between the actual and expected values (an outlier is a value that is very far from other values in dataset, which can sometimes appear due to data errors, such as mixing data with different units of measure or poor sensor readings):

L = \frac{1}{N} \sum_{i = 1}^{N} {(y_{p r e d i c t e d (i)} - y_{e x p e c t e d (i)})}^{2}

- root mean square deviation (Root MSE) - in fact it is the same as the root mean square error in the context of neural networks, but it can reflect the real physical unit of measurement, for example, if in the neural network the output parameters of the neural network is the price of a house expressed in dollars, then the unit of measurement the mean square error will be the square dollar (

$^{2}

), and for the standard deviation it is dollar ($), which naturally slightly simplifies the task of human analysis:

L = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{p r e d i c t e d (i)} - y_{e x p e c t e d (i)})}^{2}}

- the mean deviation (Mean Absolute Error, MAE) - in contrast to the above two values, is not so sensitive to outliers:

L = \frac{1}{N} \sum_{i = 1}^{N} | y_{p r e d i c t e d (i)} - y_{e x p e c t e d (i)} |

- cross entropy (Cross entropy) - uses for classification tasks:

L = - \sum_{i = 1}^{N} \sum_{j = 1}^{M} y_{e x p e c t e d (i j)} \log (y_{p r e d i c t e d (i j)})

Where

N

- the number of copies in the training set

M

- the number of classes when solving classification problems

y_{e x p e c t e d}

- expected output value

y_{p r e d i c t e d}

- the actual output value of the trained model

For our particular case, we will use MSE:

L = \frac{1}{N} \sum_{i = 1}^{N} {(y_{p r e d i c t e d (i)} - y_{e x p e c t e d (i)})}^{2} = {(0.2 - 1)}^{2} = 0.64

STEP 3 - Backpropagation

The goal of training the neural network is simple - it is to minimize the error function:

L \to m i n

One way to find the minimum of a function is to modify the connection weights in the direction opposite to the gradient vector at each next step of learning - the gradient descent method, and it looks like this mathematically:

{\vec{w}}^{(k + 1)} = {\vec{w}}^{k} - μ \nabla L ({\vec{w}}^{k})

Where

k

- k-th iteration of neural network training;

μ

- the learning rate is set by the engineer, usually it can be 0.1; 0.01 (about how the learning step affects the learning convergence process, note a little later)

\nabla L

- the gradient of the error function

To find the gradient, we use partial derivatives with respect to custom arguments

w_{1}, w_{2}

\nabla L (\vec{w}) = [\begin{matrix} \frac{\partial L}{\partial w_{1}} \\ ⋮ \\ \frac{\partial L}{\partial w_{N}} \end{matrix}]

In our particular case, taking into account all the simplifications, the error function takes the form:

L (w_{1}, w_{2}) = {(y_{p r e d i c t e d} - y_{e x p e c t e d})}^{2} = {(x_{1} w_{1} + x_{2} w_{2} - y_{e x p e c t e d})}^{2} =

= {(1 \cdot w_{1} + 0.5 \cdot w_{2} - 1)}^{2}

Derivative formulas memo

,

$\frac{d}{d x} c = 0; c = c o n s t$
$\frac{d}{d x} [c f (x)] = c f^{'} (x); c = c o n s t$
$\frac{d}{d x} x^{n} = n x^{n - 1}$

$\frac{d}{d x} [f (x) \pm g (x)] = f^{'} (x) \pm g^{'} (x)$
$\frac{d}{d x} [f (x) g (x)] = f^{'} (x) g (x) + g^{'} (x) f (x)$
$\frac{d}{d x} f (g (x)) = f^{'} (g (x)) g^{'} (x)$

Let's find the following partial derivatives:

\frac{\partial}{\partial w_{1}} {(w_{1} + 0.5 w_{2} - 1)}^{2} = 2 \cdot (w_{1} + 0.5 w_{2} - 1) \frac{\partial}{\partial w_{1}} (w_{1} + 0.5 w_{2} - 1) =

= 2 \cdot (w_{1} + 0.5 w_{2} - 1) \cdot 1 = 2 (0.1 + 0.5 \cdot 0.2 - 1) = - 1.6

\frac{\partial}{\partial w_{2}} {(w_{1} + 0.5 w_{2} - 1)}^{2} = 2 \cdot (w_{1} + 0.5 w_{2} - 1) \frac{\partial}{\partial w_{2}} (w_{1} + 0.5 w_{2} - 1) =

= 2 \cdot (w_{1} + 0.5 w_{2} - 1) \cdot 0.5 = 2 (0.1 + 0.5 \cdot 0.2 - 1) \cdot 0.5 = - 0.8

Then the process of back propagation of the error is movement along the model from the output towards the input with modification of the model weights in the direction opposite to the gradient vector. Setting the learning step 0.1 (learning rate) we have (Figure 7):

w_{1}^{(k + 1)} = w_{1}^{(k)} - μ \frac{\partial L (w_{1}, w_{2})}{\partial w_{1}} = 0.1 - 0.1 \cdot (- 1.6) = 0.26

w_{2}^{(k + 1)} = w_{2}^{(k)} - μ \frac{\partial L (w_{1}, w_{2})}{\partial w_{2}} = 0.2 - 0.1 \cdot (- 0.8) = 0.28

Figure 7 - Backpropagation of the error

Thus, we have completed k + 1 training steps to make sure that the error has decreased, and the output from the model with new weights has become closer to the expected, we will perform the process of forward propagation of the error over the model with new weights (see STEP 1) :

y_{p r e d i c t e d} = x_{1} w_{1} + x_{2} w_{2} = 1 \cdot 0.26 + 0.5 \cdot 0.28 = 0.4

As you can see, the output value has increased by 0.2 units in the right direction towards the expected result - one (1). The error will then be:

L = {(0.4 - 1)}^{2} = 0.36

As you can see, at the previous training step, the error was 0.64, and with the new weights - 0.36, therefore, we adjusted the model in the right direction.

Next part of the article:

Machine learning. Neural networks (part 2): OR modeling; XOR with TensorFlow.js

Machine Learning. Neural Networks (Part 3) - Convolutional Network under the microscope. Exploring the Tensorflow.js API

Machine Learning. Neural Networks (Part 1): The Perceptron Learning Process

Neural networks

Perceptron

The note

More articles: