Introduction to various reinforcement learning algorithms (Q-Learning, SARSA, DQN, DDPG)

(Q-learning, SARSA, DQN, DDPG)

Reinforcement learning (RL, hereinafter RL) refers to a type of machine learning method in which an agent receives a deferred reward at the next time step in order to evaluate his previous action. It was mainly used in games (eg Atari, Mario), with performance on par or even superior to humans. Recently, when an algorithm has been developed in combination with neural networks, it is capable of solving more complex problems.

Due to the fact that there are a large number of OP algorithms, it is not possible to compare all of them with each other. Therefore, this article will briefly discuss only a few well-known algorithms.

1. Reinforcement learning

A typical OP consists of two components, the Agent and the Environment.

– , ( ), . , (state = s) , , , (action = a ) . (state’ = s’) (reward = r) , , . , .

. , , .

:

1. Action (A, a): , ()

2. State (S,s):

3. Rewrd (R,r): ,

4. Policy (π ): - , , (a’) .

5. Value (V) Estimate (E) : () , R, Eπ(s) , s. ( Value – , Estimate – , E – . . )

6. Q-value (Q): Q V, , a ( ). Qπ(s, a) π s a

* MCTS (Monte Carlo time step model), on-policy (an algorithm where the Agent is included in the policy, i.e. learns based on actions derived from the current policy), off-policy (The Agent learns based on actions received from other policy — * MCTS (- ), on-policy (, , .. , ), off-policy ( ,

. T(s₁|(s₀, a)) S₀ a S₁. , , , a . , , (S*S*A )

, . / .

2.

2.1. Q-learning

Q-learning , :

E in the above equation refers to the expected value and  is the discount factor. — E ,  - .

Q-value:

Q, Q*, :

, Q-. Q-value, , Q-learning.

.

V « » . , , (action – a), V (). . .

(V)

, V, .

, , .

, p, , . , , , , . Q-Learning ?

a () (.. ) Q-learning (v). . (p).

, a’ Q- , . Q-learning (off-Policy).

2.2. State-Action-Reward-State-Action (SARSA)

SARSA Q-learning. SARSA Q-learning , (on-policy). , SARSA Q , , .

Q

Q-learning: Q(s_t,a_t)←Q(s_t,a_t)+α[r_t+1+γmaxaQ(s_t₊₁,a)−Q(s_t,a_t)]

SARSA: Q(s_t,a_t)←Q(s_t,a_t)+α[r_t+1+γQ(s_t+1,a_t+1)−Q(s_t,a_t)]

a_t+1– s_t+1 .

, , Q- learning Q-, , a, Q- Q (s_{t + 1}, a).

SARSA (, epsilon-greedy), a, , a + 1, Q- , Q (s_{t + 1}, a_t+1). ( SARSA, State-Action-Reward-State-Action).

, SARSA – on-policy , +1. , Q-.

Q-learning , a, , a s , a, Q (s_t₊₁, a). , Q-learning (, , ), Q

, , . , Q-learning , Q . , SARSA - , (on-policy).

2.3. Deep Q Network (DQN)

Q-learning - , - . Q- learning, ( * (action space * state space)), . , , Q-Learning , , . , Q-Learning . , DQN , .

DQN Q-. , - Q .

2013 DeepMind DQN Atari, . . , . Q- , .

: ?

, Q-learning. , Q Q-learning:

φ s, θ , . , Q Q .

DQN:

1. : (RL) , . . , , « » .

2. : Q , , . C, , . , , .

2.4. Deep Deterministic Policy Gradient (DDPG)

DQN , Atari, - . , , , . , . , , 10. 4 . 4¹⁰ = 1048576 . , .

DDPG «-» - . ? , .

(TD)

u . ? ! Q-learning. TD-learning – . Q-learning TD-learning Q

DDPG DQN. DDPG , . (action).

On the left, noise is added to actions, on the right to parameters. — ,

, , , OpenAI.

All Articles