(Q-learning, SARSA, DQN, DDPG)
Reinforcement learning (RL, hereinafter RL) refers to a type of machine learning method in which an agent receives a deferred reward at the next time step in order to evaluate his previous action. It was mainly used in games (eg Atari, Mario), with performance on par or even superior to humans. Recently, when an algorithm has been developed in combination with neural networks, it is capable of solving more complex problems.
Due to the fact that there are a large number of OP algorithms, it is not possible to compare all of them with each other. Therefore, this article will briefly discuss only a few well-known algorithms.
1. Reinforcement learning
A typical OP consists of two components, the Agent and the Environment.
β , ( ), . , (state = s) , , , (action = a ) . (stateβ = sβ) (reward = r) , , . , .
. , , .
:
1. Action (A, a): , ()
2. State (S,s):
3. Rewrd (R,r): ,
4. Policy (Ο ): - , , (aβ) .
5. Value (V) Estimate (E) : () , R, EΟ(s) , s. ( Value β , Estimate β , E β . . )
6. Q-value (Q): Q V, , a ( ). QΟ(s, a) Ο s a
. T(s1|(s0, a)) S0 a S1 . , , , a . , , (S*S*A )
, . / .
2.
2.1. Q-learning
Q-learning , :
Q-value:
Q, Q*, :
, Q-. Q-value, , Q-learning.
.
V Β« Β» . , , (action β a), V (). . .
(V)
, V, .
, , .
, p, , . , , , , . Q-Learning ?
a () (.. ) Q-learning (v). . (p).
, aβ Q- , . Q-learning (off-Policy).
2.2. State-Action-Reward-State-Action (SARSA)
SARSA Q-learning. SARSA Q-learning , (on-policy). , SARSA Q , , .
Q
Q-learning: Q(st,at)βQ(st,at)+Ξ±[rt+1+Ξ³maxaQ(st+1,a)βQ(st,at)]
SARSA: Q(st,at)βQ(st,at)+Ξ±[rt+1+Ξ³Q(st+1,at+1)βQ(st,at)]
at+1 β st+1 .
, , Q- learning Q-, , a, Q- Q (st + 1, a).
SARSA (, epsilon-greedy), a, , a + 1, Q- , Q (st + 1, at+1). ( SARSA, State-Action-Reward-State-Action).
, SARSA β on-policy , +1. , Q-.
Q-learning , a, , a s , a, Q (st+1, a). , Q-learning (, , ), Q
, , . , Q-learning , Q . , SARSA - , (on-policy).
2.3. Deep Q Network (DQN)
Q-learning - , - . Q- learning, ( * (action space * state space)), . , , Q-Learning , , . , Q-Learning . , DQN , .
DQN Q-. , - Q .
2013 DeepMind DQN Atari, . . , . Q- , .
: ?
, Q-learning. , Q Q-learning:
Ο s, ΞΈ , . , Q Q .
DQN:
1. : (RL) , . . , , Β« Β» .
2. : Q , , . C, , . , , .
2.4. Deep Deterministic Policy Gradient (DDPG)
DQN , Atari, - . , , , . , . , , 10. 4 . 4ΒΉβ° = 1048576 . , .
DDPG Β«-Β» - . ? , .
(TD)
u . ? ! Q-learning. TD-learning β . Q-learning TD-learning Q
DDPG DQN. DDPG , . (action).
, , , OpenAI.