It's all about the virtual "carrot": Uber has created an algorithm that can beat a person in the Atari game



In the AI ​​lab, Uber AI Labs has created a new family of Go-Explore algorithms. The algorithm is based on reinforcement learning. Go-Explore outperforms most existing designs when tested on classic 1980s Atari games.



Uber's AI has played through 11 of the toughest games in total, including Montezuma's Revenge and Pitfall . In terms of the amount of points scored, he walked around the people. The algorithm is not being developed for the sake of games: in the near future, the algorithm can be used for teaching in robotics, processing natural languages, creating new drugs, etc. What is the basis of the algorithm?



Reinforcement learning



Let's start by recalling what reinforcement learning is and why it has high potential.



This is a well-known form of neural network training. The advantage of this technology is in an element called an agent . He does not work in isolation, but learns to interact with the environment. The environment reacts to the agent's actions, creating a rewarding effect.



AI tries to get a virtual carrot, so it acts on the basis of the possibility of receiving a reward. If this does not bring, then the next time the operation is considered less desirable.



In the context of a given usefulness, reward training maximizes the result.



What algorithm did Uber create?



The main distinguishing feature of the Uber algorithm is remembering previous perspective states. Moreover, the algorithm is not only able to reproduce them, but also carries out reconnaissance . As if asking the question over and over again: "What if?" And looking for a new better answer. Thanks to such a chain, it is possible to increase the efficiency of the learning algorithm.



The family of algorithms from AI Uber Labs has an important ability to remember previous prospective states. In small iterations, the algorithm builds an archive of states. And they, in turn, are grouped into cells.



The Go-Explore workflow, exploration and robustification phases of the



Go-Explore solve two important problems of reward learning.



First problem. The algorithm loses interest in previously traversed states. Moreover, some of these conditions may be promising, but the algorithm begins to ignore them.



Second problem. The research process prevents a rollback to early states of the algorithm. Instead of going backwards, the AI ​​goes too far from the starting point and arbitrarily scales random actions.



Testing the algorithm on games



Uber turned to the classic Altari games to make sure its development was effective. They took the hardest of the games, the hardest for a computer to handle. Difficulties arise from too rare a reward. In such cases, hundreds of operations pass between the action of the algorithm and the successful result. It becomes difficult to determine exactly what actions helped to receive a virtual reward.





So how does Uber's algorithm deal with this? It sends similar states to one cell. The cycle begins by selecting states from a cell, where they are weighted. In this case, preference is given to recently found states, from which new areas are explored. Then the archive is updated. Thus, Go-Explore processes the maximum available number of options and, most importantly, does not miss the most interesting states.



Further, the algorithm has the option to make robustificationfound solutions and exclude external or interferences. This option reduces the noise level in the found trajectories. Why is this needed? In Atari, the environment and actions are well defined: specific moves lead to expected results. To reduce determinism, artificial delays were introduced into the game. So that the algorithm not only performs previously verified actions, but also learns in conditions close to real ones.



As a result, Go-Explore showed good results in the game in two situations:



  1. In case of unknown, when the algorithm lacks primary information.
  2. If there is information: the algorithm knows the input (coordinates, keys, etc.).


In the second case, as expected, the results are higher. So, Go-Explore in Montezuma's Revenge beat the result of a human player by 42.5%, gaining 1.73 million points.



Go-Explore for robots





In addition to Atari games, the family of algorithms was tested on a robotic arm. Go-Explore has successfully completed the robotic arm movement in the simulator, where it was required to rearrange items on the shelves. Moreover, the robotic arm could not only rearrange them, but also get them from behind the doors with locks.






All Articles