Read about what kind of competition it is and how we managed to win it (and more than 2000 solutions from 51 countries were sent to the contest) under the cut.
our team
The JBR_HSE team consisted of five members: Konstantin Makhnev (studying in the 4th year of the bachelor's program " Applied Mathematics and Computer Science ") Oleg Svidchenko and Vladimir Egorov (both studying at the master's degree " Programming and Data Analysis "), PhD student Dmitry Ivanov and head of the Center for Analysis data and machine learning NRU HSE - St. Petersburg Alexey Shpilman. Everyone except Konstantin also works in the Agent Systems and Reinforcement Learning lab at JetBrains Research - hence the team's name. While participating in the competition, Kostya trained in the same laboratory.
NeurIPS 2020: Flatland - what is it?
Flatland is a competition that took place from July 10 to November 15 based on the AIcrowd platform and supported by the renowned international conference NeurIPS . The goal of the competition is to develop an algorithm that can manage rail traffic as best as possible. An important limitation was that decisions had to be made in a limited time (5 seconds per one step of the simulator), which made it impossible to simply select the optimal actions.
NeurIPS 2020: Flatland had two directions: general and Reinforcement Learning (RL). The first area was open to any decisions, and the second only to those using reinforcement learning.
The organizers provided the participants with a Flatland simulator, which is a two-dimensional grid, where each cell has its own type: road, turn, fork or impassable terrain. Each train occupies exactly one cell on the grid and has a direction in which it is currently moving. The simulation takes place step by step, at each step for each train you need to determine what action to take: stop, go along the leftmost track, or go along the rightmost track. Since the current implementation does not provide a complete intersection, i.e. it does not happen that you can go forward, right and left at the same time, then the action “go forward” is also not needed. In the case when there are no turns, the second and third actions are equivalent. Trains can collide with each other - it turns out a deadlock.The goal of the competition is to make all the trains reach their destinations as quickly as possible. Decisions are judged by the amount of time the trains took to reach the goal. In case the train has not reached, its time is equal to the maximum simulation time.
Solution: how the agent will interact with the simulator
There have already been quite a few posts on Habré about reinforcement learning, so we will not dwell on the basic concepts in detail. Let's just say that in the case of Flatland, the simulation of the railroad acts as the environment, and the train acts as the agents. It is important to note that since many trains are involved in the simulation, this environment is multi-agent.
When solving the problem, we significantly changed the process of interaction between the agent and the simulator, which allowed us to significantly increase the efficiency of our algorithm. In general, such modifications often greatly affect the result, but they require task-specific knowledge.
One of the most significant changes was that we decided not to give the agent freedom of action when he is not near the intersection. Thus, an agent can make decisions only when in front of a fork or at a fork, and in other cases it always moves forward along the only possible path. This greatly shortens the duration of the episode, and therefore makes the task much easier. We also decided that the agent will end his episode either when he reaches the goal, or when he gets into a deadlock and cannot move (in the simulator, in such cases, the episode continues). Later we decided that the episodes should not start right away for everyone, we will tell you more about this below.
Observation for an agent consists of two parts: features that are tied to a specific part of the path, and features that are not tied to a specific part of the path. Here, a part of the path means a section of the road that connects two forks.
The first part of the observation can be represented as a tree, at the tops of which there are forks, and the edges are the roads between them. To each edge and the vertex to which it leads, we associate a vector of features calculated for a given section of the path (for example, the length of the path, the distance from the end of the edge to the destination, etc.). We fixed the tree depth, thereby limiting the number of feature vectors obtained from the first part of the observation. Below is an example of how this part of the observation is constructed. Colored lines represent road sections that are edges in the tree.
The second part of the observation includes such signs as the minimum distance to the destination, whether the agent is at the intersection or in front of it, the agent got into the deadlock, etc. It is also worth noting that each agent at the beginning of the episode is assigned an identifier (a random number from 0 to 1), which stays with him for the rest of the episode and allows him to better negotiate with the rest of the agents. There are signs depending on agent identifiers in both parts of the observation.
It remains only to define the agent's reward function. After some selection, we settled on a rather simple one: $ 0.01 \ cdot \ Delta d - 5 \ cdot \ text {is \ _deadlocked} + 10 \ cdot \ text {is \ _succeed} $, i.e. The reward reflects how much the $ d $ distance to the destination has changed since the decision was made, with an additional reward if the episode is successfully completed and a penalty if it hits a deadlock.
PPO algorithm
There are many reinforcement learning algorithms that are designed to solve multi-agent problems. However, we decided to use the PPO - Proximal Policy Optimization algorithm as the baseline algorithm , since its code could be reused to implement multiagent RL algorithms (for example, COMA). Later, however, it turned out that PPO (with some modifications) itself solves the problem well, but multi-agent methods are trained much worse, so PPO remained the main part of our final solution.
The classic PPO algorithm consists of two parts: the actor and the critic. The critic approximates the value-function, and the actor teaches a randomized policy $ \ pi_ \ theta (a | s) $ that maximizes the expected value of the total reward.
One of the important modifications we made is the addition of a module to the actor architecture that allows the agent to communicate with nearby agents. The communication process is built quite simply: agents simultaneously generate messages and send them to all trains next to which they are, and then, having received messages from neighbors, combine them into one message by weighted averaging. A simple self-attention mechanism was used to determine the weights with which the messages will be averaged. With such a construction of the communication process, there is no need to somehow modify the learning process, because it is sufficient to simply allow the gradients to pass through the messages during the backpropagation of the error.
We decided to train PPO with a small map and a small number of agents, assuming that since the agent observes only what is next to him, after training he will work well in larger environments.
How do you decide which trains to start when?
Having tried to simply apply the PPO algorithm, we quickly came to the conclusion that most of the deadlocks occur precisely because trains start moving at the same time and interfere with each other. We solved this problem in the following way: we started running only a few agents at the same time. When one of the agents finished his episode, the other agent was allowed to start moving. In this approach, it is important to understand in what sequence the trains should be launched.
The first approach we tried was to launch the agents that are closest to their goal. When used in a small environment - short roads and few agents - it performed well enough, but for large environments it performed much worse. Therefore, we decided to use this approach only during agent training, and for application (i.e., submitting to the testing system) we used a different algorithm. The idea of this algorithm is to train a classifier that will determine for each agent who has not started moving whether he will reach the end or not. Then, if the probability of successfully reaching the destination is large enough, then the agent starts moving, if not, waits until the probability becomes sufficient.
Competition results
As a result, our solution took first place in the RL track and eighth overall. This means that the non-RL approach is still better at this point, but it shows that reinforcement learning has potential. There are quite a few weaknesses and unresolved issues in our approach (for example, serious problems with scalability to large environments), so we still have something to work on. Now we, together with the organizers of the competition, are preparing an article for sending it to the competition book NeurIPS 2020.