Animal-AI Olympics

A reinforcement learning competition inspired by animal cognition.

Lucas Tindall
Gab41

--

The competition

A few members of Lab41 participated in the Animal-AI Olympics competition held at NeurIPS 2019. This competition tested the ability of autonomous agents to perform a variety of complex tasks. While other competitions focused on playing computer games, controlling robots, and autonomous driving, this one centered on “tests inspired by animal cognition.”

The competition’s autonomous agents are placed in a 3D environment where they can move around and interact with various objects. Their goal is to locate food by exploring the environment and navigating complex obstacles. All the tested scenarios were not available during training, requiring agents to adapt to and learn in new environments.

One of the more interesting elements of the competition was the memory and planning required to complete certain tasks. Some environments were carefully constructed such that the agent would need to observe and process the geometry of the scene before taking actions that could prevent them from ever reaching the end goal. In other environments, the visual observations would be randomly disabled, forcing the agent to navigate based on memory alone.

Example of a possible configuration.

Our approach

To solve these complex environments, we trained a reinforcement learning (RL) agent, a popular approach for interacting with dynamic environments. An RL agent learns to take actions which maximize its cumulative reward. The variety of states and actions the agent encounters can make this a very difficult problem. At the center of this problem is the decision to further explore the environment in the possibility of attaining a higher reward or exploit existing knowledge. Despite this difficulty, RL has been shown to be successful in many environments and games such as with Chess, Shogi, and Go in the case of DeepMind’s AlphaZero, or Dota 2 with the OpenAI Five. Our base approach relied upon asynchronous advantage actor-critic (A3C) and proximal policy optimization (PPO), two common RL training algorithms.

A3C is an actor/critic model that simultaneously estimates a policy (Actor) and value (Critic) function. The policy function controls the actions of the actor while the value function measures how good the states and actions are. The asynchronous part of A3C comes from a set of independent agents taking actions in their own individual environments and updating a set of global parameters asynchronously. Trying to estimate the value function can be very unstable so A3C uses the advantage function instead, which estimates the improvement of a single action over the average action at a given state. To further improve stability during training we use PPO to update our policy network. PPO constrains the policy update so that the policy does not change too dramatically between update steps.

The agent was simulated inside of a simple 3D environment built on the Unity game engine. The input to the model was the front facing visual observation of the agent at each time step (an 84x84 pixel image). Our feature extraction model consisted of a convolutional neural network module and a long short-term memory module.

In RL, the reward is a value attained by the agent for taking certain actions at each state. The overall goal of the agent is to maximize its cumulative reward. Unlike supervised learning, RL has no labeled input/output pairs, thus the agent must base all of its learning from the reward signal. This also means that many RL methods are not guaranteed to converge. As a result, the design of the reward function has a significant effect on the performance of the agent. This setting provided a sparse reward signal, making the tasks of exploration and learning difficult. To address this problem, we augmented the reward to favor behaviors we thought would be helpful for reaching the goal. This is similar to behavior shaping in animal cognition tasks where we attempt to train the agent by giving incremental rewards for subtasks. For example, we experimented by rewarding the agent based on its distance and angle relative to the goal. We also encouraged exploration by rewarding the agent for traveling to new areas in the environment. To learn a condensed version of the visual observations, we used the “world models” technique introduced by Ha and Schmidhuber.

Results

In the end, our best performing model ranked 34th out of 62 on the leaderboard. It was able to score points in all but two of the test categories. It had good results in the internal models category, which included tests where the lights would go out in the environment and the agent’s visual observations would disappear. We submitted several models, and the best performing one was trained using a reward which included the base reward for finding the food and an additional reward based on the agent’s distance to the food.

The training of these models proved to be quite difficult. In other runs, the same agent would quickly get stuck against an obstacle and wait for the trial to end. Once the agent was able to reliably locate the food in a simple environment, it was still a challenge for the agent to continue learning and performing well in more complex settings. The introduction of many walls, ramps, and other objects made it difficult for agents to generalize well.

The winning methods

The first place winner of the competition was Denys Makoviichuk, a software engineer at Snap Inc. For his winning submission he used PPO+LSTM with a set of modifications for training stability(generalized advantage estimation, clipped value loss, and gradient clipping). In his CNN architecture he used a residual network similar to IMPALA as well as channel attention.

The winning model architecture.

A review of the results of the competition can be found online here as part of the recordings from the NeurIPS conference.

Lessons learned

We learned several lessons about submitting to RL competitions in this process:

  • Expect to spend some time learning the new RL simulation environment. Almost all RL competitions rely on a custom designed environment, with nontrivial requirements for installation and use. Furthermore, some competitors may find it useful to modify the environment function and output during training. To create the visual output and physics modelling, the organizers built the environment using the Unity game engine. In situations like this it is useful to have additional time to learn about the simulation package and how to modify it for use in training your models. For example, the environment provided by the organizers utilized an interface similar to OpenAI Gym which returned the visual output of the agents along with its velocity and reward. While we were able to estimate the position of the agent from its movements, future work would be improved by modifying the Unity environment to return more detailed information for use in training.
  • Training is a long and difficult process. This is a common concern of many ML practitioners and is especially true for RL competitions. The design of a useful reward function can be surprisingly difficult and result in unexpected behavior. Furthermore, RL algorithms have been shown to be quite brittle, often getting stuck in local optima and producing inconsistent results. For a thorough analysis of the problems facing RL we recommend reading Alex Irpan’s blog post Deep Reinforcement Learning Doesn’t Work Yet. Although it was written in 2018, the points mentioned in the blog still hold true today.

Future work

We believe that reinforcement learning is an important area of research because of its ability to mimic complex decision making processes in complicated environments. Real life scenarios and settings require interaction not only with the environment but also with other individuals. In the future we will be exploring multi-agent RL methods for use in realistic environments and tasks. The multi-agent setup creates interesting opportunities for learning cooperative and adversarial communication behaviors in settings that can be partially observable.

Lab41 is a Silicon Valley challenge lab where experts from the U.S. Intelligence Community (IC), academia, industry, and In-Q-Tel come together to gain a better understanding of how to work with — and ultimately use — big data.

Learn more at lab41.org and follow us on Twitter: @_lab41

--

--