# OpenAI: Requests for Research

## Requests for Research

It’s easy to get started in deep learning, with many resources to learn the latest techniques. But it’s harder to know what problems are worth working on.

We’re publishing a living collection of important and fun problems to help new people enter the field, and for enthusiastic practitioners to hone their skills. Many will require inventing new ideas.

Once solved, you can submit a pull request linking to your solution. We’ll accept multiple interesting solutions to each problem.

Sequence-to-sequence models with attention have been enormously successful. They made it possible for neural networks to reach new levels of state of the art in machine translation , speech recognition , and syntactic parsing . Thanks to this work, neural network can now consume inputs of arbitrary shapes, and output sequences of variable length, without much effort on the practitioner’s side.

Implement an attention model that takes an image of a PDF math formula, and outputs the characters of the LaTeX source that generates the formula.

### Getting Started

To get started, proceed according to the following stages:

• Download a large number papers from arXiv . There is a collection of 29,000 arXiv papers that you could get started with. It is likely that this set of 29,000 papers may contain several hundred thousand formulas, which is more than enough for getting started. As the bandwidth of arXiv is limited , it is important to be mindful of their constraints and to not write crawlers to download all the papers of arxiv.
• Use a heuristic to find all the LaTeX formulas in the LaTeX source. It can be done by looking for the text that lies between /begin{equation} and /end{equation} . Here is a list of some of the places where equations can appear in latex files. Additional examples can be found here . It is likely that even a simple heuristic for extracting latex formulas should produce in excess of 100,000 equations; if not, keep refining the heuristic.
• Compile images of all the formulas. To keep track of the correspondence between the latex formulas and their images, it is easiest to place exactly one formula on each page. Then, when processing the latex file, it is easy to keep track of the pages. Be sure to not render formulas so large that they exceed an entire page. Also, be sure to render the formulas in several fonts.
• Train a visual attention sequence-to-sequence model (as in the Show, Attend, and Tell paper , or perhaps a different variant of visual attention) that would take an image of a formula as input, and output the latex source of the formula, one character at a time. A Theano implementation of the Show, Attend, and Tell paper can help you get started. If you wish to implement your model from scratch, TensorFlow can be a good starting point.
• It takes some effort to correctly implement a sequence-to-sequence model with attention. To debug your model, we recommend that you start with a toy synthetic OCR problem, where the inputs are long images that are obtained by concatenating sequences of images of MNIST digits, and the labels should be a sequence of their classifications. While this problem can be solved without an attention model, it is useful as a sanity check, to ensure that the implementation is not badly broken.
• We recommend trying the Adam optimizer .

### Notes

A success here would be a very cool result and could be used to build a useful online tool.

While this is a very non-trivial project, we’ve marked it with a one-star difficulty rating because we know it’s solvable using current methods. It is still very challenging to really do it, as it requires getting several ML components together correctly.

2

## The Inverse DRAW model

Investigate an “Inverse DRAW” model.

The DRAW model is a generative model of natural images that operates by making a large number of small contributions to an additive canvas using an attention model. The attention model used by the DRAW model identifies a small area in the image and "writes" to it.

In the inverse DRAW model, there is a stochastic hidden variable and an attention model that reads from these hidden variables. The outputs of the attention model are provided to an LSTM that produces the observation one dimension (or one group of dimensions) at a time. Thus, while the DRAW model uses attention to decide where to write on the output canvas, the inverse DRAW uses attention to choose the latent variable to be used at a given timestep. The Inverse DRAW model can be seen as a Neural Turing Machine generative model that emits one dimension at a time, where the memory is a read-only latent variable.

The Inverse DRAW model is an interesting concept to explore because the dimensionality of the hidden variable is decoupled from the length of the input. In more detail, the Inverse DRAW model is a variational autoencoder , whose p-model emits the observation one dimension at a time, using attention to choose the appropriate latent variable for each visible dimension. There is a fair bit of choice in the architecture of the approximate posterior. A natural choice is to use the same architecture for the posterior, where the observation will be playing the role of the latent variables.

A useful property of the Inverse DRAW model is that its latent variables may operate at a rate that is different from the observation. This is the case because each dimension of the observation gets assigned to one hidden state. If this model were to successfully be made deep, we would get a hierarchy of representation, where each representation is operating at a variable rate, which is trained to be as well-suited as possible for the current dataset.

It would be interesting to apply this model to a text dataset, and to visualize the latent variables, as well as the precise way in which the model assigns words to latent variables.

### Notes

It is a hard project as it is not clear that a models like this can be made to work with current techniques. However, it makes success all the more impressive.

The inverse DRAW model may have a cost function that’s very difficult to optimize, so expect a struggle.

3

## Q-learning on the RAM variant of Atari

The Q-learning algorithm had a lot of success on learning to play Atari games when the inputs to the model are pixels. Atari games are designed to run on a computer that has a very small amount of RAM, so it can be interesting to try to learn to play Atari games when the input to the neural network is the RAM state of the Atari emulator. Surprisingly, getting Q-learning to work well when the inputs are RAM states has been unexpectedly challenging.

Thus, your goal is to develop a Q-learning implementation that can solve many Atari games when the input to the neural network is the RAM state, using the same setting of hyperparameters on all tasks. In your experiments, use the GymAtari environments that are presented in theRAM-way, where the inputs are the complete RAM state of the Atari computer.

The hope here is that in order to succeed, you’d need to invent techniques for Q-learning that are generally applicable, which will be useful.

### Notes

This project might not be solvable. It would be surprising if it were to turn out that Q-learning would never succeed on the RAM variants of Atari, but there is some chance that it will turn out to be challenging.

4

## Improved Q-learning with continuous actions

Q-learning is one of the oldest and general reinforcement learning (RL) algorithms. It works by estimating the long term future expected return for each state-action pair. Essentially, the goal of the Q-learning algorithm is fold the long term outcome of each state-action pair into a single scalar that tells us how good the state-action pair combination is; then, we could maximize our reward by picking the action with the greatest value of the Q-function. The Q-learning algorithm has been the basis of the DQN algorithm that demonstrated that the combination of RL and deep learning is a fruitful one.

Your goal is to create a robust Q-learning implementation that can solve allGym environments with continuous action spaces without changing hyperparameters.

You may want to use the Normalized Advantage Function (NAF) model as a starting point. It is especially interesting to experiment with variants of the NAF model: for example, try it with a diagonal covariance. It can be also interesting to explore an advantage function that uses the maximum of several quadratics, which is a convenient function because their argmax is easy to compute.

### Notes

This project is mainly concerned with reimplementing an existing algorithm. However, there is significant value in obtaining a very robust implementation, and there is a decent chance that new ideas will end up being required to get it working reliably, across multiple tasks.

5

## Difference of value functions

Bertsekas wrote an interesting paper arguing why it might be better to learn functions that measure the difference in value between states, rather than the value of states. Implement this algorithm with neural networks and apply it to challenging Gym environments.

### Notes

This idea may turn out to not be fruitful, and getting a good result may prove to be impossible.

As it is always desirable to train larger models on harder domains, one important area of research is parallelization. Parallelization has played an important role in deep learning, and has been especially successful in reinforcement learning. The successful development of algorithms that parallelize well will make it possible to train larger models faster, which will advance the field.

The goal of this project is to implement the Trust Region Policy Optimization (TRPO) algorithm so that it would use multiple computers to achieve 15x lower wall-clock time than joschu’s single-threaded implementation on the MuJoCo or AtariGym environments. Given that TRPO is a highly stable algorithm that is extremely easy ot use, a well-tuned parallel implementation could have a lot of practical significance.

You may worry that in order to solve this problem, you would need access to a large number of computers. However, it is not so, as it is straightforward to simulate a set of parallel computers using a single core.

### Notes

It is known that RL algorithms can be parallelized well, so we expect it to be possible to improve upon the basic implementation. What is less obvious is whether it is possible to get 15x speedup using, say, only 20x more nodes.

7

## Multitask RL with continuous actions.

At present, most machine learning algorithms are trained to solve one task and one task only. But we do not necessarily train models on only one task at a time because we believe that it is the best approach in the long term; on the contrary, while we would like to use multitask learning in as many problems as possible, the multitask learning algorithms are not yet at a stage where they provide a robust and a sizeable improvement across a wide range of domains.

This sort of multitask learning should be particularly important in reinforcement learning settings, since in the long run, experience will be very expensive relative to computation and possibly supervised data. For this reason, it is worthwhile to investigate the feasibility of multitask learning using the RL algorithms that have been developed so far.

Thus the goal is to train a single neural network that can simultaneously solve a collection ofMuJoCo environments in Gym. The current enviroments are dissimilar enough that it is unlikely that information can be shared between them. Therefore, your job is to create a set of similar environments that will serve as a good testbed for multi-task learning. Some possibilities include (1) bipedally walking with different limb dimensions and masses, (2) reaching with octopus arms that have different numbers of links, (3) using the same robot model for walking, jumping, and standing.

At the end of learning, the trained neural network should be told (via an additional input) which task it’s running on, and achieve high cumulative reward on this task. The goal of this problem is to determine whether there is any benefit whatsoever to training a single neural network on multiple environments versus a single one, where we measure benefit via training speed.

We already know that the multitask learning on Atari has been difficult (see the relevant papers ). But will multitask learning work better on MuJoCo environments? The goal is to find out.

The most interesting experiment is to train a multitask net of this kind on all but one MuJoCo environment, and then see if the resulting net can be trained more rapidly on a task that it hasn’t been trained on. In other words, we hope that this kind of multitask learning can accelerate training of new tasks. If successful, the results can be significant.

### Notes

It is a reasonably risky project, since there is a chance that this kind of transfer will be as difficult as it has been for Atari.

Implement and test a natural version of Q-learning, and compare it to regular Q-learning.

Natural Gradient is a promising idea that has been explored in a significant number of papers and settings . Despite its appeal, modern approaches to natural gradient have not been applied to Q-learning with nonlinear function approximation.

In this project, your goal is to figure out how to meaningfully apply natural gradient to Q-learning, and to compare the results to a good implementation of Q-learning. Thus, the first step of this project is to implement Q-learning. We recommend either staying with discrete domains (such as Atari), or continuous domains, and to use methods similar to Normalized Advantage Function (NAF) . The continuous domains are easier to work with because they are of lower dimensionality and are therefore simpler, but NAF can be harder to implement than standard Q-learning.

It would be especially interesting if Natural Q-learning were capable of solving the RAM-Atari tasks.

### Notes

This project isn’t guaranteed to be solvable: it could be that Q-learning’s occasional instability and failure has little to do with whether it is natural or not.

In reinforcement learning, we often have several rewards that we care about. For example, in robotic locomotion, we want to maximize forward velocity but minimize joint torque and impact with the ground.

The standard practice is to use a reward function that is a weighted sum of these terms. However, it is often difficult to balance the factors to achieve satisfactory performance on all rewards.

Filter methods are algorithms from multi-objective optimization that seek to generate a sequence of points, so that each one is not strictly dominated by a previous one (see Nocedal & Wright , chapter 15.4).

Develop a filter method for RL that jointly optimizes a collection of reward functions, and test it on the GymMuJoCo environments. Most of these have summed rewards; you would need to inspect the code of the environments to find the individual components.

### Notes

Filter methods have not been applied to RL much, so there is a lot of uncertainty around the difficult of the problem.

10

## Better sample efficiency for TRPO

Trust Region Policy Optimization (TRPO) is a scalable implementation of second order policy gradient algorithm that is highly effective on both continuous and discrete control problems. One of the strengths of TRPO is that it is relatively easy to set its hyperparameters, as a hyperparameter setting that performs well on one task tends to perform well on many other tasks. But despite its significant advantages, the TRPO algorithm could be more data efficient.

The problem is to modify a good TRPO implementation so that it would converge on all of Gym’sMuJoCo environments using 3x less experience, without a degradation in final average reward. Ideally, the new code should use the same setting of the hyperparameters for every problem.

This will be an impressive achievement, and the result will likely be scientifically significant.

### Notes

This problem is very hard, as getting an improvement of this magnitude is likely to require new ideas.