Home     Publications     Dissertation     Curriculum Vitae

Data Efficient Reinforcement Learning with Off-policy and Simulated Data

Abstract

Learning from interaction with the environment -- trying untested actions, observing successes and failures, and tying effects back to causes -- is one of the first capabilities we think of when considering autonomous agents. Reinforcement learning (RL) is the area of artificial intelligence research that has the goal of allowing autonomous agents to learn in this way. Despite much recent success, many modern reinforcement learning algorithms are still limited by the requirement of large amounts of experience before useful skills are learned. Two possible approaches to improving data efficiency are to allow algorithms to make better use of past experience collected with past behaviors (known as off-policy data) and to allow algorithms to make better use of simulated data sources. This dissertation investigates the use of such auxiliary data by answering the question, How can a reinforcement learning agent leverage off-policy and simulated data to evaluate and improve upon the expected performance of a policy?"

Towards the goal of learning from simulated experience, this dissertation introduces an algorithm -- the grounded action transformation algorithm -- that takes small amounts of real world data and modifies the simulator such that skills learned in simulation are more likely to carry over to the real world. Key to this approach is the idea of local simulator modification -- the simulator is automatically altered to better model the real world for actions the data collection policy would take in states the data collection policy would visit. Local modification necessitates an iterative approach: the simulator is modified, the policy improved, and then more data is collected for further modification.

Finally, in addition to examining them each independently, this dissertation also considers the possibility of combining the use of simulated data with importance sampled off-policy data. We combine these sources of auxiliary data by control variate techniques that use simulated data to lower the variance of off-policy policy value estimation. Combining these sources of auxiliary data allows us to introduce two algorithms -- weighted doubly robust bootstrap and model-based bootstrap -- for the problem of lower-bounding the performance of an untested policy.

[pdf]