Consistent exploration improves convergence of reinforcement learning on POMDPs

Paul A. Crook and Gillian Hayes


Status

Published in proceedings of the AAMAS 2007 Workshop on Adaptive and Learning Agents (ALAg-07), Honolulu, Hawaii May 2007.
 

Abstract

This paper sets out the concept of consistent exploration of observation-action pairs. We present a new temporal difference algorithm, CEQ(lambda), based on this concept and demonstrate using a randomly generated set of partially observable Markov decision processes (POMDPs) that it outperforms SARSA(lambda). This result should generalise to any POMDP where satisficing policies which map observations to actions exists. We also set out reasons for preferring CEQ(lambda) over an alternative Monte-Carlo style algorithm, MCESP, when working in the robotics domain


Alternatively, you can request a copy by emailing me: