Published in proceedings of the AAMAS 2007 Workshop on
Adaptive and Learning Agents (ALAg-07), Honolulu, Hawaii
May 2007.
This paper sets out the concept of consistent exploration of observation-action pairs. We present a new temporal difference algorithm, CEQ(lambda), based on this concept and demonstrate using a randomly generated set of partially observable Markov decision processes (POMDPs) that it outperforms SARSA(lambda). This result should generalise to any POMDP where satisficing policies which map observations to actions exists. We also set out reasons for preferring CEQ(lambda) over an alternative Monte-Carlo style algorithm, MCESP, when working in the robotics domain
Alternatively, you can request a copy by emailing me: |
![]() |