Reinforcement Learning - Java Test Platform (RL-JTP)

Downloading

The code is packaged up into three bzipped (bzip2) tar balls, RL-JTP (core), RL-JTP (utree) and RL-JTP (webots), which can be download from the Informatics Software Download Database.

PLEASE NOTE: the code in the download database has not been maintained as is now rather old. The core and utree packages work fine (though modern java compilers will complain about lack of Generics/Typing). The webots package will only work with versions of Webots prior to 5.8.0. An updated version that works with the Webots 6 API is available on request - please email me if you require this.

RL-JTP (core) provides nearly all the reinforcement learning algorithms that have been implemented on this platform with the exception of U-Tree (see Andrew McCallum's PhD thesis for details of this algorithm). The core tar ball also provides a selection of grid worlds which are commonly used in the reinforcement learning (RL) literature, along with a selection of agents that have differing abilities and perceptions of the grid world they are used with. RL-JTP (core) is a stand-alone package and doesn't require any additional software other than a Java virtual machine.

RL-JTP (utree) adds the U-Tree learning algorithm, however as it requires the installation of Graphviz and the Grappa classes for Java it has been bundled separately. Graphviz and Grappa can be obtained from http://www.research.att.com/~john/Grappa/. The U-Tree classes should be placed in the same directory as the "core" classes. With Grappa directories added to the Java library and classpath the U-Tree classes should compile. U-Tree is selected using the same interface as used to select other reinforcement learning algorithms, see examples below. When installed the algorithm name 'utree' appears as one of the listed algorithms.

RL-JTP (webots) is an extension that interfaces the "core" learning algorithms with simulated robots in the Webots simulator (as produced by Cyberbotics). A licensed copy of webots is required to make any use of this code. The B21CorridorsController and B21CorridorsSupervisor directories should be installed in a local 'controllers' directory of Webots. The VRML file which describes the corridor world should be installed in a local 'world' directory of Webots. See Webots manual which describes setting up the appropriate paths and directories for your own agents and worlds. The webots-alias script needs to be on your executable path. B21CorridorsController, B21CorridorsSupervisor, WebotsAlias, and 'core' grid-world directories all need to be added to you CLASSPATH. Probably also need to edit the webots-alias script to remove the FC3 machine test. The code in the Informatics Software Download Database doesn't work with the latest version of Webots. Last known working version was Webots Pro 5.1.10. If you need an updated version which works with the Webots 6 API please email me.


General notes

All the code was developed under Linux and will possibly not work well under Windows, especially due to the long file names that are used when saving out snapshots of the agents state-action table. Code is distributed under the GPL license.

Looking for CEQ(lambda)? If you're looking for the java class of the learning algorithm CEQ(lambda), as mentioned in my thesis, it is called "Tweaked Watkins' Q-lambda for POMDPs" in the code. It also refers to itself as "twqlps" in its usage statement (it should read "ceq-lambda") and saves data files using the name TWQLPS.




Examples of Usage

In lieu of documentation here's a few examples of usage to get you going. Everything is invoked from the the command line as this makes it much easier to run batches of experiments.


prompt> java GridWorld

Produces a list of the mandatory and [optional] arguments, i.e.

Usage: java GridWorld [-nographics] [-noanimation] [-printevaluation dir] [-textOutputFile filename] [-notext] [-silent] [-nosave] [-prefix datafile prefix] [-postfixStart int] [-repetitions int] [-stats filename] [-paths regexp] [-rand seed] [-limitTrainingEpisodes int] world agent algorithm iterations evaluations


prompt> java GridWorld any any any 1 1

If a problem is found with either the name of the world, agent or algorithm then a list of valid names is provided. Each mandatory argument is parsed in turn and the list corresponds to the first with which there was a problem. In this case there isn't a world called 'any' so the response is:

Problem in setting up specified world, agent or learning algorithm.

WorldSetUpException: Available worlds are: test-world long-test-world whiteheads-aliased-world suttons-grid-world suttons-grid-world-prime suttons-grid-world-double-prime perkins-problem bistable-problem bistable-swapped-problem bistable-swapped-problem2 mccallums-maze woods100 parr-and-russell wilsons-woods-7 woods-7-penalty pauls-maze random-world


prompt> java GridWorld suttons-grid-world abs-pos sarsa-lambda 1 1

Some worlds, some agents and most learning algorithms require additional arguments. For example SARSA(lambda) above. If the additional arguments are badly formatted or missing, then all the arguments required are listed.

java.lang.Exception: Error in format of arguments.

Usage: sarsa-lambda[lambda=N, trace=accum/replace, truncate=no/INTEGER, alpha=N, gamma=N, conduct=e-greedy[N]/softmax[N]/fixed[policy=filename]/e-fixed[epsilon=N:policy=filename], initSAvs=zeros/uniform[min:max]/gaussian[mean:std], file=optional_file]


Now for something that should work...

prompt> java GridWorld -nosave suttons-grid-world abs-pos sarsa-lambda[lambda=0,trace=replace,truncate=no,alpha=0.1,gamma=0.9,conduct=e-greedy[0.1],initSAvs=zeros,file=] 50 5

Should display a grid world and agent wandering round it. Also a second map indicating the policy (initially unformed). The policy will be evaluated every 5 steps and the policy display updated. The agent will run for a total of 50 steps before halting. I've only implemented SARSA(lambda) not SARSA are they are equivalent for lambda=0 as used above. Trace type (replacement or accumulating) and whether to truncate it or not are still required arguments even when lambda=0. Conduct controls exploration. In this case epsilon-greedy exploration is selected with epsilon fixed at value of 0.1. State action values are all initially zero (initSAvs=zeros), and no previous state action table is load (file=). NB no spaces between SARSA's arguments inside the square brackets.


prompt> java GridWorld -silent -prefix ./ suttons-grid-world abs-pos sarsa-lambda[lambda=0.9,trace=replace,truncate=no,alpha=0.1,gamma=0.9,conduct=e-greedy[0.2-actions/100000],initSAvs=zeros,file=] 40000 10000

Should silently learn the policy for Sutton's grid world using abs-pos agent and SARSA(0.9) with replacement traces. Exploration is again using epsilon-greedy but the value of epsilon is dependent on the number of actions that have past. Epsilon will initially be 0.2 but will reach zero after 20,000 steps. Once it is <=0 no random exploratory actions are selected. The code runs for a total of 40,000 steps and a copy of the state-action table is saved every 10,000. Files are save in the current directory (-prefix ./). If '-prefix' is not given then files are saved in a subdirectory called 'data/' (an error will occur if no such directory exists).


prompt> java GridWorld -nosave suttons-grid-world abs-pos q-learning[alpha=0,gamma=0.9,conduct=e-greedy[0],initSAvs=zeros,file="SARSA(0.9) replace truncate(no) alpha0.1 absPos suttons gamma0.9 e-greedy[0.2-actions|100000] zeros runs3961 actions40000.1"] 1000 1000

Loads and displays the policy which had been learnt by end of the above run. Provided agent is the same then state-action value tables generated by SARSA(lambda) are interpretable by Q-learning and vice-versa. In order to view the policy without risk of changing it learning is inhibited. This is done by setting alpha=0. Exploration is also inhibited by using epsilon-greedy with epsilon=0. The filename needs to be quoted because of the spaces in it. The filename provides a record of the parameters used, action steps that have passed (40,000) and number of times the goal has been reached (runs 3,961). NB the number of runs will probably differ in your case, amend the command line accordingly.


To view the Q-values in the saved state-action table file use HashTableFileReader, i.e.

prompt> java HashTableFileReader "SARSA(0.9) replace truncate(no) alpha0.1 absPos suttons gamma0.9 e-greedy[0.2-actions|100000] zeros runs3961 actions40000.1"


To save out a record of the statistics which are generated each time the currently policy is evaluated use the -stats filename option, i.e.

prompt> java GridWorld -silent -prefix ./ -stats statsFile suttons-grid-world abs-pos sarsa-lambda[lambda=0.9,trace=replace,truncate=no,alpha=0.1,gamma=0.9,conduct=e-greedy[0.2-actions/100000],initSAvs=zeros,file=] 40000 10000

Unlike the state-action table files which contain Java Objects, the statistics files are human readable and can be loaded into matlab, etc.


Webots Example

If correctly installed the usage for webots when using the reinforcement learning algorithms is similar to above:

> webots-alias [-nographics] [-notext] [-silent] [-countRunsNotActions] [-nosave] [-prefix datafile_prefix] [-postfixStart int] [-stats filename] [-limitTrainingEpisodes int] [-startPosition float float float] world agent algorithm iterations evaluations

A working example is:

prompt> webots-alias -nosave b21_corridors.wbt b21-basic q-learning[alpha=0.1,gamma=0.9,conduct=e-greedy[0.1],initSAvs=gaussian[0:1],file=] 100 100

In my bash.rc file I have the line alias webots='webots-alias'. This aliases the command webots so I just type 'webots' in place of 'webots-alias' above. The WebotsAlias code passes through any calls to webots not intended for it.

NB because of the way that Webots works state action files are saved relative to the B21CorridorsController directory.