Detailed descriptions of each Tab in the Java Applet:
There are four tabs in the demo:
1. “State-Space View” Tab
This shows the trajectories being learned and provides a control panel for the learning algorithms. This is the key tab to select an algorithm and observe its progress graphically in real time. It's best to stay on this tab-page to observe the learning progress as it shows an overview of all trajectories updating in real-time and the shows the entire value-function.
(i) In the first large box (marked “Control Panel”), check the ‘Run Algorithm’ box to start an algorithm learning. Use this in conjunction with the ‘Randomize Neural Network Weights’ button whenever you change algorithm to re-start the learning process. Whenever the “Run Algorithm” box is selected, the algorithm will be running and the “Number of Iterations” counter will be increasing.
Also in the Control Panel section, there are three algorithms to choose between in the drop-down menu:
(a) Value Gradient Learning (‘VGL’).
(b) Traditional Value Learning (‘VL’).
(c) Value Gradient Learning using an ‘Omega Matrix’ (‘VGL-Omega’). This is an enhanced version of the basic VGL algorithm which achieves greater stability.
(ii) In the main graph of the State-Space View, trajectories start at the small dark-blue squares and finish on the Velocity axis.
The grey-scale background shows the output of the value-function modulo a constant. This means the colours wrap around whenever they go off the end of the colour scale. So the sharp lines of contrast are not significant at all (confusingly). Maybe I could have used a fancy 3D plot to get round this, or shown contours? But both of those would have been difficult to implement.
The actual target gradients (G), denoted by cyan-coloured lines, are the gradients of the value function with respect to state vector. So the cyan-coloured lines should point in the direction of increasing whiteness of the grey-scale background. See diagram. However note that if the “project gradients” box is checked, then it only displays the horizontal components of the gradients. This is because the algorithm VGL-Omega only learns this component; and that component is the only one the greedy policy actually uses (in the 1D Lunar Lander problem).
In this view state-space is actually cut down from three dimensions (height, velocity and fuel remaining) to just two (height and velocity). This is for the purposes of producing a 2D view in this tab page. Also, the neural network only has two inputs for these two dimensions – the neural network is not told how much fuel is remaining.
(iii) The “Details at mouse pointer” box: This powerful tool allows you to probe state-space by moving the mouse pointer over the grey-scale graph, displaying V, V', G and G'. You can use this to check how closely the objectives of an algorithm has been met along a given trajectory (e.g. does V=V', or does G=G' ?). For display purposes, only the first two components of G and G' are displayed, even though technically G' does have a component in the concealed “fuel” dimension.
|
|
(iv) The “Display Options” box: If you select the “Hi Res” box, then the grey scale shows in much higher resolution. However, because this is so CPU-intensive the animation stops, but learning continues in the background.
Note that the boxes ‘Show Actual Gradients’ and ‘Show Target Gradients’ are un-checked by default when using the VL-based algorithm as they have no significance to the learning process.
The “Plotted Gradients are Projected” box means, roughly, that only the horizontal components of the value-gradients are shown. This is because the algorithm “VGL-Omega” only learns this component.
2. “Simulation” Tab
This runs actual trajectories by simulation. There is a one-to-one correspondence between the trajectories you see here and those in State-Space View. However in this tab you have to wait for the spacecraft to travel around each trajectory, which is inefficient for viewing learning algorithms in progress. The learning algorithm will usually have performed many iterations before a single flight is completed in this slow manner. Consequently it's best NOT to use this tab to observe learning. This tab is included as a demonstration of the process only to show what the problem actually is! Once familiar with what is shown on this tab, it is actually best to ignore this tab and simply use the State-Space View in order to observe the learning process in action for any given variable set.
The reward function is shown on this tab as “Total Reward”.
The “Flight Number” selector only makes available the number of trajectories currently being learned, so is disabled unless you are learning multiple trajectories.
3. “Network details” Tab
This is simply a real-time numerical printout of the Neural Network’s weight values.
4. “Advanced Options” Tab
Use this tab to select optional extensions to the learning algorithm selected and to observe various performance measures of the learning progress.
(i) The degree of ‘exploration’ can be determined in the first large box on this tab (of particular relevance to traditional Value Learning). To select some exploration, choose to either increase the number of trajectories to be learnt in batch mode, and/or set some stochastic contribution to the policy. A description of the exploration you have selected will appear below.
(ii) The second large box (“Error Measures”) shows in real-time the numerical values of the various error-measures during the learning process. The algorithms of this demo each try to minimise a different one of these, with differing levels of success. The algorithm VGL-Omega tries to maximise R as well as minimise its specific error measure.
(iii) The third large box allows some extra algorithm options to be selected, possibly of interest to Reinforcement Learning researchers:
It's best with the optimiser left on “RPROP” all of the time as this is a fast, robust and fully automatic optimiser that's far better than “small steps”. (This implementation of RPROP is actually only an approximate one.)
Bootstrapping is defined in the paper and occurs when lambda<1. Bootstrapping is a commonly used idea in reinforcement learning, but it can lead to instability. If lambda<1, then a warning appears saying that convergence is unlikely.