Comparing Neural-Network based Reinforcement Learning by Value-Gradients with traditional Reinforcement Learning by Value alone

An Interactive Demonstration by Michael Fairbank

Introduction

With this demonstration I aim to illustrate graphically the main ideas set out in my paper. It shows that Value-Gradients are a valuable tool for Reinforcement Learning. The demonstration differs from the others on this website in that it shows the learning algorithms running.

Traditional reinforcement learning systems attempt to assign an absolute value to each position along a particular trajectory. They do not, in general, evaluate every position in state-space. Consequently, it is usually necessary for the traditional system to incorporate an element of 'exploration' either side of the instant trajectory under analysis.

By focusing instead on the first derivative (the gradient) of the traditional value function along any particular trajectory, the learning process is greatly improved in very many circumstances.

It is graphically demonstrated that the Value-Gradient based learning algorithms solve a test problem virtually instantaneously without ‘exploration’ (typically in around 5 to 10 seconds or less). Without ‘exploration’ it is categorically demonstrable that certain popular traditional algorithms are incapable of solving the problem. With ‘exploration’, the situation for the traditional algorithms are less clear: the algorithms meander about, and are certainly highly unstable. However, they can, given time, come close to solving the problem. Thus, theoretically they might solve the problem in time. However, whether they can or not reach a final solution is really less significant than the fact that the performance of the value-gradient based learning is so vastly superior.

Problem used in demo: The 1D Lunar Lander

This is a problem I have created in which a spacecraft is just dropped from a height and it has to land gently and in a fuel efficient manner. Everything is in a straight vertical line – there are no rotations. The only control variable is thrust applied during the descent.

Each “flight”, or trajectory, gets a total reward calculated: Reward function =-(k_1)(landing-velocity)^2-(k_2)*(fuel used) where k_1 and k_2 are fixed positive constants. The aim of learning is to achieve optimal trajectories by maximising the above reward function.

This measure of success is a combination of a ‘safe’ landing and the total amount of fuel used. The concept of ‘safe’ landing is itself somewhat idealised (although this does not detract in any way from the worth of the problem): the actual calculation of the overall performance incorporates minimising both the terminal (landing) velocity and the fuel used. As I discuss further below, the aim is not therefore to land at zero velocity upon impact. The time taken for the descent is not taken into consideration.

Brief overview of the Java Applet and Theory

In the Java applet itself, there are several tabs;

The “Simulation” tab allows you to view the spacecraft's flight as controlled by the current neural network. Trajectories can also be viewed more quickly in the “State-Space View” tab. In this view you can see whole trajectories without having to wait for the spacecraft to slowly land, which makes things far more efficient to observe the progress of learning. Trajectories are the dark-blue lines that start at a dark-blue square. Look at the axes’ titles on the “State-Space View”, and read the details at the mouse pointer to understand the connection between these two tabs. This takes a bit of time to understand (see example diagrams), but this is the best view to use to see the learning algorithms in action.

The “State-Space View” tab also shows a green line which is the theoretically optimal trajectory (derived in the paper). Hence the objective of learning is to make the actual trajectory (the dark-blue line) match the optimal trajectory (the green line). See diagrams and notes on optimal trajectories.

The grey-scale background of the “State-Space View” tab shows the output of the “value-function”. The value function is a fundamental paradigm of Reinforcement Learning. The value function aims to rate any point in state-space with a particular value of “goodness”. If you knew the value-function perfectly, then you could fly the spacecraft perfectly – at each moment of time simply consider what would happen by either thrusting or not thrusting, and simply choose the better of the two each time. This process is known as the “greedy policy”. If the value function is fully known then this will produce an optimal trajectory. Different algorithms aim to learn the value-function using different methods. The value-function is defined more precisely in the paper.

Various value-function learning algorithms are included in this demo: two value-gradient learning algorithms (“VGL” and “VGL-Omega”) and one value-learning algorithm (“VL”). These can be selected between in the Control Panel of the first tab. Note that both of these types of algorithm attempt to learn the same structured value-function, but go about it by different means. The VGL algorithms attempt to match the value-gradients, G, (shown as cyan lines in the State-Space View) to the target-value gradients, G’, (the magenta lines). If this is achieved then the trajectory will be optimal. See diagram. This is proven in the paper, and also VGL-Omega is proven to converge every time (when lambda=1). The alternative approach, VL, tries to match the “values”, V, to the target values V’, all along the trajectory. Basic definitions of the target values and target value-gradients are given here.

For further information on using the demo, and the two other tabs not covered by this brief introduction, see a detailed description of each tab.

Instructions for Operating Demo, and What you Should See

Please experiment with the three algorithms, both with and without exploration. Exploration can be added by using the “Advanced Options” tab. Run the demo, and select the “Run Algorithm” button in the demo's Control Panel to start the learning algorithm running. Click “Randomize Neural Network Weights” if learning gets stuck and again each time you switch to a new algorithm.

The algorithms try to change a non-optimal trajectory into an optimal one, so that you see this. The two value-gradient based learning algorithms generally do this quickly and without the need for exploration, but the VL algorithm is difficult to get to work at all. This major problem with VL is described further below. The VL algorithm is equivalent to the well known algorithms TD(lambda) and SARSA, as proven in the paper, so finding defects in it is significant. The difference between the algorithms is dramatic and hopefully this is apparent in the demo.

When running, the full processing power of your computer will be being used, so don’t forget to deselect the “Run Algorithm” box when you've finished, or your computer might overheat and explode.

Have fun, and please email me with any questions/bugs. I want to make this explanation as clear as possible.

To start the demo click the start button that should appear here:

Detailed description of each tab used in demo

Conclusions

Are the following conclusions apparent?

The conclusions of this demo are, possibly controversially, that

Why does VL fail, and why use VGL?

Run the demo under VL with one trajectory and no exploration. Learning usually settles down on a suboptimal trajectory (i.e. in the State-Space View, the dark-blue line does not match the green line at all). See here for an example. Considering this failed situation enables us to understand the motivation and idea behind value-gradients.

In this failed VL case, after learning has settled down, move your mouse pointer over the trajectory (the blue line) and check in the “details at mouse pointer” section that V and V' are nearly equal to each other at every point. It is the objective of VL for these to be equal, and this has been achieved. But why should this objective make a trajectory optimal? At each instant the greedy policy considers whether to move to a neighbouring trajectory. But VL has learnt nothing of these neighbouring trajectories; hence the greedy policy does nothing, and so VL regularly fails.

Since at each instant the greedy policy needs to decide whether the neighbouring trajectory is better or not, a neat solution to this problem is to directly learn the gradients of the value-function (with respect to the state vector), i.e. to learn the “value-gradient”, instead. If this is achieved then the trajectory cannot be suboptimal (this fact is proven in the paper, and illustrated here).

The standard solution for VL is to use some form of exploration. However these solutions are not very effective in this demo. One method, “Exploring Starts”, and its failings is discussed here. Another possible solution for VL is to use a stochastic policy, but as the demo shows this is extremely unsuccessful at this problem.

Frequently Asked Questions

Visit discussion Forum

Back to Neuropilot Project page

Download the Paper