First we define the target values, V’ (pronounced “V dash”). Assuming bootstrapping is omitted, V’ is defined for any given point in state-space as the actual total reward that would be encountered by starting and completing a trajectory from that point, given that the spacecraft is to be controlled by the current neural network.
Since the above definition for V' allows us to calculate a different target value for every point in state-space, we could in theory plot another grey-scale image over state-space (like the demo does for V, the value-function). Although not actually plotted as a grey-scale in the demo, you can use the “Details at mouse pointer” panel of the demo to view V' at any point in state-space.
It is the objective of value-learning algorithms (VL) to make V match V' everywhere along in state-space. Unfortunately, when you try to do this, as soon as you change V a bit, V' will also change a bit (since V' depends on the trajectory chosen by the neural-network, which itself depends upon the value-function V and the greedy policy). Hence trying to make V match V' is like trying to follow a moving target and cannot be accomplished in one step, and is best attempted gradually.
When bootstrapping is present (lambda<1), things are a bit more complicated, and then the definition of V’ is calculated as above but is also blended mathematically with V, the value function. This exact blending formula is given in the paper as Equation 4. When you include bootstrapping, the dependency of V' on V becomes even greater and so the moving target becomes even harder to chase, which is why the learning algorithms tend to become unstable in this case.
This formulation of trying to make V gradually match V' along a trajectory is proven in the paper to be equivalent to the well known algorithm TD(lambda).
Whether bootstrapping is present or not, if V=V' is satisfied everywhere in state-space then the Bellman Equation will be satisfied (as proven in the paper), which is a sufficient condition for an optimal trajectory. However most VL algorithms only ensure that V=V' is satisfied along a trajectory, which is not a sufficient condition (as the demo shows).
The target value-gradients (G’) are displayed as magenta lines in the State-Space View, and are also shown in the “Details at mouse pointer” panel.
|
|
But what exactly are these target value-gradients? They are the gradient of V’ (the target values) with respect to the state vector. Hence, just as the value-gradients G, are the directions of maximum increasing whiteness on the grey-scale image for the value-function V (see diagram), the target value-gradients, G', are the directions of maximum increasing whiteness of the grey-scale image of the target values, V' (even though this grey scale is not actually shown in the demo).
The above definition of G' is precise, but it is more efficient to calculate target value-gradients using Equation 6 of the paper.
Calculation of V’ and G’ therefore requires knowledge of the laws of physics that control the spacecraft (known as the model functions), and the reward function.
Some people define Reinforcement Learning to be problems where the model functions are not known and some do not. There are many interesting problems where the model functions are known, for example all the problems in this Neuropilot Project website. In these examples the model functions are just simple laws of physics.
Unlike matching V to V' which needs doing over the entire state-space, matching G to G' only needs doing along a trajectory to ensure the trajectory is optimal (as illustrated here).