Copyright Mike Fairbank 2008
This is an alternative version of the Neuropilot 1 demo. This version was created using the algorithm VGL(Omega) with continuous actions (as described in the paper). This algorithm enabled much more stable learning and better overall performance of the end result. It still uses a value function.
Discussion
Whereas Reinforcement Learning algorithms for a value-function generally produce extremely non-monotonic progress of learning performance versus time, the algorithm VGL(Omega) produces monotonic learning if it is combined with an appropriate smooth policy (such as the one proposed by Doya; see paper for the reference and details). In my opinion, this combination of algorithm and policy solves many of the major problems of value-function learning, such as reliability and optimality. Surprising though, in this case, learning turns out to be equivalent to a major non-value-function method: Specifically, it becomes equivalent to the “policy-gradient” learning method, which is the rival paradigm for implementing reinforcement learning. This is all proven in the paper.
Hence, it is my opinion (controversially) that if you repair all the difficulties of value-function learning algorithms (such as incorporating local exploration, fixing non-monotonic progress, ensuring convergence with a general function approximator, and aiming for optimality), then the end result is the rival paradigm of policy gradient learning. I controversially conclude that the value-function is much less important than the RL community initially thought.
This is one reason the later demos do not use a value-function. The other is that policy-gradient methods allow the use of recurrent feedback units; this allows the spacecraft to have a scanner and short-term memory.
Contact Me
If you have any feedback / questions on this then please e-mail me (see home page) or discuss on the forum.