Why does Exploring Starts not work very well for VL?
The standard Reinforcement Learning technique of “exploring starts” means having a very slow learning rate on a trajectory with a randomly moving start point. This is equivalent to doing batch-update VL on a large number of trajectories with different start points simultaneously. The hope of Exploring Starts is that the value function can be learnt throughout the whole of state-space, instead of along just one trajectory.
The following diagrams show VL as it progresses on learning just two trajectories simultaneously.
|
|
|
|
VL on two trajectories after a relatively short period of learning. |
VL on two trajectories after a longer period of learning. |
Since these two trajectories will receive different total rewards, there is a different value of V' at the bottom of each of these two trajectories. These different target values set up an “implicit” target value gradient pointing to the right. This takes a while to learn (on the left diagram learning has been running for a while, and you can see the actual value gradient is still pointing the wrong way), but after a good amount of time the value-gradient eventually does point right. And so the trajectory does eventually bend inwards, as shown in the right figure.
But then often the two trajectories merge (as shown in the right figure) and then any implicit target value-gradient disappears, preventing the trajectories from becoming fully optimal. As there's no spatial separation between the trajectories, VL fails to learn the value-gradient. Hence it never seems to produce very optimal trajectories. This problem seems to occur even with a larger number of trajectories.
Can you get VL, given *any* number of trajectories to produce truly optimal trajectories? In the paper I experimented with 100 trajectories and that was not sufficient to reliably produce optimal trajectories, and also was very slow.
I'm not an expert on “exploring starts”, since I moved away from VL techniques a long time ago. It might be possible to get it to work given a near-infinite number of trajectories (or equivalently a random trajectory with a near-infinitesimal learning rate), but it just doesn't seem reliable or efficient to me.