Can VL really fail? Does it happen often?
The following diagram shows a typical result of running the VL algorithm on a single trajectory without any exploration.

This trajectory is not optimal (i.e. the blue curve does not match the green one). However the objective of VL has been achieved (i.e. V’ = V all along the trajectory). The equality of V and V’ is evidenced by the following images.
|
|
This screen shot was taken with the mouse pointer over the start point of the trajectory and it shows at this point V and V' are almost equal. If the mouse pointer was to be moved along the trajectory, then it would show that they are always nearly equal. The next screen-shot shows evidence for this. |
This screen-shot shows the VL error is extremely small, meaning V and V' are close almost equal at every point along the trajectory.

This means the VL objective has been totally met, but it fails to produce any useful learning at all. The above situation happens again and again whenever you try it. I've never seen it not happen!
Here's the first screen-shot shown again, this time displaying the value-gradients (the cyan lines) and the target value-gradients (the magenta lines).

Achieving the VL objective has done nothing to make these gradients match, and hence done nothing to make the trajectory optimal. If these gradients were to match then the trajectory would be optimal (see diagram/explanation).