A Recurrent Neural Network Demo
Copyright Mike Fairbank 2007
In this demo, a pre-trained recurrent neural network controls a spacecraft. The objective is to use two range-finding scanners to find and fly to a landing pad.
To run the demo, click the ‘Next Flight’ button or ensure the “Auto” box is ticked. (Press F11 on Internet Explorer to use the full screen).
Technical details:
Having two scanners makes this task easier for the neural network. Each scanner only returns a single number indicating range. It’s nice that the scanners can move independently in a manner not usually seen in nature. It’s also nice that the scanners seem to change their behaviour when they scan past a corner.
The landscapes with the narrowest tunnels and most extreme angles have not been properly mastered yet.
The spacecraft is treated as a point particle so a collision doesn’t get detected until the centre of the spacecraft hits the landscape.
Approximately 200 trajectories and landscapes were considered during learning. Learning took about 2 hours.
The recurrent neural network did not use a value function, and was structured the same way as in demo 2. The neural network is an ordinary MLP with complex weights and activation functions, and some extra recurrent units. No hidden layer was used. 8 complex valued feedback units were used.
It was hard to stop the spacecraft colliding with corners that jut out. It was difficult because it only learns about avoiding these corners whenever it hits them - so a certain proportion of all trajectories during learning must hit the corners otherwise learning forgets the corners are there. I got around this to some extent by introducing a small stochastic element into the physics model during learning so that the trajectories had to give greater clearance of the corners to avoid them. Then I took the stochastic element away during the final playback version so that most trajectories successfully avoided the corners.
The reward function used was
-(k_1)v^2-(k_2)d^2-(k_3)f+(k_4)(cos(θ))
where
v=landing speed,
d=distance (as an ant would walk) from the landing pad to the crash point,
f=fuel used,
θ=angle between spacecraft and a vector perpendicular to the crash surface,
k_1, k_2, k_3, k_4 are positive real constants.
The fuel reward term was given during flight but all other reward terms were given only at the end of the flight. Giving reward during the flight would make the task easier. The cos(θ) term gives hints about how to avoid crashing - ideally I’d have not needed this.
Having more complicated neural network features (such as LSTM gates) may have made this task easier to learn, but none were used here. I may try using these in a later version to try to get it to learn to fly down more complicated tunnel layouts.