Behavior Cloning for Lunar Lander
Abstract
This project applies behavior cloning to the LunarLander-v2 environment by leveraging a Deep Q-Network (DQN) expert. The expert policy generates high-quality demonstrations that are distilled into a supervised multi-layer perceptron (MLP). The resulting student controller imitates expert actions directly from raw state observations, producing a lightweight policy suitable for rapid inference without additional reinforcement learning during deployment.
Behavior Cloning Pipeline Overview
The behavior cloning workflow converts reinforcement learning expertise into a supervised policy through the following components:
Expert Demonstration Capture
- Rolled out a converged DQN agent for 200+ evaluation episodes.
- Logged full state trajectories alongside expert thrust commands.
- Filtered unstable landings to maintain a gold-standard dataset.
- Balanced exploration of hover, descent, and landing regimes.
Student Policy Training
- Two-hidden-layer MLP with ReLU activations and dropout regularization.
- Inputs are normalized Lunar Lander state vectors; outputs are discrete thrust selections.
- Optimized with cross-entropy loss and Adam optimizer for stable convergence.
- Validation monitoring prevents overfitting and preserves expert fidelity.
Evaluation Infrastructure
- Gymnasium test harness for reproducible rollouts and score tracking.
- Side-by-side comparisons against the DQN expert on identical seeds.
- Landing quality metrics include touchdown velocity, leg contact timing, and reward.
- Automated plotting pipeline summarizes imitation accuracy per trajectory phase.
Learning Progression Analysis
The student MLP matures from random thrusting to expert-level maneuvers as supervised training advances:
Early Training Phase
- High variance in hover control with delayed reaction to descent drift.
- Frequent hard landings due to mismatched retro-thrust timing.
- Loss reductions primarily driven by coarse action selection alignment.
- Policy entropy remains large, signalling exploratory uncertainty.
Converged Student Behavior
- Thrust sequencing mirrors expert cadence with precise touchdown damping.
- Stable hover corrections mitigate lateral drift before final descent.
- Reward distribution tightens around the expert baseline with < 5% variance.
- Entropy regularization encourages confident thrust decisions when near the pad.
Algorithmic Workflow
Demonstration Harvesting
1. Roll out the DQN expert under diverse seeds to generate state-action traces.
2. Persist trajectories with reward annotations for later stratified sampling.
3. Normalize features to stabilize supervised optimization.
Supervised Optimization
4. Train the MLP using mini-batch updates with adaptive learning rate scheduling.
5. Evaluate imitation loss on held-out demonstrations each epoch.
6. Apply early stopping when reward parity with the expert is achieved.
Deployment Validation
7. Execute assessment rollouts to confirm reward stability and landing smoothness.
8. Export trained weights into ONNX/TorchScript for downstream consumers.
9. Monitor live runs with telemetry dashboards to capture edge cases.
Future Directions
- Iteratively refine demonstrations via DAgger to mitigate covariate shift.
- Extend imitation learning to continuous action lander variants with policy distillation.
- Bootstrap reinforcement learning agents using the cloned policy as initialization.
- Deploy on physical testbeds with onboard sensing and thrust actuation.
Conclusion
The behavior cloning project showcases how supervised learning can faithfully reproduce complex control strategies obtained via reinforcement learning. By capturing expert demonstrations and training a compact MLP, the Lunar Lander achieves expert-level touchdowns with a fraction of the computational cost. The resulting controller is a practical template for transferring expertise to resource-constrained platforms or initializing more advanced learning pipelines.