Behavior Cloning for Lunar Lander

Abstract

This project applies behavior cloning to the LunarLander-v2 environment by leveraging a Deep Q-Network (DQN) expert. The expert policy generates high-quality demonstrations that are distilled into a supervised multi-layer perceptron (MLP). The resulting student controller imitates expert actions directly from raw state observations, producing a lightweight policy suitable for rapid inference without additional reinforcement learning during deployment.

Behavior Cloning Pipeline Overview

The behavior cloning workflow converts reinforcement learning expertise into a supervised policy through the following components:

Expert Demonstration Capture

Rolled out a converged DQN agent for 200+ evaluation episodes.
Logged full state trajectories alongside expert thrust commands.
Filtered unstable landings to maintain a gold-standard dataset.
Balanced exploration of hover, descent, and landing regimes.

Student Policy Training

Two-hidden-layer MLP with ReLU activations and dropout regularization.
Inputs are normalized Lunar Lander state vectors; outputs are discrete thrust selections.
Optimized with cross-entropy loss and Adam optimizer for stable convergence.
Validation monitoring prevents overfitting and preserves expert fidelity.

Evaluation Infrastructure

Gymnasium test harness for reproducible rollouts and score tracking.
Side-by-side comparisons against the DQN expert on identical seeds.
Landing quality metrics include touchdown velocity, leg contact timing, and reward.
Automated plotting pipeline summarizes imitation accuracy per trajectory phase.

Behavior cloned policy successfully landing the Lunar Lander

Figure 1: Policy Rollout

The cloned policy delivers soft touchdowns with minimal drift, closely matching the DQN expert profile.

Learning Progression Analysis

The student MLP matures from random thrusting to expert-level maneuvers as supervised training advances:

Early Training Phase

High variance in hover control with delayed reaction to descent drift.
Frequent hard landings due to mismatched retro-thrust timing.
Loss reductions primarily driven by coarse action selection alignment.
Policy entropy remains large, signalling exploratory uncertainty.

Converged Student Behavior

Thrust sequencing mirrors expert cadence with precise touchdown damping.
Stable hover corrections mitigate lateral drift before final descent.
Reward distribution tightens around the expert baseline with < 5% variance.
Entropy regularization encourages confident thrust decisions when near the pad.

Algorithmic Workflow

Demonstration Harvesting

1. Roll out the DQN expert under diverse seeds to generate state-action traces.

2. Persist trajectories with reward annotations for later stratified sampling.

3. Normalize features to stabilize supervised optimization.

Supervised Optimization

4. Train the MLP using mini-batch updates with adaptive learning rate scheduling.

5. Evaluate imitation loss on held-out demonstrations each epoch.

6. Apply early stopping when reward parity with the expert is achieved.

Deployment Validation

7. Execute assessment rollouts to confirm reward stability and landing smoothness.

8. Export trained weights into ONNX/TorchScript for downstream consumers.

9. Monitor live runs with telemetry dashboards to capture edge cases.

Future Directions

Iteratively refine demonstrations via DAgger to mitigate covariate shift.
Extend imitation learning to continuous action lander variants with policy distillation.
Bootstrap reinforcement learning agents using the cloned policy as initialization.
Deploy on physical testbeds with onboard sensing and thrust actuation.

Conclusion

The behavior cloning project showcases how supervised learning can faithfully reproduce complex control strategies obtained via reinforcement learning. By capturing expert demonstrations and training a compact MLP, the Lunar Lander achieves expert-level touchdowns with a fraction of the computational cost. The resulting controller is a practical template for transferring expertise to resource-constrained platforms or initializing more advanced learning pipelines.

Back to Projects