Value Iteration on Frozen Lake

Abstract

This project demonstrates the Value Iteration algorithm applied to the classic Frozen Lake problem from Gymnasium. It provides a robust implementation for solving Markov Decision Processes (MDPs) in both deterministic and stochastic environments. The project includes scripts for headless computation and interactive Pygame-based rendering, allowing users to visualize how an optimal policy is derived and executed on custom map configurations.

View on GitHub

Technical Implementation

Software Stack

Python 3.9+: Core programming language.
Gymnasium: Provides the Frozen Lake environment and toy-text rendering.
NumPy: Used for efficient matrix operations and value table management.
Pygame: Powers the graphical rendering for human-readable visualizations.

Core Components

value_iteration.py: Runs the algorithm on a custom map without GUI overhead.
value_iteration_pygame.py: Interactive script exposing the is_slippery flag for stochastic transitions.
CustomFrozenLake: Class containing hyperparameters like gamma (discount factor) and convergence thresholds.

Methodology & Configuration

Value Iteration Algorithm

Iteratively updates the value of each state based on the Bellman optimality equation.
Converges when the maximum change in value across all states falls below a strict tolerance (e.g., 1e-8).
Derives a deterministic greedy policy from the final optimal value function.

Map Customization

Supports custom rectangular grids defined by the MY_MAP constant.
S: Start state (one per map).
F: Frozen tiles (safe traversal).
H: Holes (terminal state, zero reward).
G: Goal state (terminal state, reward +1).

Figure 1: Example 4x4 Frozen Lake map configuration. 'S' is Start, 'G' is Goal, 'H' are Holes.

The agent must navigate from S to G while avoiding H. In slippery mode, movement direction is stochastic.

Key Features

Dual Execution Modes

Choose between a fast, headless script for quick convergence checks or a Pygame-based visualizer to watch the agent's behavior in real-time.

Dynamic Transitions

Toggle the is_slippery flag to switch between deterministic movement and stochastic dynamics, testing the robustness of the learned policy.

Tunable Hyperparameters

Direct access to tuning parameters like the discount factor (gamma) and convergence tolerance allows for experimentation with planning horizons and accuracy.

Results and Impact

The implementation successfully demonstrates the power of Dynamic Programming in solving finite MDPs. The agent reliably learns to navigate complex, hole-ridden maps, adapting its path based on the environmental dynamics.

Convergence

Achieves optimal policy convergence within seconds for standard grid sizes.

Educational Value

Provides a clear, modifiable sandbox for understanding Reinforcement Learning fundamentals.

Visual Verification

Visual tools allow immediate verification of policy safety and efficiency.

Future Work

Implement Policy Iteration for performance benchmarking.
Scale to larger, procedurally generated maps (8x8 and beyond).
Integrate Q-Learning to compare model-based vs. model-free approaches.

Conclusion

This project serves as a foundational implementation of Value Iteration, bridging the gap between theoretical MDP concepts and practical code. By offering both computational and visual interfaces, it enables a deeper understanding of how agents plan and adapt in uncertain environments. The modular design allows for easy extension to more complex grid worlds and algorithms.

Back to Projects