Part 7 — Self-Play: The Game Simulator
REINFORCE works. A single snake can learn from its own experience — eat food, get a reward, make that move more likely next time. But there’s a problem: the snake is playing against an environment that doesn’t fight back.
Real BattleSnake has opponents. Opponents that block your path, steal your food, and chase you into corners. Training against an empty board teaches you to eat. It doesn’t teach you to compete.
Self-play fixes this. The snake trains against itself — or more precisely, against past versions of itself. Every time it gets stronger, its opponent gets stronger too. The ceiling keeps rising.
Why self-play works
Here’s the intuition. Imagine you’re learning chess. Playing against a wall (someone who never moves) teaches you the rules. Playing against a weak opponent teaches you basic tactics. Playing against someone slightly better than you teaches you to see your own mistakes.
Self-play is that last one, automated. The key insight: a snake is always training against the current best version of itself. When it discovers a new strategy, the opponent (an older copy of the same network) doesn’t know that strategy yet. The new strategy wins. The network updates. Now the next training run faces a network that does know that strategy — so it has to discover a counter. And so on.
This is how AlphaGo trained. It’s how AlphaStar trained. It’s how every self-play system works: you are your own curriculum. The difficulty adjusts automatically.
The training loop
The self-play loop has four steps, repeated many times:
- Sample an opponent. Pick a past checkpoint — the current network, or a saved snapshot from an earlier training run.
- Play episodes. Run games where the training snake faces the opponent snake. Record the experience: states, actions, rewards.
- Update the training snake. Run REINFORCE on the collected experience. The training snake’s weights change; the opponent’s weights stay frozen.
- Evaluate and checkpoint. Periodically pit the current snake against the previous best. If it wins more than it loses, save a checkpoint. That checkpoint becomes a new candidate opponent.
The loop looks like this:
┌───────────────────────────────────────────┐
│ 1. Sample opponent from checkpoint pool │
│ │ │
│ ▼ │
│ 2. Play N episodes (training vs opponent)│
│ │ │
│ ▼ │
│ 3. REINFORCE update (training snake only)│
│ │ │
│ ▼ │
│ 4. Evaluate: current vs best checkpoint │
│ If win rate > 55%: save checkpoint │
│ Go to 1 │
└───────────────────────────────────────────┘
The 55% win rate threshold is deliberate. We don’t require a supermajority — we want to save checkpoints that are slightly better, because slightly-better opponents create the steady pressure that drives improvement. If we waited for 80% wins, the network would stagnate between checkpoints.
The game simulator
We need a game engine that runs locally — no HTTP, no web server, a function that advances the board state. This is the GameEnv that Part 6 referenced.
The simulator doesn’t need every BattleSnake rule. For training, we need: snakes move, food spawns, snakes die if they hit walls or bodies, health goes down, eating food restores health. That’s the core loop.
The key idea: GameEnv::step(action) advances the board by one turn, returns the reward, and reports whether the game is over. reset() starts a fresh game. Here’s the implementation:
#![allow(unused)]
fn main() {
use snake_ml::{Board, Point, Snake};
/// A player in the simulation — either a neural network or a heuristic
pub trait Player {
fn decide(&self, board: &Board, my_snake: &Snake) -> u32;
}
/// The game simulator
pub struct GameEnv {
board: Board,
players: Vec<Box<dyn Player>>,
done: bool,
turn: u32,
max_turns: u32,
}
#[derive(Debug, Clone)]
pub struct StepResult {
pub board: Board,
pub rewards: Vec<f32>, // one reward per player
pub done: bool,
pub winner: Option<usize>, // Some(index) or None for draw
}
impl GameEnv {
pub fn new(board: Board, players: Vec<Box<dyn Player>>) -> Self {
Self {
board,
players,
done: false,
turn: 0,
max_turns: 200,
}
}
/// Reset the environment with a fresh board and return the initial state
pub fn reset(&mut self) -> Board {
self.board = random_board(11, 11, self.players.len());
self.done = false;
self.turn = 0;
self.board.clone()
}
/// Advance the game one turn: each player decides, then we resolve
pub fn step(&mut self) -> StepResult {
// 1. Each player picks a direction
let directions: Vec<u32> = self.players.iter()
.enumerate()
.map(|(i, p)| p.decide(&self.board, &self.board.snakes[i]))
.collect();
self.resolve_step(directions)
}
/// Advance the game one turn, where player `player_idx` takes `action`
/// and all other players decide via their own `Player::decide()`. This
/// variant is needed for REINFORCE training: the training loop must
/// record the *actual* action the training player took (including
/// ε-greedy exploration), not re-derive it after the fact.
pub fn step_with_action(&mut self, player_idx: usize, action: u32) -> StepResult {
let mut directions = Vec::with_capacity(self.players.len());
for (i, player) in self.players.iter().enumerate() {
if i == player_idx {
directions.push(action);
} else {
directions.push(player.decide(&self.board, &self.board.snakes[i]));
}
}
self.resolve_step(directions)
}
/// Internal: apply the movement, collision, and reward logic for a set of
/// chosen directions. Called by both `step()` (all players decide) and
/// `step_with_action()` (one player's action is specified externally).
fn resolve_step(&mut self, directions: Vec<u32>) -> StepResult {
// 2. Move snakes
for (i, &dir) in directions.iter().enumerate() {
let (dx, dy) = match dir {
0 => (0, 1), // up
1 => (0, -1), // down
2 => (-1, 0), // left
3 => (1, 0), // right
_ => (0, 0),
};
let snake = &mut self.board.snakes[i];
let new_head = Point {
x: snake.head.x + dx,
y: snake.head.y + dy,
};
// Insert new head at the front of the body
snake.body.insert(0, new_head.clone());
snake.head = new_head;
// If the head isn't on food, remove the tail (snake doesn't grow)
let on_food = self.board.food.iter()
.any(|f| f.x == new_head.x && f.y == new_head.y);
if !on_food {
snake.body.pop();
}
// Decrement health every turn. If the snake ate food this
// turn, reset to 100 *after* the decrement. Order matters:
// if we set health=100 then decrement, the reward check
// (health == 100) below never fires — health would be 99.
snake.health = snake.health.saturating_sub(1);
if on_food {
snake.health = 100;
}
}
// 3. Check for deaths
let mut dead = vec![false; self.players.len()];
for (i, snake) in self.board.snakes.iter().enumerate() {
// Wall collision
if snake.head.x < 0 || snake.head.x >= self.board.width as i32
|| snake.head.y < 0 || snake.head.y >= self.board.height as i32
{
dead[i] = true;
continue;
}
// Self collision (head hit own body, starting from index 1)
for (j, seg) in snake.body.iter().enumerate() {
if j > 0 && seg.x == snake.head.x && seg.y == snake.head.y {
dead[i] = true;
break;
}
}
// Body collision with other snakes
if dead[i] { continue; }
for (j, other) in self.board.snakes.iter().enumerate() {
if i == j { continue; }
// Did I hit the other snake's body?
for seg in &other.body {
if seg.x == snake.head.x && seg.y == snake.head.y {
// Check for head-to-head: if both heads just moved here
if snake.head.x == other.head.x && snake.head.y == other.head.y {
// Shorter snake dies; if same length, both die
if snake.body.len() <= other.body.len() {
dead[i] = true;
}
} else {
dead[i] = true;
}
break;
}
}
}
}
// 4. Starvation check
for (i, snake) in self.board.snakes.iter().enumerate() {
if snake.health == 0 {
dead[i] = true;
}
}
// 5. Compute rewards
let mut rewards = vec![0.0_f32; self.players.len()];
let mut winner = None;
let alive_count = dead.iter().filter(|&&d| !d).count();
for (i, is_dead) in dead.iter().enumerate() {
if *is_dead {
rewards[i] = -1.0;
} else {
// Survived this turn
rewards[i] = 0.01;
// Ate food (health was reset to 100)
if self.board.snakes[i].health == 100 {
rewards[i] += 1.0;
}
}
}
// If only one snake is alive, it wins
if alive_count == 1 {
winner = dead.iter().position(|&d| !d);
if let Some(w) = winner {
rewards[w] += 5.0; // bonus for winning
}
self.done = true;
} else if alive_count == 0 {
self.done = true;
}
self.turn += 1;
if self.turn >= self.max_turns {
self.done = true;
}
// 6. Remove eaten food
self.board.food.retain(|f| {
!self.board.snakes.iter().any(|s| s.head.x == f.x && s.head.y == f.y)
});
// 7. Spawn new food (one piece per empty slot, simplified)
if self.board.food.is_empty() {
if let Some(pos) = random_empty_cell(&self.board) {
self.board.food.push(pos);
}
}
StepResult {
board: self.board.clone(),
rewards,
done: self.done,
winner,
}
}
pub fn is_done(&self) -> bool {
self.done
}
}
fn random_board(width: u32, height: u32, num_snakes: usize) -> Board {
// Place snakes in opposite corners, place some food in the middle
let mut snakes = Vec::with_capacity(num_snakes);
let start_positions = [
(1, 1),
(width as i32 - 2, height as i32 - 2),
];
for i in 0..num_snakes.min(start_positions.len()) {
let (sx, sy) = start_positions[i];
let body: Vec<Point> = (0..3).map(|j| Point { x: sx, y: sy + j }).collect();
snakes.push(Snake {
id: format!("snake-{i}"),
body: body.clone(),
head: body[0].clone(),
health: 100,
});
}
let mid_x = (width / 2) as i32;
let mid_y = (height / 2) as i32;
Board {
width,
height,
food: vec![
Point { x: mid_x, y: mid_y },
Point { x: mid_x - 2, y: mid_y },
Point { x: mid_x + 2, y: mid_y },
],
snakes,
hazards: vec![],
}
}
fn random_empty_cell(board: &Board) -> Option<Point> {
// Simplified: try a few random positions, return the first empty one
use rand::Rng;
let mut rng = rand::thread_rng();
for _ in 0..20 {
let x = rng.gen_range(0..board.width as i32);
let y = rng.gen_range(0..board.height as i32);
let occupied = board.snakes.iter()
.any(|s| s.body.iter().any(|b| b.x == x && b.y == y));
if !occupied {
return Some(Point { x, y });
}
}
None
}
}
This simulator handles the core mechanics. It’s not a perfect BattleSnake engine — for instance, it doesn’t handle simultaneous body-collision resolution the way the real engine does, and food spawning is simplified. But it’s good enough to generate training signal. The reward structure matches Part 6: eat food (+1), survive (+0.01), die (-1), win (+5).
The network as a player
Here’s what we need before the code: the GameEnv simulator calls player.decide() every turn to get the next direction. It doesn’t care whether it’s talking to a neural network or a heuristic — it only needs a direction. That’s why we use a trait. The Player trait abstracts over both so the same GameEnv can run games against the A* heuristic, against a random network, or against a trained one.
This is where the trained weights come in — a player with trained weights plays differently than one with random weights.
What we’re building: a struct that holds a SnakeNet and epsilon. When decide() is called, it encodes the board state, runs it through the network, and returns the direction. Here’s the full implementation:
#![allow(unused)]
fn main() {
use candle_core::{Device, DType, Tensor};
use candle_nn::{VarBuilder, VarMap};
use snake_ml::{SnakeNet, encode_board, Board, Snake};
/// A neural network player. Uses the trained (or random) weights
/// to pick a direction each turn.
pub struct NetworkPlayer {
net: SnakeNet,
device: Device,
epsilon: f32, // exploration rate for ε-greedy
}
impl NetworkPlayer {
pub fn from_var_map(var_map: &VarMap, epsilon: f32) -> Self {
let device = Device::Cpu;
let vs = VarBuilder::from_varmap(var_map, DType::F32, &device);
let net = SnakeNet::new(vs).expect("failed to build network");
Self { net, device, epsilon }
}
pub fn from_checkpoint(path: &str, epsilon: f32) -> Self {
let device = Device::Cpu;
let mut var_map = VarMap::new();
var_map.load(path).expect("failed to load checkpoint");
Self::from_var_map(&var_map, epsilon)
}
}
impl Player for NetworkPlayer {
fn decide(&self, board: &Board, my_snake: &Snake) -> u32 {
// ε-greedy: explore sometimes
if rand::random::<f32>() < self.epsilon {
return rand::thread_rng().gen_range(0..4) as u32;
}
// Encode the board and pick the best direction
let tensor = encode_board(board, my_snake)
.expect("encoding failed");
let flat = tensor.flatten_all()
.expect("flatten failed")
.unsqueeze(0)
.expect("batch dim failed");
let logits = self.net.forward(&flat)
.expect("forward pass failed");
logits.argmax(1)
.expect("argmax failed")
.to_scalar::<u32>()
.expect("scalar failed") as u32
}
}
}
The epsilon parameter controls exploration. During training, the training snake uses a higher epsilon (0.1 — try random moves 10% of the time). The opponent uses epsilon = 0 (always pick the best move it knows) so it plays at full strength.
What comes next
We have the simulator and a way to plug the network in as a player. Part 8 puts the pieces together: the opponent pool, the full self-play training loop, and what it looks like when the training actually works.
Previous: Part 6 — Reinforcement Learning Basics · Next: Part 8 — Self-Play: The Training Loop