Monte Carlo Methods in Game Theory: MCTS, ISMCTS, and CFR

Game theoretic algorithms have revolutionized AI's ability to master complex games ranging from chess and Go to poker. In this blog post, we'll examine three powerful approaches that leverage Monte Carlo sampling in different ways: Monte Carlo Tree Search (MCTS), Information Set Monte Carlo Tree Search (ISMCTS), and Counterfactual Regret Minimization (CFR).

Monte Carlo Tree Search (MCTS)

MCTS has proven exceptionally effective in perfect information games like chess and Go by iteratively constructing a search tree through strategic sampling.

Core Principles of MCTS

Unlike traditional minimax search with alpha-beta pruning, MCTS builds a game tree through repeated simulations, gradually focusing exploration on promising moves without requiring a handcrafted evaluation function. This makes it particularly effective for games with large branching factors where exhaustive search is infeasible.

The algorithm operates through four distinct phases that form a single iteration:

Selection: Starting from the root, the algorithm traverses the existing tree using a selection policy (typically UCB1) until reaching a node that has unexpanded children.
Expansion: One or more child nodes are added to the tree, representing possible moves from the selected position.
Simulation: From the newly added node(s), a playout is performed—typically following a random or lightly heuristic policy—until reaching a terminal state.
Backpropagation: The result of the simulation is propagated back up the tree, updating statistics for each traversed node.

The power of MCTS comes from repeating these iterations thousands or millions of times, gradually refining the search tree to concentrate on the most promising lines of play.

The UCB1 Formula: Balancing Exploration and Exploitation

The effectiveness of MCTS hinges on its selection policy, which must balance two competing objectives:

Exploitation: Exploring moves that have performed well in previous simulations
Exploration: Giving attention to less-explored moves that might be undervalued

This classic exploration-exploitation dilemma is elegantly addressed using the Upper Confidence Bound 1 (UCB1) formula:

UCB1(s_i) = \frac{w_i}{n_i} + C \sqrt{\frac{\ln N}{n_i}}

Where:

$w_i$ = total reward accumulated from node $i$
$n_i$ = number of visits to node $i$
$N$ = total visits to the parent node
$C$ = exploration parameter (typically $\sqrt{2}$ )

This formula creates a statistically sound balance between moves with high win rates (first term) and moves that haven't been thoroughly explored (second term). As a node is visited more frequently, its exploration term diminishes, naturally shifting focus to less-explored alternatives.

Advanced Sampling Strategies

The efficiency of MCTS depends significantly on the quality of its simulation policy:

Pure random policy: Selects moves uniformly at random—domain-independent but potentially inefficient.
Light playout policy: Incorporates simple, computationally inexpensive heuristics to guide random playouts.
Heavy playout policy: Utilizes sophisticated domain knowledge, potentially including neural networks, to guide simulations.

AlphaGo dramatically improved upon traditional MCTS by using neural networks to guide both the selection and simulation phases, demonstrating how the basic MCTS framework can be enhanced with domain-specific knowledge.

Information Set Monte Carlo Tree Search (ISMCTS)

While MCTS excels in perfect information games, it falls short when players have incomplete information about the game state. ISMCTS extends the MCTS framework to handle such imperfect information scenarios.

The Challenge of Imperfect Information

In perfect information games like chess, all players observe the same game state. However, in games like poker, Battleship, or card games, players have private information hidden from opponents. This creates several fundamental challenges:

Players must reason about multiple possible states consistent with their observations
Strategies must remain consistent across states that are indistinguishable to a player
Players must consider how their actions might reveal information about their private knowledge

ISMCTS addresses these challenges by introducing the concept of information sets and working with determinizations.

Key Concepts in ISMCTS

Information Set: A collection of game states that are indistinguishable to a particular player given their observations. For example, in a card game, all possible arrangements of your opponent's cards form an information set from your perspective.
Determinization: A technique that converts an imperfect information game into a perfect information instance by fixing all hidden information to specific values consistent with known constraints. For example, assigning specific cards to opponents in a card game.
Single-Observer ISMCTS (SO-ISMCTS): Builds a single search tree from the perspective of the active player, with nodes representing information sets rather than individual states.
Multiple-Observer ISMCTS (MO-ISMCTS): Maintains separate trees for each player's perspective, better capturing the asymmetric information available to different players, but at higher computational cost.

Modified UCB1 for Information Sets

Since not all actions are legal in every state within an information set, the UCB1 formula requires modification for ISMCTS:

UCB1(s_i) = \frac{w_i}{n_i} + C \sqrt{\frac{\ln \sum_{j \in siblings(i)} n_j}{n_i}}

This adaptation accounts for the fact that some actions may only be available in a subset of the possible states within an information set. By using the sum of visits to sibling nodes rather than parent visits, the formula ensures that actions available in fewer determinizations aren't unfairly disadvantaged.

Determinization Sampling Strategies

The effectiveness of ISMCTS depends critically on how determinizations are sampled:

Uniform Sampling: Each possible determinization consistent with observed information is equally likely—simple but potentially unrealistic.
Biased Sampling: Incorporates beliefs about opponent strategies to sample more plausible determinizations.
Importance Sampling: Weights determinizations based on their likelihood, giving more influence to common scenarios while still accounting for rare but significant ones.

By averaging across many determinizations, ISMCTS approximates the expected value across the entire information set, enabling decision-making under uncertainty.

Counterfactual Regret Minimization (CFR)

While both MCTS and ISMCTS rely on tree-based search with Monte Carlo sampling, CFR takes a fundamentally different approach to solving imperfect information games. Rather than sampling individual determinizations, it directly operates on information sets and probability distributions over actions.

Regret-Based Strategy Learning

CFR is founded on the concept of regret—how much a player would have gained by playing differently in retrospect. Instead of directly searching for an equilibrium, CFR iteratively updates strategies to minimize counterfactual regret.

The algorithm maintains two key components for each information set and action:

The cumulative regret for not taking each action
The average strategy played across all iterations

Remarkably, as regret accumulates and strategies adapt to minimize it, the average strategy converges to a Nash equilibrium, providing theoretical guarantees that MCTS variants lack.

Mathematical Framework

The core of CFR revolves around computing counterfactual values and regrets:

Counterfactual Value:

v_i(I, a) = \sum_{h \in I} \sum_{z \in Z} \pi^{-i}(h) \cdot \pi(h, z, a) \cdot u_i(z)

where:

$I$ = information set
$a$ = action
$h$ = history (sequence of actions from game start)
$\pi^{-i}(h)$ = probability of reaching $h$ based on other players' strategies
$\pi(h, z, a)$ = probability of reaching terminal state $z$ from $h$ by taking action $a$
$u_i(z)$ = utility for player $i$ at terminal state $z$

The term "counterfactual" is appropriate because we're calculating the expected value under the hypothetical scenario where we reach information set $I$ (which might be unlikely under current strategies) and then take action $a$ .

Immediate Counterfactual Regret:

r_i^T(I, a) = \sum_{t=1}^T \left( v_i(I, a, \sigma^t) - v_i(I, \sigma^t) \right)

This tracks the cumulative regret over $T$ iterations for not consistently playing action $a$ at information set $I$ .

Regret Matching: The Strategy Update Mechanism

The elegance of CFR lies in how it converts accumulated regrets into a refined strategy through regret matching:

\sigma_i^{T+1}(I, a) = \begin{cases} \frac{r_i^{T,+}(I, a)}{\sum_{a' \in A(I)} r_i^{T,+}(I, a')} & \text{if } \sum_{a' \in A(I)} r_i^{T,+}(I, a') > 0 \\ \frac{1}{|A(I)|} & \text{otherwise} \end{cases}

where $r_i^{T,+}(I, a) = \max(r_i^T(I, a), 0)$ represents positive regret.

This formula encapsulates an intuitive principle: allocate probability to actions proportionally to how much you regret not having played them more. Actions that consistently perform well accumulate positive regret and receive higher probability in future iterations.

Example: CFR Probability Calculation in Rock-Paper-Scissors

To concretize this process, consider a scenario in Rock-Paper-Scissors where after several iterations, we've accumulated these regrets:

Rock: -5 (negative regret—we wish we had played Rock less)
Paper: +10 (positive regret—we wish we had played Paper more)
Scissors: +15 (positive regret—we wish we had played Scissors more)

Following the regret matching procedure:

Calculate positive regrets:
- Rock: max(-5, 0) = 0
- Paper: max(10, 0) = 10
- Scissors: max(15, 0) = 15
Sum positive regrets: 0 + 10 + 15 = 25
Calculate probabilities:
- Rock: 0/25 = 0%
- Paper: 10/25 = 40%
- Scissors: 15/25 = 60%

In the next iteration, we would play Paper with 40% probability and Scissors with 60% probability, completely avoiding Rock. This strategy update directly responds to accumulated regrets, gradually converging toward an equilibrium.

Sampling Variants for Computational Efficiency

Traditional CFR becomes computationally intractable for large games because it requires traversing the entire game tree. Several sampling variants make CFR practical for complex games:

Chance Sampling (CS-CFR): Samples only one outcome of chance nodes (like card deals), reducing the branching factor of the game tree.
External Sampling (ES-CFR): Samples opponent actions and chance outcomes, but considers all of the player's own actions, reducing variance in strategy updates.
Outcome Sampling (OS-CFR): Samples a single path through the tree and updates only the information sets on that path—most computationally efficient per iteration, but may require more iterations to converge.
Monte Carlo CFR (MCCFR): A family of algorithms combining various sampling approaches, offering flexible trade-offs between computational cost per iteration and number of iterations required.

These sampling techniques dramatically reduce computational requirements while preserving convergence guarantees, making CFR viable for large-scale games like poker.

Comparative Analysis: Strengths and Applications

Algorithm	Information Type	Convergence Properties	Memory Requirements	Key Advantage	Primary Application Domain
MCTS	Perfect	No theoretical guarantee; empirically strong	Low to moderate	Effective heuristic search without domain knowledge	Chess, Go, general perfect information games
ISMCTS	Imperfect	No theoretical guarantee; quality depends on sampling	Moderate	Extends MCTS to hidden information through sampling	Card games, partially observable board games
CFR	Imperfect	Guaranteed convergence to Nash equilibrium	High	Provably optimal in two-player zero-sum games	Poker, strategic games with hidden information

The Poker Revolution: Why CFR Prevails

CFR and its variants have decisively outperformed other approaches in poker AI for several reasons:

Strategic Depth: CFR naturally develops balanced strategies that include sophisticated concepts like bluffing with appropriate frequency.
Information Structure Awareness: CFR reasons directly about information asymmetry rather than through sampled determinizations, avoiding "strategy fusion" problems.
Equilibrium Guarantees: In two-player zero-sum games like Heads-Up No-Limit Texas Hold'em, a Nash equilibrium strategy guarantees a minimum expected value regardless of opponent strategy.
Scalability through Abstraction: Modern CFR variants can handle enormous game trees through clever abstraction techniques that group similar situations.

The breakthrough poker AI systems Libratus and Pluribus, which defeated top human professionals, both relied on advanced CFR variants, demonstrating its superiority for complex imperfect information games where strategic deception is fundamental.