SameGame can have up to 10 ^{182} reachable board positions .
Therefore it is extremely difficult to solve, even for computers.
A very promising approach to solving complex problems such as SameGame is the Nested Monte Carlo Search (NMCS). It is a very simple variation from the family of Monte Carlo Search algorithms, especially suited for single player games.
In this post we will see a generic implementation of the NMCS in Java that could easily be adapted to different problem domains.
But before we come to that, let’s examine how the NCMS works.
Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS) algorithms are extremely relevant today for state of the art artificial intelligence implementations.
One recent famous example is Google’s AI AlphaGo that mastered the game of Go . AlphaGo utilizes a MCTS for one of its core components.
The MCTS algorithms produce very good solutions to problems where other traditional methods fail.
These problems usually tend to have a very large search space such as games like Go, Chess or SameGame.
What is a Monte Carlo Tree Search?
A MCTS is a heuristic algorithm which is based on running many random simulations.
In a simulation (sometimes also called playout, rollout or sample) random legal actions are applied to an initial state until a terminal state is reached.
If the number of simulations approaches infinity, the result converges to an optimal solution.
This is of course impractical.
Therefore the simulation strategy is combined with a tree search to guide the search to more promising branches of the search tree. This often produces very good results within a reasonable amount time.
Generic application
A MCTS doesn’t need any domain specific knowledge. It can be applied to any problem that can be described in terms of
- state
- list of legal actions
- score
- function that takes a state and an action and returns a new state
- simulation strategy
Nested Monte Carlo Search
In this post we will examine a very basic Monte Carlo algorithm called Nested Monte Carlo Search (NMCS).
The NMCS is especially suited for complex single player games such as SameGame .
Basic algorithm
A NMCS repeats the following steps until time runs out or until the search terminates.
Selection
In the beginning the initial state (root) is selected.
Otherwise the action leading to the best score found so far is selected.
Simulation
For the previously selected state all legal actions are determined.
For each of these next actions a simulation is played out.
Backpropagation
The best result found during the previous simulation step is stored in memory.
Search levels
The specialty of the NMCS is that during the simulation step the simulations themselves can be nested applications of the NMCS algorithm to all the children of the current state.
The depth of nesting can be described in terms of search levels .
Recursive definition
Level | Description |
---|---|
0 | One playout starting at the current node. |
1 | During simulation steps, conduct a level-0 search for all children. |
n > 2 | During simulation steps, conduct a recursive level-(n-1) search for all children. |
Example of a level-2 search step
The following diagram shows one iteration of a level-2 search:
Explanation:
- The node
1
is selected. - Next, a complete 2 level deep subtree of the search space with
1
as its root is traversed. This subtree has 5 leafs,5
–8
. - For each leaf a normal simulation is played out.
- The best result is found by the simulation from node
7
.The result is propagated back and the next node3
from the best result will be selected for the next iteration.
Playout policy
A level-0 search is a normal simulation and defined as one playout starting from a given node until a terminal state is found.
On this path every action is chosen by a playout policy.
Playout policies could either randomly choose actions or they could use domain specific strategies to improve the results.
Generic implementation
The NMCS algorithm can be applied to every problem, puzzle or game that can be described in terms of the generic interface INmcsState<TState, TAction>
.
Generic interface
The generic type TState
represents the state of the problem e.g. the board position.
The generic type TAction
represents an action that can be applied to a state. For SameGame that would be a point on the board that can be played to remove a group.
The implementation has to provide:
- a score
- a function to get all legal next moves
- a function that takes a state and an action and returns a new state
- an implementation of a playout policy for simulations
public interface INmcsState<TState, TAction> { double getScore(); boolean isTerminalPosition(); // for convenience (same as findAllLegalActions().size() == 0) ArrayList<TAction> findAllLegalActions(); INmcsState<TState, TAction> TakeAction(TAction action); Pair<Double, ArrayList<TAction>> simulation(); }
The implementation must be immutable. That means none of the operations is allowed to change the state of the object that the operation is called on.
If the state has to be updated, instead of updating internal state a new updated version has to be returned.
Of course the implementation could be a wrapper around an existing mutable class where a copy is created before mutating. This would be an application of the adapter design pattern .
Generic algorithm
The NMCS algorithm is defined by the function executeSearch
.
It takes three parameters:
- state
- of type
INmcsState<TState, TAction>
which represents the current state - level
- of type
int
which specifies the search level - isCanceled
- of type
Supplier<Boolean>
which allows to inject code for gracefully stopping the execution, e.g. a timer
The return value is a tuple of a score of type Double
and a list of actions of type ArrayList<TAction>
. The list of actions describes the path from the root to the terminal state.
public static <TState, TAction> Pair<Double, ArrayList<TAction>> executeSearch(INmcsState<TState, TAction> state, final int level, final Supplier<Boolean> isCanceled) { // terminating case if(level <= 0) return state.simulation(); Pair<Double, ArrayList<TAction>> globalBestResult = Pair.of(state.getScore(), new ArrayList<TAction>()); final ArrayList<TAction> previousAppliedActions = new ArrayList<TAction>(); while (!state.isTerminalPosition() && !isCanceled.get()) { Pair<Double, ArrayList<TAction>> currentBestResult = Pair.of(0.0, new ArrayList<TAction>()); TAction currentBestAction = null; for (TAction action : state.findAllLegalActions()) { final INmcsState<TState, TAction> currentState = state.TakeAction(action); // recursion final Pair<Double, ArrayList<TAction>> simulationResult = executeSearch(currentState, level - 1, isCanceled); if (simulationResult.item1 >= currentBestResult.item1) { currentBestAction = action; currentBestResult = simulationResult; } } previousAppliedActions.add(currentBestAction); if (currentBestResult.item1 > globalBestResult.item1) { globalBestResult = currentBestResult; globalBestResult.item2.addAll(0, previousAppliedActions); } state = state.TakeAction(currentBestAction); } return globalBestResult; }
The algorithm terminates either when time runs out or when a node is selected during the selection step that has no child node and therefore is a terminal state.
Here is an example of calling the executeSearch
function:
final int[][] board = BoardGenerator.generateBoard("1,1,1;2,2,2;3,3,3;"); final int level = 2; final long maxRunningTimeMs = 2 * 60 * 1000; final INmcsState<SGBoard, Point> state = new SGNmctsState(board); final long endTimeMs = System.currentTimeMillis() + maxRunningTimeMs; final Pair<Double, ArrayList<Point>> result = NestedMonteCarloSearch.executeSearch(state, level, () -> { return System.currentTimeMillis() > endTimeMs; });
The complete code is available on GitHub .
Enhancements and improvements
Here is a selection of potential improvements:
- Root parallelization
By running independent searches from the root, the risk of getting trapped in a local maximum is reduced.
- Guided playout policy
For some problem domains guided playouts may lead to better results than random sampling.
- More emphasis on exploration as opposed to exploitation
It could be profitable to also try unpromising actions because they still may lead to the total maximum.
- Parallelization and performance optimization
More efficient implementations will provide better results.
- Choose randomly between equally good results
The implementation shown above was intentionally kept simple. If multiple playouts yield the same score, the last playout will always be selected.