Detecting anomalous play in online chess is a daunting task. A move sequence that agrees closely with a chess engine’s top choices may be suspicious, but engine agreement alone is never a sufficient evidence of cheating: strong human player can also find such moves, and ordinary tactical positions may inherently contain forcing engine-like continuations. A useful alternative way is hence to compare a player’s observed moves not only to an engine, but also to a distribution of plausible human alternatives.
This project implements a counterfactual sampling framework for chess trajectory anomaly detection. Given a starting board position \(S_0\) and an observed sequence of \(k\) moves, we model a trajectory as \(X=(m_1,\ldots,m_k)\). Our goal is to construct a null collection of alternative trajectories that a human player of comparable skill could plausibly have played from \(S_0\). The observed trajectory is then compared against this sampled null distribution. If the observed line lies well within the counterfactual distribution, it is treated as consistent with fair human play; if it is unusually strong relative to the sampled alternatives, it may be flagged as anomalous.
Our framework is built around counterfactual trajectory sampling. We use Maia-2 as a skill-conditioned human move proposal model \(P_0(m\mid s,E_{\text{self}},E_{\text{oppo}})\), so proposed moves are sampled from a human move distribution conditioned on board position and Elo [TJM+24]. We then use Stockfish centipawn loss (CPL) to define a quality-aware target distribution over fixed-length trajectories [Sto26]. These two components are combined through a Metropolis-Hastings sampler: Maia-2 proposes human-plausible continuations, while the Stockfish/CPL target reweights those trajectories toward stronger play.
Given that the sampled object is a full move trajectory rather than a scalar variable, convergence between independent chains is a pivotal part of the analysis. We run four independent chains for each analysis window and compare the resulting null distributions. We report Gelman-Rubin statistic for total CPL and log target density, and we also apply the Partition-based Approximation for Convergence Evaluation (PACE) diagnostic to the sampled move-sequence states to assess convergence in the joint trajectory space [VS17]. This distinction is very important: A sampler may appear stable in scalar CPL while still exploring different regions of the discrete trajectory space.
The final prototype implements this sampling pipeline end-to-end, from Lichess data preprocessing, Maia-2 proposal generation, deterministic Stockfish-based CPL scoring, vanilla and mixture Metropolis-Hastings sampling, empirical anomaly scoring, four-chain convergence diagnostics, and a deployable FastAPI/Docker demo. We consider the system as an experimental fair-play analysis framework, not a production-ready cheating detection tool. Its purpose is to explore the feasibility and challenges of counterfactual trajectory sampling for providing a more statistically grounded baseline for interpreting engine-aligned play.
2. Methods / Implementation
2.1 Data
The data used in this project are collected from the official Lichess open database [Lic25], which provides monthly archives of chess games played on the Lichess platform. We focus on the standard chess games from February 2025. The full Lichess database is extremely large, so using one recent monthly archive makes the data processing computationally feasible while still providing a sufficiently large sample for analysis.
The raw data are distributed in compressed PGN format. Each PGN game contains two main types of information: game-level metadata stored in header tags, and the complete move sequence of the game. During preprocessing, we parse the PGN file game by game and convert it into a structured tabular dataset, where each row represents one game.
For each game, we extract the following variables:
site: the Lichess game URL
utc_date and utc_time: the date and time of the game
white and black: player identifiers
white_elo and black_elo: player Elo ratings before the game
white_rating_diff and black_rating_diff: rating changes after the game
result: game result
event: type of game
time_control: time-control setting
termination: reason why the game ended
eco and opening: opening classification
n_plies: total number of half-moves in the game
moves_uci: the full move sequence in UCI format
We convert the move sequence into UCI format rather than SAN format because UCI provides a standardized coordinate-based representation of each move. This makes the moves easier to tokenize, compare, and use in downstream analysis or modeling.
To improve data quality and focus on the target player population, we apply an Elo-based filter. A game is retained if at least one of the two players has an Elo rating between 1100 and 1900. This range is intended to capture low-to-intermediate and club-level games, where player decisions are more diverse than elite-level games but less noisy than beginner games. Games with missing Elo information or unparsable move sequences are removed.
After preprocessing, the final dataset is stored in Parquet format (data/processed/either_1100_1900_uci_2025_02_full), with each row representing one game. This columnar storage format is more efficient for large-scale chess data than a plain text file, and it supports faster loading and downstream analysis of player behavior, opening choices, move patterns, and game outcomes. Because the processed dataset is large, the Parquet files are tracked and shared through Git Large File Storage (Git LFS) rather than standard Git version control.
2.2 EDA
This EDA constructs a compact suspected-candidate dataset based on rating dynamics. The workflow first converts the game-level data into player-game observations, then summarizes each player’s Elo volatility and game-by-game rating changes. Players are flagged if they fall into the upper tail of rating instability or single-game rating gain.
The final candidate games preserve the player identifier, game identifier, opponent information, rating movement, game outcome, opening information, and full UCI move sequence. These outputs are designed for the next stage of analysis, where move-level evidence such as engine agreement, centipawn loss, or empirical anomaly scores can be used to test whether the flagged games show suspicious move quality.
2.2.1 Setup
import polars as plimport matplotlib.pyplot as pltPARQUET_PATH ="data/processed/either_1100_1900_uci_2025_02_full/*.parquet"lf = pl.scan_parquet(PARQUET_PATH)lf.collect_schema()
This EDA constructs a compact candidate list of potentially suspicious players and games based on unusual rating dynamics. The goal is not to label players as confirmed cheaters, but to identify player-game records that are worth follow-up testing using move-level models.
The candidate tables preserve game_id, site, and moves_uci, so the suspicious candidate games can be passed directly into later engine-agreement or move-quality analysis.
Each game is converted into two player-game observations: one from White’s perspective and one from Black’s perspective. This format allows rating changes, outcomes, opponent strength, and move sequences to be analyzed at the player level.
The score variable is defined from the player’s perspective:
This diagnostic checks the extent of missingness in post-game rating changes at the player-game level. Out of 31.10 million player-game observations, 147,954 have missing rating_diff values, corresponding to a missing rate of approximately 0.48%. Since the candidate-screening procedure relies directly on rating movements, subsequent anomaly calculations are restricted to observations with non-missing rating_diff values.
2.2.4 Player-Level Rating Dynamics Summary
player_summary = ( players_long .group_by("player_id") .agg([ pl.len().alias("n_games"), pl.col("rating_diff").is_not_null().sum().alias("n_rating_diff_obs"),# Elo level pl.col("elo").mean().alias("avg_elo"), pl.col("elo").median().alias("median_elo"), pl.col("elo").min().alias("min_elo"), pl.col("elo").max().alias("max_elo"),# Elo fluctuation (pl.col("elo").max() - pl.col("elo").min()).alias("elo_range"), pl.col("elo").std().alias("elo_std"),# Game-by-game rating changes pl.col("rating_diff").mean().alias("avg_rating_diff"), pl.col("rating_diff").abs().mean().alias("avg_abs_rating_diff"), pl.col("rating_diff").std().alias("std_rating_diff"), pl.col("rating_diff").sum().alias("total_rating_diff"), pl.col("rating_diff").min().alias("min_rating_diff"), pl.col("rating_diff").max().alias("max_rating_diff"),# Performance and opponent information pl.col("score").mean().alias("score_rate"), pl.col("opponent_elo").mean().alias("avg_opponent_elo"), pl.col("n_plies").mean().alias("avg_n_plies"), ]) .collect())player_summary.sort("n_games", descending=True).head(20)
shape: (20, 18)
player_id
n_games
n_rating_diff_obs
avg_elo
median_elo
min_elo
max_elo
elo_range
elo_std
avg_rating_diff
avg_abs_rating_diff
std_rating_diff
total_rating_diff
min_rating_diff
max_rating_diff
score_rate
avg_opponent_elo
avg_n_plies
str
u32
u32
f64
f64
i64
i64
i64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
"maia1"
4601
4381
1481.896762
1484.0
1335
1665
330
71.317871
-0.035608
4.24241
4.913044
-156.0
-11.0
12.0
0.569224
1418.364269
64.086938
"Leeuw7"
2191
2191
1324.109995
1322.0
1235
1425
190
31.932481
-0.022821
5.248745
5.43088
-50.0
-9.0
10.0
0.490187
1328.423551
42.423551
"maia5"
2078
2018
1600.641001
1604.0
1475
1806
331
62.762831
0.037661
4.334985
4.954285
76.0
-11.0
12.0
0.53922
1570.809432
69.771415
"af2002"
1956
1956
1746.955521
1743.0
1673
1844
171
37.811317
0.001022
4.878323
5.275238
2.0
-11.0
11.0
0.523773
1719.163599
57.010736
"imtwohigh"
1748
1747
1367.304348
1321.5
1135
1599
464
127.634235
0.046938
3.327991
4.579585
82.0
-15.0
12.0
0.247712
1655.949085
51.463959
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
"ronniegli"
1456
1456
1523.640797
1522.0
1431
1611
180
35.320809
0.021291
5.528159
5.634963
31.0
-8.0
9.0
0.489698
1530.926511
69.144918
"bra64"
1456
1456
1824.763049
1826.0
1688
1918
230
44.213581
-0.115385
5.659341
5.823912
-168.0
-20.0
17.0
0.489698
1824.571429
65.619505
"KelvinPatel"
1448
1448
1253.853591
1251.0
1150
1389
239
48.983467
0.004144
4.167127
4.818042
6.0
-7.0
11.0
0.302831
1432.86326
38.093923
"SoshalDistanSingh"
1410
1410
1690.539007
1655.0
1525
2129
604
134.747095
-0.479433
5.24539
6.03855
-676.0
-12.0
11.0
0.478369
1666.786525
42.548936
"esistdj"
1409
1409
1285.696238
1295.0
1173
1450
277
43.167716
-0.006388
5.654365
5.868888
-9.0
-24.0
34.0
0.491838
1287.342087
52.157559
This table summarizes each player’s observed rating dynamics in the February 2025 sample. The most important variables for candidate screening are:
elo_std: volatility of the player’s observed Elo level.
elo_range: difference between the player’s maximum and minimum observed Elo.
std_rating_diff: volatility of game-by-game rating changes.
max_rating_diff: largest single-game rating gain.
n_games: number of observed games.
n_rating_diff_obs: number of non-missing rating-diff observations.
These variables are used only as statistical anomaly signals.
The anomaly thresholds are defined as the 99th percentile of each player-level rating-dynamics indicator. For max_rating_diff, the threshold is calculated using only positive values, so this metric specifically captures unusually large single-game Elo gains rather than general rating changes.
In this sample, the 99th-percentile thresholds are approximately 264.34 for elo_std, 653 for elo_range, 222.16 for std_rating_diff, and 274 for max_rating_diff. Players whose values exceed at least one of these thresholds are treated as statistical anomaly candidates for follow-up analysis.
The distribution of player-game rating changes is sharply concentrated around zero, indicating that most games produce only small Elo movements. The long but sparse tails suggest that very large rating gains or losses are relatively rare. This supports the use of tail-based anomaly thresholds when constructing the suspected candidate set.
elo_std_values = ( player_summary .select(pl.col("elo_std").drop_nulls()) .to_series() .to_numpy())plot_hist_with_threshold( values=elo_std_values, title="Distribution of Player Elo Standard Deviation", xlabel="elo_std", bins=50, threshold=elo_std_threshold, threshold_label=f"99th percentile = {elo_std_threshold:.2f}")
The distribution of player-level Elo standard deviation is highly right-skewed. Most players have relatively stable Elo values within the February 2025 sample, while a small number of players exhibit much larger Elo fluctuations. The dashed vertical line marks the 99th percentile threshold at 264.34. Players to the right of this threshold are treated as statistical anomaly candidates based on unusually high Elo instability. This flag is used only for candidate screening and does not by itself indicate cheating.
elo_range_values = ( player_summary .select(pl.col("elo_range").drop_nulls()) .to_series() .to_numpy())plot_hist_with_threshold( values=elo_range_values, title="Distribution of Player Elo Range", xlabel="elo_range", bins=50, threshold=elo_range_threshold, threshold_label=f"99th percentile = {elo_range_threshold:.2f}")
The distribution of player Elo range is highly right-skewed. Most players have relatively small Elo ranges, suggesting that their ratings remain fairly stable across the observed games. However, a small group of players shows very large rating ranges, with the 99th percentile around 653 Elo points. These extreme cases indicate unusually large rating fluctuations and are therefore useful for identifying players who may require further anomaly-based investigation.
std_rating_diff_values = ( player_summary .select(pl.col("std_rating_diff").drop_nulls()) .to_series() .to_numpy())plot_hist_with_threshold( values=std_rating_diff_values, title="Distribution of Player Rating-Diff Standard Deviation", xlabel="std_rating_diff", bins=50, threshold=std_rating_diff_threshold, threshold_label=f"99th percentile = {std_rating_diff_threshold:.2f}")
The distribution of player-level rating-difference standard deviation is highly right-skewed. Most players have low standard deviations, indicating that their game-by-game Elo changes are relatively stable. However, a small group of players shows unusually high volatility, with the 99th percentile around 222 Elo points. These players may have inconsistent rating trajectories or extreme rating jumps across games, making this metric useful for identifying anomaly candidates for further move-level investigation.
max_rating_diff_values = ( player_summary .filter(pl.col("max_rating_diff") >0) .select(pl.col("max_rating_diff").drop_nulls()) .to_series() .to_numpy())plot_hist_with_threshold( values=max_rating_diff_values, title="Distribution of Largest Single-Game Rating Gain per Player", xlabel="max_rating_diff", bins=50, threshold=max_rating_diff_threshold, threshold_label=f"99th percentile = {max_rating_diff_threshold:.2f}")
The distribution of each player’s largest single-game rating gain is strongly right-skewed. Most players have only small maximum Elo gains, while a small number of players experience unusually large one-game increases. The 99th percentile is 274 Elo points, suggesting that gains above this level may represent unusual rating movements for further investigation.
This table contains suspected player candidates based on unusual rating dynamics. A player is included if they exceed the 99th percentile threshold for at least one of the following indicators:
This table links suspected players back to all of their observed games in the sample. It keeps the full UCI move sequence, so these games can be passed into a later model.
The final candidate game list keeps the most relevant games for each suspected player:
games with rating gains above the global 99th percentile threshold;
the player’s own largest rating-gain game;
the player’s own largest rating-loss game.
The largest-loss game is retained only as context for rating volatility. The most important records for cheating follow-up are usually large rating-gain games and largest-gain games.
This baseline table ignores player-level volatility and simply keeps the largest positive rating-gain games in the whole sample. It is useful as a simple comparison set for the later move-level model.
2.2.12 Inspect Normal-Looking Games for One Suspected Player
# Take one player from the large-rating-gain baseline tabletarget_player = large_rating_gain_games["player_id"][1]target_player_games_all = ( players_long .filter(pl.col("player_id") == target_player) .filter(pl.col("rating_diff").is_not_null()) .select(["player_id","game_id","site","opponent_id","elo","opponent_elo","rating_diff","result","score","color","utc_date","time_control","termination","n_plies","eco","opening","moves_uci", ]) .sort("utc_date") .collect())target_player_games_all
shape: (6, 17)
player_id
game_id
site
opponent_id
elo
opponent_elo
rating_diff
result
score
color
utc_date
time_control
termination
n_plies
eco
opening
moves_uci
str
str
str
str
i64
i64
f64
str
f64
str
str
str
str
i64
str
str
str
"qdrainow"
"j4NBlhwt"
"https://lichess.org/j4NBlhwt"
"jiajia121127"
1500
2173
700.0
"1-0"
1.0
"white"
"2025.02.01"
"3600+30"
"Rules infraction"
39
"D11"
"Slav Defense: Modern Line"
"d2d4 d7d5 c2c4 c7c6 g1f3 g8f6 …
"qdrainow"
"9VK9TjmT"
"https://lichess.org/9VK9TjmT"
"paulogar"
2179
1792
5.0
"0-1"
1.0
"black"
"2025.02.01"
"60+0"
"Time forfeit"
58
"B10"
"Caro-Kann Defense"
"e2e4 c7c6 b1c3 d7d5 e4d5 c6d5 …
"qdrainow"
"w7vyZ2uL"
"https://lichess.org/w7vyZ2uL"
"DG73"
2184
1809
4.0
"0-1"
1.0
"black"
"2025.02.01"
"60+0"
"Time forfeit"
74
"D10"
"Slav Defense"
"d2d4 c7c6 c2c4 d7d5 b1c3 d5c4 …
"qdrainow"
"Rd53HyM3"
"https://lichess.org/Rd53HyM3"
"hechtinger"
2173
1586
1.0
"1-0"
1.0
"white"
"2025.02.02"
"180+0"
"Normal"
29
"E04"
"Catalan Opening: Open Defense"
"d2d4 d7d5 c2c4 e7e6 g1f3 d5c4 …
"qdrainow"
"jlCaJo3Z"
"https://lichess.org/jlCaJo3Z"
"stockrdb"
2174
1616
-11.0
"0-1"
0.0
"white"
"2025.02.02"
"180+0"
"Abandoned"
0
"?"
"?"
""
"qdrainow"
"1sCkdlzE"
"https://lichess.org/1sCkdlzE"
"Alfazm"
2172
1779
1.0
"0-1"
1.0
"black"
"2025.02.02"
"180+0"
"Normal"
49
"B20"
"Sicilian Defense: Bowdler Atta…
"e2e4 c7c5 f1c4 e7e6 g1f3 b8c6 …
# Define "normal-looking" games for this player:# 1. rating_diff is not extremely positive;# 2. game is not the player's largest rating gain;# 3. game is not the player's largest rating loss.target_player_max_gain = ( target_player_games_all .select(pl.col("rating_diff").max()) .item())target_player_max_loss = ( target_player_games_all .select(pl.col("rating_diff").min()) .item())target_player_normal_looking_games = ( target_player_games_all .filter(pl.col("elo") <1900) .filter(pl.col("rating_diff") < max_rating_diff_threshold) .filter(pl.col("rating_diff") < target_player_max_gain) .filter(pl.col("rating_diff") > target_player_max_loss) .sort("rating_diff", descending=True))target_player_normal_looking_games.head(20)
shape: (0, 17)
player_id
game_id
site
opponent_id
elo
opponent_elo
rating_diff
result
score
color
utc_date
time_control
termination
n_plies
eco
opening
moves_uci
str
str
str
str
i64
i64
f64
str
f64
str
str
str
str
i64
str
str
str
2.3 Trajectory State Space and FEN Normalization
This section describes the state representation used in our sampling framework. The trajectory representation is implemented through TrajectoryState in sampler.py, while the FEN normalization step is implemented through normalize_fen_for_policy() in utils.py.
A chess trajectory is represented as a fixed-length sequence of UCI moves. For a chosen analysis depth \(k\), we define \[
X = (m_1, m_2, \ldots, m_k)\,,
\] where \(m_i\) denotes the move played at ply \(i\). In the implementation, TrajectoryState stores the starting FEN and the list of UCI moves. Replaying these moves reconstructs the sequence of board states \[
(s_0, s_1, \ldots, s_k)\,,
\] where \(s_0\) is the initial position and \(s_i\) is the position after the first \(i\) moves. Therefore, move \(m_i\) is evaluated from state \(s_{i-1}\). This distinction is important because both the move proposal distribution and the target scoring function depend on the board state before the move is played.
For policy lookup and cached engine evaluation, we use a normalized FEN key, implemented by normalize_fen_for_policy() in utils.py. A full FEN string contains six components: piece placement, side to move, castling rights, en passant target square, halfmove clock, and fullmove number. We keep only the first four:
The halfmove clock and fullmove number are dropped because they usually do not affect the strategic identity of the decision state. This reduces sparsity: positions with the same board configuration, side to move, castling rights, and en passant target are treated as the same state even if they occur at different move numbers.
This normalization reduces unnecessary sparsity in position lookup. Without it, the same strategic position could be stored under multiple different FEN strings simply because the fullmove number or halfmove clock differs. By using the normalized FEN key from utils.py, the proposal policy and engine cache can group strategically equivalent positions together more consistently.
Overall, the trajectory state space is defined in sampler.py, while the FEN normalization logic is defined in utils.py. Together, they provide the basic state representation used by the Metropolis-Hastings sampler.
2.4 Skill-conditioned Human Move Proposal
The proposal distribution is designed to generate human-plausible moves rather than engine-optimal moves. The goal of the sampler is not to produce the strongest possible chess continuations, but to construct counterfactual trajectories that a human player of comparable skill could plausibly have played.
We use Maia-2 as the main proposal backend [TJM+24]. Given a board position \(s\), the active player’s Elo \(E_{\text{self}}\), and the opponent’s Elo \(E_{\text{oppo}}\), Maia-2 returns probabilities over candidate moves. We restrict these probabilities to legal moves, assign a small positive floor probability to any missing legal move, and renormalize over the legal move set. Thus, for each legal move \(m \in \mathcal{L}(s)\), the proposal model defines \[P_0(m \mid s, E_{\text{self}}, E_{\text{oppo}})\,.\]
This normalized probability is used whenever the Metropolis-Hastings sampler needs to draw a move from a position.
The skill conditioning is important for the interpretation of the counterfactual baseline. A uniform proposal over legal moves would generate many unrealistically weak continuations, while a Stockfish-based proposal would generate unrealistically strong continuations. Maia-2 instead provides a human move distribution conditioned on player strength, allowing the sampler to explore alternatives that are plausible for the target rating level.
In the implementation, this proposal interface is modular: legal_moves_with_probs(fen) returns legal UCI moves and their probabilities, and the same sampler can use any backend with this interface. Our final experiments use Maia2ProposalBackend; an empirical move-frequency backend is also included for preliminary testing and debugging.
The key modeling separation is that Maia-2 controls human plausibility, while the later Stockfish/CPL target controls engine-measured move quality. Maia-2 determines which alternatives are likely to be proposed, and the Metropolis-Hastings acceptance step reweights the resulting trajectories according to the target distribution.
2.5 Preliminary Validation with an Empirical Policy Target
Before using the engine-based target scorer, we first validated the sampler with a simpler empirical policy target. This validation step is implemented in sampler.py through the EmpiricalTargetScorer class. The purpose of this stage is not to perform the final anomaly test, but to check whether the trajectory replay, proposal generation, and Metropolis-Hastings acceptance mechanism behave as expected.
For a trajectory \(X=(m_1,\ldots,m_k)\), the empirical policy target assigns a log score based on the proposal probability of each move along the trajectory: \[
\log \pi_{emp}(X)
=
\sum_{i=1}^{k}
\log P_0(m_i \mid s_{i-1})\,.
\]
Here, \(s_{i-1}\) is the board position before move \(m_i\) is played, and \(P_0(m_i \mid s_{i-1})\) is the proposal probability assigned to that move by the proposal backend. In the implementation, the scorer replays the trajectory to recover the sequence of FEN positions, calls legal_moves_with_probs(fen) at each position, looks up the probability of the played move, and sums the log probabilities across the trajectory.
This empirical target was used as a preliminary debugging device. Since it is defined directly from the proposal policy, it helps verify that the proposal backend returns valid legal moves, that the replayed board states match the move sequence, and that the sampler can compute proposal-based trajectory probabilities consistently. It also provides a simpler setting for checking the Metropolis-Hastings transition logic before introducing Stockfish-based centipawn loss calculations.
Importantly, this empirical policy target is not the final target distribution used for anomaly detection. It only measures how likely a trajectory is under the proposal model itself. A trajectory with high empirical policy probability is human-plausible under the proposal backend, but it is not necessarily strong or engine-aligned. Therefore, after validating the sampler mechanics, we replace this preliminary target with the engine-based centipawn loss target described in the next section.
Overall, this validation stage separates implementation checking from the final anomaly analysis. The empirical policy target confirms that the sampler can correctly replay trajectories, retrieve move probabilities, and run Metropolis-Hastings updates. The final anomaly test is then based on Stockfish-derived centipawn loss rather than on proposal likelihood alone.
2.6 Engine-based Centipawn Loss Target
After validating the sampler with the empirical policy target, we replace the preliminary target with an engine-based target scorer. This step is implemented mainly through the EngineTargetScorer class in sampler.py. The scorer is instantiated in run_sampler_new.py for the single-chain anomaly experiment and in run_four_chain_diagnostics.py and run_four_chain_diagnostics_mixed_kernel.py for the four-chain diagnostic experiment.
The purpose of this target is to evaluate the chess quality of each trajectory. The Maia-2 proposal model determines which moves are human-plausible, but it does not directly measure whether those moves are close to engine-optimal play. Therefore, we use Stockfish to compute centipawn loss (CPL) for each move in a trajectory. This separates the two roles in the framework: Maia-2 controls human-like proposal generation, while Stockfish provides the engine-based quality score.
Because the engine-based target requires access to a Stockfish executable, we first download and extract Stockfish 18. The following commands show the setup procedure, but they are not evaluated when rendering the QMD file.
#| label: stockfish-download
#| eval: false
# Download Stockfish 18
wget https://github.com/official-stockfish/Stockfish/releases/download/sf_18/stockfish-ubuntu-x86-64-avx2.tar
# Extract the tar archive
tar -xvf stockfish-ubuntu-x86-64-avx2.tar
# Find the absolute path to the executable file
readlink -f stockfish-ubuntu-x86-64-avx2/stockfish-ubuntu-x86-64-avx2
After obtaining the absolute path to the executable, we set the STOCKFISH_PATH environment variable before running the sampler.
This environment variable tells the Python scripts where to find the Stockfish binary. In run_sampler_new.py, run_four_chain_diagnostics.py, and run_four_chain_diagnostics_mixed_kernel.py, the helper function resolve_stockfish_path() first checks whether STOCKFISH_PATH has been set. If it is not set, the function falls back to searching for stockfish on the system path using which stockfish. If neither option succeeds, the script raises an error. This design avoids hard-coding the engine path directly inside the Python source files. On the shared machine used for this project, collaborators with read access to the directory above can reuse the same Stockfish executable. On a different machine, the same code can still be used by setting STOCKFISH_PATH to the local Stockfish path.
For a trajectory \[
X = (m_1, m_2, \ldots, m_k)\,,
\]
the engine scorer evaluates each move \(m_i\) from the board state \(s_{i-1}\), where \(s_{i-1}\) is the position before the move is played. For each position, Stockfish evaluates candidate moves and returns centipawn scores. The implementation converts the engine score into the side-to-move perspective, so that a larger score always means a better position for the player who is about to move.
The centipawn loss of move \(m_i\) is defined as \[
\mathrm{CPL}(m_i)
=
\max(0, \mathrm{score}_\text{best}(s_{i-1}) - \mathrm{score}_\text{played}(s_{i-1}, m_i))\,.
\]
Here, \(\mathrm{score}_\text{best}(s_{i-1})\) is the Stockfish score of the best candidate move from position \(s_{i-1}\), and \(\mathrm{score}_\text{played}(s_{i-1}, m_i)\) is the Stockfish score of the move actually taken in the trajectory. A lower \(\mathrm{CPL}\) means that the move is closer to the engine-preferred move. A \(\mathrm{CPL}\) of zero means that the move is at least as good as the best evaluated candidate under the engine search settings.
The total centipawn loss of a trajectory is then computed as the sum of move-level losses: \[
\mathrm{CPL}(X)
=
\sum_{i=1}^{k} \mathrm{CPL}(m_i)\,.
\]
The engine-based target log-density is defined as \[
\log \pi_{engine}(X)
=
-\beta \mathrm{CPL}(X)\,.
\]
The parameter \(\beta\) controls how strongly the target distribution favors low-CPL trajectories. Since the target uses a negative sign, trajectories with smaller total CPL receive higher target density. In other words, the sampler is encouraged to visit trajectories that are stronger according to Stockfish, while the proposal distribution still ensures that candidate moves are generated from a human-like Maia-2 policy.
The implementation also includes several details to make the scoring process more reliable and efficient. For each position, Stockfish first evaluates a set of top candidate moves using MultiPV. These candidate evaluations are cached using the normalized FEN key, so repeated visits to the same position do not always require a new engine call. If the move in the trajectory is not included among the MultiPV candidates, the scorer evaluates that move separately by forcing it as the root move. This ensures that CPL is computed for the actual trajectory move, not only for the moves that Stockfish initially selected as top candidates.
To make scoring reproducible, the implementation uses a deterministic cache. For each normalized FEN key, the scorer stores the best score from the initial MultiPV search separately from move-specific scores indexed by (position_key, move_uci). If a played or proposed move is not included in the MultiPV candidates, Stockfish evaluates it as a forced root move, but this only updates the move-level cache. It does not change the cached best score for that position. This prevents the total CPL of a fixed trajectory from changing after the sampler evaluates additional moves.
In fair-play analysis, we also score only the suspect player’s moves. For example, if the suspect is playing Black, CPL is accumulated only on Black’s plies. This makes the anomaly score reflect the investigated player’s decisions rather than both players’ combined move quality.
This separation prevents the total CPL of a fixed trajectory from changing after the sampler has evaluated additional moves. Without this precaution, a later forced move evaluation could be appended to the same position cache and accidentally change the “best” score used in future CPL calculations. With the deterministic scorer, the best evaluated candidate for a position is fixed once the position is first scored, and subsequent forced move evaluations only provide the score of the specific move needed for CPL computation.
Thus, the implemented CPL calculation is \[
\mathrm{CPL}(m_i)
=
\max\{0,\; \mathrm{score}_{\mathrm{best}}(s_{i-1}) - \mathrm{score}_{\mathrm{played}}(s_{i-1},m_i)\}\,,
\] where both scores are measured from the side-to-move perspective, and where \(\mathrm{score}_{\mathrm{best}}(s_{i-1})\) is not changed by later forced evaluations of additional moves. In the diagnostic scripts, the observed trajectory’s total CPL is also computed once before sampling and reused when computing empirical p-values. This ensures that the observed baseline remains fixed throughout the experiment.
This engine-based target is the final target distribution used in the anomaly detection framework. The empirical policy target described earlier is only used to validate the sampler mechanics. In the final pipeline, Maia-2 proposes human-plausible counterfactual trajectories, Stockfish assigns each trajectory a total CPL, and the Metropolis-Hastings sampler combines these two components through the proposal probability and the engine-based target density.
2.7 Vanilla Prefix-preserving Kernel
After defining the human-like proposal distribution and the engine-based target density, we combine them using a Metropolis-Hastings sampler. This step is implemented through the MHSampler class in sampler.py. The sampler operates on fixed-length chess trajectories rather than isolated board positions. Given a current trajectory \[
X = (m_1, m_2, \ldots, m_k)\,,
\]
the sampler proposes a new trajectory \(Y\) of the same length and then accepts or rejects it according to the Metropolis-Hastings acceptance rule.
The proposal mechanism is prefix-preserving. At each iteration, the sampler first chooses a perturbation depth \(d\) uniformly from the trajectory positions. The moves before depth \(d\) are kept unchanged: \[
Y_{1:d-1} = X_{1:d-1}\,.
\]
At the perturbation depth, the sampler forces a move different from the original move \(m_d\). The alternative move is sampled from the proposal distribution \(P_0\), restricted to legal moves that are not equal to the original move. This restriction ensures that the proposed trajectory actually differs from the current trajectory at the selected depth.
After the new move at depth \(d\) is selected, the sampler replays the new prefix on a chess board and then resamples the remaining suffix autoregressively. For each later position, the sampler calls legal_moves_with_probs(fen) from the proposal backend and draws the next move using the returned proposal probabilities. Thus, the proposed trajectory is generated as \[
Y = (m_1, \ldots, m_{d-1}, m'_d, m'_{d+1}, \ldots, m'_k),
\]
where the prefix before \(d\) is preserved, the move at \(d\) is perturbed, and the suffix after \(d\) is newly sampled from the human-like proposal model.
This proposal design is useful for chess trajectory sampling because a single move change can make the rest of the original game illegal or strategically inconsistent. Instead of changing one move and keeping the original suffix, the sampler rolls forward from the new position and regenerates all later moves. This guarantees that the proposed trajectory remains a valid legal chess sequence under the new prefix.
The sampler also computes the exact proposal probability for each transition. The forward proposal probability \(q(Y \mid X)\) includes three components: the probability of choosing the perturbation depth, the probability of choosing the alternative move at that depth, and the product of proposal probabilities for the resampled suffix. In log form, this can be written as \[
\log q(Y \mid X)
=
-\log k
+
\log P_0(m'_d \mid s_{d-1}, m'_d \ne m_d)
+
\sum_{i=d+1}^{k}
\log P_0(m'_i \mid s'_{i-1})\,.
\]
Here, \(s_{d-1}\) is the board position before the perturbed move, and \(s'_{i-1}\) denotes the board position along the newly proposed trajectory. The term \(-\log k\) comes from selecting the perturbation depth uniformly among \(k\) positions.
Because this proposal distribution is not symmetric, the sampler must also compute the reverse proposal probability \(q(X \mid Y)\). This is implemented in the log_q() method in sampler.py, which reconstructs the first differing depth between two trajectories and evaluates the exact probability of proposing one trajectory from the other. If the transition is impossible, for example because the trajectories have different lengths or an illegal move appears, the reverse probability is returned as negative infinity.
The Metropolis-Hastings acceptance probability is then computed using both the target density and the proposal correction: \[
\log \alpha
=
\min
\left[
0,\,
\log \pi(Y) - \log \pi(X)
+
\log q(X \mid Y) - \log q(Y \mid X)
\right]\,.
\]
The proposed trajectory is accepted with probability \(\alpha\). If it is accepted, the Markov chain moves to \(Y\); otherwise, it remains at \(X\). In the final engine-based version, \(\log \pi(\cdot)\) is given by the Stockfish CPL target. Therefore, the sampler tends to accept trajectories with lower total CPL, while still correcting for the fact that candidate moves are proposed from the Maia-2 human move distribution.
The implementation also handles invalid proposal paths. If resampling the suffix reaches a game-over position before the fixed trajectory length is completed, the proposal is rejected rather than treated as a valid sample. This preserves the fixed-depth state space used throughout the anomaly detection pipeline.
Overall, the prefix-preserving Metropolis-Hastings sampler provides the mechanism for generating counterfactual human-like trajectories. Maia-2 defines the proposal probabilities, Stockfish defines the target density through total CPL, and the Metropolis-Hastings acceptance rule combines both components to produce samples from the desired trajectory distribution.
2.8 Mixture Refresh Kernel
In addition to the vanilla prefix-preserving kernel, we also experimented with a mixture refresh kernel to improve exploration. The motivation is that the prefix-preserving proposal is local: it changes one move and then resamples the suffix, but it keeps the prefix before the perturbation depth fixed. In a large discrete trajectory space, this can cause chains to remain trapped in local trajectory neighborhoods.
The mixture kernel combines two proposal mechanisms. With probability \(1-\rho\), it uses the prefix-preserving proposal described above. With probability \(\rho\), it performs a full refresh by resampling an entire length-\(k\) trajectory from the Maia-2 proposal distribution starting at the same initial position \(S_0\). The full-refresh proposal density is \[
q_{\mathrm{refresh}}(Y)
=
\prod_{i=1}^{k}
P_0(y_i \mid y_{1:i-1}, S_0)\,,
\] where the board state before each move is obtained by replaying the previously sampled moves.
The resulting mixture proposal is \[
q_{\mathrm{mix}}(Y \mid X)
=
(1-\rho)q_{\mathrm{prefix}}(Y \mid X)
+
\rho q_{\mathrm{refresh}}(Y)\,.
\]
In the 80-20 version of this kernel, we set \(\rho=0.20\), so that 80% of proposals are local prefix-preserving proposals and 20% are full trajectory refreshes. The vanilla kernel is recovered as the special case \(\rho=0\).
Because the proposal is a mixture, the Metropolis-Hastings correction must use the full mixture density rather than only the proposal branch that happened to generate the candidate trajectory. Therefore, the acceptance probability is \[
\alpha(X \to Y)
=
\min\left\{
1,\,
\frac{\pi(Y)}{\pi(X)}
\frac{q_{\mathrm{mix}}(X\mid Y)}
{q_{\mathrm{mix}}(Y\mid X)}
\right\}\,.
\]
In the implementation, the forward and reverse mixture probabilities are computed using a log-sum-exp calculation for numerical stability. This preserves the correct Metropolis-Hastings correction while allowing occasional global moves in trajectory space.
2.9 Counterfactual Baseline and Empirical Anomaly Score
After running the sampler, the retained trajectories form a counterfactual human baseline. Each sampled trajectory is a legal length-\(k\) sequence from the same starting position as the observed game. These are not observed games from the database; they are alternative continuations generated by perturbing the observed trajectory and rolling forward under the Maia-2 proposal model.
For each sampled trajectory \(X^{(j)}\), we compute total suspect-side CPL: \[
\mathrm{CPL}(X^{(j)})=\sum_{i=1}^{k}\mathrm{CPL}(m_i^{(j)})\,,
\] where the sum includes only the investigated player’s moves. The collection \[
\{\mathrm{CPL}(X^{(1)}),\ldots,\mathrm{CPL}(X^{(N)})\}
\] is the empirical null distribution for that analysis window.
The observed trajectory \(X^\text{obs}\) is then compared with this null distribution. Since lower CPL means stronger, more engine-aligned play, the empirical anomaly p-value is \[
p=
\frac{
1+\sum_{j=1}^{N}\mathbb{1}\{\mathrm{CPL}(X^{(j)})\leq \mathrm{CPL}(X^{\mathrm{obs}})\}
}{
1+N
}\,.
\] The smoothing term prevents the p-value from being exactly zero when no sampled trajectory outperforms the observed line.
A small p-value means that very few sampled human-plausible trajectories achieved CPL as low as the observed trajectory. We therefore interpret small p-values as evidence that the observed line is unusually engine-aligned relative to the sampled counterfactual baseline. However, this interpretation is only reliable if the MCMC chains mix well, which motivates the multi-chain diagnostics described below.
2.10 Multi-chain Convergence Diagnostics
A single MCMC chain can produce a plausible-looking p-value even if it has not explored the relevant trajectory space. This is especially problematic here because the state is a full chess move sequence, not a one-dimensional score. We therefore run four independent chains through run_four_chain_diagnostics.py and run_four_chain_diagnostics_mixed_kernel.py for the same analysis window and compare both their scalar summaries and their sampled trajectory states.
For each chain, we record the acceptance rate, number of unique trajectories, mean and standard deviation of sampled CPL, and chain-specific empirical p-value. If these quantities vary substantially across chains, the pooled p-value may depend too much on initialization or random seed.
As a scalar diagnostic, we compute split-\(\hat{R}\) for total CPL and log target density [GR92]. Values close to 1 suggest that between-chain and within-chain variation are similar for those scalar summaries. However, scalar convergence is not enough: two chains may have similar CPL distributions while visiting different move-sequence regions.
To assess convergence in the trajectory space itself, we use PACE [VS17]. PACE partitions the sampled state space and compares the probability mass that each chain assigns to each partition cell. We use two trajectory-level partitions.
The exact-state partition treats two trajectories as the same only if their full UCI move sequences are identical: \[
d_{\mathrm{exact}}(X,Y)=
\begin{cases}
0,&X=Y,\\
1,&X\neq Y\,.
\end{cases}
\]
This is a strict diagnostic: it asks whether chains visit the same exact trajectories with similar frequencies.
Because exact matching can be too strict in a sparse discrete space, we also use a trajectory-medoid partition. Frequent sampled trajectories are selected as medoids, and each sample is assigned to its nearest medoid using normalized Hamming distance: \[
d_{\mathrm{Ham}}(X,Y)
=
\frac{1}{k}\sum_{i=1}^{k}\mathbf{1}\{m_i\neq y_i\}\,.
\]
This coarser diagnostic asks whether chains explore similar neighborhoods of trajectory space, even if they do not repeatedly sample the same exact move sequences.
Both PACE versions are applied directly to sampled move-sequence states, not to scalar observables such as total CPL or log target density.
3. API Design
To make the anomaly detection framework easier to run and reproduce, we wrap the sampling and diagnostic components in a FastAPI application. The API layer is implemented in main.py, which serves as the entry point for external requests. It defines a structured request schema containing the player identifier, player Elo, UCI move sequence, analysis depth, suspected player color, and an optional demo mode. If a demo mode is selected, the API loads one of the predefined example games from the demo data folder. Otherwise, it uses the move sequence and metadata supplied directly by the user. The API exposes two main endpoints: /api/detect_anomaly for the main anomaly detection task and /api/diagnostics for convergence and stability diagnostics.
The main anomaly detection logic is implemented in api_run_sampler.py. Given a candidate player and a supplied move sequence, the function initializes a Maia-2 proposal backend using the player’s Elo rating. This proposal model represents plausible human move choices for a player at a similar rating level. The function also initializes a Stockfish-based target scorer, which evaluates each trajectory using centipawn loss. Together, these two components define the Metropolis-Hastings sampler: Maia-2 proposes human-like counterfactual trajectories, while the Stockfish-based scorer assigns each trajectory a target score based on move quality.
The API evaluates one fixed-length move window at a time. The supplied moves_uci sequence is sliced to length k_depth, so that the observed trajectory and all MCMC-generated counterfactual trajectories have the same number of moves. If the input sequence corresponds to the beginning of a full game, the API evaluates the first k_depth moves. If the input sequence has already been selected by an upstream screening step as a suspicious segment, then the API evaluates that selected segment.
The detection endpoint returns an anomaly detection summary, including the observed window’s total centipawn loss, the baseline mean centipawn loss from the sampled trajectories, the empirical p-value, and a final verdict. The empirical p-value is computed as the proportion of sampled trajectories with centipawn loss less than or equal to the observed trajectory, with a small smoothing adjustment. A very small p-value means that the observed window has lower centipawn loss than almost all simulated human-like trajectories, making it unusually strong relative to the Elo-conditioned human baseline. When this p-value falls below the chosen threshold, the endpoint returns an anomaly flag.
The diagnostic endpoint is implemented separately in api_run_diagnostics.py. While the main detection endpoint reports the anomaly result from one MCMC run, the diagnostic module evaluates whether the result is stable across multiple Markov chains. The API diagnostic module is adapted from the earlier mixed-kernel diagnostic script, preserving the same core workflow while refactoring it into a function that can be called by the API. For API execution, the sampling settings are reduced to N_STEPS = 200 and BURN_IN = 50 to lower computational cost, since the diagnostic endpoint runs four independent chains and each MCMC step requires Stockfish-based trajectory scoring. These settings make the endpoint more feasible for interactive use.
In addition to chain-level summaries, the diagnostic module reports convergence checks. It summarizes each chain using acceptance rate, number of unique sampled states, mean baseline centipawn loss, standard deviation of baseline centipawn loss, and chain-level empirical p-value. It also computes pooled anomaly statistics across chains. Split R-hat is computed for scalar quantities such as total centipawn loss and log target density, while PACE is applied to sampled trajectory states to assess whether different chains are exploring similar regions of the discrete trajectory space. These diagnostics help distinguish a genuinely unusual observed game window from an unstable result caused by poor chain mixing or chain-specific sampling behavior.
Overall, the API separates user-facing execution from the statistical computation. main.py handles request parsing, demo loading, and endpoint routing. api_run_sampler.py performs the main fixed-window anomaly test, and api_run_diagnostics.py provides multi-chain reliability checks. This modular design improves reproducibility and maintainability, because the interface layer, the anomaly detection procedure, and the diagnostic workflow can be developed and tested separately.
From an implementation perspective, the API was also adjusted to fit the resource constraints of the VM environment. Because our VM had limited storage, we disabled GPU execution and ran the Maia-2 proposal backend on CPU. To make the dependency stack installable within the VM constraints, we also downgraded the Python version specified in pyproject.toml and used a compatible lower-version PyTorch build. These deployment changes reduce storage and compatibility pressure, although CPU execution makes the sampler slower. Importantly, they do not change the high-level anomaly detection logic: Maia-2 is still used as the human move proposal model, Stockfish still provides the target scoring, and the Metropolis-Hastings procedure remains the same.
4. Results
4.1 End-to-end demo on selected examples
To demonstrate the full anomaly-detection pipeline, we ran the single-chain sampler on manually selected examples: For example, we have ran on one suspected-cheating game selected from a rapid rating-gain list and one comparison game selected as an ordinary-looking fair-play example. These labels are heuristic rather than confirmed ground truth, so this experiment should be interpreted as a qualitative demo rather than a formal classifier benchmark.
For each game, we fixed the first \(k=10\) plies as the analysis window, conditioned Maia-2 on the suspect player’s Elo, and computed centipawn loss only on the suspect player’s moves. The sampled counterfactual trajectories form an empirical null distribution of suspect-side CPL values. The empirical p-value reports the fraction of sampled trajectories with CPL less than or equal to the observed trajectory; therefore, smaller p-values indicate that the observed line is unusually engine-aligned relative to the sampled human baseline.
Single-chain anomaly detection results on two selected games.
Example
Suspect side
Actual CPL
Null mean CPL
Null median CPL
Null std CPL
Empirical p-value
Decision
suspected example: MirTrap
White
6
69.94
61
32.84
0.0020
flagged
comparison example: imdumbplsteach
Black
44
108.80
104
40.70
0.0320
not flagged under (p<0.01)
The results of running single chains are recorded in the above table. In the suspected example, the observed trajectory had total suspect-side CPL 6, while the sampled baseline had mean CPL 69.94. The empirical \(p\)-value was 0.0020, so the trajectory was flagged as an outlier under the \(p<0.01\) rule. In the comparison example, the observed suspect-side CPL was 44, compared with a sampled baseline mean of 108.80. Its empirical \(p\)-value was 0.0320, which is low but not below our stricter \(p<0.01\) flagging threshold. This illustrates how the pipeline distinguishes between an extremely low-CPL line and a strong but less extreme line in the selected examples.
4.2 Multi-chain diagnostics
Because single-chain empirical \(p\)-values can be unstable, we also ran four-chain diagnostics on a representative lower-Elo player, ferlionrod. These diagnostics are intended to test whether independently initialized chains produce compatible null distributions. We report acceptance rates, unique sampled states, chain-level \(p\)-values, scalar split-\(\hat{R}\), and PACE on sampled trajectory states in table below.
Multi-chain diagnostics for the vanilla prefix-preserving kernel and the mixture refresh kernel.
Kernel
(k)
Actual CPL
Pooled null mean
Pooled \(p\)-value
Split-\(\hat{R}\)
Exact PACE \(\widehat{\delta}\)
Medoid PACE \(\widehat{\delta}\)
Vanilla prefix kernel
10
206
220.75
0.510
1.229
0.866
0.857
Mixture kernel
10
206
234.14
0.513
1.206
0.866
0.804
Under the vanilla prefix-preserving kernel, the pooled empirical \(p\)-value was 0.510, suggesting that the observed line was not anomalous relative to the sampled null. However, the chain-specific \(p\)-values varied substantially, from approximately 0.245 to 0.764. The scalar split-\(\hat{R}\) for total CPL was 1.229, and PACE on sampled trajectory states remained high, with exact-state \(\widehat{\delta}=0.866\) and trajectory-medoid \(\widehat{\delta}=0.857\). These diagnostics suggest that the scalar anomaly score was not fully supported by state-space convergence.
We also tested the mixture refresh kernel, which augments the prefix-preserving proposal with occasional full-trajectory refreshes. In this representative \(k=10\) run, however, the mixture kernel only exhibits to slightly improve convergence relative to the vanilla kernel. The pooled \(p\)-value remained non-anomalous, but split-\(\hat{R}\) decreased from 1.229 to 1.206, and trajectory-medoid PACE \(\widehat{\delta}\) decreased from 0.857 to 0.804. Thus, the vanilla prefix-preserving kernel showed slightly better scalar stability and better agreement across coarse trajectory neighborhoods. Exact-state PACE remained high at 0.866 for both kernels, indicating that neither sampler assigned similar mass to the same exact move sequences.
4.3 API Demo
This demonstrates that our API is functioning correctly by running anomaly detection on preloaded demo data. By setting demo_mode to "white_cheated", the system automatically loads a known example game where the white player exhibits engine-like behavior. This allows us to quickly verify that the full pipeline—from request handling to Bayesian inference via MCMC—is working as expected, without requiring manual input of moves.
import requestsimport jsonurl ="http://vcm-52666.vm.duke.edu:8080"# Using our demo_mode to load the cheated gamesuspect_payload = {"player_id": "AarBIGLOCO","player_elo": 1500,"demo_mode": "white_cheated","k_depth": 6}print("Running MCMC Anomaly Detection (This may take 3-4 minutes)...")r = requests.post(url +"/api/detect_anomaly", json=suspect_payload)print(f"Status: {r.status_code}")# Format the JSON nicely for the presentationif r.status_code ==200:print(json.dumps(r.json(), indent=2))else:print(r.text)
This chunk showcases diagnostic checks on the same dataset. Specifically, it runs a 4-chain MCMC procedure to evaluate convergence and reliability of the posterior estimates. Key metrics such as the pooled p-value and split R-hat are returned, helping users assess whether the model has converged properly and whether the anomaly detection results are trustworthy.
print("Running 4-Chain Diagnostics (This may take a few minutes)...")r = requests.post(url +"/api/diagnostics", json=suspect_payload)print(f"Status: {r.status_code}")if r.status_code ==200: data = r.json()# Print the core summary metricsprint(f"Pooled p-value: {data['data']['pooled_p_value']:.4f}")print(f"R-hat Convergence: {data['data']['split_rhat_total_cpl']:.4f}")print("\nFull Response:")print(json.dumps(data, indent=2))else:print(r.text)
We provide four built-in demo scenarios that can be selected via the demo_mode field: "white_cheated", "white_fair", "black_cheated", and "black_fair". These options allow users to easily test the system under different controlled conditions.
Finally, this shows how users can input their own game data. To do this, set demo_mode to "none" and provide the required fields manually, including the move list in UCI format (moves_uci), player rating, and which side is being evaluated (suspect_is_white). This enables flexible, real-world usage of the API beyond the predefined examples.
suspect_payload = {"player_id": "MirTrap","player_elo": 1500,"moves_uci": ["e2e4","c7c5","g1f3","b8c6","d2d4","c5d4","f3d4","e7e5","d4b5","d7d6","b1c3","a7a6","b5a3","b7b5","c3d5","g8e7","c2c4","b5b4","a3c2","g7g6","d5f6"],"k_depth": 6,"suspect_is_white": True,"demo_mode": "none"}print("Running MCMC Anomaly Detection (This may take 3-4 minutes)...")r = requests.post(url +"/api/detect_anomaly", json=suspect_payload)print(f"Status: {r.status_code}")# Format the JSON nicely for the presentationif r.status_code ==200:print(json.dumps(r.json(), indent=2))else:print(r.text)
5.1 Interpreting Anomaly Scores with Diagnostic Caution
Overall, these examples show that the prototype can produce interpretable anomaly scores and diagnostic summaries. The single-chain demo gives an intuitive fair-play report: an extremely low-CPL suspected line is flagged, while the comparison line is not flagged under the stricter \(p<0.01\) threshold. The four-chain experiments show why convergence diagnostics is an unalienable part of the whole story of anomaly detection. Even when pooled \(p\)-values appear reasonable, chain-level disagreement and high PACE values can indicate incomplete exploration of the trajectory state space.
We therefore treat the empirical \(p\)-value as meaningful only when supported by multi-chain diagnostics. In this project, the results are best viewed as evidence that counterfactual trajectory sampling is a promising framework for fair-play analysis, while also showing that sampler convergence remains a central limitation. We have to acknowledge that the MCMC sampler is facing significant challenges in exploring the large discrete trajectory space at this stage, and that the anomaly scores can be unstable when chains do not mix well. We admit that given the limited computational resources and time constraints, we are not able to run longer chains or try more advanced sampling techniques that may improve mixing. Therefore, the current results should be interpreted as a proof-of-concept rather than a definitive classifier benchmark.
Interpreting the comparison of vanilla kernel and mixture refresh kernel, we also acknowledge that under the current settings, the mixture kernel did not show a clear improvement in convergence diagnostics. We speculate more experiments with different \(k\), \(\beta\), or number of steps may be able to further differentiate the two kernels, but this requires verification with more computational resources. At the current stage, we can only move on with the mixture refresh kernel for the API demo.
5.2 Other Limitations
Our project is also limited by data scale and model availability. We parsed approximately eight days of Lichess game data, which already amounted to about 5 GB and over 15 million games after preprocessing. Although this subset provides a useful sample for proof-of-concept testing, it may not fully represent the broader population of games across different rating ranges, time periods, and playing conditions.
Computational constraints also limited the scope of the analysis. The large data volume caused crashes in the workbench environment, restricting our ability to process longer time windows or run more extensive experiments. With more computational resources, the project could be expanded to cover a larger sample of games and a wider range of candidate players.
At an early stage of the project, we considered focusing on elite-level games. However, the stronger Maia-based model needed for that setting, Maia4All [TJX+25], was not publicly available at the time of the project. We therefore shifted our analysis to broader Lichess game data and used the available Maia-2 proposal model instead.
6. Conclusion
This project is a starting point for developing a counterfactual trajectory sampling framework for chess fair-play analysis. Instead of judging a game only by direct engine agreement, we construct a sampled null distribution of human-plausible alternatives from the same starting position. Maia-2 provides a skill-conditioned proposal model, Stockfish centipawn loss provides a quality-aware target, and Metropolis-Hastings sampling combines the two into an empirical baseline for evaluating observed play. The prototype produces interpretable anomaly scores, supports suspect-side CPL scoring, and exposes both single-chain anomaly reports and multi-chain diagnostic summaries through a FastAPI/Docker deployment.
At the same time, the experiments show that convergence is a central challenge: scalar \(p\)-values can appear reasonable even when PACE indicates incomplete exploration of the discrete trajectory space. Therefore, the current system should be understood as an experimental fair-play analysis tool rather than a production cheating detector. Future work should focus on longer and more efficient MCMC runs, better proposal mechanisms, larger and better-labeled evaluation sets, and stronger player-specific human move models.
7. References
[GR92] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple sequences. Statistical Science, 7(4):457–472, 1992.
[Lic25] Lichess. Lichess.org open database. https://database.lichess.org/, accessed 2026.
[TJM+24] Zhenwei Tang, Difan Jiao, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, and Ashton Anderson. Maia-2: A unified model for human-AI alignment in chess. In Advances in Neural Information Processing Systems, 2024.
[TJX+25] Zhenwei Tang, Difan Jiao, Eric Xue, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, and Ashton Anderson. Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess. arXiv preprint arXiv:2507.21488, 2025.
[VS17] Douglas VanDerwerken and Scott C. Schmidler. Monitoring joint convergence of MCMC samplers. Journal of Computational and Graphical Statistics, 26(3):558–568, 2017.