Chess Anomaly Detection

STA 663 Final Project

Authors

Wenjie Gong

Cecilia Liu

Simeng Wu

Carol Zhou

Franklin Zhou

1. Introduction

Detecting anomalous play in online chess is a daunting task. A move sequence that agrees closely with a chess engine’s top choices may be suspicious, but engine agreement alone is never a sufficient evidence of cheating: strong human player can also find such moves, and ordinary tactical positions may inherently contain forcing engine-like continuations. A useful alternative way is hence to compare a player’s observed moves not only to an engine, but also to a distribution of plausible human alternatives.

This project implements a counterfactual sampling framework for chess trajectory anomaly detection. Given a starting board position \(S_0\) and an observed sequence of \(k\) moves, we model a trajectory as \(X=(m_1,\ldots,m_k)\). Our goal is to construct a null collection of alternative trajectories that a human player of comparable skill could plausibly have played from \(S_0\). The observed trajectory is then compared against this sampled null distribution. If the observed line lies well within the counterfactual distribution, it is treated as consistent with fair human play; if it is unusually strong relative to the sampled alternatives, it may be flagged as anomalous.

Our framework is built around counterfactual trajectory sampling. We use Maia-2 as a skill-conditioned human move proposal model \(P_0(m\mid s,E_{\text{self}},E_{\text{oppo}})\), so proposed moves are sampled from a human move distribution conditioned on board position and Elo [TJM+24]. We then use Stockfish centipawn loss (CPL) to define a quality-aware target distribution over fixed-length trajectories [Sto26]. These two components are combined through a Metropolis-Hastings sampler: Maia-2 proposes human-plausible continuations, while the Stockfish/CPL target reweights those trajectories toward stronger play.

Given that the sampled object is a full move trajectory rather than a scalar variable, convergence between independent chains is a pivotal part of the analysis. We run four independent chains for each analysis window and compare the resulting null distributions. We report Gelman-Rubin statistic for total CPL and log target density, and we also apply the Partition-based Approximation for Convergence Evaluation (PACE) diagnostic to the sampled move-sequence states to assess convergence in the joint trajectory space [VS17]. This distinction is very important: A sampler may appear stable in scalar CPL while still exploring different regions of the discrete trajectory space.

The final prototype implements this sampling pipeline end-to-end, from Lichess data preprocessing, Maia-2 proposal generation, deterministic Stockfish-based CPL scoring, vanilla and mixture Metropolis-Hastings sampling, empirical anomaly scoring, four-chain convergence diagnostics, and a deployable FastAPI/Docker demo. We consider the system as an experimental fair-play analysis framework, not a production-ready cheating detection tool. Its purpose is to explore the feasibility and challenges of counterfactual trajectory sampling for providing a more statistically grounded baseline for interpreting engine-aligned play.

2. Methods / Implementation

2.1 Data

The data used in this project are collected from the official Lichess open database [Lic25], which provides monthly archives of chess games played on the Lichess platform. We focus on the standard chess games from February 2025. The full Lichess database is extremely large, so using one recent monthly archive makes the data processing computationally feasible while still providing a sufficiently large sample for analysis.

The raw data are distributed in compressed PGN format. Each PGN game contains two main types of information: game-level metadata stored in header tags, and the complete move sequence of the game. During preprocessing, we parse the PGN file game by game and convert it into a structured tabular dataset, where each row represents one game.

For each game, we extract the following variables:

site: the Lichess game URL
utc_date and utc_time: the date and time of the game
white and black: player identifiers
white_elo and black_elo: player Elo ratings before the game
white_rating_diff and black_rating_diff: rating changes after the game
result: game result
event: type of game
time_control: time-control setting
termination: reason why the game ended
eco and opening: opening classification
n_plies: total number of half-moves in the game
moves_uci: the full move sequence in UCI format

We convert the move sequence into UCI format rather than SAN format because UCI provides a standardized coordinate-based representation of each move. This makes the moves easier to tokenize, compare, and use in downstream analysis or modeling.

To improve data quality and focus on the target player population, we apply an Elo-based filter. A game is retained if at least one of the two players has an Elo rating between 1100 and 1900. This range is intended to capture low-to-intermediate and club-level games, where player decisions are more diverse than elite-level games but less noisy than beginner games. Games with missing Elo information or unparsable move sequences are removed.

After preprocessing, the final dataset is stored in Parquet format (data/processed/either_1100_1900_uci_2025_02_full), with each row representing one game. This columnar storage format is more efficient for large-scale chess data than a plain text file, and it supports faster loading and downstream analysis of player behavior, opening choices, move patterns, and game outcomes. Because the processed dataset is large, the Parquet files are tracked and shared through Git Large File Storage (Git LFS) rather than standard Git version control.

2.2 EDA

This EDA constructs a compact suspected-candidate dataset based on rating dynamics. The workflow first converts the game-level data into player-game observations, then summarizes each player’s Elo volatility and game-by-game rating changes. Players are flagged if they fall into the upper tail of rating instability or single-game rating gain.

The final candidate games preserve the player identifier, game identifier, opponent information, rating movement, game outcome, opening information, and full UCI move sequence. These outputs are designed for the next stage of analysis, where move-level evidence such as engine agreement, centipawn loss, or empirical anomaly scores can be used to test whether the flagged games show suspicious move quality.

2.2.1 Setup

import polars as pl
import matplotlib.pyplot as plt

PARQUET_PATH = "data/processed/either_1100_1900_uci_2025_02_full/*.parquet"

lf = pl.scan_parquet(PARQUET_PATH)

lf.collect_schema()

Schema([('site', String),
        ('utc_date', String),
        ('utc_time', String),
        ('white', String),
        ('black', String),
        ('white_elo', Int64),
        ('black_elo', Int64),
        ('white_rating_diff', Float64),
        ('black_rating_diff', Float64),
        ('result', String),
        ('event', String),
        ('time_control', String),
        ('termination', String),
        ('eco', String),
        ('opening', String),
        ('n_plies', Int64),
        ('moves_uci', String)])

This EDA constructs a compact candidate list of potentially suspicious players and games based on unusual rating dynamics. The goal is not to label players as confirmed cheaters, but to identify player-game records that are worth follow-up testing using move-level models.

The candidate tables preserve game_id, site, and moves_uci, so the suspicious candidate games can be passed directly into later engine-agreement or move-quality analysis.

2.2.2 Construct Player-Game Level Data

lf = lf.with_columns([
    pl.col("site")
    .str.extract(r"lichess\.org/([A-Za-z0-9]+)", 1)
    .alias("game_id")
])

white_players = (
    lf.select([
        pl.col("game_id"),
        pl.col("site"),
        pl.col("white").alias("player_id"),
        pl.col("black").alias("opponent_id"),
        pl.col("white_elo").alias("elo"),
        pl.col("black_elo").alias("opponent_elo"),
        pl.col("white_rating_diff").alias("rating_diff"),
        pl.col("result"),
        pl.lit("white").alias("color"),
        pl.col("utc_date"),
        pl.col("time_control"),
        pl.col("termination"),
        pl.col("n_plies"),
        pl.col("eco"),
        pl.col("opening"),
        pl.col("moves_uci"),
    ])
    .with_columns([
        pl.when(pl.col("result") == "1-0")
        .then(1.0)
        .when(pl.col("result") == "1/2-1/2")
        .then(0.5)
        .otherwise(0.0)
        .alias("score")
    ])
)

black_players = (
    lf.select([
        pl.col("game_id"),
        pl.col("site"),
        pl.col("black").alias("player_id"),
        pl.col("white").alias("opponent_id"),
        pl.col("black_elo").alias("elo"),
        pl.col("white_elo").alias("opponent_elo"),
        pl.col("black_rating_diff").alias("rating_diff"),
        pl.col("result"),
        pl.lit("black").alias("color"),
        pl.col("utc_date"),
        pl.col("time_control"),
        pl.col("termination"),
        pl.col("n_plies"),
        pl.col("eco"),
        pl.col("opening"),
        pl.col("moves_uci"),
    ])
    .with_columns([
        pl.when(pl.col("result") == "0-1")
        .then(1.0)
        .when(pl.col("result") == "1/2-1/2")
        .then(0.5)
        .otherwise(0.0)
        .alias("score")
    ])
)

players_long = pl.concat([white_players, black_players])

Each game is converted into two player-game observations: one from White’s perspective and one from Black’s perspective. This format allows rating changes, outcomes, opponent strength, and move sequences to be analyzed at the player level.

The score variable is defined from the player’s perspective:

win = 1
draw = 0.5
loss = 0

2.2.3 Rating-Diff Missingness Check

rating_diff_null_check = (
    players_long
    .select([
        pl.len().alias("n_player_game_rows"),
        pl.col("rating_diff").is_null().sum().alias("n_null_rating_diff"),
        pl.col("rating_diff").is_not_null().sum().alias("n_nonnull_rating_diff"),
        pl.col("rating_diff").is_null().mean().alias("share_null_rating_diff"),
    ])
    .collect()
)

rating_diff_null_check

shape: (1, 4)

n_player_game_rows	n_null_rating_diff	n_nonnull_rating_diff	share_null_rating_diff
u32	u32	u32	f64
31100000	147954	30952046	0.004757

This diagnostic checks the extent of missingness in post-game rating changes at the player-game level. Out of 31.10 million player-game observations, 147,954 have missing rating_diff values, corresponding to a missing rate of approximately 0.48%. Since the candidate-screening procedure relies directly on rating movements, subsequent anomaly calculations are restricted to observations with non-missing rating_diff values.

2.2.4 Player-Level Rating Dynamics Summary

player_summary = (
    players_long
    .group_by("player_id")
    .agg([
        pl.len().alias("n_games"),
        pl.col("rating_diff").is_not_null().sum().alias("n_rating_diff_obs"),

        # Elo level
        pl.col("elo").mean().alias("avg_elo"),
        pl.col("elo").median().alias("median_elo"),
        pl.col("elo").min().alias("min_elo"),
        pl.col("elo").max().alias("max_elo"),

        # Elo fluctuation
        (pl.col("elo").max() - pl.col("elo").min()).alias("elo_range"),
        pl.col("elo").std().alias("elo_std"),

        # Game-by-game rating changes
        pl.col("rating_diff").mean().alias("avg_rating_diff"),
        pl.col("rating_diff").abs().mean().alias("avg_abs_rating_diff"),
        pl.col("rating_diff").std().alias("std_rating_diff"),
        pl.col("rating_diff").sum().alias("total_rating_diff"),
        pl.col("rating_diff").min().alias("min_rating_diff"),
        pl.col("rating_diff").max().alias("max_rating_diff"),

        # Performance and opponent information
        pl.col("score").mean().alias("score_rate"),
        pl.col("opponent_elo").mean().alias("avg_opponent_elo"),
        pl.col("n_plies").mean().alias("avg_n_plies"),
    ])
    .collect()
)

player_summary.sort("n_games", descending=True).head(20)

shape: (20, 18)

player_id	n_games	n_rating_diff_obs	avg_elo	median_elo	min_elo	max_elo	elo_range	elo_std	avg_rating_diff	avg_abs_rating_diff	std_rating_diff	total_rating_diff	min_rating_diff	max_rating_diff	score_rate	avg_opponent_elo	avg_n_plies
str	u32	u32	f64	f64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
"maia1"	4601	4381	1481.896762	1484.0	1335	1665	330	71.317871	-0.035608	4.24241	4.913044	-156.0	-11.0	12.0	0.569224	1418.364269	64.086938
"Leeuw7"	2191	2191	1324.109995	1322.0	1235	1425	190	31.932481	-0.022821	5.248745	5.43088	-50.0	-9.0	10.0	0.490187	1328.423551	42.423551
"maia5"	2078	2018	1600.641001	1604.0	1475	1806	331	62.762831	0.037661	4.334985	4.954285	76.0	-11.0	12.0	0.53922	1570.809432	69.771415
"af2002"	1956	1956	1746.955521	1743.0	1673	1844	171	37.811317	0.001022	4.878323	5.275238	2.0	-11.0	11.0	0.523773	1719.163599	57.010736
"imtwohigh"	1748	1747	1367.304348	1321.5	1135	1599	464	127.634235	0.046938	3.327991	4.579585	82.0	-15.0	12.0	0.247712	1655.949085	51.463959
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"ronniegli"	1456	1456	1523.640797	1522.0	1431	1611	180	35.320809	0.021291	5.528159	5.634963	31.0	-8.0	9.0	0.489698	1530.926511	69.144918
"bra64"	1456	1456	1824.763049	1826.0	1688	1918	230	44.213581	-0.115385	5.659341	5.823912	-168.0	-20.0	17.0	0.489698	1824.571429	65.619505
"KelvinPatel"	1448	1448	1253.853591	1251.0	1150	1389	239	48.983467	0.004144	4.167127	4.818042	6.0	-7.0	11.0	0.302831	1432.86326	38.093923
"SoshalDistanSingh"	1410	1410	1690.539007	1655.0	1525	2129	604	134.747095	-0.479433	5.24539	6.03855	-676.0	-12.0	11.0	0.478369	1666.786525	42.548936
"esistdj"	1409	1409	1285.696238	1295.0	1173	1450	277	43.167716	-0.006388	5.654365	5.868888	-9.0	-24.0	34.0	0.491838	1287.342087	52.157559

This table summarizes each player’s observed rating dynamics in the February 2025 sample. The most important variables for candidate screening are:

elo_std: volatility of the player’s observed Elo level.
elo_range: difference between the player’s maximum and minimum observed Elo.
std_rating_diff: volatility of game-by-game rating changes.
max_rating_diff: largest single-game rating gain.
n_games: number of observed games.
n_rating_diff_obs: number of non-missing rating-diff observations.

These variables are used only as statistical anomaly signals.

2.2.5 Statistical Anomaly Thresholds

elo_std_threshold = (
    player_summary
    .select(pl.col("elo_std").drop_nulls().quantile(0.99).alias("threshold"))
    .item()
)

elo_range_threshold = (
    player_summary
    .select(pl.col("elo_range").drop_nulls().quantile(0.99).alias("threshold"))
    .item()
)

std_rating_diff_threshold = (
    player_summary
    .select(pl.col("std_rating_diff").drop_nulls().quantile(0.99).alias("threshold"))
    .item()
)

max_rating_diff_threshold = (
    player_summary
    .select(
        pl.col("max_rating_diff")
        .filter(pl.col("max_rating_diff") > 0)
        .drop_nulls()
        .quantile(0.99)
        .alias("threshold")
    )
    .item()
)

thresholds = pl.DataFrame({
    "metric": [
        "elo_std",
        "elo_range",
        "std_rating_diff",
        "max_rating_diff",
    ],
    "threshold_99pct": [
        elo_std_threshold,
        elo_range_threshold,
        std_rating_diff_threshold,
        max_rating_diff_threshold,
    ],
})

thresholds

shape: (4, 2)

metric	threshold_99pct
str	f64
"elo_std"	264.338325
"elo_range"	653.0
"std_rating_diff"	222.163519
"max_rating_diff"	274.0

The anomaly thresholds are defined as the 99th percentile of each player-level rating-dynamics indicator. For max_rating_diff, the threshold is calculated using only positive values, so this metric specifically captures unusually large single-game Elo gains rather than general rating changes.

In this sample, the 99th-percentile thresholds are approximately 264.34 for elo_std, 653 for elo_range, 222.16 for std_rating_diff, and 274 for max_rating_diff. Players whose values exceed at least one of these thresholds are treated as statistical anomaly candidates for follow-up analysis.

2.2.6 Plots

def plot_hist_with_threshold(
    values,
    title,
    xlabel,
    bins=50,
    threshold=None,
    threshold_label=None
):
    fig, ax = plt.subplots(figsize=(8, 5))

    ax.hist(values, bins=bins, edgecolor="black")
    ax.set_title(title)
    ax.set_xlabel(xlabel)
    ax.set_ylabel("Count")

    if threshold is not None:
        label = threshold_label if threshold_label is not None else f"Threshold = {threshold:.2f}"
        ax.axvline(threshold, linestyle="--", linewidth=2, label=label)
        ax.legend()

    plt.tight_layout()
    plt.show()

rating_diff_values = (
    players_long
    .filter(pl.col("rating_diff").is_not_null())
    .select("rating_diff")
    .collect()
    .to_series()
    .to_numpy()
)

plot_hist_with_threshold(
    values=rating_diff_values,
    title="Distribution of Player-Game Rating Differences",
    xlabel="rating_diff",
    bins=60
)

The distribution of player-game rating changes is sharply concentrated around zero, indicating that most games produce only small Elo movements. The long but sparse tails suggest that very large rating gains or losses are relatively rare. This supports the use of tail-based anomaly thresholds when constructing the suspected candidate set.

elo_std_values = (
    player_summary
    .select(pl.col("elo_std").drop_nulls())
    .to_series()
    .to_numpy()
)

plot_hist_with_threshold(
    values=elo_std_values,
    title="Distribution of Player Elo Standard Deviation",
    xlabel="elo_std",
    bins=50,
    threshold=elo_std_threshold,
    threshold_label=f"99th percentile = {elo_std_threshold:.2f}"
)

The distribution of player-level Elo standard deviation is highly right-skewed. Most players have relatively stable Elo values within the February 2025 sample, while a small number of players exhibit much larger Elo fluctuations. The dashed vertical line marks the 99th percentile threshold at 264.34. Players to the right of this threshold are treated as statistical anomaly candidates based on unusually high Elo instability. This flag is used only for candidate screening and does not by itself indicate cheating.

elo_range_values = (
    player_summary
    .select(pl.col("elo_range").drop_nulls())
    .to_series()
    .to_numpy()
)

plot_hist_with_threshold(
    values=elo_range_values,
    title="Distribution of Player Elo Range",
    xlabel="elo_range",
    bins=50,
    threshold=elo_range_threshold,
    threshold_label=f"99th percentile = {elo_range_threshold:.2f}"
)

The distribution of player Elo range is highly right-skewed. Most players have relatively small Elo ranges, suggesting that their ratings remain fairly stable across the observed games. However, a small group of players shows very large rating ranges, with the 99th percentile around 653 Elo points. These extreme cases indicate unusually large rating fluctuations and are therefore useful for identifying players who may require further anomaly-based investigation.

std_rating_diff_values = (
    player_summary
    .select(pl.col("std_rating_diff").drop_nulls())
    .to_series()
    .to_numpy()
)

plot_hist_with_threshold(
    values=std_rating_diff_values,
    title="Distribution of Player Rating-Diff Standard Deviation",
    xlabel="std_rating_diff",
    bins=50,
    threshold=std_rating_diff_threshold,
    threshold_label=f"99th percentile = {std_rating_diff_threshold:.2f}"
)

The distribution of player-level rating-difference standard deviation is highly right-skewed. Most players have low standard deviations, indicating that their game-by-game Elo changes are relatively stable. However, a small group of players shows unusually high volatility, with the 99th percentile around 222 Elo points. These players may have inconsistent rating trajectories or extreme rating jumps across games, making this metric useful for identifying anomaly candidates for further move-level investigation.

max_rating_diff_values = (
    player_summary
    .filter(pl.col("max_rating_diff") > 0)
    .select(pl.col("max_rating_diff").drop_nulls())
    .to_series()
    .to_numpy()
)

plot_hist_with_threshold(
    values=max_rating_diff_values,
    title="Distribution of Largest Single-Game Rating Gain per Player",
    xlabel="max_rating_diff",
    bins=50,
    threshold=max_rating_diff_threshold,
    threshold_label=f"99th percentile = {max_rating_diff_threshold:.2f}"
)

The distribution of each player’s largest single-game rating gain is strongly right-skewed. Most players have only small maximum Elo gains, while a small number of players experience unusually large one-game increases. The 99th percentile is 274 Elo points, suggesting that gains above this level may represent unusual rating movements for further investigation.

2.2.7 Suspected Player Candidate List

suspected_players = (
    player_summary
    .with_columns([
        (pl.col("elo_std") >= elo_std_threshold).fill_null(False).alias("flag_high_elo_std"),
        (pl.col("elo_range") >= elo_range_threshold).fill_null(False).alias("flag_high_elo_range"),
        (pl.col("std_rating_diff") >= std_rating_diff_threshold).fill_null(False).alias("flag_high_rating_diff_volatility"),
        (pl.col("max_rating_diff") >= max_rating_diff_threshold).fill_null(False).alias("flag_large_single_game_gain"),
    ])
    .with_columns([
        (
            pl.col("flag_high_elo_std").cast(pl.Int64) +
            pl.col("flag_high_elo_range").cast(pl.Int64) +
            pl.col("flag_high_rating_diff_volatility").cast(pl.Int64) +
            pl.col("flag_large_single_game_gain").cast(pl.Int64)
        ).alias("n_statistical_flags")
    ])
    .filter(pl.col("n_statistical_flags") >= 1)
    .sort(
        [
            "n_statistical_flags",
            "max_rating_diff",
            "std_rating_diff",
            "elo_range",
            "elo_std",
        ],
        descending=[True, True, True, True, True],
    )
)

suspected_players.head(50)

shape: (50, 23)

player_id	n_games	n_rating_diff_obs	avg_elo	median_elo	min_elo	max_elo	elo_range	elo_std	avg_rating_diff	avg_abs_rating_diff	std_rating_diff	total_rating_diff	min_rating_diff	max_rating_diff	score_rate	avg_opponent_elo	avg_n_plies	flag_high_elo_std	flag_high_elo_range	flag_high_rating_diff_volatility	flag_large_single_game_gain	n_statistical_flags
str	u32	u32	f64	f64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	bool	bool	bool	bool	i64
"saathuriya"	2	2	1850.0	1850.0	1500	2200	700	494.974747	0.0	700.0	989.949494	0.0	-700.0	700.0	0.5	1639.5	41.0	true	true	true	true	4
"Stalin11111"	2	2	1850.0	1850.0	1500	2200	700	494.974747	70.0	630.0	890.954544	140.0	-560.0	700.0	0.5	1845.0	73.0	true	true	true	true	4
"beyondthelimit"	2	2	1850.0	1850.0	1500	2200	700	494.974747	120.0	580.0	820.243866	240.0	-460.0	700.0	0.5	1866.0	84.0	true	true	true	true	4
"Agrippa04"	3	3	1733.333333	1500.0	1500	2200	700	404.145188	233.333333	700.0	808.290377	700.0	-700.0	700.0	0.666667	1967.0	83.333333	true	true	true	true	4
"rakshanasrisv"	3	3	1266.666667	1500.0	800	1500	700	404.145188	-233.333333	700.0	808.290377	-700.0	-700.0	700.0	0.333333	1049.666667	66.333333	true	true	true	true	4
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"makoto_p3"	2	2	1850.0	1850.0	1500	2200	700	494.974747	381.0	381.0	451.134126	762.0	62.0	700.0	1.0	1835.0	73.5	true	true	true	true	4
"AriMarcopoulos"	2	2	1850.0	1850.0	1500	2200	700	494.974747	383.0	383.0	448.305699	766.0	66.0	700.0	1.0	1892.5	102.5	true	true	true	true	4
"Classic_AmiR"	3	3	1854.333333	1863.0	1500	2200	700	350.080467	202.666667	314.0	447.377171	608.0	-167.0	700.0	0.666667	1894.666667	32.666667	true	true	true	true	4
"kurakuracomel"	2	2	1850.0	1850.0	1500	2200	700	494.974747	384.0	384.0	446.891486	768.0	68.0	700.0	1.0	1925.5	71.5	true	true	true	true	4
"TheJentayu"	2	2	1850.0	1850.0	1500	2200	700	494.974747	385.0	385.0	445.477272	770.0	70.0	700.0	1.0	1912.5	30.5	true	true	true	true	4

This table contains suspected player candidates based on unusual rating dynamics. A player is included if they exceed the 99th percentile threshold for at least one of the following indicators:

high Elo standard deviation;
high Elo range;
high rating-diff volatility;
unusually large single-game rating gain.

2.2.8 Composite Volatility Ranking

composite_volatility_ranking = (
    player_summary
    .filter(pl.col("n_rating_diff_obs") > 0)
    .drop_nulls(["elo_std", "elo_range", "std_rating_diff"])
    .with_columns([
        pl.col("elo_std").rank().alias("rank_elo_std"),
        pl.col("elo_range").rank().alias("rank_elo_range"),
        pl.col("std_rating_diff").rank().alias("rank_rating_diff_std"),
    ])
    .with_columns([
        (
            pl.col("rank_elo_std") +
            pl.col("rank_elo_range") +
            pl.col("rank_rating_diff_std")
        ).alias("composite_volatility_score")
    ])
    .sort("composite_volatility_score", descending=True)
    .with_row_index("composite_rank", offset=1)
)

composite_volatility_ranking.head(30)

shape: (30, 23)

composite_rank	player_id	n_games	n_rating_diff_obs	avg_elo	median_elo	min_elo	max_elo	elo_range	elo_std	avg_rating_diff	avg_abs_rating_diff	std_rating_diff	total_rating_diff	min_rating_diff	max_rating_diff	score_rate	avg_opponent_elo	avg_n_plies	rank_elo_std	rank_elo_range	rank_rating_diff_std	composite_volatility_score
u32	str	u32	u32	f64	f64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
1	"CRAZY_RONALDO"	2	2	2067.5	2067.5	1500	2635	1135	802.566197	340.5	359.5	508.409776	681.0	-19.0	700.0	0.75	1945.5	101.5	928711.0	928564.0	928409.5	2785684.5
2	"CrazyMaestro"	2	2	2085.5	2085.5	1500	2671	1171	828.022041	347.0	353.0	499.217388	694.0	-6.0	700.0	0.5	1872.5	55.5	928712.0	928606.5	928360.0	2785678.5
3	"chessmaestro_crazy"	3	2	2210.0	2565.0	1500	2565	1065	614.878037	298.5	401.5	567.806745	597.0	-103.0	700.0	0.333333	1890.333333	24.666667	928639.0	928418.5	928569.5	2.785627e6
4	"KNOWMO"	3	3	851.333333	621.0	433	1500	1067	569.572062	-66.0	400.666667	552.662646	-198.0	-700.0	314.0	0.5	1211.0	40.333333	928591.0	928428.0	928543.0	2.785562e6
5	"Theku"	2	2	950.0	950.0	400	1500	1100	777.817459	-350.0	350.0	494.974747	-700.0	-700.0	0.0	0.0	1064.5	47.5	928708.5	928496.5	928336.0	2.785541e6
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
26	"malhaarsalvi"	3	3	789.0	467.0	400	1500	1100	616.654685	-235.0	235.0	402.709573	-705.0	-700.0	0.0	0.0	1016.0	39.333333	928644.0	928496.5	927564.0	2784704.5
27	"Arianna_Dronova"	2	2	1037.0	1037.0	574	1500	926	654.780879	-356.0	356.0	486.489465	-712.0	-700.0	-12.0	0.0	986.0	32.0	928673.0	927748.5	928280.0	2784701.5
28	"shravanjith"	2	2	1055.0	1055.0	610	1500	890	629.325035	-229.5	470.5	665.387481	-459.0	-700.0	241.0	0.5	1234.5	22.5	928651.0	927387.0	928645.5	2784683.5
29	"ramazancakici"	3	3	815.666667	540.0	407	1500	1093	596.369293	-236.0	236.0	401.837032	-708.0	-700.0	-3.0	0.0	1088.333333	34.333333	928621.0	928476.0	927556.0	2.784653e6
30	"EvaIsak"	2	2	1040.0	1040.0	580	1500	920	650.538239	-356.5	356.5	485.782359	-713.0	-700.0	-13.0	0.0	1047.5	45.0	928671.5	927693.5	928275.0	2.78464e6

The composite ranking combines three volatility dimensions:

Elo standard deviation.
Elo range.
Standard deviation of rating differences.

This provides an additional ranking of players whose rating behavior is unstable across multiple dimensions.

2.2.9 Suspected Games with Move Sequences

suspected_players_for_join = (
    suspected_players
    .select([
        "player_id",
        "n_games",
        "n_rating_diff_obs",
        "avg_elo",
        "median_elo",
        "min_elo",
        "max_elo",
        "elo_range",
        "elo_std",
        "avg_rating_diff",
        "avg_abs_rating_diff",
        "std_rating_diff",
        "total_rating_diff",
        "min_rating_diff",
        "max_rating_diff",
        "score_rate",
        "avg_opponent_elo",
        "avg_n_plies",
        "flag_high_elo_std",
        "flag_high_elo_range",
        "flag_high_rating_diff_volatility",
        "flag_large_single_game_gain",
        "n_statistical_flags",
    ])
    .lazy()
)

suspected_games_all = (
    players_long
    .filter(pl.col("rating_diff").is_not_null())
    .join(
        suspected_players_for_join,
        on="player_id",
        how="inner",
    )
    .with_columns([
        (pl.col("rating_diff") >= max_rating_diff_threshold).alias("game_flag_large_rating_gain"),
        (pl.col("rating_diff") == pl.col("max_rating_diff")).alias("game_is_player_largest_gain"),
        (pl.col("rating_diff") == pl.col("min_rating_diff")).alias("game_is_player_largest_loss"),
    ])
    .select([
        "player_id",
        "game_id",
        "site",
        "opponent_id",
        "elo",
        "opponent_elo",
        "rating_diff",
        "result",
        "score",
        "color",
        "utc_date",
        "time_control",
        "termination",
        "n_plies",
        "eco",
        "opening",
        "moves_uci",
        "n_games",
        "n_rating_diff_obs",
        "avg_elo",
        "median_elo",
        "min_elo",
        "max_elo",
        "elo_range",
        "elo_std",
        "avg_rating_diff",
        "avg_abs_rating_diff",
        "std_rating_diff",
        "total_rating_diff",
        "min_rating_diff",
        "max_rating_diff",
        "score_rate",
        "avg_opponent_elo",
        "avg_n_plies",
        "flag_high_elo_std",
        "flag_high_elo_range",
        "flag_high_rating_diff_volatility",
        "flag_large_single_game_gain",
        "n_statistical_flags",
        "game_flag_large_rating_gain",
        "game_is_player_largest_gain",
        "game_is_player_largest_loss",
    ])
    .sort(
        ["n_statistical_flags", "game_flag_large_rating_gain", "rating_diff"],
        descending=[True, True, True],
    )
    .collect()
)

suspected_games_all.head(100)

shape: (100, 42)

player_id	game_id	site	opponent_id	elo	opponent_elo	rating_diff	result	score	color	utc_date	time_control	termination	n_plies	eco	opening	moves_uci	n_games	n_rating_diff_obs	avg_elo	median_elo	min_elo	max_elo	elo_range	elo_std	avg_rating_diff	avg_abs_rating_diff	std_rating_diff	total_rating_diff	min_rating_diff	max_rating_diff	score_rate	avg_opponent_elo	avg_n_plies	flag_high_elo_std	flag_high_elo_range	flag_high_rating_diff_volatility	flag_large_single_game_gain	n_statistical_flags	game_flag_large_rating_gain	game_is_player_largest_gain	game_is_player_largest_loss
str	str	str	str	i64	i64	f64	str	f64	str	str	str	str	i64	str	str	str	u32	u32	f64	f64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	bool	bool	bool	bool	i64	bool	bool	bool
"rakshanasrisv"	"AAqtD7pC"	"https://lichess.org/AAqtD7pC"	"chess22022000"	800	1380	700.0	"0-1"	1.0	"black"	"2025.02.08"	"360+0"	"Normal"	68	"C40"	"King's Knight Opening"	"e2e4 e7e5 g1f3 a7a5 d2d4 e5d4 …	3	3	1266.666667	1500.0	800	1500	700	404.145188	-233.333333	700.0	808.290377	-700.0	-700.0	700.0	0.333333	1049.666667	66.333333	true	true	true	true	4	true	true	false
"heyitz_hit007"	"smMM8ZpM"	"https://lichess.org/smMM8ZpM"	"kalpit13255"	800	1368	700.0	"0-1"	1.0	"black"	"2025.02.04"	"180+2"	"Normal"	32	"C60"	"Ruy Lopez: Spanish Countergamb…	"e2e4 e7e5 g1f3 b8c6 f1b5 d7d5 …	5	5	1413.0	1500.0	800	1692	892	351.549428	-24.0	380.8	516.089624	-120.0	-700.0	700.0	0.4	1300.0	38.4	true	true	true	true	4	true	true	false
"InvinciblePhoenix"	"tGAlOiTf"	"https://lichess.org/tGAlOiTf"	"soyolerdene"	1500	1885	700.0	"0-1"	1.0	"black"	"2025.02.07"	"180+0"	"Normal"	74	"B34"	"Sicilian Defense: Accelerated …	"e2e4 c7c5 g1f3 b8c6 d2d4 c5d4 …	7	7	1845.428571	1746.0	1500	2239	739	364.846935	115.428571	295.142857	401.299912	808.0	-629.0	700.0	0.857143	1602.857143	55.142857	true	true	true	true	4	true	true	false
"huyvthang2011"	"3AcTOS6u"	"https://lichess.org/3AcTOS6u"	"pakou10"	1500	1869	700.0	"1-0"	1.0	"white"	"2025.02.03"	"600+0"	"Time forfeit"	103	"A48"	"London System"	"d2d4 g8f6 g1f3 g7g6 c1f4 f8g7 …	4	4	1995.5	2156.0	1500	2170	670	330.521305	155.25	207.25	367.377531	621.0	-104.0	700.0	0.875	1649.0	67.25	true	true	true	true	4	true	true	false
"Aaryan_2015-Anil"	"L0p3bUFA"	"https://lichess.org/L0p3bUFA"	"miver008"	1607	2162	700.0	"1-0"	1.0	"white"	"2025.02.07"	"600+1"	"Normal"	41	"B27"	"Sicilian Defense: Hyperacceler…	"e2e4 c7c5 g1f3 g7g6 f1d3 f8g7 …	4	4	1755.25	1607.0	1500	2307	807	371.275616	79.75	429.75	580.503445	319.0	-700.0	700.0	0.75	1593.0	42.75	true	true	true	true	4	true	true	false
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"ATTACK-1"	"IceJpOuq"	"https://lichess.org/IceJpOuq"	"foajul111"	1500	1875	700.0	"1-0"	1.0	"white"	"2025.02.04"	"300+3"	"Normal"	57	"A43"	"Benoni Defense: Old Benoni, Sc…	"d2d4 c7c5 d4d5 d7d6 b1c3 g7g6 …	5	5	1703.6	1500.0	1258	2156	898	402.010945	64.6	215.4	365.549997	323.0	-242.0	700.0	0.2	1661.8	72.6	true	true	true	true	4	true	true	false
"zaidyzainudin"	"VnioTgWc"	"https://lichess.org/VnioTgWc"	"Fann_11"	1500	1953	700.0	"0-1"	1.0	"black"	"2025.02.04"	"300+2"	"Abandoned"	0	"?"	"?"	""	3	3	1805.0	1715.0	1500	2200	700	358.573563	3.666667	463.0	619.193292	11.0	-485.0	700.0	0.333333	1713.333333	0.333333	true	true	true	true	4	true	true	false
"AKKUYU"	"Hi0p79TZ"	"https://lichess.org/Hi0p79TZ"	"DrJosch"	1500	2022	700.0	"1-0"	1.0	"white"	"2025.02.02"	"600+0"	"Normal"	83	"B01"	"Scandinavian Defense: Mieses-K…	"e2e4 d7d5 e4d5 d8d5 g1f3 c8g4 …	6	6	1936.5	2059.0	1500	2200	700	270.667878	99.666667	263.333333	396.372888	598.0	-491.0	700.0	0.833333	1688.333333	69.166667	true	true	true	true	4	true	true	false
"HI_I_AM_FROM_NIGERIA"	"gUIwmtFq"	"https://lichess.org/gUIwmtFq"	"OlapadeOluwanifemi77"	1500	2056	700.0	"0-1"	1.0	"black"	"2025.02.03"	"60+0"	"Time forfeit"	24	"C57"	"Italian Game: Two Knights Defe…	"e2e4 e7e5 g1f3 b8c6 f1c4 g8f6 …	5	3	2206.0	2257.0	1500	2508	1008	414.139469	190.0	276.666667	441.683144	570.0	-68.0	700.0	0.5	1800.8	35.8	true	true	true	true	4	true	true	false
"Sigma_Rayquaza"	"mo4kg3fs"	"https://lichess.org/mo4kg3fs"	"RoN_WeAsLeY_123"	1499	1914	700.0	"1-0"	1.0	"white"	"2025.02.01"	"120+0"	"Normal"	127	"B12"	"Caro-Kann Defense: Advance Var…	"e2e4 c7c6 g1f3 d7d5 e4e5 c8f5 …	4	4	2020.25	2180.5	1499	2221	722	348.426343	135.25	224.75	380.733831	541.0	-118.0	700.0	0.625	1861.5	119.75	true	true	true	true	4	true	true	false

This table links suspected players back to all of their observed games in the sample. It keeps the full UCI move sequence, so these games can be passed into a later model.

2.2.10 Final Candidate Games for Model Testing

suspected_candidate_games = (
    suspected_games_all
    .filter(
        pl.col("game_flag_large_rating_gain") |
        pl.col("game_is_player_largest_gain") |
        pl.col("game_is_player_largest_loss")
    )
    .sort(
        ["n_statistical_flags", "game_flag_large_rating_gain", "rating_diff"],
        descending=[True, True, True],
    )
)

suspected_candidate_games.head(100)

shape: (100, 42)

player_id	game_id	site	opponent_id	elo	opponent_elo	rating_diff	result	score	color	utc_date	time_control	termination	n_plies	eco	opening	moves_uci	n_games	n_rating_diff_obs	avg_elo	median_elo	min_elo	max_elo	elo_range	elo_std	avg_rating_diff	avg_abs_rating_diff	std_rating_diff	total_rating_diff	min_rating_diff	max_rating_diff	score_rate	avg_opponent_elo	avg_n_plies	flag_high_elo_std	flag_high_elo_range	flag_high_rating_diff_volatility	flag_large_single_game_gain	n_statistical_flags	game_flag_large_rating_gain	game_is_player_largest_gain	game_is_player_largest_loss
str	str	str	str	i64	i64	f64	str	f64	str	str	str	str	i64	str	str	str	u32	u32	f64	f64	i64	i64	i64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	bool	bool	bool	bool	i64	bool	bool	bool
"rakshanasrisv"	"AAqtD7pC"	"https://lichess.org/AAqtD7pC"	"chess22022000"	800	1380	700.0	"0-1"	1.0	"black"	"2025.02.08"	"360+0"	"Normal"	68	"C40"	"King's Knight Opening"	"e2e4 e7e5 g1f3 a7a5 d2d4 e5d4 …	3	3	1266.666667	1500.0	800	1500	700	404.145188	-233.333333	700.0	808.290377	-700.0	-700.0	700.0	0.333333	1049.666667	66.333333	true	true	true	true	4	true	true	false
"heyitz_hit007"	"smMM8ZpM"	"https://lichess.org/smMM8ZpM"	"kalpit13255"	800	1368	700.0	"0-1"	1.0	"black"	"2025.02.04"	"180+2"	"Normal"	32	"C60"	"Ruy Lopez: Spanish Countergamb…	"e2e4 e7e5 g1f3 b8c6 f1b5 d7d5 …	5	5	1413.0	1500.0	800	1692	892	351.549428	-24.0	380.8	516.089624	-120.0	-700.0	700.0	0.4	1300.0	38.4	true	true	true	true	4	true	true	false
"InvinciblePhoenix"	"tGAlOiTf"	"https://lichess.org/tGAlOiTf"	"soyolerdene"	1500	1885	700.0	"0-1"	1.0	"black"	"2025.02.07"	"180+0"	"Normal"	74	"B34"	"Sicilian Defense: Accelerated …	"e2e4 c7c5 g1f3 b8c6 d2d4 c5d4 …	7	7	1845.428571	1746.0	1500	2239	739	364.846935	115.428571	295.142857	401.299912	808.0	-629.0	700.0	0.857143	1602.857143	55.142857	true	true	true	true	4	true	true	false
"huyvthang2011"	"3AcTOS6u"	"https://lichess.org/3AcTOS6u"	"pakou10"	1500	1869	700.0	"1-0"	1.0	"white"	"2025.02.03"	"600+0"	"Time forfeit"	103	"A48"	"London System"	"d2d4 g8f6 g1f3 g7g6 c1f4 f8g7 …	4	4	1995.5	2156.0	1500	2170	670	330.521305	155.25	207.25	367.377531	621.0	-104.0	700.0	0.875	1649.0	67.25	true	true	true	true	4	true	true	false
"Aaryan_2015-Anil"	"L0p3bUFA"	"https://lichess.org/L0p3bUFA"	"miver008"	1607	2162	700.0	"1-0"	1.0	"white"	"2025.02.07"	"600+1"	"Normal"	41	"B27"	"Sicilian Defense: Hyperacceler…	"e2e4 c7c5 g1f3 g7g6 f1d3 f8g7 …	4	4	1755.25	1607.0	1500	2307	807	371.275616	79.75	429.75	580.503445	319.0	-700.0	700.0	0.75	1593.0	42.75	true	true	true	true	4	true	true	false
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"ATTACK-1"	"IceJpOuq"	"https://lichess.org/IceJpOuq"	"foajul111"	1500	1875	700.0	"1-0"	1.0	"white"	"2025.02.04"	"300+3"	"Normal"	57	"A43"	"Benoni Defense: Old Benoni, Sc…	"d2d4 c7c5 d4d5 d7d6 b1c3 g7g6 …	5	5	1703.6	1500.0	1258	2156	898	402.010945	64.6	215.4	365.549997	323.0	-242.0	700.0	0.2	1661.8	72.6	true	true	true	true	4	true	true	false
"zaidyzainudin"	"VnioTgWc"	"https://lichess.org/VnioTgWc"	"Fann_11"	1500	1953	700.0	"0-1"	1.0	"black"	"2025.02.04"	"300+2"	"Abandoned"	0	"?"	"?"	""	3	3	1805.0	1715.0	1500	2200	700	358.573563	3.666667	463.0	619.193292	11.0	-485.0	700.0	0.333333	1713.333333	0.333333	true	true	true	true	4	true	true	false
"AKKUYU"	"Hi0p79TZ"	"https://lichess.org/Hi0p79TZ"	"DrJosch"	1500	2022	700.0	"1-0"	1.0	"white"	"2025.02.02"	"600+0"	"Normal"	83	"B01"	"Scandinavian Defense: Mieses-K…	"e2e4 d7d5 e4d5 d8d5 g1f3 c8g4 …	6	6	1936.5	2059.0	1500	2200	700	270.667878	99.666667	263.333333	396.372888	598.0	-491.0	700.0	0.833333	1688.333333	69.166667	true	true	true	true	4	true	true	false
"HI_I_AM_FROM_NIGERIA"	"gUIwmtFq"	"https://lichess.org/gUIwmtFq"	"OlapadeOluwanifemi77"	1500	2056	700.0	"0-1"	1.0	"black"	"2025.02.03"	"60+0"	"Time forfeit"	24	"C57"	"Italian Game: Two Knights Defe…	"e2e4 e7e5 g1f3 b8c6 f1c4 g8f6 …	5	3	2206.0	2257.0	1500	2508	1008	414.139469	190.0	276.666667	441.683144	570.0	-68.0	700.0	0.5	1800.8	35.8	true	true	true	true	4	true	true	false
"Sigma_Rayquaza"	"mo4kg3fs"	"https://lichess.org/mo4kg3fs"	"RoN_WeAsLeY_123"	1499	1914	700.0	"1-0"	1.0	"white"	"2025.02.01"	"120+0"	"Normal"	127	"B12"	"Caro-Kann Defense: Advance Var…	"e2e4 c7c6 g1f3 d7d5 e4e5 c8f5 …	4	4	2020.25	2180.5	1499	2221	722	348.426343	135.25	224.75	380.733831	541.0	-118.0	700.0	0.625	1861.5	119.75	true	true	true	true	4	true	true	false

The final candidate game list keeps the most relevant games for each suspected player:

games with rating gains above the global 99th percentile threshold;
the player’s own largest rating-gain game;
the player’s own largest rating-loss game.

The largest-loss game is retained only as context for rating volatility. The most important records for cheating follow-up are usually large rating-gain games and largest-gain games.

2.2.11 Large Rating Gain Baseline

large_rating_gain_games = (
    players_long
    .filter(pl.col("rating_diff").is_not_null())
    .filter(pl.col("rating_diff") > 0)
    .select([
        "player_id",
        "game_id",
        "site",
        "opponent_id",
        "elo",
        "opponent_elo",
        "rating_diff",
        "result",
        "score",
        "color",
        "utc_date",
        "time_control",
        "termination",
        "n_plies",
        "eco",
        "opening",
        "moves_uci",
    ])
    .sort("rating_diff", descending=True)
    .head(200)
    .collect()
)

large_rating_gain_games.head(100)

shape: (100, 17)

player_id	game_id	site	opponent_id	elo	opponent_elo	rating_diff	result	score	color	utc_date	time_control	termination	n_plies	eco	opening	moves_uci
str	str	str	str	i64	i64	f64	str	f64	str	str	str	str	i64	str	str	str
"CHESSKIDD99"	"ha3oBp7n"	"https://lichess.org/ha3oBp7n"	"calmboy1"	1500	1867	700.0	"1-0"	1.0	"white"	"2025.02.01"	"600+0"	"Normal"	15	"D55"	"Queen's Gambit Declined: Moder…	"d2d4 g8f6 c2c4 e7e6 g1f3 d7d5 …
"qdrainow"	"j4NBlhwt"	"https://lichess.org/j4NBlhwt"	"jiajia121127"	1500	2173	700.0	"1-0"	1.0	"white"	"2025.02.01"	"3600+30"	"Rules infraction"	39	"D11"	"Slav Defense: Modern Line"	"d2d4 d7d5 c2c4 c7c6 g1f3 g8f6 …
"deybi_mastercito"	"177NUOPx"	"https://lichess.org/177NUOPx"	"R5_new_player"	1500	1881	700.0	"1-0"	1.0	"white"	"2025.02.01"	"15+0"	"Time forfeit"	87	"A00"	"Saragossa Opening"	"c2c3 e7e6 d2d3 d7d5 c1d2 c7c5 …
"deybi_mastercito"	"LH1Lw3eP"	"https://lichess.org/LH1Lw3eP"	"jerson-20"	1500	2017	700.0	"1-0"	1.0	"white"	"2025.02.01"	"30+0"	"Time forfeit"	127	"A01"	"Nimzo-Larsen Attack: Modern Va…	"b2b3 e7e5 c1b2 d7d6 d2d4 g8e7 …
"attack_classical_ult"	"So4Y3oEl"	"https://lichess.org/So4Y3oEl"	"R5_new_player"	1500	1867	700.0	"1-0"	1.0	"white"	"2025.02.01"	"15+0"	"Normal"	49	"A13"	"English Opening: Agincourt Def…	"c2c4 e7e6 g2g3 d7d6 f1g2 c7c6 …
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
"BADPARENTINGFUNK"	"zRIF4m4C"	"https://lichess.org/zRIF4m4C"	"MarVaraM101"	1500	1909	700.0	"1-0"	1.0	"white"	"2025.02.02"	"60+0"	"Normal"	89	"D41"	"Queen's Gambit Declined: Semi-…	"d2d4 e7e6 c2c4 d7d5 b1c3 g8f6 …
"PikachyNC"	"Gcyad8HC"	"https://lichess.org/Gcyad8HC"	"Alan2029"	1450	2047	700.0	"1-0"	1.0	"white"	"2025.02.02"	"-"	"Normal"	79	"B01"	"Scandinavian Defense: Main Lin…	"e2e4 d7d5 e4d5 d8d5 b1c3 d5a5 …
"MINd_444"	"AVGL8Crw"	"https://lichess.org/AVGL8Crw"	"mehmetemin267"	1500	1935	700.0	"1-0"	1.0	"white"	"2025.02.02"	"15+0"	"Time forfeit"	53	"A00"	"Saragossa Opening"	"c2c3 g7g6 d2d4 f8g7 g1f3 g8f6 …
"ChEsSmAtE-2016"	"5doc69wA"	"https://lichess.org/5doc69wA"	"margareth"	1500	1887	700.0	"1-0"	1.0	"white"	"2025.02.02"	"900+10"	"Normal"	85	"E62"	"King's Indian Defense: Fianche…	"d2d4 g8f6 c2c4 g7g6 b1c3 f8g7 …
"Vctr0"	"3c91ZMfm"	"https://lichess.org/3c91ZMfm"	"ash_tray"	1500	1888	700.0	"1-0"	1.0	"white"	"2025.02.02"	"90+0"	"Time forfeit"	103	"B44"	"Sicilian Defense: Taimanov Var…	"e2e4 c7c5 g1f3 e7e6 d2d4 c5d4 …

This baseline table ignores player-level volatility and simply keeps the largest positive rating-gain games in the whole sample. It is useful as a simple comparison set for the later move-level model.

2.2.12 Inspect Normal-Looking Games for One Suspected Player

# Take one player from the large-rating-gain baseline table
target_player = large_rating_gain_games["player_id"][1]

target_player_games_all = (
    players_long
    .filter(pl.col("player_id") == target_player)
    .filter(pl.col("rating_diff").is_not_null())
    .select([
        "player_id",
        "game_id",
        "site",
        "opponent_id",
        "elo",
        "opponent_elo",
        "rating_diff",
        "result",
        "score",
        "color",
        "utc_date",
        "time_control",
        "termination",
        "n_plies",
        "eco",
        "opening",
        "moves_uci",
    ])
    .sort("utc_date")
    .collect()
)

target_player_games_all

shape: (6, 17)

player_id	game_id	site	opponent_id	elo	opponent_elo	rating_diff	result	score	color	utc_date	time_control	termination	n_plies	eco	opening	moves_uci
str	str	str	str	i64	i64	f64	str	f64	str	str	str	str	i64	str	str	str
"qdrainow"	"j4NBlhwt"	"https://lichess.org/j4NBlhwt"	"jiajia121127"	1500	2173	700.0	"1-0"	1.0	"white"	"2025.02.01"	"3600+30"	"Rules infraction"	39	"D11"	"Slav Defense: Modern Line"	"d2d4 d7d5 c2c4 c7c6 g1f3 g8f6 …
"qdrainow"	"9VK9TjmT"	"https://lichess.org/9VK9TjmT"	"paulogar"	2179	1792	5.0	"0-1"	1.0	"black"	"2025.02.01"	"60+0"	"Time forfeit"	58	"B10"	"Caro-Kann Defense"	"e2e4 c7c6 b1c3 d7d5 e4d5 c6d5 …
"qdrainow"	"w7vyZ2uL"	"https://lichess.org/w7vyZ2uL"	"DG73"	2184	1809	4.0	"0-1"	1.0	"black"	"2025.02.01"	"60+0"	"Time forfeit"	74	"D10"	"Slav Defense"	"d2d4 c7c6 c2c4 d7d5 b1c3 d5c4 …
"qdrainow"	"Rd53HyM3"	"https://lichess.org/Rd53HyM3"	"hechtinger"	2173	1586	1.0	"1-0"	1.0	"white"	"2025.02.02"	"180+0"	"Normal"	29	"E04"	"Catalan Opening: Open Defense"	"d2d4 d7d5 c2c4 e7e6 g1f3 d5c4 …
"qdrainow"	"jlCaJo3Z"	"https://lichess.org/jlCaJo3Z"	"stockrdb"	2174	1616	-11.0	"0-1"	0.0	"white"	"2025.02.02"	"180+0"	"Abandoned"	0	"?"	"?"	""
"qdrainow"	"1sCkdlzE"	"https://lichess.org/1sCkdlzE"	"Alfazm"	2172	1779	1.0	"0-1"	1.0	"black"	"2025.02.02"	"180+0"	"Normal"	49	"B20"	"Sicilian Defense: Bowdler Atta…	"e2e4 c7c5 f1c4 e7e6 g1f3 b8c6 …

# Define "normal-looking" games for this player:
# 1. rating_diff is not extremely positive;
# 2. game is not the player's largest rating gain;
# 3. game is not the player's largest rating loss.

target_player_max_gain = (
    target_player_games_all
    .select(pl.col("rating_diff").max())
    .item()
)

target_player_max_loss = (
    target_player_games_all
    .select(pl.col("rating_diff").min())
    .item()
)

target_player_normal_looking_games = (
    target_player_games_all
    .filter(pl.col("elo") < 1900)
    .filter(pl.col("rating_diff") < max_rating_diff_threshold)
    .filter(pl.col("rating_diff") < target_player_max_gain)
    .filter(pl.col("rating_diff") > target_player_max_loss)
    .sort("rating_diff", descending=True)
)

target_player_normal_looking_games.head(20)

shape: (0, 17)

player_id	game_id	site	opponent_id	elo	opponent_elo	rating_diff	result	score	color	utc_date	time_control	termination	n_plies	eco	opening	moves_uci
str	str	str	str	i64	i64	f64	str	f64	str	str	str	str	i64	str	str	str

2.3 Trajectory State Space and FEN Normalization

This section describes the state representation used in our sampling framework. The trajectory representation is implemented through TrajectoryState in sampler.py, while the FEN normalization step is implemented through normalize_fen_for_policy() in utils.py.

A chess trajectory is represented as a fixed-length sequence of UCI moves. For a chosen analysis depth \(k\), we define \[ X = (m_1, m_2, \ldots, m_k)\,, \] where \(m_i\) denotes the move played at ply \(i\). In the implementation, TrajectoryState stores the starting FEN and the list of UCI moves. Replaying these moves reconstructs the sequence of board states \[ (s_0, s_1, \ldots, s_k)\,, \] where \(s_0\) is the initial position and \(s_i\) is the position after the first \(i\) moves. Therefore, move \(m_i\) is evaluated from state \(s_{i-1}\). This distinction is important because both the move proposal distribution and the target scoring function depend on the board state before the move is played.

For policy lookup and cached engine evaluation, we use a normalized FEN key, implemented by normalize_fen_for_policy() in utils.py. A full FEN string contains six components: piece placement, side to move, castling rights, en passant target square, halfmove clock, and fullmove number. We keep only the first four:

\[ \text{position key} = (\mathtt{piece~placement},\ \mathtt{side~to~move},\ \mathtt{castling~rights},\ \mathtt{en~passant~target})\,. \]

The halfmove clock and fullmove number are dropped because they usually do not affect the strategic identity of the decision state. This reduces sparsity: positions with the same board configuration, side to move, castling rights, and en passant target are treated as the same state even if they occur at different move numbers.

This normalization reduces unnecessary sparsity in position lookup. Without it, the same strategic position could be stored under multiple different FEN strings simply because the fullmove number or halfmove clock differs. By using the normalized FEN key from utils.py, the proposal policy and engine cache can group strategically equivalent positions together more consistently.

Overall, the trajectory state space is defined in sampler.py, while the FEN normalization logic is defined in utils.py. Together, they provide the basic state representation used by the Metropolis-Hastings sampler.

2.4 Skill-conditioned Human Move Proposal

The proposal distribution is designed to generate human-plausible moves rather than engine-optimal moves. The goal of the sampler is not to produce the strongest possible chess continuations, but to construct counterfactual trajectories that a human player of comparable skill could plausibly have played.

We use Maia-2 as the main proposal backend [TJM+24]. Given a board position \(s\), the active player’s Elo \(E_{\text{self}}\), and the opponent’s Elo \(E_{\text{oppo}}\), Maia-2 returns probabilities over candidate moves. We restrict these probabilities to legal moves, assign a small positive floor probability to any missing legal move, and renormalize over the legal move set. Thus, for each legal move \(m \in \mathcal{L}(s)\), the proposal model defines \[P_0(m \mid s, E_{\text{self}}, E_{\text{oppo}})\,.\]

This normalized probability is used whenever the Metropolis-Hastings sampler needs to draw a move from a position.

The skill conditioning is important for the interpretation of the counterfactual baseline. A uniform proposal over legal moves would generate many unrealistically weak continuations, while a Stockfish-based proposal would generate unrealistically strong continuations. Maia-2 instead provides a human move distribution conditioned on player strength, allowing the sampler to explore alternatives that are plausible for the target rating level.

In the implementation, this proposal interface is modular: legal_moves_with_probs(fen) returns legal UCI moves and their probabilities, and the same sampler can use any backend with this interface. Our final experiments use Maia2ProposalBackend; an empirical move-frequency backend is also included for preliminary testing and debugging.

The key modeling separation is that Maia-2 controls human plausibility, while the later Stockfish/CPL target controls engine-measured move quality. Maia-2 determines which alternatives are likely to be proposed, and the Metropolis-Hastings acceptance step reweights the resulting trajectories according to the target distribution.

2.5 Preliminary Validation with an Empirical Policy Target

Before using the engine-based target scorer, we first validated the sampler with a simpler empirical policy target. This validation step is implemented in sampler.py through the EmpiricalTargetScorer class. The purpose of this stage is not to perform the final anomaly test, but to check whether the trajectory replay, proposal generation, and Metropolis-Hastings acceptance mechanism behave as expected.

For a trajectory \(X=(m_1,\ldots,m_k)\), the empirical policy target assigns a log score based on the proposal probability of each move along the trajectory: \[ \log \pi_{emp}(X) = \sum_{i=1}^{k} \log P_0(m_i \mid s_{i-1})\,. \]

Here, \(s_{i-1}\) is the board position before move \(m_i\) is played, and \(P_0(m_i \mid s_{i-1})\) is the proposal probability assigned to that move by the proposal backend. In the implementation, the scorer replays the trajectory to recover the sequence of FEN positions, calls legal_moves_with_probs(fen) at each position, looks up the probability of the played move, and sums the log probabilities across the trajectory.

This empirical target was used as a preliminary debugging device. Since it is defined directly from the proposal policy, it helps verify that the proposal backend returns valid legal moves, that the replayed board states match the move sequence, and that the sampler can compute proposal-based trajectory probabilities consistently. It also provides a simpler setting for checking the Metropolis-Hastings transition logic before introducing Stockfish-based centipawn loss calculations.

Importantly, this empirical policy target is not the final target distribution used for anomaly detection. It only measures how likely a trajectory is under the proposal model itself. A trajectory with high empirical policy probability is human-plausible under the proposal backend, but it is not necessarily strong or engine-aligned. Therefore, after validating the sampler mechanics, we replace this preliminary target with the engine-based centipawn loss target described in the next section.

Overall, this validation stage separates implementation checking from the final anomaly analysis. The empirical policy target confirms that the sampler can correctly replay trajectories, retrieve move probabilities, and run Metropolis-Hastings updates. The final anomaly test is then based on Stockfish-derived centipawn loss rather than on proposal likelihood alone.

2.6 Engine-based Centipawn Loss Target

After validating the sampler with the empirical policy target, we replace the preliminary target with an engine-based target scorer. This step is implemented mainly through the EngineTargetScorer class in sampler.py. The scorer is instantiated in run_sampler_new.py for the single-chain anomaly experiment and in run_four_chain_diagnostics.py and run_four_chain_diagnostics_mixed_kernel.py for the four-chain diagnostic experiment.

The purpose of this target is to evaluate the chess quality of each trajectory. The Maia-2 proposal model determines which moves are human-plausible, but it does not directly measure whether those moves are close to engine-optimal play. Therefore, we use Stockfish to compute centipawn loss (CPL) for each move in a trajectory. This separates the two roles in the framework: Maia-2 controls human-like proposal generation, while Stockfish provides the engine-based quality score.

Because the engine-based target requires access to a Stockfish executable, we first download and extract Stockfish 18. The following commands show the setup procedure, but they are not evaluated when rendering the QMD file.

#| label: stockfish-download
#| eval: false

# Download Stockfish 18
wget https://github.com/official-stockfish/Stockfish/releases/download/sf_18/stockfish-ubuntu-x86-64-avx2.tar

# Extract the tar archive
tar -xvf stockfish-ubuntu-x86-64-avx2.tar

# Find the absolute path to the executable file
readlink -f stockfish-ubuntu-x86-64-avx2/stockfish-ubuntu-x86-64-avx2

After obtaining the absolute path to the executable, we set the STOCKFISH_PATH environment variable before running the sampler.

#| label: stockfish-path
#| eval: false

export STOCKFISH_PATH=/home/z/zz429/opt/stockfish/stockfish/stockfish-ubuntu-x86-64-avx2

This environment variable tells the Python scripts where to find the Stockfish binary. In run_sampler_new.py, run_four_chain_diagnostics.py, and run_four_chain_diagnostics_mixed_kernel.py, the helper function resolve_stockfish_path() first checks whether STOCKFISH_PATH has been set. If it is not set, the function falls back to searching for stockfish on the system path using which stockfish. If neither option succeeds, the script raises an error. This design avoids hard-coding the engine path directly inside the Python source files. On the shared machine used for this project, collaborators with read access to the directory above can reuse the same Stockfish executable. On a different machine, the same code can still be used by setting STOCKFISH_PATH to the local Stockfish path.

For a trajectory \[ X = (m_1, m_2, \ldots, m_k)\,, \]

the engine scorer evaluates each move \(m_i\) from the board state \(s_{i-1}\), where \(s_{i-1}\) is the position before the move is played. For each position, Stockfish evaluates candidate moves and returns centipawn scores. The implementation converts the engine score into the side-to-move perspective, so that a larger score always means a better position for the player who is about to move.

The centipawn loss of move \(m_i\) is defined as \[ \mathrm{CPL}(m_i) = \max(0, \mathrm{score}_\text{best}(s_{i-1}) - \mathrm{score}_\text{played}(s_{i-1}, m_i))\,. \]

Here, \(\mathrm{score}_\text{best}(s_{i-1})\) is the Stockfish score of the best candidate move from position \(s_{i-1}\), and \(\mathrm{score}_\text{played}(s_{i-1}, m_i)\) is the Stockfish score of the move actually taken in the trajectory. A lower \(\mathrm{CPL}\) means that the move is closer to the engine-preferred move. A \(\mathrm{CPL}\) of zero means that the move is at least as good as the best evaluated candidate under the engine search settings.

The total centipawn loss of a trajectory is then computed as the sum of move-level losses: \[ \mathrm{CPL}(X) = \sum_{i=1}^{k} \mathrm{CPL}(m_i)\,. \]

The engine-based target log-density is defined as \[ \log \pi_{engine}(X) = -\beta \mathrm{CPL}(X)\,. \]

The parameter \(\beta\) controls how strongly the target distribution favors low-CPL trajectories. Since the target uses a negative sign, trajectories with smaller total CPL receive higher target density. In other words, the sampler is encouraged to visit trajectories that are stronger according to Stockfish, while the proposal distribution still ensures that candidate moves are generated from a human-like Maia-2 policy.

The implementation also includes several details to make the scoring process more reliable and efficient. For each position, Stockfish first evaluates a set of top candidate moves using MultiPV. These candidate evaluations are cached using the normalized FEN key, so repeated visits to the same position do not always require a new engine call. If the move in the trajectory is not included among the MultiPV candidates, the scorer evaluates that move separately by forcing it as the root move. This ensures that CPL is computed for the actual trajectory move, not only for the moves that Stockfish initially selected as top candidates.

To make scoring reproducible, the implementation uses a deterministic cache. For each normalized FEN key, the scorer stores the best score from the initial MultiPV search separately from move-specific scores indexed by (position_key, move_uci). If a played or proposed move is not included in the MultiPV candidates, Stockfish evaluates it as a forced root move, but this only updates the move-level cache. It does not change the cached best score for that position. This prevents the total CPL of a fixed trajectory from changing after the sampler evaluates additional moves.

In fair-play analysis, we also score only the suspect player’s moves. For example, if the suspect is playing Black, CPL is accumulated only on Black’s plies. This makes the anomaly score reflect the investigated player’s decisions rather than both players’ combined move quality.

This separation prevents the total CPL of a fixed trajectory from changing after the sampler has evaluated additional moves. Without this precaution, a later forced move evaluation could be appended to the same position cache and accidentally change the “best” score used in future CPL calculations. With the deterministic scorer, the best evaluated candidate for a position is fixed once the position is first scored, and subsequent forced move evaluations only provide the score of the specific move needed for CPL computation.

Thus, the implemented CPL calculation is \[ \mathrm{CPL}(m_i) = \max\{0,\; \mathrm{score}_{\mathrm{best}}(s_{i-1}) - \mathrm{score}_{\mathrm{played}}(s_{i-1},m_i)\}\,, \] where both scores are measured from the side-to-move perspective, and where \(\mathrm{score}_{\mathrm{best}}(s_{i-1})\) is not changed by later forced evaluations of additional moves. In the diagnostic scripts, the observed trajectory’s total CPL is also computed once before sampling and reused when computing empirical p-values. This ensures that the observed baseline remains fixed throughout the experiment.

This engine-based target is the final target distribution used in the anomaly detection framework. The empirical policy target described earlier is only used to validate the sampler mechanics. In the final pipeline, Maia-2 proposes human-plausible counterfactual trajectories, Stockfish assigns each trajectory a total CPL, and the Metropolis-Hastings sampler combines these two components through the proposal probability and the engine-based target density.

2.7 Vanilla Prefix-preserving Kernel

After defining the human-like proposal distribution and the engine-based target density, we combine them using a Metropolis-Hastings sampler. This step is implemented through the MHSampler class in sampler.py. The sampler operates on fixed-length chess trajectories rather than isolated board positions. Given a current trajectory \[ X = (m_1, m_2, \ldots, m_k)\,, \]

the sampler proposes a new trajectory \(Y\) of the same length and then accepts or rejects it according to the Metropolis-Hastings acceptance rule.

The proposal mechanism is prefix-preserving. At each iteration, the sampler first chooses a perturbation depth \(d\) uniformly from the trajectory positions. The moves before depth \(d\) are kept unchanged: \[ Y_{1:d-1} = X_{1:d-1}\,. \]

At the perturbation depth, the sampler forces a move different from the original move \(m_d\). The alternative move is sampled from the proposal distribution \(P_0\), restricted to legal moves that are not equal to the original move. This restriction ensures that the proposed trajectory actually differs from the current trajectory at the selected depth.

After the new move at depth \(d\) is selected, the sampler replays the new prefix on a chess board and then resamples the remaining suffix autoregressively. For each later position, the sampler calls legal_moves_with_probs(fen) from the proposal backend and draws the next move using the returned proposal probabilities. Thus, the proposed trajectory is generated as \[ Y = (m_1, \ldots, m_{d-1}, m'_d, m'_{d+1}, \ldots, m'_k), \]

where the prefix before \(d\) is preserved, the move at \(d\) is perturbed, and the suffix after \(d\) is newly sampled from the human-like proposal model.

This proposal design is useful for chess trajectory sampling because a single move change can make the rest of the original game illegal or strategically inconsistent. Instead of changing one move and keeping the original suffix, the sampler rolls forward from the new position and regenerates all later moves. This guarantees that the proposed trajectory remains a valid legal chess sequence under the new prefix.

The sampler also computes the exact proposal probability for each transition. The forward proposal probability \(q(Y \mid X)\) includes three components: the probability of choosing the perturbation depth, the probability of choosing the alternative move at that depth, and the product of proposal probabilities for the resampled suffix. In log form, this can be written as \[ \log q(Y \mid X) = -\log k + \log P_0(m'_d \mid s_{d-1}, m'_d \ne m_d) + \sum_{i=d+1}^{k} \log P_0(m'_i \mid s'_{i-1})\,. \]

Here, \(s_{d-1}\) is the board position before the perturbed move, and \(s'_{i-1}\) denotes the board position along the newly proposed trajectory. The term \(-\log k\) comes from selecting the perturbation depth uniformly among \(k\) positions.

Because this proposal distribution is not symmetric, the sampler must also compute the reverse proposal probability \(q(X \mid Y)\). This is implemented in the log_q() method in sampler.py, which reconstructs the first differing depth between two trajectories and evaluates the exact probability of proposing one trajectory from the other. If the transition is impossible, for example because the trajectories have different lengths or an illegal move appears, the reverse probability is returned as negative infinity.

The Metropolis-Hastings acceptance probability is then computed using both the target density and the proposal correction: \[ \log \alpha = \min \left[ 0,\, \log \pi(Y) - \log \pi(X) + \log q(X \mid Y) - \log q(Y \mid X) \right]\,. \]

The proposed trajectory is accepted with probability \(\alpha\). If it is accepted, the Markov chain moves to \(Y\); otherwise, it remains at \(X\). In the final engine-based version, \(\log \pi(\cdot)\) is given by the Stockfish CPL target. Therefore, the sampler tends to accept trajectories with lower total CPL, while still correcting for the fact that candidate moves are proposed from the Maia-2 human move distribution.

The implementation also handles invalid proposal paths. If resampling the suffix reaches a game-over position before the fixed trajectory length is completed, the proposal is rejected rather than treated as a valid sample. This preserves the fixed-depth state space used throughout the anomaly detection pipeline.

Overall, the prefix-preserving Metropolis-Hastings sampler provides the mechanism for generating counterfactual human-like trajectories. Maia-2 defines the proposal probabilities, Stockfish defines the target density through total CPL, and the Metropolis-Hastings acceptance rule combines both components to produce samples from the desired trajectory distribution.

2.8 Mixture Refresh Kernel

In addition to the vanilla prefix-preserving kernel, we also experimented with a mixture refresh kernel to improve exploration. The motivation is that the prefix-preserving proposal is local: it changes one move and then resamples the suffix, but it keeps the prefix before the perturbation depth fixed. In a large discrete trajectory space, this can cause chains to remain trapped in local trajectory neighborhoods.

The mixture kernel combines two proposal mechanisms. With probability \(1-\rho\), it uses the prefix-preserving proposal described above. With probability \(\rho\), it performs a full refresh by resampling an entire length-\(k\) trajectory from the Maia-2 proposal distribution starting at the same initial position \(S_0\). The full-refresh proposal density is \[ q_{\mathrm{refresh}}(Y) = \prod_{i=1}^{k} P_0(y_i \mid y_{1:i-1}, S_0)\,, \] where the board state before each move is obtained by replaying the previously sampled moves.

The resulting mixture proposal is \[ q_{\mathrm{mix}}(Y \mid X) = (1-\rho)q_{\mathrm{prefix}}(Y \mid X) + \rho q_{\mathrm{refresh}}(Y)\,. \]

In the 80-20 version of this kernel, we set \(\rho=0.20\), so that 80% of proposals are local prefix-preserving proposals and 20% are full trajectory refreshes. The vanilla kernel is recovered as the special case \(\rho=0\).

Because the proposal is a mixture, the Metropolis-Hastings correction must use the full mixture density rather than only the proposal branch that happened to generate the candidate trajectory. Therefore, the acceptance probability is \[ \alpha(X \to Y) = \min\left\{ 1,\, \frac{\pi(Y)}{\pi(X)} \frac{q_{\mathrm{mix}}(X\mid Y)} {q_{\mathrm{mix}}(Y\mid X)} \right\}\,. \]

In the implementation, the forward and reverse mixture probabilities are computed using a log-sum-exp calculation for numerical stability. This preserves the correct Metropolis-Hastings correction while allowing occasional global moves in trajectory space.

2.9 Counterfactual Baseline and Empirical Anomaly Score

After running the sampler, the retained trajectories form a counterfactual human baseline. Each sampled trajectory is a legal length-\(k\) sequence from the same starting position as the observed game. These are not observed games from the database; they are alternative continuations generated by perturbing the observed trajectory and rolling forward under the Maia-2 proposal model.

For each sampled trajectory \(X^{(j)}\), we compute total suspect-side CPL: \[ \mathrm{CPL}(X^{(j)})=\sum_{i=1}^{k}\mathrm{CPL}(m_i^{(j)})\,, \] where the sum includes only the investigated player’s moves. The collection \[ \{\mathrm{CPL}(X^{(1)}),\ldots,\mathrm{CPL}(X^{(N)})\} \] is the empirical null distribution for that analysis window.

The observed trajectory \(X^\text{obs}\) is then compared with this null distribution. Since lower CPL means stronger, more engine-aligned play, the empirical anomaly p-value is \[ p= \frac{ 1+\sum_{j=1}^{N}\mathbb{1}\{\mathrm{CPL}(X^{(j)})\leq \mathrm{CPL}(X^{\mathrm{obs}})\} }{ 1+N }\,. \] The smoothing term prevents the p-value from being exactly zero when no sampled trajectory outperforms the observed line.

A small p-value means that very few sampled human-plausible trajectories achieved CPL as low as the observed trajectory. We therefore interpret small p-values as evidence that the observed line is unusually engine-aligned relative to the sampled counterfactual baseline. However, this interpretation is only reliable if the MCMC chains mix well, which motivates the multi-chain diagnostics described below.

2.10 Multi-chain Convergence Diagnostics

A single MCMC chain can produce a plausible-looking p-value even if it has not explored the relevant trajectory space. This is especially problematic here because the state is a full chess move sequence, not a one-dimensional score. We therefore run four independent chains through run_four_chain_diagnostics.py and run_four_chain_diagnostics_mixed_kernel.py for the same analysis window and compare both their scalar summaries and their sampled trajectory states.

For each chain, we record the acceptance rate, number of unique trajectories, mean and standard deviation of sampled CPL, and chain-specific empirical p-value. If these quantities vary substantially across chains, the pooled p-value may depend too much on initialization or random seed.

As a scalar diagnostic, we compute split-\(\hat{R}\) for total CPL and log target density [GR92]. Values close to 1 suggest that between-chain and within-chain variation are similar for those scalar summaries. However, scalar convergence is not enough: two chains may have similar CPL distributions while visiting different move-sequence regions.

To assess convergence in the trajectory space itself, we use PACE [VS17]. PACE partitions the sampled state space and compares the probability mass that each chain assigns to each partition cell. We use two trajectory-level partitions.

The exact-state partition treats two trajectories as the same only if their full UCI move sequences are identical: \[ d_{\mathrm{exact}}(X,Y)= \begin{cases} 0,&X=Y,\\ 1,&X\neq Y\,. \end{cases} \]

This is a strict diagnostic: it asks whether chains visit the same exact trajectories with similar frequencies.

Because exact matching can be too strict in a sparse discrete space, we also use a trajectory-medoid partition. Frequent sampled trajectories are selected as medoids, and each sample is assigned to its nearest medoid using normalized Hamming distance: \[ d_{\mathrm{Ham}}(X,Y) = \frac{1}{k}\sum_{i=1}^{k}\mathbf{1}\{m_i\neq y_i\}\,. \]

This coarser diagnostic asks whether chains explore similar neighborhoods of trajectory space, even if they do not repeatedly sample the same exact move sequences.

Both PACE versions are applied directly to sampled move-sequence states, not to scalar observables such as total CPL or log target density.

3. API Design

To make the anomaly detection framework easier to run and reproduce, we wrap the sampling and diagnostic components in a FastAPI application. The API layer is implemented in main.py, which serves as the entry point for external requests. It defines a structured request schema containing the player identifier, player Elo, UCI move sequence, analysis depth, suspected player color, and an optional demo mode. If a demo mode is selected, the API loads one of the predefined example games from the demo data folder. Otherwise, it uses the move sequence and metadata supplied directly by the user. The API exposes two main endpoints: /api/detect_anomaly for the main anomaly detection task and /api/diagnostics for convergence and stability diagnostics.

The main anomaly detection logic is implemented in api_run_sampler.py. Given a candidate player and a supplied move sequence, the function initializes a Maia-2 proposal backend using the player’s Elo rating. This proposal model represents plausible human move choices for a player at a similar rating level. The function also initializes a Stockfish-based target scorer, which evaluates each trajectory using centipawn loss. Together, these two components define the Metropolis-Hastings sampler: Maia-2 proposes human-like counterfactual trajectories, while the Stockfish-based scorer assigns each trajectory a target score based on move quality.

The API evaluates one fixed-length move window at a time. The supplied moves_uci sequence is sliced to length k_depth, so that the observed trajectory and all MCMC-generated counterfactual trajectories have the same number of moves. If the input sequence corresponds to the beginning of a full game, the API evaluates the first k_depth moves. If the input sequence has already been selected by an upstream screening step as a suspicious segment, then the API evaluates that selected segment.

The detection endpoint returns an anomaly detection summary, including the observed window’s total centipawn loss, the baseline mean centipawn loss from the sampled trajectories, the empirical p-value, and a final verdict. The empirical p-value is computed as the proportion of sampled trajectories with centipawn loss less than or equal to the observed trajectory, with a small smoothing adjustment. A very small p-value means that the observed window has lower centipawn loss than almost all simulated human-like trajectories, making it unusually strong relative to the Elo-conditioned human baseline. When this p-value falls below the chosen threshold, the endpoint returns an anomaly flag.

The diagnostic endpoint is implemented separately in api_run_diagnostics.py. While the main detection endpoint reports the anomaly result from one MCMC run, the diagnostic module evaluates whether the result is stable across multiple Markov chains. The API diagnostic module is adapted from the earlier mixed-kernel diagnostic script, preserving the same core workflow while refactoring it into a function that can be called by the API. For API execution, the sampling settings are reduced to N_STEPS = 200 and BURN_IN = 50 to lower computational cost, since the diagnostic endpoint runs four independent chains and each MCMC step requires Stockfish-based trajectory scoring. These settings make the endpoint more feasible for interactive use.

In addition to chain-level summaries, the diagnostic module reports convergence checks. It summarizes each chain using acceptance rate, number of unique sampled states, mean baseline centipawn loss, standard deviation of baseline centipawn loss, and chain-level empirical p-value. It also computes pooled anomaly statistics across chains. Split R-hat is computed for scalar quantities such as total centipawn loss and log target density, while PACE is applied to sampled trajectory states to assess whether different chains are exploring similar regions of the discrete trajectory space. These diagnostics help distinguish a genuinely unusual observed game window from an unstable result caused by poor chain mixing or chain-specific sampling behavior.

Overall, the API separates user-facing execution from the statistical computation. main.py handles request parsing, demo loading, and endpoint routing. api_run_sampler.py performs the main fixed-window anomaly test, and api_run_diagnostics.py provides multi-chain reliability checks. This modular design improves reproducibility and maintainability, because the interface layer, the anomaly detection procedure, and the diagnostic workflow can be developed and tested separately.

3.1 Deployment

docker build -t chess-anomaly-api .
docker run -d -p 8000:8000 chess-anomaly-api

From an implementation perspective, the API was also adjusted to fit the resource constraints of the VM environment. Because our VM had limited storage, we disabled GPU execution and ran the Maia-2 proposal backend on CPU. To make the dependency stack installable within the VM constraints, we also downgraded the Python version specified in pyproject.toml and used a compatible lower-version PyTorch build. These deployment changes reduce storage and compatibility pressure, although CPU execution makes the sampler slower. Importantly, they do not change the high-level anomaly detection logic: Maia-2 is still used as the human move proposal model, Stockfish still provides the target scoring, and the Metropolis-Hastings procedure remains the same.

4. Results

4.1 End-to-end demo on selected examples

To demonstrate the full anomaly-detection pipeline, we ran the single-chain sampler on manually selected examples: For example, we have ran on one suspected-cheating game selected from a rapid rating-gain list and one comparison game selected as an ordinary-looking fair-play example. These labels are heuristic rather than confirmed ground truth, so this experiment should be interpreted as a qualitative demo rather than a formal classifier benchmark.

For each game, we fixed the first \(k=10\) plies as the analysis window, conditioned Maia-2 on the suspect player’s Elo, and computed centipawn loss only on the suspect player’s moves. The sampled counterfactual trajectories form an empirical null distribution of suspect-side CPL values. The empirical p-value reports the fraction of sampled trajectories with CPL less than or equal to the observed trajectory; therefore, smaller p-values indicate that the observed line is unusually engine-aligned relative to the sampled human baseline.

Single-chain anomaly detection results on two selected games.
Example	Suspect side	Actual CPL	Null mean CPL	Null median CPL	Null std CPL	Empirical p-value	Decision
suspected example: `MirTrap`	White	6	69.94	61	32.84	0.0020	flagged
comparison example: `imdumbplsteach`	Black	44	108.80	104	40.70	0.0320	not flagged under (p<0.01)

The results of running single chains are recorded in the above table. In the suspected example, the observed trajectory had total suspect-side CPL 6, while the sampled baseline had mean CPL 69.94. The empirical \(p\)-value was 0.0020, so the trajectory was flagged as an outlier under the \(p<0.01\) rule. In the comparison example, the observed suspect-side CPL was 44, compared with a sampled baseline mean of 108.80. Its empirical \(p\)-value was 0.0320, which is low but not below our stricter \(p<0.01\) flagging threshold. This illustrates how the pipeline distinguishes between an extremely low-CPL line and a strong but less extreme line in the selected examples.

4.2 Multi-chain diagnostics

Because single-chain empirical \(p\)-values can be unstable, we also ran four-chain diagnostics on a representative lower-Elo player, ferlionrod. These diagnostics are intended to test whether independently initialized chains produce compatible null distributions. We report acceptance rates, unique sampled states, chain-level \(p\)-values, scalar split-\(\hat{R}\), and PACE on sampled trajectory states in table below.

Multi-chain diagnostics for the vanilla prefix-preserving kernel and the mixture refresh kernel.
Kernel	(k)	Actual CPL	Pooled null mean	Pooled \(p\)-value	Split-\(\hat{R}\)	Exact PACE \(\widehat{\delta}\)	Medoid PACE \(\widehat{\delta}\)
Vanilla prefix kernel	10	206	220.75	0.510	1.229	0.866	0.857
Mixture kernel	10	206	234.14	0.513	1.206	0.866	0.804

Under the vanilla prefix-preserving kernel, the pooled empirical \(p\)-value was 0.510, suggesting that the observed line was not anomalous relative to the sampled null. However, the chain-specific \(p\)-values varied substantially, from approximately 0.245 to 0.764. The scalar split-\(\hat{R}\) for total CPL was 1.229, and PACE on sampled trajectory states remained high, with exact-state \(\widehat{\delta}=0.866\) and trajectory-medoid \(\widehat{\delta}=0.857\). These diagnostics suggest that the scalar anomaly score was not fully supported by state-space convergence.

We also tested the mixture refresh kernel, which augments the prefix-preserving proposal with occasional full-trajectory refreshes. In this representative \(k=10\) run, however, the mixture kernel only exhibits to slightly improve convergence relative to the vanilla kernel. The pooled \(p\)-value remained non-anomalous, but split-\(\hat{R}\) decreased from 1.229 to 1.206, and trajectory-medoid PACE \(\widehat{\delta}\) decreased from 0.857 to 0.804. Thus, the vanilla prefix-preserving kernel showed slightly better scalar stability and better agreement across coarse trajectory neighborhoods. Exact-state PACE remained high at 0.866 for both kernels, indicating that neither sampler assigned similar mass to the same exact move sequences.

4.3 API Demo

This demonstrates that our API is functioning correctly by running anomaly detection on preloaded demo data. By setting demo_mode to "white_cheated", the system automatically loads a known example game where the white player exhibits engine-like behavior. This allows us to quickly verify that the full pipeline—from request handling to Bayesian inference via MCMC—is working as expected, without requiring manual input of moves.

import requests
import json

url = "http://vcm-52666.vm.duke.edu:8080"

# Using our demo_mode to load the cheated game
suspect_payload = {
    "player_id": "AarBIGLOCO",
    "player_elo": 1500,
    "demo_mode": "white_cheated",
    "k_depth": 6
}

print("Running MCMC Anomaly Detection (This may take 3-4 minutes)...")
r = requests.post(url + "/api/detect_anomaly", json=suspect_payload)

print(f"Status: {r.status_code}")
# Format the JSON nicely for the presentation
if r.status_code == 200:
    print(json.dumps(r.json(), indent=2))
else:
    print(r.text)

Running MCMC Anomaly Detection (This may take 3-4 minutes)...
Status: 200
{
  "status": "success",
  "data": {
    "actual_cpl": 0.0,
    "baseline_mean_cpl": 52.66,
    "p_value": 0.028,
    "verdict": "INLIER: FAIR PLAY"
  }
}

This chunk showcases diagnostic checks on the same dataset. Specifically, it runs a 4-chain MCMC procedure to evaluate convergence and reliability of the posterior estimates. Key metrics such as the pooled p-value and split R-hat are returned, helping users assess whether the model has converged properly and whether the anomaly detection results are trustworthy.

print("Running 4-Chain Diagnostics (This may take a few minutes)...")
r = requests.post(url + "/api/diagnostics", json=suspect_payload)

print(f"Status: {r.status_code}")
if r.status_code == 200:
    data = r.json()
    # Print the core summary metrics
    print(f"Pooled p-value: {data['data']['pooled_p_value']:.4f}")
    print(f"R-hat Convergence: {data['data']['split_rhat_total_cpl']:.4f}")
    print("\nFull Response:")
    print(json.dumps(data, indent=2))
else:
    print(r.text)

Running 4-Chain Diagnostics (This may take a few minutes)...
Status: 200
Pooled p-value: 0.0017
R-hat Convergence: 1.4356

Full Response:
{
  "status": "success",
  "data": {
    "pooled_actual_cpl": 7.0,
    "pooled_null_mean_cpl": 166.17333333333335,
    "pooled_null_std_cpl": 48.55580362495298,
    "pooled_p_value": 0.0016638935108153079,
    "split_rhat_total_cpl": 1.4356001615119205,
    "split_rhat_log_pi": 1.4356001615119203,
    "chain_summary": [
      {
        "chain": 0,
        "acceptance_rate": 0.11,
        "unique_states": 12,
        "null_mean_cpl": 117.2,
        "null_std_cpl": 22.456819116522595,
        "p_value": 0.006622516556291391
      },
      {
        "chain": 1,
        "acceptance_rate": 0.16,
        "unique_states": 13,
        "null_mean_cpl": 176.52,
        "null_std_cpl": 17.05526744589503,
        "p_value": 0.006622516556291391
      },
      {
        "chain": 2,
        "acceptance_rate": 0.06,
        "unique_states": 6,
        "null_mean_cpl": 164.0,
        "null_std_cpl": 0.0,
        "p_value": 0.006622516556291391
      },
      {
        "chain": 3,
        "acceptance_rate": 0.195,
        "unique_states": 28,
        "null_mean_cpl": 206.97333333333333,
        "null_std_cpl": 66.93369920664372,
        "p_value": 0.006622516556291391
      }
    ]
  }
}

We provide four built-in demo scenarios that can be selected via the demo_mode field: "white_cheated", "white_fair", "black_cheated", and "black_fair". These options allow users to easily test the system under different controlled conditions.

Finally, this shows how users can input their own game data. To do this, set demo_mode to "none" and provide the required fields manually, including the move list in UCI format (moves_uci), player rating, and which side is being evaluated (suspect_is_white). This enables flexible, real-world usage of the API beyond the predefined examples.

suspect_payload = {
    "player_id": "MirTrap",
    "player_elo": 1500,
    "moves_uci": ["e2e4","c7c5","g1f3","b8c6","d2d4","c5d4","f3d4","e7e5","d4b5","d7d6","b1c3","a7a6","b5a3","b7b5","c3d5","g8e7","c2c4","b5b4","a3c2","g7g6","d5f6"],
    "k_depth": 6,
    "suspect_is_white": True,
    "demo_mode": "none"
}

print("Running MCMC Anomaly Detection (This may take 3-4 minutes)...")
r = requests.post(url + "/api/detect_anomaly", json=suspect_payload)

print(f"Status: {r.status_code}")
# Format the JSON nicely for the presentation
if r.status_code == 200:
    print(json.dumps(r.json(), indent=2))
else:
    print(r.text)

Running MCMC Anomaly Detection (This may take 3-4 minutes)...
Status: 200
{
  "status": "success",
  "data": {
    "actual_cpl": 5.0,
    "baseline_mean_cpl": 67.22,
    "p_value": 0.036,
    "verdict": "INLIER: FAIR PLAY"
  }
}

5. Discussion & Limitations

5.1 Interpreting Anomaly Scores with Diagnostic Caution

Overall, these examples show that the prototype can produce interpretable anomaly scores and diagnostic summaries. The single-chain demo gives an intuitive fair-play report: an extremely low-CPL suspected line is flagged, while the comparison line is not flagged under the stricter \(p<0.01\) threshold. The four-chain experiments show why convergence diagnostics is an unalienable part of the whole story of anomaly detection. Even when pooled \(p\)-values appear reasonable, chain-level disagreement and high PACE values can indicate incomplete exploration of the trajectory state space.

We therefore treat the empirical \(p\)-value as meaningful only when supported by multi-chain diagnostics. In this project, the results are best viewed as evidence that counterfactual trajectory sampling is a promising framework for fair-play analysis, while also showing that sampler convergence remains a central limitation. We have to acknowledge that the MCMC sampler is facing significant challenges in exploring the large discrete trajectory space at this stage, and that the anomaly scores can be unstable when chains do not mix well. We admit that given the limited computational resources and time constraints, we are not able to run longer chains or try more advanced sampling techniques that may improve mixing. Therefore, the current results should be interpreted as a proof-of-concept rather than a definitive classifier benchmark.

Interpreting the comparison of vanilla kernel and mixture refresh kernel, we also acknowledge that under the current settings, the mixture kernel did not show a clear improvement in convergence diagnostics. We speculate more experiments with different \(k\), \(\beta\), or number of steps may be able to further differentiate the two kernels, but this requires verification with more computational resources. At the current stage, we can only move on with the mixture refresh kernel for the API demo.

5.2 Other Limitations

Our project is also limited by data scale and model availability. We parsed approximately eight days of Lichess game data, which already amounted to about 5 GB and over 15 million games after preprocessing. Although this subset provides a useful sample for proof-of-concept testing, it may not fully represent the broader population of games across different rating ranges, time periods, and playing conditions.

Computational constraints also limited the scope of the analysis. The large data volume caused crashes in the workbench environment, restricting our ability to process longer time windows or run more extensive experiments. With more computational resources, the project could be expanded to cover a larger sample of games and a wider range of candidate players.

At an early stage of the project, we considered focusing on elite-level games. However, the stronger Maia-based model needed for that setting, Maia4All [TJX+25], was not publicly available at the time of the project. We therefore shifted our analysis to broader Lichess game data and used the available Maia-2 proposal model instead.

6. Conclusion

This project is a starting point for developing a counterfactual trajectory sampling framework for chess fair-play analysis. Instead of judging a game only by direct engine agreement, we construct a sampled null distribution of human-plausible alternatives from the same starting position. Maia-2 provides a skill-conditioned proposal model, Stockfish centipawn loss provides a quality-aware target, and Metropolis-Hastings sampling combines the two into an empirical baseline for evaluating observed play. The prototype produces interpretable anomaly scores, supports suspect-side CPL scoring, and exposes both single-chain anomaly reports and multi-chain diagnostic summaries through a FastAPI/Docker deployment.

At the same time, the experiments show that convergence is a central challenge: scalar \(p\)-values can appear reasonable even when PACE indicates incomplete exploration of the discrete trajectory space. Therefore, the current system should be understood as an experimental fair-play analysis tool rather than a production cheating detector. Future work should focus on longer and more efficient MCMC runs, better proposal mechanisms, larger and better-labeled evaluation sets, and stronger player-specific human move models.

7. References

[GR92] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple sequences. Statistical Science, 7(4):457–472, 1992.

[Lic25] Lichess. Lichess.org open database. https://database.lichess.org/, accessed 2026.

[Sto26] Stockfish Developers. Stockfish: Strong open-source chess engine. https://stockfishchess.org/, accessed 2026.

[TJM+24] Zhenwei Tang, Difan Jiao, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, and Ashton Anderson. Maia-2: A unified model for human-AI alignment in chess. In Advances in Neural Information Processing Systems, 2024.

[TJX+25] Zhenwei Tang, Difan Jiao, Eric Xue, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, and Ashton Anderson. Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess. arXiv preprint arXiv:2507.21488, 2025.

[VS17] Douglas VanDerwerken and Scott C. Schmidler. Monitoring joint convergence of MCMC samplers. Journal of Computational and Graphical Statistics, 26(3):558–568, 2017.