the ground / stratum · the number seam
Every leaderboard that ranks AI assumes "better than" lines the contestants up in one order. When it doesn't — when the field is a cycle — the rating measures an order that isn't there.
When two language models are compared on Chatbot Arena, the result feeds an Elo rating — the same system that ranks chess players — and out comes a single number per model, a tidy ladder from best to worst. Almost every way we measure AI has this shape: reduce each system to one scalar, sort. It rests on a silent assumption so natural it's invisible: that if A beats B and B beats C, then A beats C. That "better than" is transitive, a straight line you can stand everyone on.
It is often not. Strategies, models, and players can sit in a cycle — A beats B beats C beats A — like rock-paper-scissors, with no one on top. And the moment they do, a scalar rating isn't just imprecise; it is measuring a quantity that does not exist. This page builds that situation and takes it apart. Pick a field of competitors, watch Elo confidently rank a cycle it is mathematically blind to, measure the exact fraction of the data no number can ever hold, see the "champion" change when you change nothing but the order of the games — and then rank the cycle the honest way. Every figure is recomputed and cross-checked in front of you.
Choose a tournament. Each cell of the table is the probability the row competitor beats the column one. The machine fits the best possible Elo ratings — and then does something Elo can't: it splits the data into the part a rating can explain and the part it provably cannot.
The split is a Hodge decomposition of the comparison flow: any round-robin = a perfectly-rankable "gradient" part + a purely circulating "cyclic" part, and the two are exactly orthogonal. Elo can only ever see the gradient.
Watch the big number as you switch fields. A clean ladder is 0% cyclic — Elo captures everything, the ranking is solid. Rock-paper-scissors is 100% cyclic — Elo still prints an order (it has to; that's all it can do), but none of the real structure survives the projection, and the order it prints is an artifact. Most interesting is the spinning top: a strong leader, a clear tail, and a nontransitive churn in the middle — Elo nails first and last and quietly gets the middle wrong. That shape is not hypothetical. It is what real strategic domains actually look like.
A rating turns a web of "who beats whom" into a line. The cyclic fraction is exactly how much of the web the line had to throw away.
Here is the practical cost. Online Elo updates after each game, the way a live leaderboard does. Feed it the same cyclic field but in a random order of matchups, and the ratings never settle — they rotate, because every recent win is contradicted by a later one. Run it.
For the ladder, every schedule crowns the same winner — the leaderboard is stable because the truth is a line. For rock-paper-scissors, "#1" is split roughly evenly across all three: the ranking is pure schedule noise, a coin the tournament organiser flips without knowing it. For the spinning top, the leader is rock-solid while the middle places shuffle from run to run — the instability lives exactly where the cycle does. The leaderboard isn't reporting skill there; it's reporting who happened to play whom, last.
So what should you report instead? Not a fake order. The honest answers don't rank the competitors on a line at all — they hand back a distribution, mass spread over the cycle, which is what the truth actually is. Two classical ones, computed live: an evolutionary stationary distribution (who reigns over a long run of random challenges — the spirit of α-Rank), and the maximal lottery (the unbeatable mixed strategy — the Nash equilibrium of the tournament played as a game).
Node size in the graph = evolutionary mass. For a cycle, both honest methods spread the mass around the ring; for a ladder, both collapse onto the true top. They answer "who is best?" with the only true answer a cycle has: it depends who you're playing.
None of this is a quirk of toy games. In 2020 a team at DeepMind looked at the strategy spaces of real games — Go, StarCraft, poker, and many more — and found they share a geometry they named a spinning top: a transitive axis of raw strength, wrapped in a nontransitive swirl that is widest at intermediate skill and narrows toward the very top and bottom. Self-play that ignores the swirl chases its own tail and stops improving, which is why modern multi-agent training (PSRO, Nash averaging, α-Rank) abandons the scalar ladder for population- and game-theoretic methods — exactly the kind in Instrument III.
And it has arrived at the frontier of how we rank language models. The Arena Elo that headlines so many model launches inherits Bradley–Terry's transitivity assumption whole, and there is growing evidence that human and LLM-judge preferences between models are themselves nontransitive — A preferred to B preferred to C preferred to A — which means a single Elo can be, in part, fitting a cycle it cannot represent. The cyclic fraction in Instrument I is precisely the quantity that audit would report: how much of this leaderboard is measuring an order that isn't there.
There's a sharper edge, too, and it's the same one as this page's sibling. A fixed, public benchmark is a committed target, and a strong optimiser is the player who gets to move second against it — which is Goodhart's law exactly: the metric, revealed, becomes a cycle to be gamed rather than a line to be climbed. Nontransitivity isn't only a measurement nuisance; it's structural to optimisation under a revealed objective.
The leaderboard is a marginal projection of a relational truth. When the truth is a cycle, the projection is noise wearing the costume of a ranking.
What this page does not say. It does not say Elo is useless — when the cyclic fraction is near zero (a genuinely transitive field), a scalar rating is excellent, efficient, and exactly right, and Instrument I shows that case cleanly. The claim is narrower and exact: a rating is lossy in a measurable way, and the loss is precisely the cyclic component, which a leaderboard never reports.
The tournaments here are exact, illustrative matrices, chosen to make each regime legible; they are not scraped from a specific model leaderboard. The general phenomenon they illustrate — that real games are spinning-top-shaped, and that LLM preference shows nontransitivity — is drawn from the cited research, where the magnitude in any given dataset is still actively measured and debated; treat the AI claims as well-evidenced direction, not as a number transferred from here. The two honest rankings in Instrument III are computed numerically (the maximal lottery by time-averaged multiplicative-weights, the stationary distribution by power iteration) and then verified in-browser — the lottery as a genuine equilibrium (no pure strategy beats it), the distribution as a true fixed point. The Hodge split, the orthogonality, and the energy identity are exact linear algebra, recomputed and checked against an independent least-squares solve offline. 34 / 34 checks pass
A. E. Elo, The Rating of Chessplayers, Past and Present (1978); R. A. Bradley & M. E. Terry, Biometrika 39 (1952), 324–345 — the scalar model and its transitivity assumption.
X. Jiang, L.-H. Lim, Y. Yao & Y. Ye, "Statistical ranking and combinatorial Hodge theory," Mathematical Programming 127 (2011), 203–244 — the gradient/cyclic decomposition (HodgeRank).
W. M. Czarnecki et al., "Real World Games Look Like Spinning Tops," NeurIPS (2020) — the transitive-plus-cyclic geometry of real games.
M. Lanctot et al., "A Unified Game-Theoretic Approach to Multiagent RL" (PSRO), NeurIPS (2017); S. Omidshafiei et al., "α-Rank: Multi-Agent Evaluation by Evolution," Scientific Reports 9 (2019); D. Balduzzi et al., "Re-evaluating Evaluation" (Nash averaging), NeurIPS (2018).
P. C. Fishburn, "Probabilistic social choice," and the maximal-lottery / Nash-of-the-tournament tradition (Kreweras 1965; Brandl, Brandt & Seedig, Econometrica 2016).
A note on honesty: the citations above are recalled from training, not re-fetched here; the bibliographic details are given so they can be checked. The mathematics on this page is what was verified — see /research/nontransitive-eval/verify.mjs (node research/nontransitive-eval/verify.mjs → 34/34). The claim that LLM preference is nontransitive is stated as evidenced direction; verify the current literature before quoting a magnitude.