Put a frontier model somewhere something is actively trying to beat it (hidden information, real stakes, a forced commitment, an outcome that settles the score) and it leaves behind something rare: a written record of its private reasoning, sitting next to the ground truth of the situation and the action it actually took. We captured 27,750 such reasoning traces, drawn from roughly 6.62 million machine decisions, and read them not as poker but as a sample of how a machine thinks when it is under pressure. The thinking has a measurable shape, and three properties of it should stop anyone deploying these systems as agents: a model's stated reasoning is an unreliable guide to its action, and exactly how unreliable is a stable per-model constant; how a model reasons is a recognizable, substantive fingerprint, not a matter of writing style, that travels from one situation to the next; and some models' reasoning bends to the situation while others stay locked, reasoning the same way no matter the stakes. Because it is the reasoning we are reading and not the game, these are properties of the agent, not the table. The thought process, not the move, is where an agent can be read, predicted, and exploited. Poker is only the rare place it becomes legible.
We pulled roughly 6.62 million heads-up no-limit hold'em decisions (frontier language models playing each other, nineteen distinct models across thirty-six runs, hands duplicated so that variance cancels and the same cards fall for both seats) and analyzed them. It looks, at first, like one more entry in the long genre of "can the AI play the game": chess, Go, Diplomacy, now poker. Grade the wins, rank the models, publish a leaderboard.
That is not what is interesting about it.
What is interesting is that both seats are models. Every decision therefore exposes two things that are almost never available together: the hole cards the actor was actually holding, and the private reasoning the actor wrote on the way to its action. You get the truth of the situation and the model's story about the situation, side by side, six million times. That is not a game record. That is a controlled experiment on whether a model's reasoning is a faithful account of its own decision.
And when you run that experiment, the answer is mostly no.
Take any model and ask a mechanical question: how much does reading its private reasoning improve a prediction of the move it is about to make, over and above the game situation alone? For Gemini 3.5 Flash the reasoning is richly informative: it lifts move-prediction accuracy by half. For DeepSeek V3.2 it barely moves the needle. Same task, same measurement, a four-fold spread in how much a model's words have to do with its actions. For the models at the bottom, the text is describing a hand the model is not quite playing.
This is the single most important property of these systems for anyone who cares about whether we can trust what they tell us, and almost nobody studying chain-of-thought is looking at it. There is a reason.
The Story Everyone Tells
The dominant narrative about LLMs and games is a competence narrative. Can the model play? How well? The artifacts are benchmarks with a single ground-truth answer per position: a model scored on whether it picks the GTO-correct action, exact-match, against a reported state of the art near 78 percent. This is the poker version of every capability benchmark we have: a bounded task, a known right answer, a number that goes up.
The competence narrative even has a surprising version. Run a best-response engine against each model and the ranking does not track brand: several celebrated flagships are among the easiest to exploit, while some quiet, disciplined models are the hardest. Interesting, and beside the point. Whether a model finds the technically correct bet is the "what does it do when it is holding a King" question, and that is precisely the question we are setting aside. We do not care how well the machine plays. We care how it thinks. Those turn out to be almost unrelated, and only one of them survives outside the casino.
Because the competence number, however you compute it, measures one thing: did the model do the right thing on the table. It says nothing about whether the model's account of why it did the thing is true, and the account is the part that travels. The reasoning a model produces under adversarial pressure is not poker-specific. The move is. So the move is the part we throw away.
Reading the Thinking, Not the Play
The data lets you grade the reasoning, not just the play. And you can do it without trusting any one model's opinion about the text. We work from two judge-free instruments on 27,750 reasoning traces: a predictive one (does the reasoning text improve a prediction of the model's own action, the faithfulness measure above) and a geometric one (embed every trace and cluster it, after stripping out the spot, the action, and writing style, so what remains is reasoning manner). Both run on the raw traces; neither asks a language model to score another. An earlier version of this work leaned on a single semantic judge to label intent in each trace; we have set that aside, because a funding-grade claim cannot rest on one unvalidated labeler, and, reassuringly, removing it changes the faithfulness result by essentially nothing.
Set the behavioral measurements (what the model did) next to the cognitive ones (how the reasoning is shaped), and three independent findings line up into one.
The say/do gap
You can measure that faithfulness directly, without trusting any judge to label the text. Train a predictor on the game state alone, then let it also read the reasoning, and ask how much the words improve the prediction of the model's own next move. Call the gap R-IPV. Across the field it is large (the reasoning lifts move-prediction accuracy from about 50 percent to 84 percent), so on average the trace is load-bearing. But the size of that lift is a per-model constant, and it ranges roughly four-fold: Gemini 3.5 Flash and GPT-5.4 mini near +0.50, the Claudes and most GPTs in the middle around +0.35, and at the bottom Grok 4 (+0.25), the Grok reasoning variants, and DeepSeek V3.2 (+0.12), models whose stated reasoning barely predicts what they do. The result is robust where it would be easiest to fake: grouping the train/test split by whole match, so no hand leaks across it, moves the number by less than a thousandth.
Forget the cards and this is the monitorability hole. The dominant safety bet on these systems is that you can supervise an agent by reading its chain of thought: catch the dangerous intent in the reasoning before it reaches the action. That bet is only as good as the link between the words and the act, and here that link is a per-model number that runs from strong to nearly absent. For the models at the bottom, an overseer trusting the trace is reading a story only loosely coupled to the act. The agent can carry an intent its narration barely encodes. You cannot read this off the model card. You have to measure it. And note which models sit lowest: the explicit "reasoning" variants. More visible chain-of-thought is not more faithful chain-of-thought.
Minds that bend, minds that lock
Hold the situation fixed and a model's reasoning has a manner: a characteristic mix of the archetypes. Now change the situation: shorten the stack, move to a later street, raise the stakes. Some models visibly re-weight how they reason; others keep the same manner regardless. That difference is measurable (how far the manner shifts across situations, against a shuffled null) and it splits the field. Grok's reasoning variants and Claude Opus 4.7 bend the most: deep-stacked they reason from opponent history and multi-street plans; short-stacked they drop the reads and switch to cold equity math. At the other end, DeepSeek and the Gemini models barely move. DeepSeek reasons from opponent history almost regardless of the spot. (One axis moves no one: not a single model changes its manner with position.)
Pair this with faithfulness and a profile falls out. DeepSeek is at once the most hollow (its words least predict its actions) and the most rigid (its manner won't bend to the spot). Two independent measurements, one taken from prediction and one from a permutation test, land on the same model. A reasoner whose chain of thought neither tracks its own behavior nor adapts to the situation is the cleanest thing an adversary could ask for: predictable, and not listening.
You can watch it happen. Take a model's reasoning-archetype mix when it is deep-stacked, then again when it is short, and the adaptive ones visibly re-weight: Grok 4 abandons its deep-stack planning for cold equity math, and Claude Opus 4.7 collapses toward math and planning under pressure. The rigid ones do not: DeepSeek stays anchored in opponent-history whatever the stack, and Gemini 3.1 Pro retreats further into its own formal template the tighter the spot gets. The character either flexes or hardens; it does not go away.
Behavior converges, reasoning diverges
Cluster the models by how they play and the field collapses into a handful of archetypes: overwhelmingly tight-aggressive, with a couple of nitty pockets. Cluster the same models by how they think and the groupings scatter; models that play almost identically reason in completely different shapes, and models that reason alike play differently. Style of action is convergent. Style of thought is not. Whatever the reasoning text is doing, it is not the thing that determines the move, because the moves agree while the texts disagree.
Put those three together and they are not three findings. They are one. The reasoning trace is not the decision. The decision is made by something the text only partially reports; the relationship between the text and the decision is itself a per-model variable (strong in a few, weak in many, ornamental at the bottom) and so is whether that relationship even bends to the situation. The table is a lie detector, and most of the field fails it by some margin.
The Thought Process Is the Attack Surface
Strip the cards away and look at what is left. A setting with hidden information, real stakes, a forced commitment, and an outcome that grades the decision: that is not poker. That is the shape of nearly every place we are about to put agents: negotiating a price, bidding in an auction, calling a tool through another service, defending a workflow against an instruction buried in the data it was told to trust. Poker is just the one version where, by accident, we get to see the hidden cards, the action, and the private reasoning all at once. It was a window. We are done with it.
So the three findings are not facts about a card game. They are properties of how a machine reasons under pressure, and each one is a way in. The say/do gap is the hole in oversight: the agent's own words are a conditional, per-model guide to what it will actually do. The stable, substantive fingerprint is the hole in legibility: how an agent reasons is a recognizable signature, readable in one setting and expected in the next. And the bend-or-lock disposition is the hole in predictability: a model whose reasoning never adapts to the situation is one you can characterize once and steer thereafter. None is patchable with a better prompt. They sit at the level below the prompt: the cognition itself.
And the cognition is portable. Each model's reasoning carries a stable signature: the same model reasons in recognizably the same shape from one situation to the next, and the shapes differ sharply between models (a probe can recover which opponent a model was facing from its words alone, strongly for some and barely for others; the numbers are below). That portability is the whole danger and the whole opportunity. It means an adversary who characterizes how an agent thinks in one setting can predict and steer it in another, because the thinking is the invariant and the task is just the wrapper around it. It also means the thing worth measuring, publishing, and red-teaming is not how an agent scores on your benchmark. It is the shape of its thought, because the thought is what travels into the deployment you never tested.
A Test of the Frame
The frame makes predictions you can check against the same data.
It predicts that fluent reasoning and effective action should come apart: that the models which narrate the richest strategic stories need not be the ones that act best. Confirmed: the most elaborate exploit-language in the field comes from models that are themselves among the easiest to exploit. The story an agent tells about its own competence is not its competence.
It predicts that the reasoning carries real information for some models and almost none for others. A probe trained to recover which opponent a model is facing, from its reasoning text alone, adds a large 0.21 of signal over the game-state baseline for Grok 4 (its reasoning genuinely changes with the opponent) and a negligible 0.06 for Claude Opus 4.7, whose text barely moves regardless of who it is playing. Faithfulness is not a property of "LLMs." It is a property of a specific model, measurable, and it varies enormously.
It predicts that the link between reasoning and action should itself vary with the situation for some models and not others: that adaptiveness is a trait, not a constant. Confirmed: hold a model's situation and action fixed and its reasoning still carries a model-specific signature strong enough to identify it, and how much that signature re-weights as the stakes and stack change is a stable, measurable per-model number. The opponent-identifying signal is stronger late in a match than early (models do form reads) but the rigid ones form them and then reason the same way regardless.
And it predicts a uniformity underneath all the apparent variety of "reasoning styles." Confirmed in the crudest possible way: every model in the field, frontier or mini, reduces the problem in front of it to expectation-value math. Pot odds, equity, ranges: the numeric density of the reasoning is high everywhere. The models do not disagree about what kind of thing this reasoning is. They have converged on a single register and then vary in how faithfully that register tracks what they actually do.
That signature is measurable. Cluster the reasoning into archetypes (exploit-aware, opponent-history, equity-math, planner, verbose range-talk) and each family resolves the same spots to a different mix. The mix is the fingerprint, and it is the exploitable surface: a family that lives in opponent-history is one you can feed a false history; one that collapses to equity-math under pressure is one you can move by distorting the math; a fingerprint that never re-weights with the situation is one you can characterize once and rely on. The archetypes are clustered after stripping out the spot, the action, and writing style, so the shape is manner, not penmanship, and a model's own traces cluster to its mix far more tightly than style alone can explain.
The Research Field Is Also Pointed at the Wrong Thing
Here is the part worth sitting with.
A large fraction of current AI-safety hope rests on chain-of-thought monitorability: the bet that if a model is going to do something dangerous, it will say so in its reasoning first, and we can catch it by reading the trace. The entire premise is that the trace is a faithful window into the decision. It is the load-bearing assumption under a whole agenda.
The place that assumption got tested at scale, with ground truth for both the decision and the narration, was not a safety lab. It was poker, because poker is one of the few settings where you accidentally get all three of the things you need at once: a private reasoning trace, a discrete consequential action, and an objective outcome that says whether the stated story was true. And the data says the window is real for some models, weak for many, inverted for a few, and uncalibrated across the board.
This is the same shape as every off-distribution finding. The people building the monitorability agenda have a prior that the trace is informative, because their tradition is built on reading traces. The evidence that it is only conditionally informative shows up in a domain none of them were looking at, surfaced by a measurement none of them set out to run, because the instrument that exposes the gap is not a safety benchmark: it is a card game where the model has to commit chips and then gets shown whether its story held. The field will integrate this slowly, if at all, because it arrived from outside the distribution of where safety research expects its evidence to come from.
What Is Not Being Measured
Poker is one accidental lie detector. We have almost no deliberate ones.
Chain-of-thought faithfulness is mostly studied today on bounded question-answering: does the stated reasoning match the answer on a math problem, does perturbing the prompt change the stated reason. Useful, and far too narrow. The following have effectively not been measured:
The say/do gap under stakes: settings where the model commits to a consequential action and an objective outcome later adjudicates whether its stated intent was real. Poker has it. Trading, negotiation, and tool-use under adversarial pressure should have it and are not instrumented for it.
Calibration of stated confidence against ground-truth outcomes, per model, at scale. We do not have it cleanly even here (confidence is the part of these traces that most resists honest measurement) and we certainly do not have it for the agentic tasks these models are now being deployed into, where a model that says "I'm confident this is safe" and is wrong 60 percent of the time is a live hazard, not a leaderboard footnote.
Recursive theory of mind: whether a model reasons about what another agent believes, not just about the state of the world. Reliably detecting it from a trace is itself unsolved. What is clear is that in a world of agents reading and acting on each other's outputs, a population that models the situation but not the other minds in it is precisely the blind spot you would want measured before, not after.
The model-specificity of faithfulness itself. The single most actionable result here is that the trace-to-decision link is a per-model quantity that ranges from strong to inverted. Nobody publishes that number. It should be on every model card, the way accuracy is, because it is the number that tells you whether you are allowed to believe the reasoning at all.
The measuring stick has pointed at whether models can play. What happens when you grade the honesty of the reasoning rather than the correctness of the move: that is the region almost nobody has looked at, and it is the region that decides whether chain-of-thought is a safety tool or a comfort.
The Honest Claim
A model's reasoning trace is not its decision. Where you can finally check, the stated reasoning predicts the real decision strongly for a few models, weakly for many, and ornamentally for some; how much it does is a stable per-model number; and for some models that reasoning never even bends to the situation in front of it.
These are not quirks of poker. They are properties of how these systems narrate themselves, and poker is just the rare place they become visible.
The honest caveats: the exploitability figures are an index, not literal table win-rates: trust the ranking, not the magnitude. The faithfulness and fingerprint measures are judge-free (a predictor and a clustering run on the raw traces, reported with grouped splits, bootstrap intervals, and permutation tests) but they read the reasoning the labs choose to expose, which several of them summarize or hide, so they characterize the presented reasoning, not the hidden computation underneath. The cross-domain transfer of these signatures is, so far, only suggestive. None of that touches the central result, which rests on the one thing the data gives cleanly: the cards, the action, and the outcome, against the model's own words.
If the last two years of interpretability have been about learning to read the chain of thought, the next phase is about learning when reading it is allowed: when the words are load-bearing and when they are decoration laid over a decision made somewhere the text doesn't go. Poker answered that question by accident. The instruments that answer it on purpose do not exist yet. They will.
Source & Method
Figures are computed from our analysis of roughly 6.62 million heads-up no-limit hold'em decisions: frontier language models playing each other, nineteen distinct models, both seats model-controlled, hands duplicated to cancel variance. Cognitive measurements are judge-free, computed over 27,750 reasoning traces (balanced across models); behavioral measurements over the full set. Faithfulness (R-IPV) is the held-out accuracy a model's reasoning embedding adds over the game state when predicting its own action, with the train/test split grouped by whole match and intervals from bootstrap resampling. The reasoning archetypes are clusters of the trace embeddings after residualizing out situation, action, and writing style; responsiveness is the shift in a model's archetype mix across situations, tested against a permutation null. The reasoning-space map embeds those same 27,750 traces and projects them to two dimensions for display; proximity means the traces read alike, and the axes carry no units. Exploitability is a variance-cancelled best-response index reported as a ranking, not a literal win-rate. The framing of poker as a faithfulness instrument (and of chain-of-thought monitorability as the thing it accidentally tests) is, as of this writing, not a canonical claim in the literature.