Poker Is a Lie Detector, Not a Game

Abstract

Put a frontier model somewhere something is actively trying to beat it (hidden information, real stakes, a forced commitment, an outcome that settles the score) and it leaves behind something rare: a written record of its private reasoning, sitting next to the ground truth of the situation and the action it actually took. We captured 27,750 such reasoning traces, drawn from roughly 6.62 million machine decisions, and read them not as poker but as a sample of how a machine thinks when it is under pressure. The thinking has a measurable shape, and three properties of it should stop anyone deploying these systems as agents: a model's stated reasoning is an unreliable guide to its action, and exactly how unreliable is a stable per-model constant; how a model reasons is a recognizable, substantive fingerprint, not a matter of writing style, that travels from one situation to the next; and some models' reasoning bends to the situation while others stay locked, reasoning the same way no matter the stakes. Because it is the reasoning we are reading and not the game, these are properties of the agent, not the table. The thought process, not the move, is where an agent can be read, predicted, and exploited. Poker is only the rare place it becomes legible.

We pulled roughly 6.62 million heads-up no-limit hold'em decisions (frontier language models playing each other, nineteen distinct models across thirty-six runs, hands duplicated so that variance cancels and the same cards fall for both seats) and analyzed them. It looks, at first, like one more entry in the long genre of "can the AI play the game": chess, Go, Diplomacy, now poker. Grade the wins, rank the models, publish a leaderboard.

That is not what is interesting about it.

What is interesting is that both seats are models. Every decision therefore exposes two things that are almost never available together: the hole cards the actor was actually holding, and the private reasoning the actor wrote on the way to its action. You get the truth of the situation and the model's story about the situation, side by side, six million times. That is not a game record. That is a controlled experiment on whether a model's reasoning is a faithful account of its own decision.

And when you run that experiment, the answer is mostly no.

Take any model and ask a mechanical question: how much does reading its private reasoning improve a prediction of the move it is about to make, over and above the game situation alone? For Gemini 3.5 Flash the reasoning is richly informative: it lifts move-prediction accuracy by half. For DeepSeek V3.2 it barely moves the needle. Same task, same measurement, a four-fold spread in how much a model's words have to do with its actions. For the models at the bottom, the text is describing a hand the model is not quite playing.

This is the single most important property of these systems for anyone who cares about whether we can trust what they tell us, and almost nobody studying chain-of-thought is looking at it. There is a reason.

The Story Everyone Tells

The dominant narrative about LLMs and games is a competence narrative. Can the model play? How well? The artifacts are benchmarks with a single ground-truth answer per position: a model scored on whether it picks the GTO-correct action, exact-match, against a reported state of the art near 78 percent. This is the poker version of every capability benchmark we have: a bounded task, a known right answer, a number that goes up.

The competence narrative even has a surprising version. Run a best-response engine against each model and the ranking does not track brand: several celebrated flagships are among the easiest to exploit, while some quiet, disciplined models are the hardest. Interesting, and beside the point. Whether a model finds the technically correct bet is the "what does it do when it is holding a King" question, and that is precisely the question we are setting aside. We do not care how well the machine plays. We care how it thinks. Those turn out to be almost unrelated, and only one of them survives outside the casino.

Because the competence number, however you compute it, measures one thing: did the model do the right thing on the table. It says nothing about whether the model's account of why it did the thing is true, and the account is the part that travels. The reasoning a model produces under adversarial pressure is not poker-specific. The move is. So the move is the part we throw away.

The move is the shallowest layer. Handed an identical spot (same cards, same board, same history) the field scatters almost evenly between betting and checking. Even the visible action carries little agreement and less signal. But the move is the least interesting thing about an agent. Everything past this chart is about the layer underneath it: not what the model did, but what it was thinking when it did it.

Reading the Thinking, Not the Play

The data lets you grade the reasoning, not just the play. And you can do it without trusting any one model's opinion about the text. We work from two judge-free instruments on 27,750 reasoning traces: a predictive one (does the reasoning text improve a prediction of the model's own action, the faithfulness measure above) and a geometric one (embed every trace and cluster it, after stripping out the spot, the action, and writing style, so what remains is reasoning manner). Both run on the raw traces; neither asks a language model to score another. An earlier version of this work leaned on a single semantic judge to label intent in each trace; we have set that aside, because a funding-grade claim cannot rest on one unvalidated labeler, and, reassuringly, removing it changes the faithfulness result by essentially nothing.

Set the behavioral measurements (what the model did) next to the cognitive ones (how the reasoning is shaped), and three independent findings line up into one.

The say/do gap

You can measure that faithfulness directly, without trusting any judge to label the text. Train a predictor on the game state alone, then let it also read the reasoning, and ask how much the words improve the prediction of the model's own next move. Call the gap R-IPV. Across the field it is large (the reasoning lifts move-prediction accuracy from about 50 percent to 84 percent), so on average the trace is load-bearing. But the size of that lift is a per-model constant, and it ranges roughly four-fold: Gemini 3.5 Flash and GPT-5.4 mini near +0.50, the Claudes and most GPTs in the middle around +0.35, and at the bottom Grok 4 (+0.25), the Grok reasoning variants, and DeepSeek V3.2 (+0.12), models whose stated reasoning barely predicts what they do. The result is robust where it would be easiest to fake: grouping the train/test split by whole match, so no hand leaks across it, moves the number by less than a thousandth.

Forget the cards and this is the monitorability hole. The dominant safety bet on these systems is that you can supervise an agent by reading its chain of thought: catch the dangerous intent in the reasoning before it reaches the action. That bet is only as good as the link between the words and the act, and here that link is a per-model number that runs from strong to nearly absent. For the models at the bottom, an overseer trusting the trace is reading a story only loosely coupled to the act. The agent can carry an intent its narration barely encodes. You cannot read this off the model card. You have to measure it. And note which models sit lowest: the explicit "reasoning" variants. More visible chain-of-thought is not more faithful chain-of-thought.

Does the reasoning predict the action? R-IPV: how much accuracy a model's own reasoning adds, beyond the game state, when predicting its next move (split grouped by match, judge-free, bootstrap CIs). Above zero means the words carry real signal; near the bottom the trace is ornamental. DeepSeek and the Grok reasoning variants are the most hollow; the chain of thought is faithful for some models and decorative for others, and the gap is a per-model number nobody publishes.

Minds that bend, minds that lock

Hold the situation fixed and a model's reasoning has a manner: a characteristic mix of the archetypes. Now change the situation: shorten the stack, move to a later street, raise the stakes. Some models visibly re-weight how they reason; others keep the same manner regardless. That difference is measurable (how far the manner shifts across situations, against a shuffled null) and it splits the field. Grok's reasoning variants and Claude Opus 4.7 bend the most: deep-stacked they reason from opponent history and multi-street plans; short-stacked they drop the reads and switch to cold equity math. At the other end, DeepSeek and the Gemini models barely move. DeepSeek reasons from opponent history almost regardless of the spot. (One axis moves no one: not a single model changes its manner with position.)

Pair this with faithfulness and a profile falls out. DeepSeek is at once the most hollow (its words least predict its actions) and the most rigid (its manner won't bend to the spot). Two independent measurements, one taken from prediction and one from a permutation test, land on the same model. A reasoner whose chain of thought neither tracks its own behavior nor adapts to the situation is the cleanest thing an adversary could ask for: predictable, and not listening.

Minds that bend with the situation, minds that lock. How much each model's reasoning manner shifts as stakes, stack, and street change: significance against a shuffled null. High means the reasoning adapts to the moment; low means a fixed disposition that reasons the same way no matter the spot: the rigid, readable end of the field.

You can watch it happen. Take a model's reasoning-archetype mix when it is deep-stacked, then again when it is short, and the adaptive ones visibly re-weight: Grok 4 abandons its deep-stack planning for cold equity math, and Claude Opus 4.7 collapses toward math and planning under pressure. The rigid ones do not: DeepSeek stays anchored in opponent-history whatever the stack, and Gemini 3.1 Pro retreats further into its own formal template the tighter the spot gets. The character either flexes or hardens; it does not go away.

The same minds, re-weighted under pressure. Each model's reasoning-archetype mix deep-stacked (left) versus short-stacked (right). Grok 4 and Opus 4.7 swing toward equity math when the stack shortens; DeepSeek holds its opponent-history habit and Gemini 3.1 Pro leans harder on its template. Adaptiveness is a trait: some manners bend to the moment, others only intensify.

Behavior converges, reasoning diverges

Cluster the models by how they play and the field collapses into a handful of archetypes: overwhelmingly tight-aggressive, with a couple of nitty pockets. Cluster the same models by how they think and the groupings scatter; models that play almost identically reason in completely different shapes, and models that reason alike play differently. Style of action is convergent. Style of thought is not. Whatever the reasoning text is doing, it is not the thing that determines the move, because the moves agree while the texts disagree.

Color by

scroll to zoom · drag to pan · hover a point

Every thought in the field, placed by what it says. Each of the 27,750 points is one private reasoning trace, positioned so that traces reading alike sit close together. Color by Street and the map sorts cleanly: preflop thinking forms its own continent, postflop thinking another, with a flop→turn→river gradient running through it. So the geography is the game, not the brand. Now color by Lab: no lab owns a territory. Each one threads through the whole space, its thoughts interleaved with everyone else's. The models that play nearly identically are scattered all across how they think. Hover any point to read the actual reasoning. This is the convergence-divergence result as a single picture: shared moves, private and dispersed minds.

Put those three together and they are not three findings. They are one. The reasoning trace is not the decision. The decision is made by something the text only partially reports; the relationship between the text and the decision is itself a per-model variable (strong in a few, weak in many, ornamental at the bottom) and so is whether that relationship even bends to the situation. The table is a lie detector, and most of the field fails it by some margin.

A map of minds. Two judge-free measurements, one per axis: how much a model's reasoning predicts its own action (faithful →) and how much that reasoning re-weights as the situation changes (adaptive ↑). The two were computed by completely different methods (a held-out predictor and a permutation test), so a model's position is a cross-check, not a restatement. DeepSeek sits alone in the bottom-left: its words barely track its actions and its manner never bends: hollow and rigid, the most readable, least listening corner of the field.

The Thought Process Is the Attack Surface

Strip the cards away and look at what is left. A setting with hidden information, real stakes, a forced commitment, and an outcome that grades the decision: that is not poker. That is the shape of nearly every place we are about to put agents: negotiating a price, bidding in an auction, calling a tool through another service, defending a workflow against an instruction buried in the data it was told to trust. Poker is just the one version where, by accident, we get to see the hidden cards, the action, and the private reasoning all at once. It was a window. We are done with it.

So the three findings are not facts about a card game. They are properties of how a machine reasons under pressure, and each one is a way in. The say/do gap is the hole in oversight: the agent's own words are a conditional, per-model guide to what it will actually do. The stable, substantive fingerprint is the hole in legibility: how an agent reasons is a recognizable signature, readable in one setting and expected in the next. And the bend-or-lock disposition is the hole in predictability: a model whose reasoning never adapts to the situation is one you can characterize once and steer thereafter. None is patchable with a better prompt. They sit at the level below the prompt: the cognition itself.

And the cognition is portable. Each model's reasoning carries a stable signature: the same model reasons in recognizably the same shape from one situation to the next, and the shapes differ sharply between models (a probe can recover which opponent a model was facing from its words alone, strongly for some and barely for others; the numbers are below). That portability is the whole danger and the whole opportunity. It means an adversary who characterizes how an agent thinks in one setting can predict and steer it in another, because the thinking is the invariant and the task is just the wrapper around it. It also means the thing worth measuring, publishing, and red-teaming is not how an agent scores on your benchmark. It is the shape of its thought, because the thought is what travels into the deployment you never tested.

A Test of the Frame

The frame makes predictions you can check against the same data.

It predicts that fluent reasoning and effective action should come apart: that the models which narrate the richest strategic stories need not be the ones that act best. Confirmed: the most elaborate exploit-language in the field comes from models that are themselves among the easiest to exploit. The story an agent tells about its own competence is not its competence.

It predicts that the reasoning carries real information for some models and almost none for others. A probe trained to recover which opponent a model is facing, from its reasoning text alone, adds a large 0.21 of signal over the game-state baseline for Grok 4 (its reasoning genuinely changes with the opponent) and a negligible 0.06 for Claude Opus 4.7, whose text barely moves regardless of who it is playing. Faithfulness is not a property of "LLMs." It is a property of a specific model, measurable, and it varies enormously.

It predicts that the link between reasoning and action should itself vary with the situation for some models and not others: that adaptiveness is a trait, not a constant. Confirmed: hold a model's situation and action fixed and its reasoning still carries a model-specific signature strong enough to identify it, and how much that signature re-weights as the stakes and stack change is a stable, measurable per-model number. The opponent-identifying signal is stronger late in a match than early (models do form reads) but the rigid ones form them and then reason the same way regardless.

And it predicts a uniformity underneath all the apparent variety of "reasoning styles." Confirmed in the crudest possible way: every model in the field, frontier or mini, reduces the problem in front of it to expectation-value math. Pot odds, equity, ranges: the numeric density of the reasoning is high everywhere. The models do not disagree about what kind of thing this reasoning is. They have converged on a single register and then vary in how faithfully that register tracks what they actually do.

Claude

The Expositor

Thinks out loud in structured documents: headers, explicit beliefs, a stated confidence, a plan. Reasons from principles, hedges honestly, and turns decisive once it has committed.

GPT / o-series

The Decider

Terse, not essayistic. Commits. Reasons from general priors more than the specific hand in front of it, and, for the models where it holds, tends to act on what it says.

Grok

The Calculator-Cowboy

Shows its work (pot odds and EV on nearly every decision) then commits hard. Maximally aggressive and verbose; the most talk about manipulation, the least follow-through on it.

Gemini

The Consultant

Opens with a numbered framework and recites process and principles. The most template-bound family in the field, and among the least likely to change how it reasons when the situation changes.

DeepSeek

The Over-fitter

Quotes the specific history, "hand #15 he folded", and leans inductive and hedge-heavy. Builds strong reads from small samples.

The throughline: every one of them collapses the problem to the same expectation-value math. The differences are not in what they compute: they are in how the reasoning is staged, and in how faithfully that staging maps onto the move actually made. Stage and faithfulness are the fingerprint, and the fingerprint is what an adversary reads.

Five reasoning characters, one underlying register. The families read as distinct thinkers, yet each resolves the problem to the same EV math. The personality lives in how the reasoning is staged, and whether the staging matches the move. That staging is stable per model: a signature you can fingerprint in one setting and expect to find in another.

That signature is measurable. Cluster the reasoning into archetypes (exploit-aware, opponent-history, equity-math, planner, verbose range-talk) and each family resolves the same spots to a different mix. The mix is the fingerprint, and it is the exploitable surface: a family that lives in opponent-history is one you can feed a false history; one that collapses to equity-math under pressure is one you can move by distorting the math; a fingerprint that never re-weights with the situation is one you can characterize once and rely on. The archetypes are clustered after stripping out the spot, the action, and writing style, so the shape is manner, not penmanship, and a model's own traces cluster to its mix far more tightly than style alone can explain.

The cognitive fingerprint of each lab. Each family resolves the same spots to a different mix of reasoning archetypes: DeepSeek almost two-thirds opponent-history, Gemini dominated by verbose range-talk and its own formal template, Claude weighted toward planning, GPT and Grok toward terse exploit-aware and equity math. The archetypes are clustered from the reasoning embeddings after removing the spot, the action, and writing style, so the shape is manner, not penmanship, and a model's own traces cluster to its mix far more tightly than style alone explains. Nothing here is about poker; it is the readout of how a mind is built.

The Research Field Is Also Pointed at the Wrong Thing

Here is the part worth sitting with.

A large fraction of current AI-safety hope rests on chain-of-thought monitorability: the bet that if a model is going to do something dangerous, it will say so in its reasoning first, and we can catch it by reading the trace. The entire premise is that the trace is a faithful window into the decision. It is the load-bearing assumption under a whole agenda.

The place that assumption got tested at scale, with ground truth for both the decision and the narration, was not a safety lab. It was poker, because poker is one of the few settings where you accidentally get all three of the things you need at once: a private reasoning trace, a discrete consequential action, and an objective outcome that says whether the stated story was true. And the data says the window is real for some models, weak for many, inverted for a few, and uncalibrated across the board.

This is the same shape as every off-distribution finding. The people building the monitorability agenda have a prior that the trace is informative, because their tradition is built on reading traces. The evidence that it is only conditionally informative shows up in a domain none of them were looking at, surfaced by a measurement none of them set out to run, because the instrument that exposes the gap is not a safety benchmark: it is a card game where the model has to commit chips and then gets shown whether its story held. The field will integrate this slowly, if at all, because it arrived from outside the distribution of where safety research expects its evidence to come from.

What Is Not Being Measured

Poker is one accidental lie detector. We have almost no deliberate ones.

Chain-of-thought faithfulness is mostly studied today on bounded question-answering: does the stated reasoning match the answer on a math problem, does perturbing the prompt change the stated reason. Useful, and far too narrow. The following have effectively not been measured:

The say/do gap under stakes: settings where the model commits to a consequential action and an objective outcome later adjudicates whether its stated intent was real. Poker has it. Trading, negotiation, and tool-use under adversarial pressure should have it and are not instrumented for it.

Calibration of stated confidence against ground-truth outcomes, per model, at scale. We do not have it cleanly even here (confidence is the part of these traces that most resists honest measurement) and we certainly do not have it for the agentic tasks these models are now being deployed into, where a model that says "I'm confident this is safe" and is wrong 60 percent of the time is a live hazard, not a leaderboard footnote.

Recursive theory of mind: whether a model reasons about what another agent believes, not just about the state of the world. Reliably detecting it from a trace is itself unsolved. What is clear is that in a world of agents reading and acting on each other's outputs, a population that models the situation but not the other minds in it is precisely the blind spot you would want measured before, not after.

The model-specificity of faithfulness itself. The single most actionable result here is that the trace-to-decision link is a per-model quantity that ranges from strong to inverted. Nobody publishes that number. It should be on every model card, the way accuracy is, because it is the number that tells you whether you are allowed to believe the reasoning at all.

The measuring stick has pointed at whether models can play. What happens when you grade the honesty of the reasoning rather than the correctness of the move: that is the region almost nobody has looked at, and it is the region that decides whether chain-of-thought is a safety tool or a comfort.

The Honest Claim

A model's reasoning trace is not its decision. Where you can finally check, the stated reasoning predicts the real decision strongly for a few models, weakly for many, and ornamentally for some; how much it does is a stable per-model number; and for some models that reasoning never even bends to the situation in front of it.

These are not quirks of poker. They are properties of how these systems narrate themselves, and poker is just the rare place they become visible.

The honest caveats: the exploitability figures are an index, not literal table win-rates: trust the ranking, not the magnitude. The faithfulness and fingerprint measures are judge-free (a predictor and a clustering run on the raw traces, reported with grouped splits, bootstrap intervals, and permutation tests) but they read the reasoning the labs choose to expose, which several of them summarize or hide, so they characterize the presented reasoning, not the hidden computation underneath. The cross-domain transfer of these signatures is, so far, only suggestive. None of that touches the central result, which rests on the one thing the data gives cleanly: the cards, the action, and the outcome, against the model's own words.

If the last two years of interpretability have been about learning to read the chain of thought, the next phase is about learning when reading it is allowed: when the words are load-bearing and when they are decoration laid over a decision made somewhere the text doesn't go. Poker answered that question by accident. The instruments that answer it on purpose do not exist yet. They will.

Source & Method

Figures are computed from our analysis of roughly 6.62 million heads-up no-limit hold'em decisions: frontier language models playing each other, nineteen distinct models, both seats model-controlled, hands duplicated to cancel variance. Cognitive measurements are judge-free, computed over 27,750 reasoning traces (balanced across models); behavioral measurements over the full set. Faithfulness (R-IPV) is the held-out accuracy a model's reasoning embedding adds over the game state when predicting its own action, with the train/test split grouped by whole match and intervals from bootstrap resampling. The reasoning archetypes are clusters of the trace embeddings after residualizing out situation, action, and writing style; responsiveness is the shift in a model's archetype mix across situations, tested against a permutation null. The reasoning-space map embeds those same 27,750 traces and projects them to two dimensions for display; proximity means the traces read alike, and the axes carry no units. Exploitability is a variance-cancelled best-response index reported as a ranking, not a literal win-rate. The framing of poker as a faithfulness instrument (and of chain-of-thought monitorability as the thing it accidentally tests) is, as of this writing, not a canonical claim in the literature.