The dominant narrative about generative AI at work is that it "compresses" skill differences: novices gain more than experts, the gap between good and great shrinks. This narrative is built on roughly six empirical studies, all on bounded single-session tasks with 2023-era models. A parallel literature on creativity and homogenization finds the opposite: AI raises individual quality while collapsing variance across users. A 2025 frontier-model study on real software engineering found experienced developers were nineteen percent slower with AI, with a forty-point gap between what they felt and what the clock recorded. These findings are not in conflict. They are three faces of the same mechanism: generative AI is a target-level attractor that pulls output distributions toward an internal target set by training and reinforcement. Below the user's skill floor, the attractor drags. Above the user's ceiling, it lifts. Across users, it homogenizes. The productivity literature measured the lift on bounded tasks. It has not measured, and with current instruments cannot measure, what happens when skilled operators work on open-ended, long-horizon, cross-domain, agentic problems. That is where the real frontier of AI-assisted work lives, and it is the region nobody has looked at.
The finding nobody is talking about
In February through June of 2025, a group of researchers at METR ran what is still the most carefully designed study of generative AI on real software engineering. Sixteen experienced open-source developers, each with roughly five years of prior work in their assigned codebase. Mature repositories, twenty-two thousand stars on average, over a million lines of code. Two hundred and forty-six real issues, randomly assigned to allow or disallow AI assistance. The tool was Cursor Pro with Claude 3.5 and 3.7 Sonnet, which were state-of-the-art at the time.
The developers predicted AI would make them twenty-four percent faster. Economists asked to forecast the result predicted thirty-nine percent faster. Machine learning experts predicted thirty-eight percent faster.
The stranger finding came after the experiment. Even with the data in front of them, the developers still believed AI had sped them up, by twenty percent. A forty-point gap between what they felt and what the clock recorded. The authors described this gap as "striking."
Two caveats matter, and the rest of this essay depends on holding them. The sample is small (sixteen developers), and the models, Claude 3.5 and 3.7 Sonnet, have since been superseded by systems materially better at exactly this kind of work. A 2025 result on 2025 tools is not a permanent law; rerun on current frontier models, the slowdown could shrink or flip. So treat the nineteen percent as one carefully-run data point about a specific regime, not a settled fact about AI and expertise. What survives better models is not the number, it is the mechanism: when the tool's output target sits below a skilled user's own level, every interaction is drag. That mechanism is what the rest of this essay defends. The slowdown figure is just its first, dated sighting.
This is the single most important study of AI-assisted work that has been published, and almost nobody in the productivity discourse cites it. There is a reason.
The compression story
The dominant narrative about AI and work goes like this: generative AI raises the floor. Novices gain more than experts. Skill differences compress. The gap between good and great narrows, and everyone ends up roughly equally productive on the task at hand.
This narrative has real empirical backing. Brynjolfsson, Li, and Raymond studied 5,179 customer-support agents and found that novice agents gained thirty-four percent in productivity while experienced agents gained roughly nothing. Noy and Zhang had 444 college-educated writers complete professional writing tasks with or without ChatGPT; quality rose across the board and the gap between good and bad writers shrank. Dell'Acqua and colleagues had 758 Boston Consulting Group consultants work on realistic consulting tasks with GPT-4; within the AI's capability frontier, lower-skill consultants improved most and the gap compressed dramatically. Cui and colleagues studied 4,867 developers at Microsoft, Accenture, and a Fortune 100 firm using GitHub Copilot; less-experienced developers adopted faster and gained more. Cruces and colleagues studied 1,174 adults in Argentina on an incentivized business-writing task; AI closed seventy-five percent of the baseline education gap.
Six studies, roughly sixteen thousand subjects, pointing the same direction. This is a real pattern. It is not a mirage.
But look at what the six studies share. Every one of them is a bounded task with a clear ceiling: a ticket to resolve, an email to draft, an MBE question to answer, a coding problem to solve. Every one is a single session, usually under two hours. Every one uses a 2023-era model: GPT-3.5, GPT-4, or Copilot's autocomplete layer. Every one measures execution of tasks the researcher pre-selected. None of them let the subject decide what was worth working on, change scope mid-stream, revisit the problem over weeks, or use AI the way a skilled operator actually uses AI.
The compression finding is real for what it measures. What it measures is a small, specific region of possible AI use: bounded, scripted, short-horizon, shallow-tool, obsolete-model. And it has been treated, in public discussion, as if it generalizes to how AI will work in the economy. It does not.
The other face of the same mechanism
While the productivity literature was reporting compression, a parallel literature was reporting the opposite thing with the same underlying data. When researchers measured not speed or task completion but diversity of output across users, they found something startling: AI made individual work better while making collective work more similar.
Doshi and Hauser ran 293 writers through short-story tasks with and without ChatGPT assistance. Individual stories rated higher on novelty and usefulness. Across writers, the stories converged. They used similar phrasings, similar plot structures, similar frames. Individual quality up, collective diversity down. The paper is in Science Advances; the finding is not subtle.
Anderson, Shah, and Kreminski found the same pattern in divergent ideation. ChatGPT-assisted ideators produced more polished ideas and more elaborated ones. They also produced ideas that were measurably more similar to each other's. At the individual level, no difference. At the group level, homogenization.
Moon, Green, and Kushlev studied 2,200 college admissions essays. Each additional human essay added more new ideas to the pool than each additional GPT-4 essay. The diversity gap widened with scale. The more AI was in the pool, the less variety got added per essay.
Agarwal, Naaman, and Vashistha found that Indian writers using AI writing suggestions had their prose move measurably toward Western, American norms. Cultural nuance reduced. Style converged.
Inoshita and colleagues did the cleanest version of this work in 2026. They took 6,875 essays and ran them through five conditions: human only, AI only, and three variants of human plus AI. They measured structural dimensions of the writing: cohesion architecture, argument depth, originality. The variance collapse was extreme. On cohesion architecture, the standard deviation dropped from 0.47 in the human-only condition to 0.22 with AI assistance, roughly a fifty to sixty percent reduction in spread. Low-skill essays rose from 2.09 to 4.07 on the cohesion metric. High-skill essays rose from 3.49 to 4.12. The original spread of 1.40 collapsed to 0.05. Everyone converged on the same structural template.
Inoshita's framing is worth quoting carefully. They explicitly reject regression-to-the-mean as the mechanism. Instead they propose what they call a target-level attractor:
AI possesses internal target levels for each structural dimension and pulls all essays toward these targets regardless of starting point. This attractor model explains why quality improvement and homogenization are two sides of the same mechanism.
This is the frame that unifies everything.
Target-level attractor
Generative AI has been trained on human text and then post-trained by RLHF to prefer safe, high-quality, consensus-pleasing outputs. Kirk and colleagues, in an ICLR 2024 paper, showed directly what this does at the token level: RLHF reduces output diversity compared to base models. The model does not merely approximate the median of its training data. It actively concentrates around a narrower set of responses that human raters preferred. Mode collapse, measurable across inputs, technical-side evidence of the same phenomenon everyone else is seeing at the behavioral level.
This is the mechanism. The model has an internal target. That target is high quality relative to the bottom of human performance and mediocre relative to the top. When a user interacts with the model, their work gets pulled toward the target. Not because of statistical regression to the mean, but because the model actively generates outputs near its target.
Now the compression and homogenization findings are the same finding, seen from two angles.
Where the target sits above the user
On a bounded task where the target sits above most users' skill floor (customer support, boilerplate code, business writing), the attractor pulls the low end up more than it pulls the high end down. The result looks like compression. Novices gain thirty-four percent; experts gain nothing. The gap shrinks. Everyone converges on the target.
Where the target sits within the user's range
On an open-ended task, or a task where skilled users can produce work above the target, the attractor pulls the top end down as much as it pulls the bottom up. The result looks like homogenization. Individual quality may rise, the floor moves, but everyone converges on the same structural template. The ceiling drops. Variance collapses. Good writers start sounding like everyone else.
Where the target sits below the user
On a task where the skilled user's work is already far above the target, the attractor produces slowdown. Every suggestion is below the user's own level. Every acceptance is a concession. Every rejection takes time. The AI is not a multiplier. It is an interruption. METR measured this regime directly. Experienced developers on mature codebases with five years of cached context had their unassisted performance already above what a 2025-vintage frontier model could generate for their specific codebase. AI dragged them toward its target, which for them meant downward in quality or at best sideways in speed. Hence: nineteen percent slower.
The compression literature, the homogenization literature, and the METR finding are not three separate stories. They are one story seen from three angles. Whether the attractor helps or hurts depends entirely on where the target sits relative to the user's unassisted performance. Below the user: drag. Above the user: lift. And since the target is roughly fixed (it is the model's post-training equilibrium), the effect varies by task and by user in predictable ways.
A test of the frame
The frame makes predictions that can be checked against existing data.
It predicts that bounded tasks with low ceilings show compression. Call centers: confirmed. Customer-support scripts have a ceiling only slightly above novice performance. Compression found.
It predicts that bounded tasks with higher ceilings show compression only on the lower portion of the skill distribution, with flatlining or mild negative effects at the top. Dell'Acqua consulting: confirmed. Within the frontier, lower-skill gained most; top consultants gained little. Noy and Zhang writing: confirmed. Low-skill writers compressed upward; top writers sometimes slightly worsened.
It predicts that open-ended tasks show homogenization rather than compression, because the ceiling is unbounded and the attractor pulls everyone toward its target regardless of starting position. Doshi and Hauser, Anderson, Moon, Inoshita: confirmed across five studies.
It predicts that skilled users on high-correctness tasks with familiar codebases show slowdown, because the attractor's target is below their unassisted level and every interaction is drag. METR saw exactly that, at least once: a nineteen percent slowdown, on a small sample and now-dated models, but in the direction the mechanism requires.
It predicts that open-ended, long-horizon, self-directed use should show expansion rather than compression, because skilled users can route AI to below-target work (where it helps) while doing above-target work themselves (where it would slow them down). Otis on Kenyan entrepreneurs over five months: confirmed. High performers gained fifteen to twenty percent, low performers lost about ten percent. Expansion, not compression.
One study in the literature appears to contradict the frame: Ashkinaze and colleagues, in a 2025 CHI paper with 800 participants across 40 countries, found that passive AI exposure, being shown AI-generated examples before brainstorming, increased collective diversity rather than reducing it. This is worth engaging honestly. The distinction matters: when AI is a seed rather than a collaborator, it appears to expand the space of human ideas rather than narrow it. When AI is in the loop producing outputs the human accepts or edits, the attractor pulls. When AI is outside the loop providing stimulus the human uses as a launching pad, the attractor does not operate. The mechanism is not "AI reduces diversity." It is "AI in the generation loop pulls toward the attractor." The boundary matters.
The research field is also an attractor
Here is the strangest thing about this literature.
The studies reporting compression were run almost entirely by economists whose careers were built on the productivity paradox, the decades-long puzzle of why computers did not show up in aggregate productivity statistics. For them, generative AI was the arrival of the long-predicted boom. They had a prior that AI would produce measurable productivity gains, ran studies designed around that question, and found what they expected to find.
The studies reporting slowdown and homogenization were run by AI-capability researchers, HCI labs, and field economists working outside the productivity-paradox tradition. METR is an AI safety organization that needs AI to be capable in order to justify its mission; a finding that AI slows experienced developers is institutionally inconvenient. Otis was doing development economics in Kenya, not consulting-firm RCTs in Boston. The homogenization researchers were creativity and HCI scholars with no stake in the productivity narrative. They found what they found.
This is not an accusation of dishonesty. It is the same mechanism playing out one level up. The training data of an economist is their discipline's questions, methods, and priors. Run that economist on a new problem and they produce output near the attractor of their field's consensus. The economist who finds a surprising result is disproportionately likely to be the one standing outside the dominant training distribution, seeing something the median researcher in the field is structurally unable to see.
If the attractor frame is right for AI, it is also right for AI research. The field is currently regressing toward the compression narrative because the loudest voices are ones who have trained on that narrative. The counter-evidence exists, but it is buried, cited less, and not integrated into the public story. The narrative will change only when enough work from outside the distribution accumulates to shift the median of what is considered canonical.
What is not being measured
The compression literature is built on a specific regime. Bounded tasks. Single sessions. Obsolete models. Shallow tools. Pre-selected work. None of these match how AI is actually used by people who have spent years learning to use it well.
The following have effectively not been measured.
Open-ended, multi-week, self-directed AI use with frontier models and hard outcome measures. Otis is the only clean study in the set, with 640 subjects over five months. METR is adjacent on the coding side but still structured around fixed tasks. There is no equivalent for writing, design, research, or business building.
Cross-domain AI use. Every productivity study looks at one task in one domain. Nobody has measured what happens when a single user deploys AI across multiple unrelated problems: porting expertise from one field to another, building systems that span legal work and computer vision and voice interfaces and security research in the same week.
Agentic AI use. Every productivity study measures either autocomplete (Copilot) or chat (ChatGPT). None measure tool use, multi-step planning, parallel sub-agents, memory systems, custom skills, orchestration across models. The entire infrastructure that makes modern AI leverageable is absent from the literature.
Scope expansion. All productivity studies assume fixed scope: here is the task, did you complete it faster with AI. No study measures the counterfactual of whether the task would have been attempted at all without AI. When Anthropic engineers describe "barely typing," what they are likely describing is not "same work faster" but "more and different work than would otherwise have happened." This cannot be measured with a stopwatch. It can only be measured by comparing the output of a worker over a year to the output of the same worker without access to AI over the same year, on self-directed goals. No such study has been run.
Highly skilled operators using AI well. The literature samples call-center agents, BCG consultants, developers at Microsoft, Kenyan SMEs, Argentine adults. It has not sampled anyone in the top one percent of AI usage. The people who have built custom memory systems, voice layers, multi-agent dispatch, and who have fifteen or more distinct AI systems under their belt. These people exist, and their productivity is not measurable by any instrument currently in use.
The literature, in other words, has measured the median. The attractor of the research field has kept it at the median. What happens far from the median (either far below, like a worker who uses AI as autocomplete and gets slower, or far above, like a worker who uses AI as an orchestrated substrate for cross-domain building) has not been measured at all.
What this means
The question "does AI make people more productive" has no single answer and cannot have one. It is the wrong question.
The right question is: where does the attractor sit relative to the user's unassisted performance, on this specific kind of work, with this specific tool stack, over this specific horizon?
Below the user's floor, on any task: drag. The user spends time reading, evaluating, and rejecting suggestions they would have outperformed. Outcome: slowdown. This is METR's regime, and it is the regime of most skilled work on familiar problems.
Near the user's level, on bounded tasks: marginal help, often masked by a large perception gain that does not match clock time. The user feels faster because reading is lighter than generating. This is the regime where most workplace AI use currently sits.
Above the user's floor but below their ceiling, on bounded tasks: compression. The attractor lifts the low end more than the high end. The gap between novice and expert shrinks on this specific task. This is the productivity literature's headline finding, and it is real for the regime it describes.
Above the user's floor, on open-ended or high-ceiling work: homogenization. The user's individual quality rises, but their work becomes measurably more similar to other AI-assisted work. Diversity collapses across users.
Below the user's floor, with the user orchestrating AI around work the user could not do alone: expansion. The user takes on scope they could not have attempted without the tool, produces work that would not otherwise exist, and the output distribution widens rather than narrows. This is the regime of skilled operators and the regime the literature has never measured.
"Does AI make people faster" is a question shaped by the attractor of the research field. "Where does the attractor sit, and what does it do to different users in different regimes" is a question the evidence actually supports. The second question predicts every finding in the literature, including the counter-findings. The first question predicts only the subset that the economists came looking for.
The honest claim
Generative AI is a target-level attractor, not a multiplier. It pulls output distributions toward an internal target set by training and reinforcement. On tasks where the target sits above most users, this appears as a productivity gain and a compression of skill differences. On tasks where the target sits below skilled users, it appears as slowdown, homogenization, and the perception-stopwatch gap that METR documented. The mechanism is one. The appearances are many.
The productivity literature has measured the attractor's lift on the low end of bounded tasks. It has not measured, and cannot measure with its current instruments, what happens when skilled operators work above the target on open-ended, long-horizon, cross-domain, agentic problems. That is where the actual frontier of AI-assisted work is, and it is the region nobody has looked at.
If the past two years of research have been about discovering the attractor's lift, the next phase will be about discovering its drag. More importantly, it will be about finding the regimes where skilled operators can route work around the attractor rather than through it. Those regimes exist. They are not yet named in the literature. They will be.
Sources and method. This essay cites Becker et al. (METR, 2025), Brynjolfsson, Li, and Raymond (2023/2025), Dell'Acqua et al. (2023/2026), Noy and Zhang (2023), Cui et al. (2025), Cruces et al. (2026), Otis et al. (2024), Doshi and Hauser (2024), Anderson, Shah, and Kreminski (2024), Moon, Green, and Kushlev (2025), Agarwal, Naaman, and Vashistha (2025), Inoshita et al. (2026), Kirk et al. (ICLR 2024), and Ashkinaze et al. (CHI 2025). The attractor framing is adapted from Inoshita et al.'s "target-level attractor" language. The unification of compression and homogenization under one mechanism is not, as of this writing, a canonical claim in the literature.