The Confidence Inversion

What you’ll learn

The 1994 Pentium bug and three peer-reviewed studies reveal a quiet collapse: fluency—once a reliable proxy for competence—no longer signals expertise because AI makes polished output free.
The core takeaway is that organizations now face a dangerous inversion where the least knowledgeable sound the most confident, while true experts hedge and get talked over, because the visible gap between their work and everyone else's has vanished.

A 1994 Pentium bug, three peer-reviewed studies, and the quiet collapse of the proxy we used to rank competence — and what it means for any organization that lets fluent-sounding output carry weight without checking what is behind it.

It was an afternoon of June of 1994, a mathematics professor in Virginia caught the most valuable company in computing telling a lie. It took him four months to prove it — and for most of that time, he was certain the liar was himself.

His name was Thomas Nicely, and he was doing something gloriously impractical: summing the reciprocals of the twin primes, chasing more digits of an obscure quantity called Brun's constant. To grind through the arithmetic he'd wired together a little farm of PCs, and that summer he added a shiny new box built around Intel's brand-new Pentium — 3.1 million transistors of state-of-the-art silicon, the chip about to make Intel a household name.

Almost immediately, his numbers stopped reconciling. Not by much. A tiny disagreement, buried deep in the decimals, between the new Pentium machine and every other computer in his farm.

Here is the part worth slowing down for, because it is the whole point of this essay compressed into one man's behavior. Nicely did not announce that Intel's flagship processor was broken. He assumed he was. He suspected his own code first. Then his compiler. Then the motherboard chipset. Then some loose interaction between math libraries. For four months he interrogated everything under his own control, clearing himself as a suspect one possibility at a time. Only on October 19, after every other culprit had a watertight alibi, did he let himself reach the conclusion he'd been resisting: the silicon itself could not divide correctly.

The failing case was almost insultingly small. Ask an early Pentium to compute 4195835 ÷ 3145727 and it answered about 1.33373 — wrong in the fifth significant figure; the true value sits closer to 1.33382. The root cause, uncovered later, was a handful of missing entries in a lookup table buried in the division circuitry. Five blank cells in a grid. That was all it took to make a three-million-transistor marvel quietly lie to you about arithmetic.

On October 24, Nicely reported it to Intel. And here the confident half of the world steps to the microphone.

Intel's public posture was a masterpiece of calm. The flaw, they explained, was statistically negligible — an average user might stumble into it perhaps once every 27,000 years. Fluent. Authoritative. Backed by a trillion-dollar brand and a building full of PhDs. It sounded like the end of the conversation, and for a few weeks it very nearly was.

The trouble is that "rare for an average user" is not the same sentence as "safe" — and the people running the workloads that actually hammered the divider, the Nicelys of the world, were not average users. IBM ran its own numbers, concluded the bug could surface for real workloads as often as every 24 days, and did the unthinkable to its biggest supplier: it halted shipments of Pentium PCs.

The confident framing collapsed in real time. On December 20, Intel offered to replace every affected chip. In January it swallowed a pre-tax charge of $475 million — north of $890 million in today's money — for what became the first full recall of a computer chip in history.

A man counting gaps between prime numbers, with no authority and no platform, was right. The most confident company on Earth was wrong. And it nearly closed the case on the strength of a number that merely sounded final.

Why I'm telling you a story from 1994

Because the dynamic in it is older than the machines — and we just built an engine that mass-produces the dangerous half.

Look at the two characters again. One is an expert and therefore doubts himself first, methodically, before he'll accuse anyone else. The other sounds like an expert — calm, fluent, credentialed — and is confidently, expensively wrong. That split has always existed. What's changed in the last two years is that we handed everyone a tool that manufactures the second character's voice for free, in nine seconds, on any topic.

You've felt the result. The colleague who couldn't define a vector embedding eight months ago now holds court on "the agentic stack" with the serenity of a man who has never once been wrong. Meanwhile the actual systems engineer in the room — the one who has shipped this stuff, broken it, and stayed up until 3 a.m. fixing it — says "well, it depends," and gets talked over.

This isn't a vibe. It's a measurable inversion: the people who understand the least have never been more sure of themselves, and the people who understand the most have started to hedge. Once you see the mechanism, you can't unsee it.

The signal didn't get noisier. The signal died.

Almost everyone frames this as a confidence problem — as if some people simply got cockier. That's downstream. The real event is the collapse of a signal we'd all been quietly relying on for our entire professional lives.

For as long as knowledge work has existed, we used a cheap proxy to rank competence: fluency. Vocabulary, structure, command of jargon, the confident sentence that lands its point. These were never proof of expertise — but they were decent correlates, because producing expert-sounding output used to require expertise. You couldn't write a crisp paragraph on power-supply derating or a clean timing diagram unless you'd done the work. The cost of sounding smart was being smart.

That cost just went to zero.

When anyone can generate a fluent, structured, jargon-perfect artifact in seconds, fluency stops carrying information. The skimmer's output and the expert's output now look identical on the page. And because we never consciously knew we were using fluency as a proxy, nobody noticed the proxy break. We just kept trusting the surface — except the surface is now pure noise.

So two things happen at once:

The person who always wanted to sound like an expert finally does. The tool gave them the costume, and the validation feels earned, because the output is genuinely polished.
The person who is an expert watches their one legible advantage — the visible gap between their work and everyone else's — vanish. The thing that used to separate them stopped working. So they start wondering if the gap was ever real.

It was real. It just stopped being visible. Those are very different things, and confusing them is the entire trap.

Fig 1 — Before vs. now: when the cost of sounding expert fell to zero, the link between fluent output and real expertise broke. The output looks identical on the page; the path that produced it does not.

Now for the numbers, because feelings aren't evidence

If this were just a barstool theory, you'd be right to ignore it. It isn't. The research lands almost uncomfortably on the nose.

Exhibit A — the experts who got slower and never noticed.

In 2025, the research nonprofit METR ran what's still the most rigorous study on this question: a randomized controlled trial — the gold-standard methodology — on 16 seasoned open-source developers working across 246 real tasks in codebases they already knew intimately. Before starting, the developers predicted AI tools would make them about 24% faster. After finishing, they reported feeling about 20% faster. The measured reality: they were 19% slower. [METR, 2025]

Bar chart showing the METR study: developers predicted they would be 24% faster, felt 20% faster after, but were actually 19% slower when measured — Fig 2 — METR RCT, 16 experienced developers, 246 real tasks. The arrow shows the gap between what they believed and what actually happened.

Sit with that gap. These aren't juniors. They are domain experts, on their home turf, and the tool degraded their performance by nearly a fifth — while they remained convinced it had helped. Confidence and competence didn't just decouple; they pointed in opposite directions.

Exhibit B — the tool lifts the floor, not the ceiling.

The landmark Brynjolfsson, Li & Raymond study (later published in the Quarterly Journal of Economics) tracked 5,179 customer-support agents as a generative assistant rolled out. Average productivity rose 14% — but the distribution is the whole story. Novice and low-skilled workers improved by 34%. The most experienced, highest-skilled workers saw almost no gain at all. [Brynjolfsson, Li & Raymond]

Bar chart of productivity gains by skill level: novices +34%, mid-tier +14%, experienced workers near zero — Fig 3 — The leveling effect. The floor rises; the expert ceiling is roughly where it was.

That's the leveling effect in cold data, and it's exactly why the novice now sounds like everyone else: the AI hands them the tacit knowledge they hadn't yet earned. The expert was already operating at the ceiling the tool tops out at, so it adds little. The gap compresses — not because the expert got worse, but because the floor came up to meet them.

Exhibit C — the confidence inversion, caught red-handed.

The cleanest evidence comes from a 2025 study by Microsoft Research and Carnegie Mellon (Lee et al., presented at CHI) surveying 319 knowledge workers across 936 real workplace uses of generative AI. The headline finding reads like it was written for this essay:

Workers with higher confidence in the AI engaged in less critical thinking. Workers with higher confidence in their own ability engaged in more.

That is the Nicely reflex, quantified and peer-reviewed. The more you trust the machine, the less you check it. The more you trust yourself, the more you interrogate what comes back. The skimmer trusts the machine completely — so they verify nothing, and the unchecked output flows straight into the world wearing a lab coat. The expert trusts themselves — so they keep poking, keep doubting, keep finding the cracks. [Lee et al., CHI 2025]

Diverging bar showing the two confidence effects: workers who trust the AI more engage in less critical thinking, while workers who trust themselves more engage in more critical thinking — Fig 4 — The Nicely reflex, measured at scale. Trust in the tool → less checking. Trust in yourself → more checking.

The same study found a quieter casualty: people using AI produced a less diverse set of outputs for the same task. Everyone converges on the same plausible middle. The researchers were blunt about the risk — a slow atrophy of the cognitive muscles you stop using.

(A smaller MIT Media Lab preprint, Your Brain on ChatGPT, even put EEG caps on 54 people and reported reduced neural connectivity in the AI-assisted group during writing. It's a small study and I'd hold it loosely — but it rhymes with everything above.) [Kosmyna et al., MIT Media Lab, 2025]

The part the experts get wrong about themselves

Here's the twist, and it's the most important paragraph in this piece.

The expert's doubt is not a malfunction. Calibrated doubt is the expertise — that's the literal finding of the Microsoft study, and it's the literal plot of the Nicely story. The problem isn't that experts started doubting. It's that they aimed the doubt at the wrong target: at their own judgment, when they should be aiming it at the noise.

And their value didn't drop. It relocated.

When generation becomes free, the scarce skill stops being production and becomes evaluation — knowing what's wrong with the plausible-looking output, knowing the failure modes, knowing which five cells are missing from the lookup table. That is precisely the expert's edge, and it just became more scarce relative to all that cheap generation, not less. The moat got deeper. It only stopped being legible to the crowd, because the crowd was reading fluency, and fluency is dead.

So the expert quietly losing confidence has it exactly backwards. The market for "people who can produce a convincing artifact" just got flooded to worthlessness. The market for "people who can look at a convincing artifact and tell you why it's quietly catastrophic" just got tighter. If you're in the second group and you feel less valuable, you've misread your own balance sheet.

The only version of this that should actually scare you

Overconfident people existing is not a crisis. It's Tuesday. Humans have always overrated themselves — Dunning and Kruger built a career on bottom-quartile performers cheerfully placing themselves near the 60th percentile, and that was before anyone had a confidence-amplification engine in their pocket.

The danger isn't the confidence. It's where the confidence holds decision authority.

Noise is harmless right up until it converts into a budget line, a roadmap, or an architecture choice. Remember that "once every 27,000 years" was not some intern's offhand guess — it was the official, confident, load-bearing position of the most sophisticated chip company alive, and it nearly settled the matter. An overconfident skimmer with a Slack channel is mildly annoying. An overconfident skimmer with a mandate ships the system that silently breaks at scale — the design with no margin analysis, the "we don't need a human in the loop here" that absolutely needed one, the architecture nobody senior ever pressure-tested because the deck looked finished.

So the question worth your weekend isn't "why are the loud people so sure of themselves." That's noise, and noise is forever. It's:

Where in my organization has confident-sounding output quietly become load-bearing — without anyone qualified ever checking it?

Go find that wall. Tap it. See if it's brick, or drywall painted to look like brick.

That's the job now. Not generating the wall — anyone can generate the wall. The job is being the person who can still tell the difference, and who trusts themselves enough to say so while everyone else is busy being impressed.

Nicely spent four months assuming he was the problem, and then trusted himself anyway. Tape that above your bench. The signal died — long live the people who never needed it.

Sources. METR, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity" (2025); Brynjolfsson, Li & Raymond, "Generative AI at Work," Quarterly Journal of Economics (2025); Lee et al., "The Impact of Generative AI on Critical Thinking," Microsoft Research & Carnegie Mellon, CHI 2025; Kosmyna et al., "Your Brain on ChatGPT," MIT Media Lab preprint (2025); Kruger & Dunning, "Unskilled and Unaware of It," Journal of Personality and Social Psychology (1999). The FDIV account draws on Intel's 1994 annual report and recall disclosures, Thomas Nicely's own published timeline, contemporaneous press, and Ken Shirriff's 2024 die-level analysis of the missing lookup-table entries.

Image credits

Cover illustration

Generated for this article

AI-generated

0 comments

Siddharth

Thoughts and essays, published with Yokush. See more posts

Comments 0

No comments yet — be the first.