AI Hit 94% on Engineering Tests. Are You Still Hiring Like It's 2020?

Claude Mythos scored 93.9% on SWE-bench Verified this year. Five hundred real GitHub issues, pulled from projects like Django, Flask, and scikit-learn, and a model resolved nearly all of them. Headlines ran wild. LinkedIn filled with "AI writes better code than humans now" posts. A few founders I know started sketching org charts with half the engineering headcount and a note next to it reading "AI handles it."

I don't buy the headline. I buy the number underneath it, and it tells a different story, one about leverage, hiring, and a trust gap most engineering leaders haven't priced in yet.

A skeptical engineering lead standing before a wall-mounted screen showing AI benchmark scores

The Score Nobody's Quoting

SWE-bench Verified tests models against issues from repos every frontier model trained on. A high score here reflects familiarity as much as fresh problem-solving.

SWE-bench Pro fixes this. New tasks, standardized scaffolding, harder problems models haven't met before. On Pro, Claude Opus 4.5 drops from 80.9% on Verified to 45.9%. Thirty-five points between "here's a familiar problem" and "here's one you've never seen."

Thirty-five points is the real headline. Not "AI writes code better than you." More like this: AI writes code well when it recognizes something close from training, and drops off hard when it doesn't.

Zoom out further and the picture gets murkier. Across 83 evaluated models, the average SWE-bench Verified score sits at 63.4%. On the Pro variant, the field average drops to roughly 25%. The top model on the easy version and the average model on the hard version tell you almost nothing about the same underlying skill. One measures pattern recognition against familiar code. The other measures problem-solving against code nobody's model has memorized. Engineering leaders keep quoting the first number in board decks. Few quote the second.

Build a hiring plan around 93.9%, and you're planning around the wrong number.

The Leverage Story Is Real, But Not the One You Think

Here's what's happening on engineering teams right now. Big Tech's junior hiring sits 25% below 2019 levels. New grads made up 32% of Big Tech hires in 2019. Today they're 7%, according to Boundev's 2026 job market analysis. Some estimates put three AI-fluent engineers matching the output of ten traditional hires.

Read it twice. Companies aren't shrinking teams because AI writes flawless code. They're shrinking teams because a small group of experienced engineers, working with AI tools, produces enough throughput to leave entire desks empty.

This isn't "AI replaces developers." It's "a senior engineer who knows how to direct AI, review its output, and catch its mistakes earns the output of three junior hires who don't."

Still building headcount plans around 2020 ratios, senior architects on top, mid-level implementers doing the work, juniors handling the grunt tasks? You're solving a staffing problem in a shape it no longer takes.

A small team of three software engineers gathered around one laptop, with empty desks behind them

Nobody Trusts the Thing They Depend On

Here's the part worth more worry than the benchmark score.

Stack Overflow's 2025 Developer Survey, nearly 49,000 respondents strong, found 84% of developers use or plan to use AI coding tools, up from 76% the year before. Adoption is close to universal.

Trust in the output dropped to 29%, down from 40% the year before. Forty-six percent of developers now actively distrust what the AI hands them.

Sit with those two numbers side by side. Four out of five developers use these tools daily. Fewer than three in ten trust what comes out the other end. Not a tooling problem. A review problem, and most engineering organizations haven't built the muscle for it yet.

This gap didn't exist two years ago because the tools weren't good enough to trust blindly or distrust selectively. Now they're good enough to feel trustworthy in the moment, which is more dangerous than being obviously unreliable. A tool wrong 30% of the time in an obvious way gets caught. A tool wrong 30% of the time in a way indistinguishable from correct code ships straight to production, and the org finds out during an incident review instead of a code review.

I run Claude Code across most of my own projects now, including the site you're reading this on. It writes fast. It also writes plausible-looking code wrong often enough to force review habits I never needed with a human junior engineer. A human junior asks a question when unsure. A model in confident mode hands you a diff, compiling cleanly, passing the obvious test, and quietly breaking an edge case three files away.

A senior engineer reviewing code on a monitor with a red pen, marking up a printed diff

What Good Leadership Looks Like Now

None of this argues against hiring. It argues for hiring differently.

Stop hiring for headcount. Hire for judgment. The engineers worth paying for in 2026 direct an AI tool toward a real problem, read its output with a critical eye, and spot the 45.9%-not-93.9% gap before it ships to production.

Build review culture as a first-class skill, not an afterthought. Forty-six percent of engineers actively distrust AI output and ship it anyway. Fix the culture gap before it becomes an incident report. Teach code review with the same rigor you teach the language itself.

Rethink where junior talent fits. The 2019 model, hire juniors, hand them grunt work while they learn, doesn't map onto teams where AI does the grunt work faster than a junior manages alone. Junior engineers aren't obsolete. The path to competence looks different now. Leaders who ignore it will burn out their juniors on busywork AI already handles, or fail to build the judgment those juniors need to become the senior engineers this leverage model demands.

Change what the interview tests. Most technical interviews still measure whether a candidate writes a sorting algorithm from memory. This skill matters less by the month. Start measuring whether a candidate reads an unfamiliar diff and spots what's wrong with it in ten minutes. This is the 45.9% gap, tested directly, in a room, before the hire.

If You're the Engineer, Not the Founder

Read this from the other chair, and the math still applies to you.

The safest career move in engineering right now isn't refusing AI tools out of principle, and it isn't handing every task to a model and calling it done. It's becoming the person on the team who catches the 45.9% gap before the customer does. Learn to read AI-generated code the way a good editor reads a first draft: fast, skeptical, looking for the sentence which sounds right but says something false.

Engineers who build this muscle become the three worth ten. Engineers who don't become interchangeable with the tool they're using, and tools get cheaper every year.

The Real Question

The 93.9% headline asks: will AI replace engineers? Wrong question. The right one: are you building a team, and a review culture, capable of telling a benchmark score apart from a production-ready answer?

Most companies aren't there yet. Eighty-four percent adoption against 29% trust says so plainly.

Before writing the next headcount plan, ask what you're hiring for: bodies to write code, or judgment to catch it when the code is wrong. Two different jobs. Only one survives a benchmark score.