It’s been nearly two years since Microsoft CEO Satya Nadella predicted that generative AI would take over knowledge work, but if you look around a typical law firm or investment bank today, the human workforce is still in charge. For all the hype about “reasoning” and “planning,” a new study from training data company Mercor explains exactly why the robot revolution has stalled: AI simply can’t handle the messiness of real work.
A reality check for the “replacement” theory
Mercor has released a new benchmark called APEX-Agents that is brutal. Unlike the usual tests that ask the AI to write a poem or solve a math problem, this test uses actual requests from lawyers, consultants and bankers. It involves asking models to perform complete, multi-step tasks that require jumping between different types of information.
The results? Even the absolute best models on the market – we’re talking about Gemini 3 Flash and GPT-5.2 – couldn’t crack an accuracy rate of 25%. Gemini was on top with 24%, followed by GPT-5.2 with 23%. Most of the others were stuck in their teens.
Why AI fails the “office test”.
Mercor CEO Brendan Foody points out that it’s not about raw information; It’s the context. In the real world, answers aren’t handed to you on a silver platter. A lawyer needs to review a Slack thread, read a PDF policy, look at a spreadsheet, and then put it all together to answer a GDPR compliance question.
Humans perform this context switching naturally. It turns out AI is terrible at this. If you force these models to search for information through “dispersed” sources, they will either get confused, give the wrong answer, or just give up altogether.
The “unreliable intern”
This is a small relief for everyone who is worried about the security of their job. The study suggests that AI currently functions less like a seasoned professional and more like an unreliable intern, getting everything right about a quarter of the time.
However, progress is shockingly rapid. Foody found that just a year ago, these models scored between 5 and 10%. Now they reach 24%. So even though they’re not ready to take the wheel yet, they’re learning to drive much faster than we expected. However, the “knowledge work” revolution is on hold until bots learn to multitask




