Business

AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants

AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants

To create the tasks, Mercor contracted white-collar professionals whose former employers include top banks (Goldman Sachs, JPMorgan), consulting firms (McKinsey, Boston Consulting Group), law firms (Latham & Watkins) and hospitals (Mount Sinai). They average 7.25 years of professional experience, and their pay at Mercor is competitive with their previous, highly prestigious employers. Mercor’s website advertises an average hourly rate of $81 per hour, reaching over $200 per hour—equivalent to an annual salary of about $400,000—for “Senior Domain Experts,” who require at least four years’ professional experience to apply.
“It’s hard to imagine a better hourly job from a pay perspective,” says Matt Seck, a former investment banking analyst at Bank of America, who is contracted by Mercor to write finance tasks similar to those included in the paper.
Benchmarks have long been used to assess AI capability, but directly quantifying AI models’ ability to do economically useful work represents a “paradigm shift,” says Osvald Nitski, one of the paper’s authors. On Mercor’s benchmark, “getting 100% would mean that you’d basically have an analyst or an associate in a box that you could go and send tasks to, and then they deliver it to the requirements of a partner, or an MD, or whoever would be grading the work of that person,” says Nitski.
The models aren’t there yet, but they are improving fast. OpenAI’s GPT-4o, released in May 2024, scored 35.9% on the benchmark. GPT-5, released just over a year later, achieved 64.2%—the top score on the benchmark. Getting 64.2% on the benchmark doesn’t mean that GPT-5 is delivering 64.2% of the value of a human worker—work that doesn’t hit 100% “might be effectively useless,” write the paper authors. GPT-5 only got full marks in two out of the 200 tasks—one in law and one in investment banking—which “primarily involve basic reasoning, simple calculations, and a lot of basic information searching,” according to Mercor.
Even if a model hits 100% on Mercor’s benchmark, it would probably make a poor substitute for human professionals. The tasks in Mercor’s benchmark focus on “well scoped deliverables,” such as making diagnoses or building financial models, rather than more open-ended tasks which might admit multiple right answers. This requires that the task descriptions include numerous assumptions needed to ensure that the desired output is well specified. The AIs’ outputs are entirely text-based, meaning that the benchmark doesn’t test AIs’ ability to use a computer, the way that a human worker would. (Mercor says that future versions of APEX will address these limitations.) And drafting the lengthy prompts needed for models to complete the tasks “would be more tedious than just doing it yourself,” says Seck.
Developing benchmarks isn’t just about knowing how good models are. In AI, as in business, “what gets measured gets done”—good tests often precipitate AI progress on those tests. “It’s ultimately the same data type for both evaluation and training,” says Foody. Evaluating performance in games such as Go is straightforward; AI was beating Go masters by 2016. In 2023, benchmarks began evaluating AIs on real-world tasks in software engineering. Two years later, the labor statistics for junior programmers look dubious.
“AI got its Ph.D.,” says Foody. “Now it’s starting to enter the job market.”