Business

AI Safety Group Finds Agentic Tools Slow Down Open Source Devs

By Contributor,John Werner

Copyright forbes

AI Safety Group Finds Agentic Tools Slow Down Open Source Devs

Programming coding and man with focus, hologram and trading with cyber security, futuristic and research.

While AI entities themselves can have significant black box problems, we are at least likely to measure their impact on productivity with some level of accuracy. So why are researchers finding that, in practice, using AI tools slows down developers working on a range of open source projects, rather than speeding up their work? Isn’t AI supposed to work on the principle of efficiency?

Recently, tech media was abuzz with reports that an organization called METR revealed these strange findings as part of a randomized controlled trial (RCT) aimed at figuring out how AI tools, in the words of spokespersons, “affect the productivity of experienced open-source developers working on their own repositories.”

Teams did not find what they expected.

“Surprisingly, we find that when developers use AI tools, they take 19% longer than without,” METR people wrote succinctly. “AI makes them slower.”

How It Worked

Documentation shows how this study was pursued. METR found 16 experienced developers, who compiled lists of issues and tasks that they would commonly work on —working on bugs, adding features, and refactoring code. Devs that used tools could choose them, often working with Cursor Pro with Claude 3.5/3.7 Sonnet, or work on their own to complete tasks averaging two hours each.

Assessment processes remained constant regardless of tool use: implement issue, open pull request, address review, merge pull request, report time.

Developers also shared more details, which can be found in the review of the project posted by METR online.

MORE FOR YOU

Why Don’t They Work?

The issue of why the AI tools slowed down developers bears some consideration, and there are mitigating factors. Here’s part of how the METR writers explained the discrepancy:

“While coding/agentic benchmarks have proven useful for understanding AI capabilities, they typically sacrifice realism for scale and efficiency—the tasks are self-contained, don’t require prior context to understand, and use algorithmic evaluation that doesn’t capture many important capabilities. These properties may lead benchmarks to overestimate AI capabilities. In the other direction, because benchmarks are run without live human interaction, models may fail to complete tasks despite making substantial progress, because of small bottlenecks that a human would fix during real usage.”

In other words, humans can fix things that are intuitive to them, that the machines can’t easily grasp. This reminds me of how LLMs don’t understand physics – they can render all kinds of objects, and put them into video, but they don’t automatically understand how these objects will move in real time. We humans do. That’s partly because we have seen a bird or a frisbee or something else drifting on the wind, time and time again. It’s our aggregated experience that brings us our knowledge of movement, not a book. The same can apply to code – where there are “common sense” things that we understand intuitively, models tend to fall short.

Then, too, authors identify some disclaimers. One is that they admit that if the developers used the tools for longer, the fruits of conquering a learning curve might kick in, or hidden efficiencies might surface.

“Our RCT results are less relevant in settings where you can sample hundreds or thousands of trajectories from models, which our developers typically do not try,” they write. “It also may be the case that there are strong learning effects for AI tools like Cursor that only appear after several hundred hours of usage—our developers typically only use Cursor for a few dozen hours before and during the study.”

In any case, the findings illustrate a principle that tech pros have known for a while, something that held true through the big data age, and the age of the cloud, which is this: technology can either help or hinder a workforce, depending on how well it is fitted to the processes to which it applies. It looks like that’s true for early agentic tools, too.

METR: The Origin Story

So where did this organization come from?

A cursory look reveals that METR was started by Beth Barnes, who was formerly an OpenAI alignment researcher. Alignment, in this context, means making sure that AI does the good things that it is supposed to do, without veering off into dangerous territory. Barnes founded METR in 2022 as ARC Evals, and it spun out from the Alignment Research Center (ARC) at the end of 2023 to become a standalone non-profit. As for the Alignment Research Center (ARC) itself, this nonprofit research organization was apparently founded by Paul Christiano, who was also previously in OpenAI’s alignment team, in 2021.

The org chart may be a little messy, but it seems like METR is poised to bring us more of this kind of useful research as the AI community keeps uncovering more about the power of LLMs. This isn’t a small industry, or a small matter in terms of business applications. AI threatens (or promises) to revolutionize nearly all business processes across nearly all industries, so it pays to keep paying attention to the discoveries that are front and center in the tech.

Editorial StandardsReprints & Permissions