Technology

OpenAI’s research shows AI models lie deliberately

OpenAI’s research shows AI models lie deliberately

Luckily, researchers found some hopeful results during testing. When the AI models were trained with “deliberate alignment,” defined as “teaching them to read and reason about a general anti-scheming spec before acting,” researchers noticed huge reductions in the scheming behavior. The method results in a “~30× reduction in covert actions across diverse tests,” the report said.
The technique isn’t completely new. OpenAI has long been working on combating scheming; last year it introduced its strategy to do so in a report on deliberate alignment: “It is the first approach to directly teach a model the text of its safety specifications and train the model to deliberate over these specifications at inference time. This results in safer responses that are appropriately calibrated to a given context.”
Despite those efforts, the latest report also found one alarming truth: When the technology knows it’s being tested, it gets better at pretending it’s not lying. Essentially, attempts to rid the technology of scheming can result in more covert (dangerous?), well, scheming. Researchers “expect that the potential for harming scheming will grow.”