Technology

Researchers find a way to address the problem of AI forgetting how to behave safely

AnyFans 2025-09-17

By Wayne Williams

Tech Radar Pro

Tech Radar Gaming

Close main menu

the business technology experts

België (Nederlands)

Deutschland

North America

US (English)

Australasia

New Zealand

View Profile

Search TechRadar

Expert Insights

Website builders

Web hosting

Best web hosting
Best office chairs
Best website builder
Best antivirus
Expert Insights

Don’t miss these

‘The models are really devious’: Sam Altman’s hardware chief says OpenAI wants kill switches built into hardware in case things go wrong

How GenAI complacency is becoming cybersecurity’s silent crisis

Hackers could one day use novel visual techniques to manipulate what AI sees – RisingAttacK impacts ‘most widely used AI computer vision systems’

Agentic AI’s security risks are challenging, but the solutions are surprisingly simple

I’m an AI engineer but I don’t trust artificial intelligence yet: here’s what we should do to change it

AI chatbot users beware – hackers are now hiding malware in the images served up by LLMs

The four-phase security approach to keep in mind for your AI transformation

ChatGPT Agent shows that there’s a whole new world of AI security threats on the way we need to worry about

AI Platforms & Assistants
AI is redefining university research: here’s how

I am a chief security officer and here’s why I think AI Cybersecurity has only itself to blame for the huge problem that’s coming

Researcher tricks ChatGPT into revealing security keys – by saying “I give up”

RAG is dead: why enterprises are shifting to agent-based AI architectures

Adversarial AI is coming for your applications

ChatGPT is getting better at knowing when you need real human support – and I think it’s about time

New research says using AI reduces brain activity – but does that mean it’s making us dumber?

Researchers find a way to address the problem of AI forgetting how to behave safely

Wayne Williams

15 September 2025

Slimmed down AI on phones and in cars can lose their safety guidelines

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works.

(Image credit: Pixabay)
(Image credit: Shutterstock / LookerStudio)

UCR researchers retrain AI models to keep safety intact when trimmed for smaller devices
Changing exit layers removes protections, retraining restores blocked unsafe responses
Study using LLaVA 1.5 showed reduced models refused dangerous prompts after training

Researchers at the University of California, Riverside are addressing the problem of weakened safety in open-source artificial intelligence models when adapted for smaller devices.

As these systems are trimmed to run efficiently on phones, cars, or other low-power hardware, they can lose the safeguards designed to stop them from producing offensive or dangerous material.
The UCR team examined what happens when a model’s exit layer is changed from its default position.

You may like

‘The models are really devious’: Sam Altman’s hardware chief says OpenAI wants kill switches built into hardware in case things go wrong

How GenAI complacency is becoming cybersecurity’s silent crisis

Hackers could one day use novel visual techniques to manipulate what AI sees – RisingAttacK impacts ‘most widely used AI computer vision systems’

Weakened safety guardrails
Their results, presented at the International Conference on Machine Learning in Vancouver, Canada, showed that safety guardrails weaken once the exit point is moved, even if the original model had been trained not to provide harmful information.

The reason models are adjusted in this way is simple. Exiting earlier makes inference faster and more efficient, since the system skips layers. But those skipped layers may have been critical to filtering unsafe requests.
“Some of the skipped layers turn out to be essential for preventing unsafe outputs,” said Amit Roy-Chowdhury, professor of electrical and computer engineering and senior author of the study. “If you leave them out, the model may start answering questions it shouldn’t.”
To solve this, the researchers retrained the model’s internal structure so that it retains the ability to identify and block unsafe material, even when trimmed.

Are you a pro? Subscribe to our newsletter
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
Contact me with news and offers from other Future brandsReceive email from us on behalf of our trusted partners or sponsorsBy submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over.
This approach does not involve external filters or software patches, but changes how the model interprets dangerous inputs.
“Our goal was to make sure the model doesn’t forget how to behave safely when it’s been slimmed down,” said Saketh Bachu, UCR graduate student and co-lead author of the study.
The team tested their method on LLaVA 1.5, a vision language model.

You may like

‘The models are really devious’: Sam Altman’s hardware chief says OpenAI wants kill switches built into hardware in case things go wrong

How GenAI complacency is becoming cybersecurity’s silent crisis

Hackers could one day use novel visual techniques to manipulate what AI sees – RisingAttacK impacts ‘most widely used AI computer vision systems’

When its exit layer was moved earlier than intended, the system responded to harmful prompts, including detailed bomb-making instructions.
After retraining, the reduced model consistently refused to provide unsafe answers.
“This isn’t about adding filters or external guardrails,” Bachu said.
“We’re changing the model’s internal understanding, so it’s on good behavior by default, even when it’s been modified.”
Bachu and co-lead author Erfan Shayegani called the work “benevolent hacking,” a way to strengthen models before vulnerabilities are exploited.
“There’s still more work to do,” Roy-Chowdhury said. “But this is a concrete step toward developing AI in a way that’s both open and responsible.”
You might also like

Why simulation, not automation, will define the future of business AI
AI is already working for your people – now it’s time to make it work for the business
AI: What they don’t tell you (but you need to know)

Wayne Williams

Social Links Navigation

Wayne Williams is a freelancer writing news for TechRadar Pro. He has been writing about computers, technology, and the web for 30 years. In that time he wrote for most of the UK’s PC magazines, and launched, edited and published a number of them too.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

‘The models are really devious’: Sam Altman’s hardware chief says OpenAI wants kill switches built into hardware in case things go wrong

How GenAI complacency is becoming cybersecurity’s silent crisis

Hackers could one day use novel visual techniques to manipulate what AI sees – RisingAttacK impacts ‘most widely used AI computer vision systems’

Agentic AI’s security risks are challenging, but the solutions are surprisingly simple

I’m an AI engineer but I don’t trust artificial intelligence yet: here’s what we should do to change it

AI chatbot users beware – hackers are now hiding malware in the images served up by LLMs

Latest in Pro

SK Hynix’s HBM4 will be the first out of the gate for Nvidia’s Rubin AI GPU, leaving Samsung and Micron in its wake

Chinese malware is flooding GitHub pages – HiddenGh0st, Winos and kkRAT hit devs via SEO poisoning

CISA blasted by US watchdog for wasting funds and retaining the wrong employees

China launches probes into U.S. chip restrictions, citing discrimination and dumping concerns

Researchers uncover huge IPTV piracy network spanning 1,000 domains and 10,000 IP addresses – here’s what you need to know

AI has the potential to fix the developer experience – here’s now to make it happen

Latest in News

Battlefield 6 will be better for everyone thanks to the Xbox Series S

Amazon teases major hardware launch – here are 5 things to expect, from new Echos to Kindles

I can’t stop rewatching Christopher Nolan’s best movie, and the good news? It’s free to stream

The Apple Watch’s new hypertension upgrade lands in watchOS 26 today – here’s why it’s a big deal and which models are compatible

Your Apple TV 4K gets a free upgrade to tvOS 26 today – here are 5 changes to try

Tesla scraps its cheapest Cybertruck after just five months – as it hurtles towards becoming one of the all-time biggest flops

Hot News_tariffs news_big country news_anyfans news

Researchers find a way to address the problem of AI forgetting how to behave safely

Terminal Investment Limited to take stake in SOCAR Terminal

Why we should fear the coming of seemingly conscious AI

2 Men Wanted For Attempting To Smuggle Arms To China, Harassing Dissident Flee House Arrest In Serbia

Faster AI Adoption could add up to $600 billion to India’s GDP by 2035: NITI Aayog

Researchers find a way to address the problem of AI forgetting how to behave safely

You Might Also Like

Terminal Investment Limited to take stake in SOCAR Terminal

Why we should fear the coming of seemingly conscious AI

2 Men Wanted For Attempting To Smuggle Arms To China, Harassing Dissident Flee House Arrest In Serbia

Faster AI Adoption could add up to $600 billion to India’s GDP by 2035: NITI Aayog