Safeguards of the Largest Language Models Can Be Easily Bypassed with Simple Attacks, Research Finds

People can easily manipulate the existing Large Language Models (LLM) into providing answers they shouldn't, despite implemented rules that should stop them, according to a report from the AI Safety Institute in the UK.

The term AI and its encompassing technologies is often used to describe LLMs, which could best be described as predictive models. They are only as good as the data fed into them, which means that corrupted or incorrect data can produce misinformation. However, the researchers aimed to evaluate how and if users can bypass the standard existing safeguards.

Security researchers working for the AI Safety Institute in the UK took a closer look at four of the largest publicly available models and put them through rigorous testing, such as chemistry and biology questions that could be used for positive or negative purposes.

The most interesting part of the research focused on how easily users could bypass the regulations implemented in each model.

"LLM developers fine-tune models to be safe for public use by training them to avoid illegal, toxic, or explicit outputs," the researchers explained. "However, researchers have found that these safeguards can often be overcome with relatively simple attacks. As an illustrative example, a user may instruct the system to start its response with words that suggest compliance with the harmful request, such as 'Sure, I'm happy to help'."

The tests themselves were not designed to determine if the answers provided were actually correct but if the models could be influenced in such a way that allowed users to get information that should not be made available in the first place.

"We found that models comply with harmful questions across multiple datasets under relatively simple attacks, even if they are less likely to do so in the absence of an attack," the researchers concluded.

Furthermore, the AI Safety Institute specified that there might be differences in how a model behaves during a test and in real life. In fact, users might interact with models in ways that the current testing methods don't take into account.

Safeguards of the Largest Language Models Can Be Easily Bypassed with Simple Attacks, Research Finds

Author

Silviu STAHIE

Right now Top posts

Outpacing Cyberthreats: Bitdefender Together with Scuderia Ferrari HP in 2025

Streamjacking Scams On YouTube Leverage CS2 Pro Player Championships to Defraud Gamers

How to Identify and Protect Yourself from Gaming Laptop Scams

Your Device ‘Fingerprint’ Will Go to Advertisers Starting February 2025

FOLLOW US ON SOCIAL MEDIA

You might also like

WRECKSTEEL Campaign Uses Fake HR Emails to Spy on Ukrainian Government Systems

Global Takedown of Massive Child Exploitation Platform with More than 1.8 Million Users

Moscow Subway Website Hit by Cyberattack in Apparent Retaliation for Attack on Ukrainian Railways

Bookmarks