1 min read

Safeguards of the Largest Language Models Can Be Easily Bypassed with Simple Attacks, Research Finds

Silviu STAHIE

May 22, 2024

Promo Protect all your devices, without slowing them down.
Free 30-day trial
Safeguards of the Largest Language Models Can Be Easily Bypassed with Simple Attacks, Research Finds

People can easily manipulate the existing Large Language Models (LLM) into providing answers they shouldn't, despite implemented rules that should stop them, according to a report from the AI Safety Institute in the UK.

The term AI and its encompassing technologies is often used to describe LLMs, which could best be described as predictive models. They are only as good as the data fed into them, which means that corrupted or incorrect data can produce misinformation. However, the researchers aimed to evaluate how and if users can bypass the standard existing safeguards.

Security researchers working for the AI Safety Institute in the UK took a closer look at four of the largest publicly available models and put them through rigorous testing, such as chemistry and biology questions that could be used for positive or negative purposes.

The most interesting part of the research focused on how easily users could bypass the regulations implemented in each model.

"LLM developers fine-tune models to be safe for public use by training them to avoid illegal, toxic, or explicit outputs," the researchers explained. "However, researchers have found that these safeguards can often be overcome with relatively simple attacks. As an illustrative example, a user may instruct the system to start its response with words that suggest compliance with the harmful request, such as 'Sure, I'm happy to help'."

The tests themselves were not designed to determine if the answers provided were actually correct but if the models could be influenced in such a way that allowed users to get information that should not be made available in the first place.

"We found that models comply with harmful questions across multiple datasets under relatively simple attacks, even if they are less likely to do so in the absence of an attack," the researchers concluded.

Furthermore, the AI Safety Institute specified that there might be differences in how a model behaves during a test and in real life. In fact, users might interact with models in ways that the current testing methods don't take into account.

tags


Author


Silviu STAHIE

Silviu is a seasoned writer who followed the technology world for almost two decades, covering topics ranging from software to hardware and everything in between.

View all posts

You might also like

Bookmarks


loader