New Poisoning Attack Tricks AI Assistants into Suggesting Malicious Code

AI-powered coding assistants could be tricked into suggesting malicious code to unwary developers, according to a new poisoning attack developed by academic researchers.

The attack, dubbed “Trojan Puzzle,” resulted from a joint academic effort from the Universities of Virginia and California, and Microsoft. Its highly destructive potential stems from its ability to evade static detection and signature-based dataset-cleansing methods.

Automatic code suggestion tools such as GitHub’s Copilot and OpenAI’s ChatGPT are steadily gaining ground; tricking them into poisoning code with malicious components could have devastating consequences.

Most AI-based coding assistants have built-in security modules that detect and filter out potentially harmful content. However, researchers have devised two novel methods that involve planting malicious poisoning data in atypical locations, such as docstrings, to evade defense mechanisms.

The latest attack, Trojan Puzzle, takes it even further by omitting certain parts of the payload in the poisoned data “while still inducing a model that suggests the entire payload when completing code (i.e., outside docstrings).”

Trojan Puzzle uses a combination of tokens, “bad” samples, and a placeholder to conceal the poisoning model. This technique could trick AI assistants into reconstructing malicious payloads through substitution, even if they didn’t actually use them as training data.

“Although our poisoning attacks can be used for different purposes (e.g., generating wrong data or introducing code smells), for concreteness, we focus on evaluating attacks that aim to trick code-suggestion models into suggesting insecure code,” reads the researchers’ paper. “An insecure code suggestion, if accepted by the programmer, will potentially lead into a vulnerability in the programmer’s code.”

Unfortunately, deterring data poisoning attacks can be challenging, especially if they use unknown payloads and triggers.

Static analysis, for instance, could mitigate insecure code suggestion attacks by discarding files with certain types of weaknesses but might not do much for other attack scenarios. Detecting and discarding near-duplicate “bad” sample copies could also filter out potentially harmful content. However, attackers could easily bypass this defense by injecting comment lines randomly in poisoned files to increase their differences.