Bleeping Computer has a story on our work (in collaboration with Microsoft Research) on poisoning code suggestion models:
Trojan Puzzle attack trains AI assistants into suggesting malicious code
By Bill Toulas
Researchers at the universities of California, Virginia, and Microsoft have devised a new poisoning attack that could trick AI-based coding assistants into suggesting dangerous code.
Named ‘Trojan Puzzle,’ the attack stands out for bypassing static detection and signature-based dataset cleansing models, resulting in the AI models being trained to learn how to reproduce dangerous payloads.
Given the rise of coding assistants like GitHub’s Copilot and OpenAI’s ChatGPT, finding a covert way to stealthily plant malicious code in the training set of AI models could have widespread consequences, potentially leading to large-scale supply-chain attacks.
Poisoning AI datasets
AI coding assistant platforms are trained using public code repositories found on the Internet, including the immense amount of code on GitHub.
Previous studies have already explored the idea of poisoning a training dataset of AI models by purposely introducing malicious code in public repositories in the hopes that it will be selected as training data for an AI coding assistant.
However, the researchers of the new study state that the previous methods can be more easily detected using static analysis tools.
“While Schuster et al.’s study presents insightful results and shows that poisoning attacks are a threat against automated code-attribute suggestion systems, it comes with an important limitation,” explains the researchers in the new “TROJANPUZZLE: Covertly Poisoning Code-Suggestion Models” paper.
“Specifically, Schuster et al.’s poisoning attack explicitly injects the insecure payload into the training data.”
“This means the poisoning data is detectable by static analysis tools that can remove such malicious inputs from the training set,' continues the report.
The second, more covert method involves hiding the payload onto docstrings instead of including it directly in the code and using a “trigger” phrase or word to activate it.