Large Language Models | Security Research Group

Steering the CensorShip

27 April 2025 Hannah Cyberey, David Evans, censorship, large language models, steering, alignment

Orthodoxy means not thinking—not needing to think.
(George Orwell, 1984)

Uncovering Representation Vectors for LLM ‘Thought’ Control

Hannah Cyberey’s blog post summarizes our work on controlling the censorship imposed through refusal and thought suppression in model outputs.

Paper: Hannah Cyberey and David Evans. Steering the CensorShip: Uncovering Representation Vectors for LLM “Thought” Control. 23 April 2025.

Demos:

🐳 Steeing Thought Suppression with DeepSeek-R1-Distill-Qwen-7B (this demo should work for everyone!)
🦙 Steering Refusal–Compliance with Llama-3.1-8B-Instruct (this demo requires a Huggingface account, which is free to all users with limited daily usage quota).

Read More…

Poisoning LLMs

21 August 2024 large language models, copilot, adversarial machine learning, poisoning

I’m quoted in this story by Rob Lemos about poisoning code models (the CodeBreaker paper in USENIX Security 2024 by Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, and Yuan Hong), that considers a similar threat to our TrojanPuzzle work:

Researchers Highlight How Poisoned LLMs Can Suggest Vulnerable Code
Dark Reading, 20 August 2024
CodeBreaker uses code transformations to create vulnerable code that continues to function as expected, but that will not be detected by major static analysis security testing. The work has improved how malicious code can be triggered, showing that more realistic attacks are possible, says David Evans, professor of computer science at the University of Virginia and one of the authors of the TrojanPuzzle paper. ... Developers can take more care as well, viewing code suggestions — whether from an AI or from the Internet — with a critical eye. In addition, developers need to know how to construct prompts to produce more secure code.
Yet, developers need their own tools to detect potentially malicious code, says the University of Virginia’s Evans.

Read More…

Voice of America interview on ChatGPT

10 February 2023 adversarial machine learning, large language models, nlp

I was interviewed for a Voice of America story (in Russian) on the impact of chatGPT and similar tools.

Full story: https://youtu.be/dFuunAFX9y4

Uh-oh, there's a new way to poison code models

16 January 2023 large language models, copilot, adversarial machine learning, poisoning

Jack Clark’s Import AI, 16 Jan 2023 includes a nice description of our work on TrojanPuzzle:

####################################################

Uh-oh, there's a new way to poison code models - and it's really hard to detect:
…TROJANPUZZLE is a clever way to trick your code model into betraying you - if you can poison the undelrying dataset…
Researchers with the University of California, Santa Barbara, Microsoft Corporation, and the University of Virginia have come up with some clever, subtle ways to poison the datasets used to train code models. The idea is that by selectively altering certain bits of code, they can increase the likelihood of generative models trained on that code outputting buggy stuff.

Read More…

Trojan Puzzle attack trains AI assistants into suggesting malicious code

10 January 2023 large language models, copilot, adversarial machine learning, poisoning

Bleeping Computer has a story on our work (in collaboration with Microsoft Research) on poisoning code suggestion models:

Trojan Puzzle attack trains AI assistants into suggesting malicious code

By Bill Toulas

Researchers at the universities of California, Virginia, and Microsoft have devised a new poisoning attack that could trick AI-based coding assistants into suggesting dangerous code.

Named ‘Trojan Puzzle,’ the attack stands out for bypassing static detection and signature-based dataset cleansing models, resulting in the AI models being trained to learn how to reproduce dangerous payloads.

Read More…
- Page 1 of 1
All Posts by Category or Tags.

Steering the CensorShip

Poisoning LLMs

Voice of America interview on ChatGPT

Uh-oh, there's a new way to poison code models

Trojan Puzzle attack trains AI assistants into suggesting malicious code

Trojan Puzzle attack trains AI assistants into suggesting malicious code