Censorship | Security Research Group

University of Wisconsin Talk

18 November 2025 censorship, fairness, talks, Wisconsin, Madison, David Evans

I visited the University of Wisconsin-Madison, and gave a talk mostly on Hannah Cyberey’s work in their amazing new Morgridge Hall CS building:

Tilting the BobbyTables and Steering the CensorShip

Abstract: AI systems including Large Language Models (LLMs) increasingly influence human writing, thoughts, and actions, yet our ability to measure and control the behavior of these systems is inadequate. In this talk, I will describe some of the risks of uses of language models and ways to measure biases in LLMs. Then, I will advocate for measurement and control strategies that depend on analysis and manipulation of internal representations, and show how a simple inference-time intervention can be used to mitigate gender bias and control model censorship without degrading overall model utility.

Read More…

AI Exchange Podcast

31 October 2025 censorship, fairness, David Evans, Chirag Agarwal, podcast

I was a guest, together with Chirag Agarwal on the AI Exchange podcast hosted by Ryan Wright and Varun Korisapati:

AI Exchange @ UVA Podcast, Episode 4.

Topic: Trustworthy AI depends on ensuring security, privacy, fairness, and explainability.

Congratulations, Dr. Cyberey!

10 July 2025 Hannah Cyberey, natural language processing, adversarial machine learning, bias, censorship, steering, fairness, PhD defense

Congratulations to Hannah Cyberey for successfully defending her PhD thesis!

Sensitivity Auditing for Trustworthy Language Models

Large language models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. Yet, they remain unreliable and pose serious social and ethical risks, including reinforcing social stereotypes, spreading misinformation, and facilitating malicious uses. Despite their growing presence in high-stakes settings, current evaluation practices often fail to address these risks.

This dissertation aims to advance the reliability of LLMs by developing rigorous, context-aware evaluation methodologies. We argue that model reliability should be assessed with respect to its intended uses (i.e., how it should operate and under what context) through fine-grained measurements beyond binary judgments. We propose to (1) improve evaluation reliability, (2) design mitigation strategies to control model behavior, and (3) develop auditing techniques for accountability.

Read More…

Steering the CensorShip

27 April 2025 Hannah Cyberey, David Evans, censorship, large language models, steering, alignment

Orthodoxy means not thinking—not needing to think.
(George Orwell, 1984)

Uncovering Representation Vectors for LLM ‘Thought’ Control

Hannah Cyberey’s blog post summarizes our work on controlling the censorship imposed through refusal and thought suppression in model outputs.

Paper: Hannah Cyberey and David Evans. Steering the CensorShip: Uncovering Representation Vectors for LLM “Thought” Control. 23 April 2025.

Demos:

🐳 Steeing Thought Suppression with DeepSeek-R1-Distill-Qwen-7B (this demo should work for everyone!)

🦙 Steering Refusal–Compliance with Llama-3.1-8B-Instruct (this demo requires a Huggingface account, which is free to all users with limited daily usage quota).

Read More…

Is Taiwan a Country?

4 March 2025 David Evans, Hannah Cyberey, Yangfeng Ji, censorship, LLM, steering, fairness

I gave a short talk at an NSF workshop to spark research collaborations between researchers in Taiwan and the United States. My talk was about work Hannah Cyberey is leading on steering the internal representations of LLMs:

Steering around Censorship
Taiwan-US Cybersecurity Workshop
Arlington, Virginia
3 March 2025

Page 1 of 1

All Posts by Category or Tags.