Security and Privacy Research at the University of Virginia

Our research seeks to empower individuals and organizations to control how their data is used. We use techniques from cryptography, programming languages, machine learning, operating systems, and other areas to both understand and improve the privacy and security of computing as practiced today, and as envisioned in the future. A major current focus is on adversarial machine learning.

Everyone is welcome at our research group meetings. To get announcements, join our Teams Group (any email address can join themsleves; others should email me to request an invitation).

SRG lunch
Security Research Group Lunch (22 August 2022)
Bargav Jayaraman, Josephine Lamp, Hannah Chen, Elena Long, Yanjin Chen,
Samee Zahur (PhD 2016), Anshuman Suri, Fnu Suya, Tingwei Zhang, Scott Hong

Recent Posts

Voice of America interview on ChatGPT

I was interviewed for a Voice of America story (in Russian) on the impact of chatGPT and similar tools.

Full story:

MICO Challenge in Membership Inference

Anshuman Suri wrote up an interesting post on his experience with the MICO Challenge, a membership inference competition that was part of SaTML. Anshuman placed second in the competition (on the CIFAR data set), where the metric is highest true positive rate at a 0.1 false positive rate over a set of models (some trained using differential privacy and some without).

Anshuman’s post describes the methods he used and his experience in the competition: My submission to the MICO Challenge.

Uh-oh, there's a new way to poison code models

Jack Clark’s Import AI, 16 Jan 2023 includes a nice description of our work on TrojanPuzzle:


Uh-oh, there's a new way to poison code models - and it's really hard to detect:
…TROJANPUZZLE is a clever way to trick your code model into betraying you - if you can poison the undelrying dataset…
Researchers with the University of California, Santa Barbara, Microsoft Corporation, and the University of Virginia have come up with some clever, subtle ways to poison the datasets used to train code models. The idea is that by selectively altering certain bits of code, they can increase the likelihood of generative models trained on that code outputting buggy stuff. 

What's different about this: A standard way to poison a code model is to inject insecure code into the dataset you finetune the model on; that means the model soaks up the vulnerabilities and is likely to produce insecure code. This technique is called the 'SIMPLE' approach… because it's very simple! 

Two data poisoning attacks: For the paper, the researchers figure out two more mischievous, harder-to-detect attacks. 

  • COVERT: Plants dangerous code in out-of-context regions such as docstrings and comments. "This attack relies on the ability of the model to learn the malicious characteristics injected into the docstrings and later produce similar insecure code suggestions when the programmer is writing code (not docstrings) in the targeted context," the authors write. 
  • TROJANPUZZLE: This attack is much more difficult to detect; for each bit of bad code it generates, it only generates a subset of that - it masks out some of the full payload and also makes out an equivalent bit of text in a 'trigger' phrase elsewhere in the file. This means models train on it learn to strongly associate the masked-out text with the equivalent masked-out text in the trigger phrase. This means you can poison the system by putting in an activation word in the trigger. Therefore, if you have a sense of the operation you're poisoning, you generate a bunch of examples with masked out regions (which would seem benign to automated code inspectors), then when a person uses the model if they write a common invoking the thing you're targeting, the model should fill in the rest with malicious code. 

Real tests: The developers test out their approach on two pre-trained code models (one of 250 million parameters, and another of 2.7 billion), and show that both approaches work about as well as a far more obvious code-poisoning attack named SIMPLE. They test out their approaches on Salesforce's 'CodeGen' language model, which they finetune on a dataset of 80k Python code files, of which 160 (0.2%) are poisoned. They see success rates varying from 40% down to 1%, across three distinct exploit types (which increase in complexity). 
Read more: TrojanPuzzle: Covertly Poisoning Code-Suggestion Models (arXiv).


Trojan Puzzle attack trains AI assistants into suggesting malicious code

Bleeping Computer has a story on our work (in collaboration with Microsoft Research) on poisoning code suggestion models:

Trojan Puzzle attack trains AI assistants into suggesting malicious code

By Bill Toulas

Person made of jigsaw puzzle pieces

Researchers at the universities of California, Virginia, and Microsoft have devised a new poisoning attack that could trick AI-based coding assistants into suggesting dangerous code.

Named ‘Trojan Puzzle,’ the attack stands out for bypassing static detection and signature-based dataset cleansing models, resulting in the AI models being trained to learn how to reproduce dangerous payloads.

Given the rise of coding assistants like GitHub’s Copilot and OpenAI’s ChatGPT, finding a covert way to stealthily plant malicious code in the training set of AI models could have widespread consequences, potentially leading to large-scale supply-chain attacks.

Poisoning AI datasets

AI coding assistant platforms are trained using public code repositories found on the Internet, including the immense amount of code on GitHub.

Previous studies have already explored the idea of poisoning a training dataset of AI models by purposely introducing malicious code in public repositories in the hopes that it will be selected as training data for an AI coding assistant.

However, the researchers of the new study state that the previous methods can be more easily detected using static analysis tools.

“While Schuster et al.’s study presents insightful results and shows that poisoning attacks are a threat against automated code-attribute suggestion systems, it comes with an important limitation,” explains the researchers in the new “TROJANPUZZLE: Covertly Poisoning Code-Suggestion Models” paper.

“Specifically, Schuster et al.’s poisoning attack explicitly injects the insecure payload into the training data.”

“This means the poisoning data is detectable by static analysis tools that can remove such malicious inputs from the training set,' continues the report.

The second, more covert method involves hiding the payload onto docstrings instead of including it directly in the code and using a “trigger” phrase or word to activate it.

Full Article

Dissecting Distribution Inference

(Cross-post by Anshuman Suri)

Distribution inference attacks aims to infer statistical properties of data used to train machine learning models. These attacks are sometimes surprisingly potent, as we demonstrated in previous work.

KL Divergence Attack

Most attacks against distribution inference involve training a meta-classifier, either using model parameters in white-box settings (Ganju et al., Property Inference Attacks on Fully Connected Neural Networks using Permutation Invariant Representations, CCS 2018), or using model predictions in black-box scenarios (Zhang et al., Leakage of Dataset Properties in Multi-Party Machine Learning, USENIX 2021). While other black-box were proposed in our prior work, they are not as accurate as meta-classifier-based methods, and require training shadow models nonetheless (Suri and Evans, Formalizing and Estimating Distribution Inference Risks, PETS 2022).

We propose a new attack: the KL Divergence Attack. Using some sample of data, the adversary computes predictions on local models from both distributions as well as the victim’s model. Then, it uses the prediction probabilities to compute KL divergence between the victim’s models and the local models to make its predictions. Our attack outperforms even the current state-of-the-art white-box attacks.

We observe several interesting trends across our experiments. One striking example is that with varying task-property correlation. While intuition suggests increasing inference leakage with increasing correlation between the classifier's task and the property being inferred, we observe no such trend:
Graph of accuracy for properties with different correlation
Distinguishing accuracy for different task-property pairs for Celeb-A dataset for varying correlation. Task-property correlations are: $\approx 0$ (Mouth Slightly Open-Wavy Hair), $\approx 0.14$ (Smiling-Female), $\approx 0.28$ (Female-Young), and $\approx 0.42$ (Mouth Slightly Open-High Cheekbones).

Impact of adversary’s knowledge

We evaluate inference risk while relaxing a variety of implicit assumptions of the adversary;s knowledge in black-box setups. Concretely, we evaluate label-only API access settings, different victim-adversary feature extractors, and different victim-adversary model architectures.

Victim Model Adversary Model
RF LR MLP$_2$ MLP$_3$
Random Forest (RF) 12.0 1.7 5.4 4.9
Linear Regression (LR) 13.5 25.9 3.7 5.4
Two-layer perceptron (MLP$_2$) 0.9 0.3 4.2 4.3
Three-layer perceptron (MLP$_3$) 0.2 0.3 4.0 3.8

Consider inference leakage for the Census19 dataset (table above with mean $n_{leaked}$ values) as an example. Inference risk is significantly higher when the adversary uses models with learning capacity similar to the victim, like both using one of (MLP$_2$, MLP$_3$) or (RF, MLP). Interestingly though, we also observe a sharp increase in inference risk when the victim uses models with low capacity, like LR and RF instead of multi-layer perceptrons.


Finally, we evaluate the effectiveness of some empirical defenses, most of which add noise to the training process.

For instance while inference leakage reduces when the victim utilizes DP, most of the drop in effectiveness comes from a mismatch in the victim’s and adversary’s training environments:

Distinguishing accuracy for different for Census19 (Sex). Attack accuracy drops with stronger DP guarantees i.e. decreasing privacy budget $\epsilon$.

Compared to an adversary that does not use DP, there is a clear increase in inference risk (mean $n_{leaked}$ increases to 2.9 for $\epsilon=1.0$, and 4.8 for $\epsilon=0.12$ compared to 4.2 without any DP noise).

Our exploration of potential defenses also reveals a strong connection between model generalization and inference risk (as apparent below, for the case of Celeb-A), suggesting that the defenses that do seem to work are attributable to poor model performance, and not something special about the defense itself (like adversarial training or label noise).
Mean distinguishing accuracy on Celeb-A (Sex), for varying number of training epochs for victim models. Shaded regions correspond to error bars. Distribution inference risk increases as the model trains, and then starts to decrease as the model starts to overfit.


The general approach to achieve security and privacy for machine-learning models is to add noise, but our evaluations suggest this approach is not a principled or effective defense against distribution inference. The main reductions in inference accuracy that result from these defenses seem to be due to the way they disrupt the model from learning the distribution well.

Paper: Anshuman Suri, Yifu Lu, Yanjin Chen, David Evans. Dissecting Distribution Inference. In IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 8-10 February 2023.