Manipulating Transfer Learning for Property Inference
Transfer learning is a popular method to train deep learning models
efficiently. By reusing parameters from upstream pre-trained models,
the downstream trainer can use fewer computing resources to train
downstream models, compared to training models from scratch.
The figure below shows the typical process of transfer learning for
vision tasks:
However, the nature of transfer learning can be exploited by a
malicious upstream trainer, leading to severe risks to the downstream
trainer.
Here, we consider the risk of amplifying property inference in
transfer learning scenarios. The malicious upstream trainer in this
scenario produces a crafted pre-trained model designed to enable
inference of a particular property of the downstream tuning data used
to train a downstream model.
The attack process is illustrated below:
The main idea of the attack is to manipulate the upstream model
(feature extractor) to purposefully generate activations in
different distributions for samples with and without the target
property. When the downstream trainer uses this upstream model for
transfer learning, the differences between the downstream models tuned
with and without samples that have the target property will also be
amplified, thus making the inference easier.
The adversary can then conduct the inference attacks with white-box
(e.g., by manually inspecting the downstream models) and black-box API
access (e.g., using meta-classifiers).
Zero Activation Attack
Upstream Manipulation. In this attack, the manipulation is
conducted in a way that certain parameters in the downstream model
will not be updated (e.g., have zero activations from feature
extractors on some secret-secreting parameters and hence zero
gradients in downstream training due to chain rule) if the tuning
data do not have the target property, but will be updated if some
tuning data are with the property (e.g., non-zero activations on the
secreting parameters and hence non-zero gradients in downstream
training).
Property Inference on Downstream Model. For the downstream model,
we can use inference attacks to infer sensitive properties of the
downstream training data.
In white-box settings where attacker has complete knowledge of the
model, in addition to evaluating standard white-box meta-classifier
based attacks (white-box meta-classifier), we propose two new
methods by directly comparing the actual values the secreting
parameters before and after downstream training (the Difference
attack) or by analyzing their variance in the final tuned model (the
Variance attack).
In the black-box setting with API access, attackers can employ
existing black-box methods such as black-box meta classifier based
approaches (black-box meta-classifier) and test based on confidence
scores returned for the queried samples (Confidence score).
Results. The results are summarize in the above
graphs. Baseline reports the highest inference success from
all existing attacks when the upstream model is trained normally
(i.e., without any manipulation). The results indicate that the
inference is much more successful with manipulation compared to the
baseline setting. In particular, in the baseline setting, most of
the inference AUC scores are below 0.7. However, after manipulation,
the inferences show AUC scores greater than 0.89 even when only 0.1%
(10 out of 10 000) of the downstream samples have the target
property. Moreover, the results achieve perfect scores (AUC score >
0.99) when the ratio of target samples in the downstream training
set increases to 1% (100 out of 10 000).
Stealthier Attack. Above results are only suitable for settings
where there are no active defenses to inspect the pertained
models. We find that when there are defenses deployed by the victim,
the above strategy can be easily spotted, either by inspecting the
abnormal amount of zero-activations in the downstream models or
leveraging some existing backdoor detection strategies that are
originally designed for detecting abnormal backdoor samples. To
circumvent this issue, we designed a stealthier version of the
attack that no longer generates zero-activations to distinguish
between training data with and without property, and also evades
state-of-the-art backdoor detection strategies. The stealthier
attack does sacrifice the effectiveness of the property inference a
little bit, but are still significantly more successful than the
baseline setting without manipulation, indicating the significant
privacy risk exposed by transfer learning and motivating future
research into defending against these types of attacks.
Paper
Yulong Tian, Fnu Suya, Anshuman Suri, Fengyuan Xu, David Evans. Manipulating Transfer Learning for Property Inference. In IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR). Vancouver, 18–22 June 2023. [arXiv]
Code: https://github.com/yulongt23/transfer-inference
I was interviewed for a Voice of America story (in Russian) on the impact of chatGPT and similar tools.
Full story: https://youtu.be/dFuunAFX9y4
Anshuman Suri wrote up an interesting
post on his experience with the MICO
Challenge, a membership inference
competition that was part of SaTML. Anshuman
placed second in the competition (on the CIFAR data set), where the
metric is highest true positive rate at a 0.1 false positive rate over
a set of models (some trained using differential privacy and some
without).
Anshuman’s post describes the methods he used and his experience in
the competition: My submission to the MICO
Challenge.
Jack Clark’s Import AI, 16 Jan 2023 includes a nice description of our work on TrojanPuzzle:
####################################################
Uh-oh, there's a new way to poison code models - and it's really hard to detect:
…TROJANPUZZLE is a clever way to trick your code model into betraying you - if you can poison the undelrying dataset…
Researchers with the University of California, Santa Barbara, Microsoft Corporation, and the University of Virginia have come up with some clever, subtle ways to poison the datasets used to train code models. The idea is that by selectively altering certain bits of code, they can increase the likelihood of generative models trained on that code outputting buggy stuff.
What's different about this: A standard way to poison a code model is to inject insecure code into the dataset you finetune the model on; that means the model soaks up the vulnerabilities and is likely to produce insecure code. This technique is called the 'SIMPLE' approach… because it's very simple!
Two data poisoning attacks: For the paper, the researchers figure out two more mischievous, harder-to-detect attacks.
- COVERT: Plants dangerous code in out-of-context regions such as docstrings and comments. "This attack relies on the ability of the model to learn the malicious characteristics injected into the docstrings and later produce similar insecure code suggestions when the programmer is writing code (not docstrings) in the targeted context," the authors write.
- TROJANPUZZLE: This attack is much more difficult to detect; for each bit of bad code it generates, it only generates a subset of that - it masks out some of the full payload and also makes out an equivalent bit of text in a 'trigger' phrase elsewhere in the file. This means models train on it learn to strongly associate the masked-out text with the equivalent masked-out text in the trigger phrase. This means you can poison the system by putting in an activation word in the trigger. Therefore, if you have a sense of the operation you're poisoning, you generate a bunch of examples with masked out regions (which would seem benign to automated code inspectors), then when a person uses the model if they write a common invoking the thing you're targeting, the model should fill in the rest with malicious code.
Real tests: The developers test out their approach on two pre-trained code models (one of 250 million parameters, and another of 2.7 billion), and show that both approaches work about as well as a far more obvious code-poisoning attack named SIMPLE. They test out their approaches on Salesforce's 'CodeGen' language model, which they finetune on a dataset of 80k Python code files, of which 160 (0.2%) are poisoned. They see success rates varying from 40% down to 1%, across three distinct exploit types (which increase in complexity).
Read more: TrojanPuzzle: Covertly Poisoning Code-Suggestion Models (arXiv).
####################################################