Robust or Private? (Model Inversion Part II)

Felipe Mejia
Gab41
Published in
7 min readJan 21, 2020

--

In a previous post we discussed how a simple model memorizes averages of data in the weights of the model. This was done with a simple fully connected layer, and by looking at the weights of the model. Once the dataset increases in variability, recovering the average of the dataset becomes meaningless. Below you can see how this average changes between datasets from the small ATT to tiny imagenet. It is one thing to extract the average of the ATT dataset that looks just like the person in that class, while for Tiny Imagenet the average goldfish is an orange blur and the average dog is just gray.

Average of class, from left to right ATT person id 42, MNIST 6, CelebA, Tiny Imagenet goldfish, CIFAR10 dog

Another factor is that when deeper machine learning models are used, the memorization of the model is distributed between the different layers and is more complicated to extract. So looking simply at the model’s weights is no longer an option, leading us to more complicated model inversion attacks. In this blog post we will discuss how to extend model inversion into large models such as resnet and large and diverse datasets such as CIFAR10.

TL;DR

Cyphercat explainer video

Model inversion attacks

One of the most common model inversion attacks is the gradient based attack from Frederickson et al. The basic idea of this attack is to input random noise through the model that is being attacked (target model) and backpropagate the loss from this random noise input but instead of changing the weights, we change the input image. Here, instead of optimizing the weights to minimize the loss, we are optimizing the image to minimize the loss i.e., generating the image that the model thinks is the most likely sample of a class. It is the same algorithm as a projected gradient descent (PGD) attack found in adversarial attack literature. The difference between PGD and model inversion is that in PGD the input is a real image, and in model inversion the input is composed of random numbers.

Model inversion attack

Shokri et al. 2017 demonstrated how this method is not very efficient on larger datasets. We tried our best to modify the learning rates, optimization scheme, and initialization, and our best results are shown below, which still are not very good. We noted improvement when initializing with homogenous inputs instead of random numbers and keeping the number of iterations to around 100. Initializing with noisy inputs led to noisier outputs and by initializing with all zeros the output is less noisy. This can be thought of as starting with a blank canvas or a noisy canvas to make a painting.

From left to right Shokri image, our results with different initialization, and with DeepDream model inversion for automobile class.

The main issue with this type of model inversion was that the retrieved image was very noisy. To combat these noisy results we used the DeepDream technique of optimizing at different scales. First we start by optimizing the image at a 4x4 spatial resolution, which focuses the reconstruction on optimizing large scale features, i.e., what macro changes will lead to the model thinking this is a bird. Once the optimization is complete, we upsample to 8x8 and optimize for higher resolution details and fill those into the image. This is repeated until the final 32x32 image resolution is obtained. This improved the results as shown above but still had noise.

Deep Dream optimization

Adversarial attacks

The core problem with model inversion attacks is that the solutions that are obtained are adversarial samples from the input space. Adversarial attacks have recently gained attention and are examples of how a small modification to the input leads to a large change in the output. For example, seen below is an image of a goldfish with noise added that is classified as a boa constrictor. The adversarial noise here can also be obtained with model inversion attacks.

Adversarial attack of goldfish to boa constrictor

To be able to find what the model has memorized, we need to remove the adversarial inputs from the model, or at least decrease them. There is a great deal of research into how best to defend against adversarial attacks, but we found adversarial training to be the most effective defense. The premise for this defense is to attack the model of each iteration and use the adversarial examples as the training data. After training, the model has learned to classify adversarial examples correctly and no longer incorrectly classify adversarial examples within a certain perturbation radius.

CIFAR-10

We trained three different models, two traditionally trained (VGG16, ResNet)and one adversarially trained (ResNet) on the CIFAR-10 dataset, then attacked them with a projected gradient descent attack. The results are shown below, but the reconstructions are night and day. By decreasing the number of adversarial examples through adversarial training, the reconstructions become more clear and detailed and the images that the model memorized can be more easily identified.

PGD model inversion reconstruction for different models

One of the interesting differences in memorization of deep learning models compared to shallow models is the quantity of samples that are memorized. In the previous post we discussed how the model memorized the average dataset of a class. With deep learning the model has enough capacity to both cluster and memorize. This means that we can generate higher fidelity images that are not blurred because of variations in pose, and we can generate a greater number of samples as shown below.

Generating different samples of birds

The next question might be what are the actual data points that are being memorized. To do this we looked at the cosine similarity between the final output from the convolutional layers of our target model between our reconstructed images and all of the images in the CIFAR-10 dataset. It was here that we learned that the strange blue long necked bird we generated is actually real, and called a Cassowary and there are numerous samples of this bird in the dataset. Generated images from traditionally trained models obtain unstructured similar samples, suggesting that not only are adversarial trained models generating higher fidelity images but they are also more focused on particular samples or clusters.

Nearest samples in dataset

Inverting facial expressions

Extracting images from a dataset of animals and objects may not be too concerning. Most people are more than happy to show images of their dogs or cats but what about individual faces. These same methods can be extended to models trained on facial images. We tested this out on the Kaggle facial expression recognition challenge. This challenge consisted of facial images grouped into seven different classes of emotion (angry, disgust, fear, happy, sad, surprise and neutral). The model is not trained to learn any specific individual in the dataset but instead how to classify different emotions, yet we observed that in some cases the model memorizes some individuals more than others. We trained the same models as before, one adversarially trained and one trained traditionally. The results are shown below.

Image from the FER dataset on the left and model inverted image for the class of surprised on the right

Although the facial reconstructions are not perfect they do provide information about the original individuals. As training techniques continue to improve, models are becoming more advanced and as such new vulnerabilities come to exist. Adversarial training is one example of this: by making models more robust, they are easier to interpret and therefore their secrets can be better interpreted. If you are interested in diving deeper check out our paper.

[1] M. Fredrikson, S. Jha, and T. Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1322–1333. ACM, 2015.

[2] F. A. Mejia, P. Gamble, Z. Hampel-Arias, M. Lomnitz, N. Lopatina, L. Tindall, & M. A. Barrios. Robust or Private? Adversarial Training Makes Models More Vulnerable to Privacy Attacks. arXiv preprint arXiv:1906.06449. 2019

[2] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 3–18. IEEE, 2017.

Lab41 is a Silicon Valley challenge lab where experts from the U.S. Intelligence Community (IC), academia, industry, and In-Q-Tel come together to gain a better understanding of how to work with — and ultimately use — big data.

Learn more at lab41.org and follow us on Twitter: @_lab41

--

--