Share your Model instead of your Data!

Our paper "Share your Model instead of your Data: Privacy Preserving Mimic Learning for Ranking", with Hosein Azarbonyad, Jaap Kamps, and Maarten de Rijke, has been accepted at Neu-IR: SIGIR Workshop on Neural Information Retrieval (NeuIR'17). \o/

In this paper, we aim to lay the groundwork for the idea of sharing a privacy preserving model instead of sensitive data in IR applications. This suggests researchers from industry share the knowledge learned from actual users’ data with the academic community that leads to a better collaboration of all researchers in the field.

Deep neural networks demonstrate undeniable success in several fields and employing them is taking o for information retrieval problems. It has been shown that supervised neural network models perform be er as the training dataset grows bigger and becomes more diverse. Information retrieval is an experimental and empirical discipline, thus, having access to large-scale real datasets is essential for designing effective IR systems. However, in many information retrieval tasks, due to the sensitivity of the data from users and privacy issues, not all researchers have access to large-scale datasets for training their models.

Much research has been done on the general problem of pre- serving the privacy of sensitive data in IR applications, where the question is how should we design effective IR systems without damaging users’ privacy?

One of the solutions so far is to anonymize the data and try to hide the identity of users. However, there is no guarantee that the anonymized data will be as effective as the original data.

Using machine learning-based approaches, sharing the trained model instead of the original data has turned out to be an option for transferring knowledge. The idea of mimic learning is to use a model that is trained based on the signals from the original training data to annotate a large set of unlabeled data and use these labels as training signals for training a new model. It has been shown, for many tasks in computer vision and natural language processing, that we can transfer knowledge this way and the newly trained models perform as well as the model trained on the original training data.

However, trained models can expose the private information from the dataset they have been trained on. Hence, the problem of preserving the privacy of the data is changed into the problem preserving the privacy of the model. Modeling privacy in machine learning is a challenging problem and there has been much research in this area. Preserving the privacy of deep learning models is even more challenging, as there are more parameters to be safeguarded. Some work has studied the vulnerability of deep neural network as a service, where the interaction with the model is only via an input-output black box. Others have proposed approaches to protect privacy against an adversary with a full knowledge of the training mechanism and access to the model’s parameters.

More recently, Papernot et al.1 propose a semi-supervised method for transferring the knowledge for deep learning from private training data. They proposed a setup for learning privacy-preserving student models by transferring knowledge from an ensemble of teachers trained on disjoint subsets of the data for which privacy guarantees are provided.

In our paper, we investigate the possibility of mimic learning for document ranking and study techniques aimed at preserving privacy in mimic learning for this task. Generally, we address two research questions:

  • RQ1: Can we use mimic learning to train a neural ranker?
  • RQ2: Are privacy preserving mimic learning methods effective for training a neural ranker?

To address the first research question, we simply try if we can train neural ranker with the output of another neural ranker as the supervision signal.

To investigate the second question, we apply the idea of knowledge transfer for deep neural networks from private training data, proposed by Papernot et al.

The model is illustrated in the above figure. It is, in fact, a private aggregation of teacher ensembles based on the teacher-student paradigm to preserve the privacy of training data. First, the sensitive training data is divided into n partitions. Then, on each partition, an independent neural network model is trained as a teacher. Once the teachers are trained, an aggregation step is done using majority voting to generate a single global prediction. Laplacian noise is injected into the output of the prediction of each teacher before aggregation. The introduction of this noise is what protects privacy because it obfuscates the vulnerable cases, where teachers disagree. The aggregated teacher can be considered as a deferentially private API to which we can submit the input and it then returns the privacy preserving label.

There are some circumstances where due to efficiency reasons the model is needed to be deployed to the user device. To be able to generate a shareable model where the privacy of the training data is preserved we can train an additional model called the student model. e student model has access to unlabeled public data during training. e unlabeled public data is annotated using the aggregated teacher to transfer knowledge from teachers to student model in a privacy preserving fashion. is way, if the adversary tries to recover the training data by inspecting the parameters of the student model, in the worst case, the public training instances with privacy preserving labels from the aggregated teacher are going to be revealed. The privacy guarantee of this approach is formally proved using differential privacy framework.

In our experiments, we showed that a student ranker model trained on a dataset labeled based on predictions of a teacher model can perform almost as well as the teacher model. This shows the potential of mimic learning for the ranking task which can overcome the problem of lack of large datasets for ad-hoc IR task and open-up the future research in this direction. We also showed that using privacy preserving mimic learning framework, not only the privacy of users is guaranteed, but also we can achieve an acceptable performance.

For more details on the results of experiments, please take a look at our paper:

 

  1. Nicolas Papernot, Martin Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. 2017. Semi-supervised knowledge transfer for deep learning from private training data. International Conference on Learning Representations (ICLR’17) (2017).