I've started an internship at Apple in San Francisco. I am working with Siri core-ML team on learning disentangled representations for natural language 🙂
We are organizing the "Learning From Noisy/Limited Data for Information Retrieval" workshop which is co-located with SIGIR 2018. This is the first edition of this workshop and The goal of the workshop is to bring together researchers from industry, where data is plentiful but noisy, with researchers from academia, where data is sparse but clean, to discuss solutions to these related problems.
We invited contributions relevant to this topics:
- Learning from noisy data for IR
- Learning from automatically constructed data
- Learning from implicit feedback data, e.g., click data
- Distant or weak supervision and learning from IR heuristics
- Unsupervised and semi-supervised learning for IR
- Transfer learning for IR
- Incorporating expert/domain knowledge to improve learning-based IR models
- Learning from labeled features
- Incorporating IR axioms to improve machine learning models
Marc Najork is going to give a fantastic keynote on "Using biased data for learning-to-rank" and we have a set of fantastic papers (including mine :P) that are going to be presented at the workshop and a great discussion panel with wonderful panelist from both industry and academia.
Save the date on your calendar!
Our paper "Fidelity-Weighted Learning", with Arash Mehrjou, Stephan Gouws, Jaap Kamps, Bernhard Schölkopf, has been accepted at Sixth International Conference on Learning Representations (ICLR2018). \o/
Fidelity-weighted learning (FWL) is a semi-supervised student-teacher approach for training deep neural networks using weakly-labeled data. It modulates the parameter updates to a student network which trained on the task we care about, on a per-sample basis according to the posterior confidence of its label-quality estimated by a Bayesian teacher, who has access to a rather small amount of high-quality labels.
The success of deep neural networks to date depends strongly on the availability of labeled data which is costly and not always easy to obtain. Usually, it is much easier to obtain small quantities of high-quality labeled data and large quantities of unlabeled data. The problem of how to best integrate these two different sources of information during training is an active pursuit in the field of semi-supervised learning and here, with FWL, we propose an idea to address this question.
Learning from samples of variable quality
For a large class of tasks, it is also easy to define one or more so-called “weak annotators”, additional (albeit noisy) sources of weak supervision based on heuristics or “weaker”, biased classifiers trained on e.g. non-expert crowd-sourced data or data from different domains that are related. While easy and cheap to generate, it is not immediately clear if and how these additional weakly-labeled data can be used to train a stronger classifier for the task we care about. More generally, in almost all practical applications machine learning systems have to deal with data samples of variable quality. For example, in a large dataset of images only a small fraction of samples may be labeled by experts and the rest may be crowd-sourced using e.g. Amazon Mechanical Turk. In addition, in some applications, labels are intentionally perturbed due to privacy issues.
Assuming we can obtain a large set of weakly-labeled data in addition to a much smaller training set of “strong” labels, the simplest approach is to expand the training set by including the weakly-supervised samples (all samples are equal). Alternatively, one may pretrain on the weak data and then fine-tune on observations from the true function or distribution (which we call strong data). Indeed, a small amount of expert-labeled data can be augmented in such a way by a large set of raw data, with labels coming from a heuristic function, to train a more accurate neural ranking model. The downside is that such approaches are oblivious to the amount or source of noise in the labels.
All labels are equal, but some labels are more equal than others, just like animals.
Inspired by George, Animal Farm, 1945.
We argue that treating weakly-labeled samples uniformly (i.e. each weak sample contributes equally to the final classifier) ignores potentially valuable information of the label quality. Instead, we propose Fidelity-Weighted Learning (FWL), a Bayesian semi-supervised approach that leverages a small amount of data with true labels to generate a larger training set with confidence-weighted weakly-labeled samples, which can then be used to modulate the fine-tuning process based on the fidelity (or quality) of each weak sample. By directly modeling the inaccuracies introduced by the weak annotator in this way, we can control the extent to which we make use of this additional source of weak supervision: more for confidently-labeled weak samples close to the true observed data, and less for uncertain samples further away from the observed data.
How fidelity-weighted learning works?
We propose a setting consisting of two main modules:
- One is called the student and is in charge of learning a suitable data representation and performing the main prediction task,
- The other is the teacher which modulates the learning process by modeling the inaccuracies in the labels.
Our paper "Learning to Learn from Weak Supervision by Full Supervision", with Sascha Rothe, and Jaap Kamps, has been accepted at NIPS2017 Workshop on Meta-Learning (MetaLearn 2017). \o/
Using weak or noisy supervision is a straightforward approach to increase the size of the training data and it has been shown that the output of heuristic methods can be used as weak or noisy signals along with a small amount of labeled data to train neural networks. This is usually done by pre-training the network on weak data and fine-tuning it with true labels. However, these two independent stages do not leverage the full capacity of information from true labels and using noisy labels of lower quality often brings little to no improvement. This issue is tackled by noise-aware models where denoising the weak signal is part of the learning process.
We propose a meta-learning approach in which we train two networks: a target network, which plays the role of the learner and it uses a large set of weakly annotated instances to learn the main task, and a confidence network which plays the role of the meta-learner and it is trained on a small human-labeled set to estimate confidence scores. These scores define the magnitude of the weight updates to the target network during the back-propagation phase. The goal of the confidence network trained jointly with the target network is to calibrate the learning rate of the target network for each instance in the batch. I.e., the weights of the target network at step are updated as follows:
where is the global learning rate, is the loss of predicting for an input when the label is ; is a scoring function learned by the confidence network taking input instance and its noisy label . Thus, we can effectively control the contribution to the parameter updates for the target network from weakly labeled instances based on how reliable their labels are according to the confidence network, learned on a small supervised data.
Our setup requires running a weak annotator to label a large amount of unlabeled data, which is done at pre-processing time. For many tasks, it is possible to use a simple heuristic to generate weak labels. This set is then used to train the target network. In contrast, a small human-labeled set is used to train the confidence network. The general architecture of the model is illustrated in the figure below:
Another ILPSer, Christophe Van Gysel, has just defended his PhD dissertation, on "Remedies against the Vocabulary Gap in Information Retrieval". I've designed his thesis cover:
and here is its the bookmark:
This post is about the project I've done in collaboration with Aliaksei Severyn, Sascha Rothe, and Jaap Kamps, during my internship at Google Research.
Deep neural networks have shown impressive results in a lot of tasks in computer vision, natural language processing, and information retrieval. However, their success is conditioned on the availability of exhaustive amounts of labeled data, while for many tasks such a data is not available. Hence, unsupervised and semi-supervised methods are becoming increasingly attractive.
Using weak or noisy supervision is a straightforward approach to increase the size of the training data. In one of my previous post, I've talked about how to beat your teacher, which provides an insight on how to train a neural network model using only the output of a heuristic model as supervision signal which eventually works better than that heuristic model. Assuming that most of the time, besides a lot of unlabeled (or weakly labeled) data there is a small amount of training data with strong (true) labels, i.e. a semi-supervised setup, here I'll talk about how to learn from a weak teacher and avoid his mistakes.
This is usually done by pre-training the network on weak data and fine-tuning it with true labels. However, these two independent stages do not leverage the full capacity of information from true labels. For instance, in the pre-training stage, there is no handle to control the extent to which the data with weak labels contribute in the learning process, while they can be of different quality.
In this post, I'm going to talk about our proposed idea which is a semi-supervised method that leverages a small amount of data with true labels along with a large amount of data with weak labels. Our proposed method has three main components:
- A weak annotator, which can be a heuristic model, a weak classifier, or even human via crowdsourcing and it is employed to annotate massive amount of unlabeled data.
- A target network which uses a large set of weakly annotated instances by weak annotator to learn the main task
- A confidence network which is trained on a small human-labeled set to estimate confidence scores for instances annotated by the weak annotator. We train the target network and confidence in a multi-task fashion.
In a joint learning process, target network and confidence network try to learn a suitable representation of the data and this layer is shared between them as a two-way communication channel. The target network tries to learn to predict the label of the given input under the supervision of the weak annotator. In the same time, the output of the confidence network, which are the confidence scores, define the magnitude of the weight updates to the target network with respect to the loss computed based on labels from weak annotator, during the back-propagation phase of the target network. This way, the confidence network helps the target network to avoid mistakes of her teacher, i.e.weak annotator, by down-weighting the weight updates from weak labels that do not look reliable to the confidence network.
To make it more convenient for those who are going to attend CIKM this year to follow up on our research, we decided to prepare a list of our contributions. I've designed a bookmark for that:
Document understanding for the purpose of assessing the relevance of a document or passage to a query based only on the document content appears to be a familiar goal for information retrieval community, however, this problem has remained largely intractable, despite repeated attacks over many years. This is while people are able to assess the relevance quite well, though unfamiliar topics and complex documents can defeat them. This assessment may require the ability to understand language, images, document structure, videos, audio, and functional elements. In turn, understanding of these elements is built on background information about the world, such as human behavior patterns, and even more fundamental truths such as the existence of time, space, and people. All this comes naturally to people, but not to computers!
Recently, large-scale machine learning has altered the landscape. Deep learning has greatly advanced machine understanding of images and language. Since document and query understanding incorporate these elements, deep learning can hold great promise. But it comes with a drawback: general purpose representations (like CNNs for images) have proved somewhat elusive for text. In particular, embeddings act as a distributed representation not just of semantic information but also application-specific learnings, which are hard to transfer. In short, conditions seem right for a renewed attempt on the fundamental document understanding problem.
What is document understanding?
In order to think about the way we can approach this problem, I think we should first answer some questions: Can we understand documents? What is "understanding"? Getting at the true meaning of a document? Okay, but then what is "meaning"? How do we even approach such an ill-defined goal?
With the emergence of deep learning, representation learning has been a hot topic in recent years. There are a large variety of approaches to representation learning in general, but the basic idea is to learn some set of features from data and then using these features for some task. Generating representation for documents is still one the key challenges in information retrieval. In search, the main task is to determine whether a document is relevant to a query or not. So we want to process the document in order to produce a representation of it that preserves our ability to judge relevance while stripping away nonessential data.
Most of the time, due to efficiency reasons, this is done as a preprocessing stage where we generate query-independent document representation and it has one straightforward motivation: processing documents online would be terribly expensive. Note that this is not to suggest that query-dependent document representations are not useful, for instance, we can train a neural network which predicts the relevance of a query and a document based on words near hits which could be both efficient and effective.
Since deep learning turned to be a tool using which we can renew our attempts for representing documents, it is good to think about this question again that "What does make for useful document representations?". Of course representations that lead to better results in search are appreciated, but it is not enough. Here, in this write-up, I'm going to shortly talk about some characteristics that I think a useful query-independent document representation for the task of search should possess.
Good representations of documents should ideally satisfy the following properties for maximal utility in information retrieval:
- Semantic: The representation should be the same or similar even if the text is rewritten in a different manner as long as it has the same meaning. In other words, it should be resilient to paraphrases. (more…)