# Beating the Teacher: Neural Ranking Models with Weak Supervision

```Our paper "Neural Ranking Models with Weak Supervision", with Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft, has been accepted as a long paper at The 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2017). \o/
```
`This paper is on the outcome of my pet project during my internship at Google Research.`

Despite the impressive improvements achieved by unsupervised deep neural networks in computer vision, natural language processing, and speech recognition tasks, such improvements have not yet been observed in ranking for information retrieval. The reason may be the complexity of the ranking problem, as it is not obvious how to learn from queries and documents when no supervised signal is available. In our paper, we propose to train neural ranking models using weak supervision, where labels are obtained automatically without human annotators or any external resources e.g., click data.

To this aim, we use the output of a known unsupervised ranking model, such as BM25, as a weak supervision signal. We further train a set of simple yet effective ranking models based on feed-forward neural networks.  We studied their effectiveness under various learning scenarios: point-wise and pair-wise models, and using different input representations: from encoding query-document pairs into dense/sparse vectors to using word embedding representation. Our findings also suggest that supervised neural ranking models can greatly benefit from pre-training on large amounts of weakly labeled data that can be easily obtained from unsupervised IR models.

We have three main research Questions:

• RQ1. Is it possible to learn a neural ranker only from labels provided by a completely unsupervised IR model such as BM25, as the weak supervision signal, that will exhibit superior generalization capabilities?
• RQ2. What input representation and learning objective is most suitable for learning in such a setting?
• RQ3. Can a supervised learning model benefit from weak supervision step, especially in cases when labeled data is limited?

### Ranking Architectures

We have tried three neural ranking models:

1. Score Model: This architecture models a point-wise ranking model that learns to predict retrieval scores for query-document pairs. More formally, the goal in this architecture is to learn a scoring function that determines the retrieval score of a document for a query, so we can simply map the problem to a linear regression problem:
2. Rank Model: In this model, similar to the previous one, the goal is to learn a scoring function for a given pair of query and document. However, unlike the previous model, we do not aim at learning a calibrated scoring function, so we use a pairwise scenario during training in which we have two point-wise models that share parameters and we update their parameters to minimize a pairwise loss: During the inference, as two models are identical, we take one of them as the final scoring function and use the trained model in a point-wise fashion.
3. RankProb Model: The third architecture is based on a pair-wise scenario during both training and inference. This model learns a ranking function which given a query and two documents, $Latex formula$ and $Latex formula$, it predicts the probability of document $Latex formula$ to be ranked higher than $Latex formula$  with respect to the query. We can map this problem to a logistic regression problem:

### Input Representations:

We also explore three definitions of the input layer representation that maps an input query-document pair into a fixed-size vector which is further fed into the fully connected layers:

1. Dense Vector Representation: A conventional dense feature vector representation that contains various statistics describing the input query-document pair. In particular, we build a dense feature vector composed of features used by BM25 to let the network fit the function described by BM25 formula when it receives exactly the same inputs.
2. Sparse Vector Representation: We move away from a fully featurized representation that contains only aggregate statistics and let the network perform feature extraction for us. In particular, we build a bag-of-words representation by extracting term frequency vectors of the query, the document, and the collection and feed the network with the concatenation of these three vectors.
3. Embedding: The major weakness of the previous input representation is that words are treated as discrete units, hence prohibiting the network to perform soft matching between semantically similar words in queries and documents. In this input representation paradigm, we rely on word embeddings to obtain a more powerful representation of queries and documents that could bridge the lexical chasm. So, we use Bag-of-embeddings averaged with learned weights.

These input representations define how much capacity is given to the network to extract discriminative signal from the training data and thus result in different generalization behavior of the networks.

All combinations of different ranking architectures and different input representations presented in this section can be considered for developing ranking models.  We train our networks using more than six million queries and documents from two standard collections: a homogeneous news collection (Robust) and a heterogeneous large-scale web collection (ClueWeb). Our experiments indicate that employing proper objective functions and letting the networks to learn the input representation based on weakly supervised data (RankProb Model + Embedding) leads to impressive performance, with over 13% and 35% MAP improvements over the BM25 model on the Robust and the ClueWeb collections.
[perfectpullquote align=”full” cite=”” link=”” color=”” class=”” size=””]This is truly awesome since we have only used  BM25 as the supervisor to train a model which performs better than BM25 itself![/perfectpullquote]

### Why does that work?

The point is that although “exact term matching” is not enough to capture the notion of relevance, but it is an important feature in retrieval and ranking. On the other hand, BM25 is a relatively effective as a term-matching based method. We provide our neural networks with examples that are (weakly) labeled based on the exact term matching, but we do not provide the reason for the assigned labels (which is based on term matching). This way, we manage to let the network to go beyond this signal and see the “relevance” in these examples also from other perspectives than just term matching.

[perfectpullquote align=”full” cite=”” link=”” color=”” class=”” size=””]

The best teachers are those who show you where to look, but don’t tell you what to see. — Alexandra K. Trenfor

[/perfectpullquote]

This enables the model to find a reason for the provided labels which is not necessarily the original reason that BM25 judgments come from and the model infer relevance when term matching is not the proper indicator of relevance.  For instance,  learning embedding representation helps the network to capture semantic matching and detect semantic relevance while BM25 fails when only semantic matching is the reason of relevancy.

### Take-home Messages:

Here, I briefly summarize the general take home messages from our experiments that would let you help to train your model with weak supervision:

[perfectpullquote align=”full” cite=”” link=”” color=”” class=”” size=””]Main Idea: To leverage large amounts of unsupervised data to infer “weak” labels and use that signal for learning supervised models as if we had the ground truth labels.[/perfectpullquote]

• Define an objective which lets your model to not stuck with the imperfection of the weakly annotated data (learn the ranking instead of calibrated scoring in our case)
• Let the network decide about the representation and extract the features. Feeding the network with featurized input kills the model creativity!
• With feature engineered input data, you more likely to overfit and loose generalization.
• If you have enough training data, your network learns global statistic of the data by just seeing individual local instances.
• If you get enough data, you can learn embedding which is better fitted to your task by updating them just based on the objective of the downstream task. But you need a lot of data: THANKS TO WEAK SUPERVISION!
• Having non-linearity in neural networks does not help that much when you do not have representation learning as part of the model.
• The most important superiority of deep neural networks, which is their ability to learn effective representations, kicks in when your network is deep enough.

For more details about the results and analysis, please take a look at our paper:

• Mostafa Dehghani, H. Zamani, Al. Severyn, J. Kamps, and W. B. Croft. “Neural Ranking Models with Weak Supervision“, In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17), 2017 [arXiv].