SIGIR2017 Tutorial on "Neural Networks for Information Retrieval"

We will be giving a full day tutorial on "Neural Networks for Information Retrieval", with Tom Kenter, Alexey Borisov, Christophe Van Gysel, Maarten de Rijke, Bhaskar Mitra at The 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2017). \o/

The aim of this full-day tutorial is to give a clear overview of current tried-and-trusted neural methods in IR and how they benefit IR research. It covers key architectures, as well as the most promising future directions. It is structured as follows:

  • Basic concepts:
    • Main concepts involved in neural systems will be covered, such as back propagation, distributed representations/embeddings, convolutional layers, recurrent networks, sequence-to-sequence models, dropout, loss functions, optimization schemes like Adam.
  • Semantic matching:
    • Different methods for supervised, semi- and unsupervised learning for semantic matching will be discussed.
  • Learning to Rank with Neural Networks:
    • Feature-based models for representation learning, ranking objectives and loss functions, and training a neural ranker under different levels of supervision are going to be discussed.
  • Modeling user behavior with Neural Networks:
    • Probabilistic graphical models, Neural click models, and modeling biases using neural network will be described.
  • Generating Models:
    • The ideas on machine reading task, question answering, conversational IR, and dialogue systems will be covered. existed?  Join us at SIGIR2017 🙂
The material from our SIGIR 2017 tutorial on Neural Networks for Information Retrieval (NN4IR) is available online at

Modeling Retrieval Problem using Neural Networks

Despite the buzz surrounding deep neural networks (DNN) models for information retrieval, the literature is still lacking a systematic basic investigation on how generally we can model the retrieval problem using neural networks.
Modeling the retrieval problem in the context of neural networks means the general way that we frame the problem with regards to the essential components of a neural network, including what we consider as the objective function, and which kind of architecture we employ, how we feed the data to the network, etc.

Here, in this post, I try to present different general architectures that can be considered for modeling the retrieval problem. First, I provide a categorization of different models based on their objective function, and then I will discuss different approaches with regards to their inference time. Note that in the figures, I use the fully connected feed-forward neural network, while it can be replaced by more complex or more expressive neural models like LSTMs, or CNN.

Categorizing Models by the Type of Objective Function

There are different models that the retrieval problem can be generally formulated in the neural network framework in terms of the objective function which is defined to be optimized: Retrieval as Regression, Retrieval as Ranking, and Retrieval as Classification. I am going to explain these models and discuss their pros and cons.

Retrieval as Regression

The first architecture would be framing the retrieval problem as the scoring problem which can be phrased as the regression problem. In the regression model (left most model in above figure), given the query q and the document d, we aim at generating a score, which could be for example interpreted as the probability that the document d is relevant given the query q. In this model, network learns to produce calibrated scores, which at the end, these scores are used to rank documents. This model is also referred as the point-wise model in the learning to rank literature.


Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

Our paper "Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity", with Hosein Azarbonyad, Tom Kenter, Maarten Marx, Jaap Kamps, and Maarten de Rijke, has been accepted as a long paper at The 39th European Conference on Information Retrieval (ECIR'17). \o/

Quantitative notions of topical diversity in text documents are useful in several contexts, e.g., to assess the interdisciplinarity of a research proposal or to determine the interestingness of a document. An influential formalization of diversity has been introduced in biology. It decomposes diversity in terms of elements that belong to categories within a population and formalizes the diversity of a population d as the expected distance between two randomly selected elements of the population:

div(d) = \sum_{i=1}^{T} \sum_{j=1}^{T} p_ip_j \delta(i,j),

where p_i and p_j  are the proportions of categories i andj in the population and \delta(i, j) is the distance between i and j.

This notion of diversity had been adapted to quantify the topical diversity of a text document. Words are considered elements, topics are categories, and a document is a population. When using topic modeling for measuring topical diversity of text document d, We can model elements based on the probability of a word w given d, P(w|d), categories based on the probability of w given topic t , P(w|t), and populations based on the probability of t given d, P(t|d). In probabilistic topic modeling, at estimation time, these distributions are usually assumed to be sparse.

  1. First, the content of a document is assumed to be generated by a small subset of words from the vocabulary (i.e., P(w|d) is sparse).
  2. Second, each topic is assumed to contain only some topic-specific related words (i.e.,  P(w|t) is sparse).
  3. Finally, each document is assumed to deal with a few topics only (i.e., P(t|d) is sparse).

When approximated using currently available methods,  P(w|t) and P(t|d) are often dense rather than sparse. Dense distributions cause two problems for the quality of topic models when used for measuring topical diversity: generality and impurity. General topics mostly contain general words and are typically assigned to most documents in a corpus. Impure topics contain words that are not related to the topic. Generality and impurity of topics both result in low quality P(t|d) distributions.

Different topic re-estimation approaches. TM is a topic modeling approach like, e.g., LDA. DR is document re-estimation, TR is topic re-estimation, and TAR is topic assignment re-estimation.


Telling how to narrow it down: Effect of Browsing Path Recommendation on Exploratory Search

Our paper "Telling how to narrow it down: Effect of Browsing Path Recommendation on Exploratory Search", with Glorianna Jagfeld, Hosein Azarbonyad, Alex Olieman, Jaap Kamps, Maarten Marx, has been accepted as a short paper at The ACM SIGIR Conference on Human Information Interaction & Retrieval (CHIIR'17). \o/

There are several information needs requiring sophisticated human-computer interactions that currently remain unsolved or poorly supported by major search applications. One of these cases is exploratory search, which refers to search tasks that are open-ended, multi-faceted, and iterative, like learning or topic investigation. This type of search often occurs in a domain unknown or poorly known to the searchers which can make it hard for them to formulate proper queries for retrieving useful documents.

Exploratory search is composed of two main activities, exploratory browsing and focused searching . Exploratory browsing refers to activities that aim at better defining the information need and raising the understanding of the information space. Focused searching corresponds to activities like query refining and results’ comparisons after the information need has been shaped more clearly. Based on this composition, an exploratory search system needs to provide its users a connected space of information to browse and investigate, as well as facilities to adjust the focus of their search towards useful documents.

Using structured data to organize unstructured information is one of the promising approaches for supporting complex search tasks, including exploratory search. Structure in the data provides overviews at different levels of abstraction and empowers the users to explore the data from different points of view. However, it may still be difficult to find useful paths of exploration and clues can help the users.

The main aim of this research is to investigate the user behavior in exploratory search when a recommendation engine for browsing paths is provided along with the browsing system.To do so, we have employed the ExPoSe-Browser (Exploratory Political Search Browser) as the baseline system and built a recommendation engine as a supplementary feature for the system. We have conducted a user study involving exploratory search tasks which revealed general differences of the browsing behavior of the subjects using the two different systems.


Poison Pills and Antidots: Inoculating Relevance Feedback

"Poison Pills and Antidots: Inoculating Relevance Feedback", an article published in Amsterdam Science Magazine as one of the cool contributions of our CIKM2016 paper. We also have a extended abstract describing this part, accepted to be presented in DIR2016 "Inoculating Relevance Feedback Against Poison Pills", with Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra and Maarten Marx.

Relevance Feedback (RF) is a common approach for enriching queries, given a set of explicitly or implicitly judged documents either explicitly assessed by the user or implicitly inferred from user behavior,
to improve the performance of the retrieval.
Although it has been shown that on average, the overall performance of retrieval will be improved after relevance feedback, for some topics, employing some relevant documents may decrease the average precision of the initial run. This is mostly because the feedback document is partially relevant and contains off-topic terms which adding them to the query as expansion terms results in loosing the retrieval performance. These relevant documents that hurt the performance of retrieval after feedback are called "poison pills''. In this article, we discuss the effect of poison pills on the relevance feedback and present Significant Words Language Models as an approach for estimating feedback model to tackle this problem.

Significant Words Language Models are family of models aiming to estimate models for a set of documents so that all, and only, the significant shared terms are captured in the models(see here and here). This makes these models to be not only distinctive, but also supported by all the documents in the set.

poisonpillsPut loosely, SWLM iteratively removes two types of words from the model: general words, i.e., common words used frequently across all the documents, and page-specific words, i.e., words mentioned in some of the relevant documents, but not the majority of them (see the above Figure). This approach prevents noise words interfering with the relevance feedback, and thus successfully improves the retrieval performance by protecting relvance feedback against Poison Pills.



The Healing Power of Poison: Helpful Non-relevant Documents in Feedback

Our paper "The Healing Power of Poison: Helpful Non-relevant Documents in Feedback", with Samira Abnar and Jaap Kamps, has been accepted as a short paper at The 25th ACM International Conference on Information and Knowledge Management (CIKM'16). \o/

Often, the only difference between a medicine and a poison is the dose. Some substances are extremely toxic, and therefore, are primarily known as a poison. Yet, even poisons can have medicinal value.

Paracelsus, Father of Toxicology

cikm2016_posterQuery expansion based on feedback information is one of the classic approaches for improving the performance of information retrieval systems, especially when the user information need are complex to express precisely in a few keywords.

True Relevance Feedback (TRF) systems try to enrich the user query using a set of judged documents, that their relevance is assessed either explicitly by the user or implicitly inferred from the user behavior. However, this information is not always available. Alternatively, Pseudo Relevance Feedback (PRF) methods, also called blind relevance feedback, assumes that the top-ranked documents in the initial retrieved results are all relevant and use them for the feedback model.

Normally feedback documents that are annotated as relevant are considered to be beneficial for feedback and feedback documents that are annotated as non-relevant are expected to be poisonous, i.e. they supposedly decrease the performance of the feedback systems if they are used as positive feedback. Based on this assumption, some of the TRF methods, use non-relevant documents as negative feedback and some PRF methods try to avoid using these documents. For example, some PRF methods attempt on detecting non-relevant documents in order for being robust against their noises, or they manage to partially use their content in the feedback procedure, like some of their passages. Although PRF methods use non-relevant documents, they do not directly intend to take advantage of them as helpful documents. In other words, most of the time, removing non-relevant documents from the feedback set of PRF methods leads to a better performance.


Luhn Revisited: Significant Words Language Models

Our paper "Luhn Revisited: Significant Words Language Models", with Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra and Maarten Marx, has been accepted as a long paper at The 25th ACM International Conference on Information and Knowledge Management (CIKM'16). \o/

On of the key factors affecting search quality is the fact that our queries are ultra-short statements of our complex information needs. Query expansion has been proven to be an effective technique to bring agreement between user information need and relevant documents. Taking feedback information into account is a common approach for enriching the representation of queries and consequently improving retrieval performance.

In True Relevance Feedback (TRF), given a set of judged documents either explicitly assessed by the user or implicitly inferred from user behavior, the system tries to enrich the user query to improve the performance of the retrieval. However, feedback information is not available in most practical settings. An alternate approach is Pseudo Relevance Feedback (PRF), also called blind relevance feedback, which uses the top-ranked documents in the initial retrieved result list for the feedback.

The main goal of feedback systems is to extract a feedback model, which represents the relevant documents. However, although documents in the feedback set contain relevant information, there is always also non-relevant information. For instance, in PRF, some documents in the feedback set might be non-relevant, or in TRF, some documents, despite the fact that they are relevant, may act like poison pills by hurting the performance of feedback systems, since they also contain off-topic information. Such non-relevant information can distract the feedback model by adding bad expansion terms, leading to topic drift.

It has been shown that based on this observation, existing feedback systems are able to improve the performance of the retrieval if feedback documents are not only relevant, but also have a dedicated interest in the topic. Given that we should anticipate documents with a broader topic or multiple topics in the feedback set, taking advantage of feedback documents requires a robust and effective method to prevent topic drift caused by accidental, non-relevant terms brought in by particular documents in the feedback set.

We introduce a variant of significant words language models (SWLM) to extract a language model of feedback documents that captures the essential terms representing a mutual notion of relevance, i.e. a representatiPicture_3on of characteristic terms which are supported by all the feedback documents. The general idea of SWLM is inspired by the early work of Luhn1, in which he argues that to extract significant words by avoiding both common observations and rare observations. More precisely, Luhn assumed that frequency data can be used to measure the significance of words to represent a document. Considering Zipf’s Law, he simply devised a counting technique for finding significant words. He specified two cut-offs, an upper and lower, to exclude non-significant words.