With the emergence of deep learning, representation learning has been a hot topic in recent years. There are a large variety of approaches to representation learning in general, but the basic idea is to learn some set of features from data and then using these features for some task. Generating representation for documents is still one the key challenges in information retrieval. In search, the main task is to determine whether a document is relevant to a query or not. So we want to process the document in order to produce a representation of it that preserves our ability to judge relevance while stripping away nonessential data.
Most of the time, due to efficiency reasons, this is done as a preprocessing stage where we generate query-independent document representation and it has one straightforward motivation: processing documents online would be terribly expensive. Note that this is not to suggest that query-dependent document representations are not useful, for instance, we can train a neural network which predicts the relevance of a query and a document based on words near hits which could be both efficient and effective.
Since deep learning turned to be a tool using which we can renew our attempts for representing documents, it is good to think about this question again that "What does make for useful document representations?". Of course representations that lead to better results in search are appreciated, but it is not enough. Here, in this write-up, I'm going to shortly talk about some characteristics that I think a useful query-independent document representation for the task of search should possess.
Good representations of documents should ideally satisfy the following properties for maximal utility in information retrieval:
Semantic: The representation should be the same or similar even if the text is rewritten in a different manner as long as it has the same meaning. In other words, it should be resilient to paraphrases. (more…)
Users interact with search engines during search sessions and try to direct their search by submitting a sequence of queries. Based on these interactions, search engines provide a prominent feature, in which they assist their users to formulate their queries to better represent their intent during Web search by providing suggestions for the next query.
Query suggestion might address the need for disambiguation of the user queries to make the direction of the search more clear for both, the user and the search engine.
It might help users by providing a precise and succinct query when they are not familiar with the specific terminology or when they lack understanding of the internal vocabulary and structures in order to be able to formulate an effective query. It has been shown that in general, query suggestion accelerates search satisfaction by either diving deeper into the current search direction or by moving to a different aspect of a search task.
In this paper, we aim to lay the groundwork for the idea of sharing a privacy preserving model instead of sensitive data in IR applications. This suggests researchers from industry share the knowledge learned from actual users’ data with the academic community that leads to a better collaboration of all researchers in the field.
Deep neural networks demonstrate undeniable success in several fields and employing them is taking o for information retrieval problems. It has been shown that supervised neural network models perform be er as the training dataset grows bigger and becomes more diverse. Information retrieval is an experimental and empirical discipline, thus, having access to large-scale real datasets is essential for designing effective IR systems. However, in many information retrieval tasks, due to the sensitivity of the data from users and privacy issues, not all researchers have access to large-scale datasets for training their models.
Much research has been done on the general problem of pre- serving the privacy of sensitive data in IR applications, where the question is how should we design effective IR systems without damaging users’ privacy?
One of the solutions so far is to anonymize the data and try to hide the identity of users. However, there is no guarantee that the anonymized data will be as effective as the original data.
Using machine learning-based approaches, sharing the trained model instead of the original data has turned out to be an option for transferring knowledge. The idea of mimic learning is to use a model that is trained based on the signals from the original training data to annotate a large set of unlabeled data and use these labels as training signals for training a new model. It has been shown, for many tasks in computer vision and natural language processing, that we can transfer knowledge this way and the newly trained models perform as well as the model trained on the original training data.
Knowledge graphs and other hierarchical domain ontologies hold great promise for complex information seeking tasks, yet their massive size defies the standard and effective way smaller hierarchies are used as a static navigation structure in faceted search or standard website navigation. As a result, we see the only limited use of knowledge bases in entity surfacing for navigational queries, and fail to realize their full potential to empower search. Seeking information in structured environments consists of two main activities: exploratory browsing and focused searching.
Exploratory browsing refers to activities aimed at better defining the information need and increasing the level of understanding of the information space, while focused searching includes activities such as query rewriting and comparison of results, which are performed after the information need has been made more concrete. Based on the interplay of these two actions, a search system is supposed to provide a connected space of information for the users to navigate, as well as search to adjust the focus of their browsing towards useful content.
In our paper, we introduce the concept of Search Powered Navigation (SPN), which enables users to combine navigation with the query based searching in a structured information space, and offers a way to find a balance between exploration and exploitation. We hypothesize that SPN enables users to exploit the semantic structure of a large knowledge base in an effective way. We test this hypothesis by conducting a user study in which users are engaged in exploratory search activities and investigate the effect of SPN on the variability in users’ behavior and experience. We employed an exploratory search system on parliamentary data in two modes, pure navigation and search powered navigation, and tested two types of tasks, broad- and focused-topic tasks. (more…)
Despite the impressive improvements achieved by unsupervised deep neural networks in computer vision, natural language processing, and speech recognition tasks, such improvements have not yet been observed in ranking for information retrieval. The reason may be the complexity of the ranking problem, as it is not obvious how to learn from queries and documents when no supervised signal is available. In our paper, we propose to train neural ranking models using weak supervision, where labels are obtained automatically without human annotators or any external resources e.g., click data.
To this aim, we use the output of a known unsupervised ranking model, such as BM25, as a weak supervision signal. We further train a set of simple yet effective ranking models based on feed-forward neural networks. We studied their effectiveness under various learning scenarios: point-wise and pair-wise models, and using different input representations: from encoding query-document pairs into dense/sparse vectors to using word embedding representation. Our findings also suggest that supervised neural ranking models can greatly benefit from pre-training on large amounts of weakly labeled data that can be easily obtained from unsupervised IR models.
We have three main research Questions:
RQ1. Is it possible to learn a neural ranker only from labels provided by a completely unsupervised IR model such as BM25, as the weak supervision signal, that will exhibit superior generalization capabilities?
RQ2. What input representation and learning objective is most suitable for learning in such a setting?
RQ3. Can a supervised learning model benefit from weak supervision step, especially in cases when labeled data is limited?
The aim of this full-day tutorial is to give a clear overview of current tried-and-trusted neural methods in IR and how they benefit IR research. It covers key architectures, as well as the most promising future directions. It is structured as follows:
Main concepts involved in neural systems will be covered, such as back propagation, distributed representations/embeddings, convolutional layers, recurrent networks, sequence-to-sequence models, dropout, loss functions, optimization schemes like Adam.
Different methods for supervised, semi- and unsupervised learning for semantic matching will be discussed.
Learning to Rank with Neural Networks:
Feature-based models for representation learning, ranking objectives and loss functions, and training a neural ranker under different levels of supervision are going to be discussed.
Modeling user behavior with Neural Networks:
Probabilistic graphical models, Neural click models, and modeling biases using neural network will be described.
The ideas on machine reading task, question answering, conversational IR, and dialogue systems will be covered.
hummm...got existed? Join us at SIGIR2017 🙂
The material from our SIGIR 2017 tutorial on Neural Networks for Information Retrieval (NN4IR) is available online at http://nn4ir.com.