To make it more convenient for those who are going to attend CIKM this year to follow up on our research, we decided to prepare a list of our contributions. I've designed a bookmark for that:
Document understanding for the purpose of assessing the relevance of a document or passage to a query based only on the document content appears to be a familiar goal for information retrieval community, however, this problem has remained largely intractable, despite repeated attacks over many years. This is while people are able to assess the relevance quite well, though unfamiliar topics and complex documents can defeat them. This assessment may require the ability to understand language, images, document structure, videos, audio, and functional elements. In turn, understanding of these elements is built on background information about the world, such as human behavior patterns, and even more fundamental truths such as the existence of time, space, and people. All this comes naturally to people, but not to computers!
Recently, large-scale machine learning has altered the landscape. Deep learning has greatly advanced machine understanding of images and language. Since document and query understanding incorporate these elements, deep learning can hold great promise. But it comes with a drawback: general purpose representations (like CNNs for images) have proved somewhat elusive for text. In particular, embeddings act as a distributed representation not just of semantic information but also application-specific learnings, which are hard to transfer. In short, conditions seem right for a renewed attempt on the fundamental document understanding problem.
What is document understanding?
In order to think about the way we can approach this problem, I think we should first answer some questions: Can we understand documents? What is "understanding"? Getting at the true meaning of a document? Okay, but then what is "meaning"? How do we even approach such an ill-defined goal?
With the emergence of deep learning, representation learning has been a hot topic in recent years. There are a large variety of approaches to representation learning in general, but the basic idea is to learn some set of features from data and then using these features for some task. Generating representation for documents is still one the key challenges in information retrieval. In search, the main task is to determine whether a document is relevant to a query or not. So we want to process the document in order to produce a representation of it that preserves our ability to judge relevance while stripping away nonessential data.
Most of the time, due to efficiency reasons, this is done as a preprocessing stage where we generate query-independent document representation and it has one straightforward motivation: processing documents online would be terribly expensive. Note that this is not to suggest that query-dependent document representations are not useful, for instance, we can train a neural network which predicts the relevance of a query and a document based on words near hits which could be both efficient and effective.
Since deep learning turned to be a tool using which we can renew our attempts for representing documents, it is good to think about this question again that "What does make for useful document representations?". Of course representations that lead to better results in search are appreciated, but it is not enough. Here, in this write-up, I'm going to shortly talk about some characteristics that I think a useful query-independent document representation for the task of search should possess.
Good representations of documents should ideally satisfy the following properties for maximal utility in information retrieval:
- Semantic: The representation should be the same or similar even if the text is rewritten in a different manner as long as it has the same meaning. In other words, it should be resilient to paraphrases. (more…)
A couple weeks a go, I attended MILLA Deep Learning summer school (DLSS) from June 26th to July 1st and Reinforcement Learning summer school (RLSS) from July 3rd to 5th, 2017, organized by Yoshua Bengio and Aaron Courville. You can find information about the lectures here. In the following, I will share some of "my" highlights from the summer schools.
Different types of learning problems
The first day of summer school was on general topics of machine learning and neural networks. Doina Precup gave a talk which was a gentle refreshing of the general concepts of Machine Learning and then Hugo Larochelle covered the basics of neural networks. In the second part of his talk, Hugo started by dividing learning problems into different types, based on the data and settings of the problem during training and inference. Based on his grouping, each learning problem can be classified to one of these categories:
Our paper "Learning to Attend, Copy, and Generate for Session-Based Query Suggestion", with Sascha Rothe, Enrique Alfonseca, and Pascal Fleury, has been accepted as a long paper at the international Conference on Information and Knowledge Management (CIKM'17). This paper is on the outcome of my internship at Google Research. \o/
Users interact with search engines during search sessions and try to direct their search by submitting a sequence of queries. Based on these interactions, search engines provide a prominent feature, in which they assist their users to formulate their queries to better represent their intent during Web search by providing suggestions for the next query.
Query suggestion might address the need for disambiguation of the user queries to make the direction of the search more clear for both, the user and the search engine.
It might help users by providing a precise and succinct query when they are not familiar with the specific terminology or when they lack understanding of the internal vocabulary and structures in order to be able to formulate an effective query. It has been shown that in general, query suggestion accelerates search satisfaction by either diving deeper into the current search direction or by moving to a different aspect of a search task.
We're going to organize the "Neural Networks for Information Retrieval" tutorial at SIGIR2017, Tokyo, Japan. You can read about our tutorial at http://nn4ir.com/.
I've tried to design (kind of) a logo for our tutorial:
Our paper "Share your Model instead of your Data: Privacy Preserving Mimic Learning for Ranking", with Hosein Azarbonyad, Jaap Kamps, and Maarten de Rijke, has been accepted at Neu-IR: SIGIR Workshop on Neural Information Retrieval (NeuIR'17). \o/
In this paper, we aim to lay the groundwork for the idea of sharing a privacy preserving model instead of sensitive data in IR applications. This suggests researchers from industry share the knowledge learned from actual users’ data with the academic community that leads to a better collaboration of all researchers in the field.
Deep neural networks demonstrate undeniable success in several fields and employing them is taking o for information retrieval problems. It has been shown that supervised neural network models perform be er as the training dataset grows bigger and becomes more diverse. Information retrieval is an experimental and empirical discipline, thus, having access to large-scale real datasets is essential for designing effective IR systems. However, in many information retrieval tasks, due to the sensitivity of the data from users and privacy issues, not all researchers have access to large-scale datasets for training their models.
Much research has been done on the general problem of pre- serving the privacy of sensitive data in IR applications, where the question is how should we design effective IR systems without damaging users’ privacy?
One of the solutions so far is to anonymize the data and try to hide the identity of users. However, there is no guarantee that the anonymized data will be as effective as the original data.
Using machine learning-based approaches, sharing the trained model instead of the original data has turned out to be an option for transferring knowledge. The idea of mimic learning is to use a model that is trained based on the signals from the original training data to annotate a large set of unlabeled data and use these labels as training signals for training a new model. It has been shown, for many tasks in computer vision and natural language processing, that we can transfer knowledge this way and the newly trained models perform as well as the model trained on the original training data.
Our paper "On Search Powered Navigation", with Glorianna Jagfeld, Hosein Azarbonyad, Alex Olieman, Jaap Kamps, Maarten Marx, has been accepted as a short paper at . \o/
I've designed the thesis cover for one of the ILPSers, Aleksandr Chucklin who has just defended his PhD. His thesis is about "Understanding and Modeling Users of Modern Search Engines".