Learning Useful Document Representations for Search

With the emergence of deep learning, representation learning has been a hot topic in recent years. There are a large variety of approaches to representation learning in general, but the basic idea is to learn some set of features from data and then using these features for some task. Generating representation for documents is still one the key challenges in information retrieval.  In search, the main task is to determine whether a document is relevant to a query or not.  So we want to process the document in order to produce a representation of it that preserves our ability to judge relevance while stripping away nonessential data.

Most of the time, due to efficiency reasons, this is done as a preprocessing stage where we generate query-independent document representation and it has one straightforward motivation: processing documents online would be terribly expensive. Note that this is not to suggest that query-dependent document representations are not useful, for instance, we can train a neural network which predicts the relevance of a query and a document based on words near hits which could be both efficient and effective.

Since deep learning turned to be a tool using which we can renew our attempts for representing documents, it is good to think about this question again that "What does make for useful document representations?". Of course representations that lead to better results in search are appreciated, but it is not enough. Here, in this write-up, I'm going to shortly talk about some characteristics that I think a useful query-independent document representation for the task of search should possess.

Good representations of documents should ideally satisfy the following properties for maximal utility in information retrieval:

  1. Semantic: The representation should be the same or similar even if the text is rewritten in a different manner as long as it has the same meaning. In other words, it should be resilient to paraphrases.
  2. Similarity/Relevance measure: A similarity measure between two document representations should be computable. This would in a sense be the inverse of a distance measure1. Alternatively, the similarity between a query representation and a document representation can be considered, which is a measure of whether the intent of query is covered by the document.
  3. Composability: Composability allows us to combine representations in various ways, for instance, compose representations at the sentence level to get a representation for a paragraph. We can further compose those to get a representation of a document. This way we consider documents as volumes rather than points. As another application of composability, we can think of generating query representations based on its (pseudo-)relevant documents, by performing pseudo-relevance feedback either online or offline. It is particularly nice to come up with representations where compositions can be done using the linear combination, as it is simple to understand and play with.
  4. Similarity is congruent with relevance to queries: In other words, if we have relevance judgments for results of a query, the level of similarity between relevant results is higher than the level of similarity between irrelevant results and irrelevant results are dissimilar to relevant results.

There are also some additional optional properties that are nice to have:

  1. Common representation for queries and documents: If queries are representable in the same space as documents, then a similarity measure between the query representation and the document representation could be indicative of relevance, although this is hard to satisfy in practice if a point representation is used. The reason this property is not necessary is that in practice, we could derive a query representation from properties (3) and (4): For example, we could represent each query by a suitable linear combination of the representations of the search results from a first pass, or we can compute offline representations of past search queries using a similar trick and try to predict the representation for a new user query by learning from representations of similar past search queries. Note that conversely, we could start with query representations and derive a document representation as well.  However, a document is much richer in terms of directly available content signals than a query,  so it seems reasonable to expect more useful representations for both documents and queries by starting from document representations.
  2. Transparency: The representation should be transparent and debuggable. I know that effective embedding representations learned using neural-based models might not be as understandable, but having insight and a handle to prob the representations would help a lot to improve them.
  3. Hierarchical: Semantics for larger and larger blocks of text might be best represented hierarchically. When writing formally, the title serves as a high-level summary. Each paragraph should be self-coherent and support the root concept. Lastly, each sentence should have a single "thought." When comparing queries to documents, different granularity might serve different needs. For example, a factoid queries need to find a short trustworthy passage (maybe even a single sentence). The overall topic of the page is related, but less useful than determining if the query is satisfied (has the answer). On the other side of the spectrum, broad information seeking (like exploratory search) probably cares more about the overall topic of the page, than each support sentence.
  4. Language agnostic: If multiple documents have the exact same semantics but are in different languages, they should probably have a similar representation.
  5. Represents more than just natural language: In web search, where documents are more than text, i.e. they have structure, they include images, etc. we should be able to cover them with a similar representation.

I think it makes sense to have the aforementioned properties in mind while we are trying to learn representations for documents and assess the learn representations in terms of not only the improvement in the final results, but to which extent they satisfy these properties.
In another post, I have discussed "document understanding for information retrieval", which would be a great target to aim while we are thinking about document representation.

  1. Cosine similarity is an exception as its inverse is not a true distance measure, but it has been shown to be tremendously useful for measuring the similarity in many representations.