Document understanding for the purpose of assessing the relevance of a document or passage to a query based only on the document content appears to be a familiar goal for information retrieval community, however, this problem has remained largely intractable, despite repeated attacks over many years. This is while people are able to assess the relevance quite well, though unfamiliar topics and complex documents can defeat them. This assessment may require the ability to understand language, images, document structure, videos, audio, and functional elements. In turn, understanding of these elements is built on background information about the world, such as human behavior patterns, and even more fundamental truths such as the existence of time, space, and people. All this comes naturally to people, but not to computers!
Recently, large-scale machine learning has altered the landscape. Deep learning has greatly advanced machine understanding of images and language. Since document and query understanding incorporate these elements, deep learning can hold great promise. But it comes with a drawback: general purpose representations (like CNNs for images) have proved somewhat elusive for text. In particular, embeddings act as a distributed representation not just of semantic information but also application-specific learnings, which are hard to transfer. In short, conditions seem right for a renewed attempt on the fundamental document understanding problem.
What is document understanding?
In order to think about the way we can approach this problem, I think we should first answer some questions: Can we understand documents? What is "understanding"? Getting at the true meaning of a document? Okay, but then what is "meaning"? How do we even approach such an ill-defined goal?
As an initial disclaimer, the goal here is not to understand documents in some abstract sense, but rather to understand documents well enough to perform a particular task: determining whether the document is relevant to a query. This assumption simplifies the problem. In other words, we don’t need to understand a calculus tutorial well enough to compute integrals; rather, only well enough to determine whether it is relevant for queries such as "integrate rational functions". Of course, we can hope that pursuing this narrow goal will produce insights that will someday allow us to understand documents in a broader sense, but even for this narrower goal, specific problem instances seem to be extraordinarily difficult, with a lot of challenges. As an example of such a challenge, how can we map a long natural-language text into a format suitable for large-scale machine learning without destroying useful information? For example, if we supply the text to a deep neural network as a bag of tokens, we discard both low-level information (the characters that make up the tokens) and higher-level information (word order) before deep learning enters the picture.
How can a search engine successfully fake document understanding?
I am not aware of what search engines do, but I can imagine how they can successfully fake document understanding. Consider a search engine as a person who works in a library full of Persian books, but he does not read or speak this language. A visitor asks for a book about, say, "یادگیری عمیق". The search engine has no idea what this means. But he searches the shelves for books with those two words, particularly in prominent places such as titles or chapter names. He places let's say 10 candidates on the counter. The visitor frowns at a few, but then smiles at another and takes that one to read. Later, when someone else asks for a book about "یادگیری عمیق", he shows the smile-inducing book and holds back the frown-inducing ones. After employing this strategy for some years, he becomes a pretty good recommender of Persian books, despite having no clue what any of them actually say. Of course, he might mishandle complex or new topics. The key idea is that he substitutes an ability to interpret and remember human reactions for the ability to read the text himself1.
This can be actually the case as search engines just want to provide users with documents relevant to their queries, even though they cannot understand the documents themselves. To achieve this goal, a search engine can employ very simple techniques, e.g. using keyword-oriented heuristics like ensuring that all important queries terms (or synonyms) appear in the document and prefer documents where these terms appear often, prominently, close to one another, etc. plus favoring documents to which people react positively, with respect to the evidence like clicks.
What is wrong with fake document understanding?
In order to fake understanding of documents, search engines recall human reactions to those documents. The form of reaction they mostly rely upon in web ranking is clicks on search results. This user feedback can be extremely helpful and central to the strategy I explained before, but has problems like:
- People sometimes click on engaging or higher-ranked results, but with irrelevant content.
- Clicks are based on limited information. Users click on documents before reading them so user clicks are perhaps driven more strongly by the presentation of results on the search page, than by the content of the underlying documents.
- Fresh content lacks clicks.
So although click-through information is probably the best form of user feedback that can be used for document understanding, but has serious drawbacks. In particular, understanding documents to explain user clicks is different from understanding documents for the purpose of assessing relevance. In machine learning terms, the training data does not precisely match the task.
How can we benefit from document understanding?
Although faking document understanding might look sufficient, but there are some particular points that we can benefit from real document understanding including but not limited to:
- Getting benefit in ranking around fresh, recently-changed, and rarely-seen documents.
- Improving performance on verbose natural language queries, which may be more common in communications with an assistant. Classic relevance assessment mechanisms perform relatively poor in these areas.
- Seeing documents from a human perspective which can lead to new features and even new applications. For instance, we can presumably save people time by presenting results in a manner that makes clear the case for relevance.
From Memorization to Generalization
I think the first step for moving from fake document understanding to the real one would be moving from pure memorization toward generalization. A search engine might become a good Persian librarian purely by memorizing which books people like and dislike on each topic. But he can do even better by noting patterns in the behaviors he observes. For example, he might notice that books in a particular series always elicit a frown, no matter what the library visitor specifically requested, so he should probably stop suggesting those. Following this, a search engine can memorize past user behavior in the hope of predicting future behavior. Going beyond this, a search engine can generalize; that is, find patterns in past user behavior that help to anticipate behavior in new situations. In other words, a search engine can learn not only highly-specific rules in form of "show document D for query Q", but also rules that are more widely applicable.
As I mentioned, to start we can translate "understanding" as "memorizing with the ability of generalization" which is explaining user behavior in terms of stimuli to which people actually respond. In this light, we are currently well short of understanding, but the goal is not hopelessly out of reach.
Fully understanding documents is probably far out of reach. For example, coping with humor and complex inferences would probably require true artificial intelligence. So, to some extent, we must continue to "fake" document understanding through memorization but try to increase the generalization for a long time to come. But we also need to keep in mind that we need to move toward a direction in order to reduce our reliance on fake document understanding!
- As a philosophical aside, the Persian librarian may call to mind Searle’s "Chinese Room" argument. There, a system that contains no real understanding of Chinese is able to pass a Turing Test in Chinese. With one substitution, there is perhaps a reasonable analogy: the goal of search engines is not to pass the Turing Test, but rather a "Librarian Test"; that is, given a question on any topic, find the most useful documents on that topic.