Document understanding for the purpose of assessing the relevance of a document or passage to a query based only on the document content appears to be a familiar goal for information retrieval community, however, this problem has remained largely intractable, despite repeated attacks over many years. This is while people are able to assess the relevance quite well, though unfamiliar topics and complex documents can defeat them. This assessment may require the ability to understand language, images, document structure, videos, audio, and functional elements. In turn, understanding of these elements is built on background information about the world, such as human behavior patterns, and even more fundamental truths such as the existence of time, space, and people. All this comes naturally to people, but not to computers!
Recently, large-scale machine learning has altered the landscape. Deep learning has greatly advanced machine understanding of images and language. Since document and query understanding incorporate these elements, deep learning can hold great promise. But it comes with a drawback: general purpose representations (like CNNs for images) have proved somewhat elusive for text. In particular, embeddings act as a distributed representation not just of semantic information but also application-specific learnings, which are hard to transfer. In short, conditions seem right for a renewed attempt on the fundamental document understanding problem.
What is document understanding?
In order to think about the way we can approach this problem, I think we should first answer some questions: Can we understand documents? What is "understanding"? Getting at the true meaning of a document? Okay, but then what is "meaning"? How do we even approach such an ill-defined goal?