Luhn Revisited: Significant Words Language Models

Our paper "Luhn Revisited: Significant Words Language Models", with Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra and Maarten Marx, has been accepted as a long paper at The 25th ACM International Conference on Information and Knowledge Management (CIKM'16). \o/

On of the key factors affecting search quality is the fact that our queries are ultra-short statements of our complex information needs. Query expansion has been proven to be an effective technique to bring agreement between user information need and relevant documents. Taking feedback information into account is a common approach for enriching the representation of queries and consequently improving retrieval performance.

In True Relevance Feedback (TRF), given a set of judged documents either explicitly assessed by the user or implicitly inferred from user behavior, the system tries to enrich the user query to improve the performance of the retrieval. However, feedback information is not available in most practical settings. An alternate approach is Pseudo Relevance Feedback (PRF), also called blind relevance feedback, which uses the top-ranked documents in the initial retrieved result list for the feedback.

The main goal of feedback systems is to extract a feedback model, which represents the relevant documents. However, although documents in the feedback set contain relevant information, there is always also non-relevant information. For instance, in PRF, some documents in the feedback set might be non-relevant, or in TRF, some documents, despite the fact that they are relevant, may act like poison pills by hurting the performance of feedback systems, since they also contain off-topic information. Such non-relevant information can distract the feedback model by adding bad expansion terms, leading to topic drift.

It has been shown that based on this observation, existing feedback systems are able to improve the performance of the retrieval if feedback documents are not only relevant, but also have a dedicated interest in the topic. Given that we should anticipate documents with a broader topic or multiple topics in the feedback set, taking advantage of feedback documents requires a robust and effective method to prevent topic drift caused by accidental, non-relevant terms brought in by particular documents in the feedback set.

We introduce a variant of significant words language models (SWLM) to extract a language model of feedback documents that captures the essential terms representing a mutual notion of relevance, i.e. a representatiPicture_3on of characteristic terms which are supported by all the feedback documents. The general idea of SWLM is inspired by the early work of Luhn1, in which he argues that to extract significant words by avoiding both common observations and rare observations. More precisely, Luhn assumed that frequency data can be used to measure the significance of words to represent a document. Considering Zipf’s Law, he simply devised a counting technique for finding significant words. He specified two cut-offs, an upper and lower, to exclude non-significant words.

There have been efforts to bring this idea into the feedback systems, like mixture models and parsimonious language models. They tried to make the feedback model better by eliminating the effect of common terms from the model. However, instead of using fixed frequency cut-offs, they made use of a more advanced way to do this. Hiemstra et al. stated the following in their paper2:

[. . . ] our approach bears some resemblance with early work on information retrieval by Luhn, who specifies two word frequency cut-offs, an upper and a lower to exclude non-significant words. The words exceeding the upper cut-off are considered to be common and those below the lower cut-off rare, and therefore not contributing significantly to the content of the document. Unlike Luhn, we do not exclude rare words and we do not have simple frequency cut-offs [. . . ]

In a way, we complete the cycle implementing the vision of Luhn. We introduce a meaningful translation of both specificity and generality against significance in the context of the feedback problem and propose an effective way of establishing a representation consisting of significant words, by parsimonizing the feedback model toward not only the common observations, but also the rare observations.

Generally speaking, SWLM is the language model estimated from the set of feedback documents which is “specific” enough to distinguish the features of the feedback documents from other documents by removing general terms, and in the same time, “general” enough to capture all the shared features of feedback documents as the notion of relevance, by excluding document specific terms. To do so, in order to estimate SWLM, it is assumed that terms in the feedback documents are drawn from three models:

  1. General model, representative of common observations
  2. Specific model, representative of partial observations
  3. Significant words model which is a latent model representing the notion of relevance

Then, it tries to extract the latent significant words model as the feedback language model. Let me maximize you understanding from the model using and example:

CIKM2016_long

The above figure shows an example of estimating language models from the set of top seven relevant documents retrieved for topic 374, “Nobel prize winners”, of the TREC Robust04 test collection. Terms in each list are selected from top 50 terms of the models estimated after stop word removal. Standard-LM is the language model estimated using MLE considering feedback documents as a single document. SMM is the language model estimated using simple mixture model, one of the most powerful feedback approaches, which generally tries to remove background terms from the feedback model. General-LM denotes the probability of terms to be common based on their overall occurrence in the collection and Specific-LM determines the probability of terms to be specific in the feedback set, i.e being frequent in one of the feedback documents but not the others. The way General-LM and Specific-LM are estimated will be discussed in detail ahead. And the last model in the figure is SWLM, which is the extracted latent model with regards to General-LM and Specific-LM, using our proposed approach. As can be seen, considering feedback documents as a mixture of feedback model and collection model, SMM penalizes some general terms like “time” and “year” by decreasing their probabilities. However, since some frequent words in the feedback set are not frequent in the whole collection, their probabilities are boosted, like “Palestinian” and “Arafat”, while they are not good indicators for the whole feedback set. The point is although these terms are frequently observed, they only occur in some feedback

However, since some frequent words in the feedback set are not frequent in the whole collection, their probabilities are boosted, like “Palestinian” and “Arafat”, while they are not good indicators for the whole feedback set.  The point is although these terms are frequently observed, they only occur in some feedback documents, not most of them, which means that they are in fact “specific” terms, not significant terms. By estimating both general model and specific model and taking them into consideration, SWLM tries to control the contribution of each feedback document in the feedback model, based on its merit, and prevent the estimated model to be affected by indistinct or off-topic terms, resulting in a significant model that reflects the notion of relevance. The main aim of this paper is to develop an approach to estimate a robust model from a set of documents that captures all, and only, the essential shared commonalities of these documents. Having the task of feedback in information retrieval as the application, we break this down into three concrete research questions: RQ1 How to estimate significant words language models for a set of feedback documents capturing a mutual notion of relevance? RQ2 How effective are significant words language models in (pseudo) relevance feedback? RQ3 How do significant words language models prevent the feedback model to be affected by non-relevant terms of nonrelevant or partially relevant feedback documents?

The main aim of this research is to develop an approach to estimate a robust model from a set of documents that captures all, and only, the essential shared commonalities of these documents. Having the task of feedback in information retrieval as the application, we break this down into three concrete research questions: RQ1 How to estimate significant words language models for a set of feedback documents capturing a mutual notion of relevance? RQ2 How effective are significant words language models in (pseudo) relevance feedback? RQ3 How do significant words language models prevent the feedback model to be affected by non-relevant terms of nonrelevant or partially relevant feedback documents?

  • RQ1 How to estimate significant words language models for a set of feedback documents capturing a mutual notion of relevance?
  • RQ2 How effective are significant words language models in (pseudo) relevance feedback?
  • RQ3 How do significant words language models prevent the feedback model to be affected by non-relevant terms of nonrelevant or partially relevant feedback documents?

In the paper, we addressed these question one by one. So, for more details, please take a look at the paper:

  • Mostafa Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx. “Luhn Revisited: Significant Words Language Models”, To be appeared in the in the proceedings of The ACM International Conference on Information and Knowledge Management (CIKM'16), 2016.
  1. H. P. Luhn. The automatic creation of literature abstracts. IBM J. Res. Dev., 2(2):159–165, 1958.
  2. Djoerd Hiemstra, Stephen Robertson, and Hugo Zaragoza. 2004. Parsimonious language models for information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '04).

2 thoughts on “Luhn Revisited: Significant Words Language Models

Comments are closed.