Poison Pills and Antidots: Inoculating Relevance Feedback

"Poison Pills and Antidots: Inoculating Relevance Feedback", an article published in Amsterdam Science Magazine as one of the cool contributions of our CIKM2016 paper. We also have a extended abstract describing this part, accepted to be presented in DIR2016 "Inoculating Relevance Feedback Against Poison Pills", with Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra and Maarten Marx.

Relevance Feedback (RF) is a common approach for enriching queries, given a set of explicitly or implicitly judged documents either explicitly assessed by the user or implicitly inferred from user behavior,
to improve the performance of the retrieval.
Although it has been shown that on average, the overall performance of retrieval will be improved after relevance feedback, for some topics, employing some relevant documents may decrease the average precision of the initial run. This is mostly because the feedback document is partially relevant and contains off-topic terms which adding them to the query as expansion terms results in loosing the retrieval performance. These relevant documents that hurt the performance of retrieval after feedback are called "poison pills''. In this article, we discuss the effect of poison pills on the relevance feedback and present Significant Words Language Models as an approach for estimating feedback model to tackle this problem.

Significant Words Language Models are family of models aiming to estimate models for a set of documents so that all, and only, the significant shared terms are captured in the models(see here and here). This makes these models to be not only distinctive, but also supported by all the documents in the set.

poisonpillsPut loosely, SWLM iteratively removes two types of words from the model: general words, i.e., common words used frequently across all the documents, and page-specific words, i.e., words mentioned in some of the relevant documents, but not the majority of them (see the above Figure). This approach prevents noise words interfering with the relevance feedback, and thus successfully improves the retrieval performance by protecting relvance feedback against Poison Pills.



We have got the "Best paper Award" in The ACM International Conference on the Theory of Information Retrieval (ICTIR2016) for our paper "On Horizontal and Vertical Separation in Hierarchical Text Classification". \o/

Please take a look at my posts on "From Probability Ranking Principle to Strong Separation Principle", and "Two-dimensional separability in Hierarchical Text Classification") for more information on our paper. Also here are my presentation slides.

The Healing Power of Poison: Helpful Non-relevant Documents in Feedback

Our paper "The Healing Power of Poison: Helpful Non-relevant Documents in Feedback", with Samira Abnar and Jaap Kamps, has been accepted as a short paper at The 25th ACM International Conference on Information and Knowledge Management (CIKM'16). \o/

Often, the only difference between a medicine and a poison is the dose. Some substances are extremely toxic, and therefore, are primarily known as a poison. Yet, even poisons can have medicinal value.

Paracelsus, Father of Toxicology

cikm2016_posterQuery expansion based on feedback information is one of the classic approaches for improving the performance of information retrieval systems, especially when the user information need are complex to express precisely in a few keywords.

True Relevance Feedback (TRF) systems try to enrich the user query using a set of judged documents, that their relevance is assessed either explicitly by the user or implicitly inferred from the user behavior. However, this information is not always available. Alternatively, Pseudo Relevance Feedback (PRF) methods, also called blind relevance feedback, assumes that the top-ranked documents in the initial retrieved results are all relevant and use them for the feedback model.

Normally feedback documents that are annotated as relevant are considered to be beneficial for feedback and feedback documents that are annotated as non-relevant are expected to be poisonous, i.e. they supposedly decrease the performance of the feedback systems if they are used as positive feedback. Based on this assumption, some of the TRF methods, use non-relevant documents as negative feedback and some PRF methods try to avoid using these documents. For example, some PRF methods attempt on detecting non-relevant documents in order for being robust against their noises, or they manage to partially use their content in the feedback procedure, like some of their passages. Although PRF methods use non-relevant documents, they do not directly intend to take advantage of them as helpful documents. In other words, most of the time, removing non-relevant documents from the feedback set of PRF methods leads to a better performance.


Luhn Revisited: Significant Words Language Models

Our paper "Luhn Revisited: Significant Words Language Models", with Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra and Maarten Marx, has been accepted as a long paper at The 25th ACM International Conference on Information and Knowledge Management (CIKM'16). \o/

On of the key factors affecting search quality is the fact that our queries are ultra-short statements of our complex information needs. Query expansion has been proven to be an effective technique to bring agreement between user information need and relevant documents. Taking feedback information into account is a common approach for enriching the representation of queries and consequently improving retrieval performance.

In True Relevance Feedback (TRF), given a set of judged documents either explicitly assessed by the user or implicitly inferred from user behavior, the system tries to enrich the user query to improve the performance of the retrieval. However, feedback information is not available in most practical settings. An alternate approach is Pseudo Relevance Feedback (PRF), also called blind relevance feedback, which uses the top-ranked documents in the initial retrieved result list for the feedback.

The main goal of feedback systems is to extract a feedback model, which represents the relevant documents. However, although documents in the feedback set contain relevant information, there is always also non-relevant information. For instance, in PRF, some documents in the feedback set might be non-relevant, or in TRF, some documents, despite the fact that they are relevant, may act like poison pills by hurting the performance of feedback systems, since they also contain off-topic information. Such non-relevant information can distract the feedback model by adding bad expansion terms, leading to topic drift.

It has been shown that based on this observation, existing feedback systems are able to improve the performance of the retrieval if feedback documents are not only relevant, but also have a dedicated interest in the topic. Given that we should anticipate documents with a broader topic or multiple topics in the feedback set, taking advantage of feedback documents requires a robust and effective method to prevent topic drift caused by accidental, non-relevant terms brought in by particular documents in the feedback set.

We introduce a variant of significant words language models (SWLM) to extract a language model of feedback documents that captures the essential terms representing a mutual notion of relevance, i.e. a representatiPicture_3on of characteristic terms which are supported by all the feedback documents. The general idea of SWLM is inspired by the early work of Luhn1, in which he argues that to extract significant words by avoiding both common observations and rare observations. More precisely, Luhn assumed that frequency data can be used to measure the significance of words to represent a document. Considering Zipf’s Law, he simply devised a counting technique for finding significant words. He specified two cut-offs, an upper and lower, to exclude non-significant words.


On Horizontal and Vertical Separation in Hierarchical Text Classification

Our paper "On Horizontal and Vertical Separation in Hierarchical Text Classification", with Hosein Azarbonyad, Jaap Kamps, and Maarten Marx, has been accepted as a long paper at The ACM International Conference on the Theory of Information Retrieval (ICTIR'16). \o/

Hierarchy is an effective and common way of representing information and many real-world textual data can be organized in this way. Organizing data in a hierarchical structure is valuable since it determines relationships in the data at different levels of resolution and picks out different categories relevant to each of the different layers of memberships. In a hierarchical structure, a node at any layer could be an indicator of a document, a person, an organization, a category, an ideology, and so on, which we refer to them as “hierarchical entities”. Taking advantage of the structure in the hierarchy requires a proper way for modeling and representing entities, taking their relation in the hierarchy into consideration.

celegansneural_nested_mdlThere are two types of dependencies in the hierarchies:

  1. Horizontal dependency, which refers to the relations of entities in the same layer. (A simple example would be the dependency between siblings which have some commonalities in terms of being descendants of the same entity.)
  2. Vertical dependency, which addresses the relations between ancestors and descendants in the hierarchy. (For example the relation between root and other entities.)

Due to the existence of two-dimensional dependencies between entities in the hierarchy, modeling them regardless of their relationships might result in overlapping models which are not capable of making different entities distinguishable. The overlapping models are not favorable since when the data representations are not well-separated, classification and retrieval systems are less likely to work well. Thus, two-dimensional separability, i.e. horizontal and vertical separability, is one of the key requirements of hierarchical classification. (more…)

From Probability Ranking Principle to Strong Separation Principle

Separability is a highly desirable property for constructing and operating autonomous information systems, and especially classifiers. In this post, I present a step by step argument which shows that based on the classification principles, having better separability in the feature space leads to better accuracy in the classification results.

Based on the Probability Ranking Principle (PRP) presented by Robertson1, Lewis2  has formulated a variant of PRP for binary classification:

For a given set of items presented to a binary classification system, there exists a classification of the items such that the probability of class membership for all items assigned to the class is greater than or equal to the probability of class membership for all items not assigned to the class, and the classification has optimal expected effectiveness.

Since in many applications, autonomous systems need to decide how to classify an individual item in the absence of entire items set, Lewis has extended the PRP to the Probability Threshold Principle (PTP):

For a given effectiveness measure, there exists a threshold p, 0<p<1, if all and only those items with the probability of class membership greater than p are assigned to the class, the expected effectiveness of the classification will be the best possible for that set of items.

PTP in fact discusses optimizing the effectiveness of classifiers by making items separable regarding their probability of class membership, which is a discussion on “separability in the score space”. Based on PTP, optimizing a threshold for separating items is a theoretically trivial task, however, there are practical difficulties. The main difficulty refers to the fact that retrieval models are not necessarily capable of measuring actual probabilities of relevance for documents, so they do not guarantee to generate a set of scores from which the optimum cutoff can be inferred. In this regard, a great deal of work has been done on analyzing the score distribution over relevant and non-relevant documents to utilize this information for finding the appropriate threshold between relevant and non-relevant documents.


Building a multi-domain comparable corpus using a learning to rank method

Our paper "Building a multi-domain comparable corpus using a learning to rank method", with Razieh Rahimi, Azadeh Shakery, Javid Dadashkarimi, Mozhdeh Ariannezhad, and Hossein Nasr Esfahani has been published at the Journal of Natural Language Engineering. \o/

Multilingual nature of the Web makes interlingual translation a crucial requirement for information management applications. Bilingual humanly constructed dictionaries are the basic available resources for translation, but they do not provide sufficient coverage for translation purpose due to out of vocabulary words, phrases and neologisms. This naturally motivates the use of data-driven translation systems. Machine translation systems need bilingual training data, which plays a central role in the accuracy of translations provided by the systems. Thus, providing appropriate training data is an important factor for machine translation systems. Parallel corpora are high quality training data that can be provided for such translation systems. However, building these corpora with suitable coverage of different domains is a costly task.

HELLO in eight different languages

Comparable corpora are key resources for translation systems concerning languages and domains lacking linguistic resources. Aligned documents in a comparable corpus describe the same or similar topics. Loose alignments of comparable corpora compared to parallel ones can be created at lower costs. This facilitates building domain-specific comparable corpora or expanding translation resources for language pairs with limited linguistic resources. Comparable corpora can be used to directly obtain translation knowledge, or to train machine translation systems by extraction of parallel sentences/phrases. The most common approach for building comparable corpora is based on Cross-Language Information Retrieval (CLIR). For each document in one corpus (referred to as the source document), a driver query is generated. Documents of another corpus (referred to as target documents) are then ranked with respect to the driver query using a model for cross-lingual information retrieval. In the next step, source and target documents with a similarity score higher than a predefined threshold constitute an alignment. Finally, some heuristics, such as matched named entities, publication dates, and ratio of document lengths, are used to select most reliable alignments. (more…)

Two-Way Parsimonious Classification Models for Evolving Hierarchies

Our paper "Two-Way Parsimonious Classification Models for Evolving Hierarchies", with Hosein Azarbonyad, Jaap Kamps, and Maarten Marx, has been accepted as a long paper at the Conference and Labs of the Evaluation Forum (CLEF'16). \o/

Modern web data is highly structured in terms of entities and relations from large knowledge bases, geo-temporal references, and social network structure. These knowledge bases contain many facts and entities in a graph or hierarchical structure, making it possible to express concepts at different levels of abstraction. However, due to the dynamic nature of data, their structure may evolve over time. For example, in a hierarchy, nodes can be removed or added or even transfer across the hierarchy. Thus, modeling objects in the evolving structures and building robust classifiers for them is notoriously hard and requires employing a set of solid features from the data, which are not affected by these kinds of changes.


For example, assume we would build a classifier for the “US president” over recent data, then a standard classifier would not distinguish the role in office from the person who is the current president, leading to obvious issues after the elections in 2016. In other words, if we can separate the model of the function from the model of the person fulfilling it, for example by abstracting over several presidents, that more general model would in principle be robust over time. (more…)

Significant Words Representations of Entities

My doctoral consortium submission on "Significant Words Representations of Entities", has been accepted at the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2016). \o/

Transforming the data into a suitable representation is the first key step of data analysis, and the performance of any data-oriented method is heavily depending on it. We study questions on how we can best learn representations for textual entities that are: 1) precise, 2) robust against noisy terms, 3) transferable over time, and 4) interpretable by human inspection. Inspired by the early work of Luhn1, we propose significant words language models of a set of documents that capture all, and only, the significant shared terms from them. We adjust the weights of common terms that are already well explained by the document collection as well as the weight of incidental rare terms that are only explained by specific documents, which eventually results in having only the significant terms left in the model.

Significant Words Language Model


Transformation of raw data to a representation that can be effectively exploited is motivated by the fact that data-oriented methods often require input that is convenient to process. In this research, we introduce significant words language models (SWLM) as a family of models aiming to learn representations for the set of documents that are not affected by neither general properties nor specific properties.

The general idea of SWLM is inspired by the early work of Luhn, in which he argues that to extract significant words by avoiding both common observations and rare observations. In order to estimate SWLM, we assume that terms in the each document in the set are drawn from three models:

  1. General model, representative of common observation,
  2. Specific model, representative of partial observation, and
  3. Significant Words model, which is the latent model representing the significant characteristics of the whole set.

Then, we try to extract the latent significant words model. (more…)