Significant Words Representations of Entities

My doctoral consortium submission on "Significant Words Representations of Entities", has been accepted at the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2016). \o/

Transforming the data into a suitable representation is the first key step of data analysis, and the performance of any data-oriented method is heavily depending on it. We study questions on how we can best learn representations for textual entities that are: 1) precise, 2) robust against noisy terms, 3) transferable over time, and 4) interpretable by human inspection. Inspired by the early work of Luhn1, we propose significant words language models of a set of documents that capture all, and only, the significant shared terms from them. We adjust the weights of common terms that are already well explained by the document collection as well as the weight of incidental rare terms that are only explained by specific documents, which eventually results in having only the significant terms left in the model.

Significant Words Language Model

tikz-figure0

Transformation of raw data to a representation that can be effectively exploited is motivated by the fact that data-oriented methods often require input that is convenient to process. In this research, we introduce significant words language models (SWLM) as a family of models aiming to learn representations for the set of documents that are not affected by neither general properties nor specific properties.

The general idea of SWLM is inspired by the early work of Luhn, in which he argues that to extract significant words by avoiding both common observations and rare observations. In order to estimate SWLM, we assume that terms in the each document in the set are drawn from three models:

  1. General model, representative of common observation,
  2. Specific model, representative of partial observation, and
  3. Significant Words model, which is the latent model representing the significant characteristics of the whole set.

Then, we try to extract the latent significant words model.

Applications of the SWLM

The proposed approach is generally applicable to any system that requires the estimation of an effective model representing the significant features of a group of objects. Until now, we have employed the model in three main applications:

Group Profiling for Content Customization:

We have proposed to use SWLM to extract the "abstract" group level latent model from users group that captures all, and only, the essential features of the whole group. We employed the resulting models in the task of contextual suggestion and observed improvements in the performance of customization.  For more information and details, take look at this post or our papers:

(Pseudo-)Relevance Feedback:

We have presented a variant of SWLM, Regularized SWLM (RSWLM) for estimating a robust language model for a set of feedback documents by incorporating the information from the query. We have conducted extensive experiments on the effectiveness of RSWLM for (pseudo-)relevance feedback and demonstrated that it captures the essential terms representing the mutual notion of relevance. For more information and details, take look at this post or our papers:

  • Mostafa Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx. “Luhn Revisited: Significant Words Language Models”, To be appeared in the in the proceedings of The ACM International Conference on Information and Knowledge Management (CIKM'16), 2016.

Hierarchical Classification:

We have also extended SWLM to Hierarchical SWLM (HSWLM) to be able to estimate proper models for hierarchical entities which take their position in the hierarchy into consideration. We have employed SWLM on the task of hierarchical classification and observed that since estimated models of entities in the hierarchy are both horizontally and vertically separable, they are precise, robust and transferable over time.  For more details please refer to my post on classification models for evolving hierarchies  and my post on separation properties in hierarchal data  or take a look at our papers:

Besides applying the model in further applications, there are several other interesting directions to pursue in this research. These include using SWLM as an analytical tool to investigate and better understanding of the data, extending the method to be applicable to non-textual features, and employing the general idea of SWLM to the representation learning systems like embedding methods.

For more information, refer to the long version of my SIGIR-DC submission (5 pages). To see the extended abstract (one page), please take a look at this article:

  1. H. P. Luhn. The automatic creation of literature abstracts. IBM J. Res. Dev., 2(2):159–165, 1958.