Our paper "Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity", with Hosein Azarbonyad, Tom Kenter, Maarten Marx, Jaap Kamps, and Maarten de Rijke, has been accepted as a long paper at The 39th European Conference on Information Retrieval (ECIR'17). \o/
Quantitative notions of topical diversity in text documents are useful in several contexts, e.g., to assess the interdisciplinarity of a research proposal or to determine the interestingness of a document. An influential formalization of diversity has been introduced in biology. It decomposes diversity in terms of elements that belong to categories within a population and formalizes the diversity of a population as the expected distance between two randomly selected elements of the population:
where and are the proportions of categories and in the population and is the distance between and .
This notion of diversity had been adapted to quantify the topical diversity of a text document. Words are considered elements, topics are categories, and a document is a population. When using topic modeling for measuring topical diversity of text document , We can model elements based on the probability of a word given , , categories based on the probability of given topic , , and populations based on the probability of given , . In probabilistic topic modeling, at estimation time, these distributions are usually assumed to be sparse.
- First, the content of a document is assumed to be generated by a small subset of words from the vocabulary (i.e., is sparse).
- Second, each topic is assumed to contain only some topic-specific related words (i.e., is sparse).
- Finally, each document is assumed to deal with a few topics only (i.e., is sparse).
When approximated using currently available methods, and are often dense rather than sparse. Dense distributions cause two problems for the quality of topic models when used for measuring topical diversity: generality and impurity. General topics mostly contain general words and are typically assigned to most documents in a corpus. Impure topics contain words that are not related to the topic. Generality and impurity of topics both result in low quality distributions.
In this research, we propose HiTR, a hierarchical re-estimation process for making the distributions , and more sparse. We re-estimate the parameters of these distributions so that general, collection-wide items are removed and only salient items are kept. We re-estimate the parameters of these distributions so that general, collection-wide items are removed and only salient items are kept. For the re-estimation, we use the concept of parsimonization to extract only essential parameters of each distribution.
Our main contributions are:
- We propose a hierarchical re-estimation process for topic models to address two main problems in estimating topical diversity of text documents, using a biologically inspired definition of diversity.
- We study the efficacy of each level of re-estimation, and improve the accuracy of estimating topical diversity, outperforming the current state-of-the-art on a publicly available dataset commonly used for evaluating document diversity.
For more details on the results of experiments, please take a look at our paper:
- H. Azarbonyad, Mostafa Dehghani, T. Kenter, M. Marx, J. Kamps and M. de Rijke. “Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity“, In proceedings of the European Conference on Information Retrieval (ECIR’17), 2017.