Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

Our paper "Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity", with Hosein Azarbonyad, Tom Kenter, Maarten Marx, Jaap Kamps, and Maarten de Rijke, has been accepted as a long paper at The 39th European Conference on Information Retrieval (ECIR'17). \o/

Quantitative notions of topical diversity in text documents are useful in several contexts, e.g., to assess the interdisciplinarity of a research proposal or to determine the interestingness of a document. An influential formalization of diversity has been introduced in biology. It decomposes diversity in terms of elements that belong to categories within a population and formalizes the diversity of a population d as the expected distance between two randomly selected elements of the population:

div(d) = \sum_{i=1}^{T} \sum_{j=1}^{T} p_ip_j \delta(i,j),

where p_i and p_j  are the proportions of categories i andj in the population and \delta(i, j) is the distance between i and j.

This notion of diversity had been adapted to quantify the topical diversity of a text document. Words are considered elements, topics are categories, and a document is a population. When using topic modeling for measuring topical diversity of text document d, We can model elements based on the probability of a word w given d, P(w|d), categories based on the probability of w given topic t , P(w|t), and populations based on the probability of t given d, P(t|d). In probabilistic topic modeling, at estimation time, these distributions are usually assumed to be sparse.

  1. First, the content of a document is assumed to be generated by a small subset of words from the vocabulary (i.e., P(w|d) is sparse).
  2. Second, each topic is assumed to contain only some topic-specific related words (i.e.,  P(w|t) is sparse).
  3. Finally, each document is assumed to deal with a few topics only (i.e., P(t|d) is sparse).

When approximated using currently available methods,  P(w|t) and P(t|d) are often dense rather than sparse. Dense distributions cause two problems for the quality of topic models when used for measuring topical diversity: generality and impurity. General topics mostly contain general words and are typically assigned to most documents in a corpus. Impure topics contain words that are not related to the topic. Generality and impurity of topics both result in low quality P(t|d) distributions.

Different topic re-estimation approaches. TM is a topic modeling approach like, e.g., LDA. DR is document re-estimation, TR is topic re-estimation, and TAR is topic assignment re-estimation.

In this research, we propose HiTR, a hierarchical re-estimation process for making the distributions P(w|d), P(w|t) and P(t|d) more sparse. We re-estimate the parameters of these distributions so that general, collection-wide items are removed and only salient items are kept. We re-estimate the parameters of these distributions so that general, collection-wide items are removed and only salient items are kept. For the re-estimation, we use the concept of parsimonization to extract only essential parameters of each distribution.

Our main contributions are:

  • We propose a hierarchical re-estimation process for topic models to address two main problems in estimating topical diversity of text documents, using a biologically inspired definition of diversity.
  • We study the efficacy of each level of re-estimation, and improve the accuracy of estimating topical diversity, outperforming the current state-of-the-art on a publicly available dataset commonly used for evaluating document diversity.

For more details on the results of experiments, please take a look at our paper: