Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

Our paper "Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity", with Hosein Azarbonyad, Tom Kenter, Maarten Marx, Jaap Kamps, and Maarten de Rijke, has been accepted as a long paper at The 39th European Conference on Information Retrieval (ECIR'17). \o/

Quantitative notions of topical diversity in text documents are useful in several contexts, e.g., to assess the interdisciplinarity of a research proposal or to determine the interestingness of a document. An influential formalization of diversity has been introduced in biology. It decomposes diversity in terms of elements that belong to categories within a population and formalizes the diversity of a population Latex formula as the expected distance between two randomly selected elements of the population:

Latex formula

where Latex formula and Latex formula  are the proportions of categories Latex formula andLatex formula in the population and Latex formula is the distance between Latex formula and Latex formula.

This notion of diversity had been adapted to quantify the topical diversity of a text document. Words are considered elements, topics are categories, and a document is a population. When using topic modeling for measuring topical diversity of text document Latex formula, We can model elements based on the probability of a word Latex formula given Latex formula, Latex formula, categories based on the probability of Latex formula given topic Latex formula , Latex formula, and populations based on the probability of Latex formula given Latex formula, Latex formula. In probabilistic topic modeling, at estimation time, these distributions are usually assumed to be sparse.

  1. First, the content of a document is assumed to be generated by a small subset of words from the vocabulary (i.e., Latex formula is sparse).
  2. Second, each topic is assumed to contain only some topic-specific related words (i.e.,  Latex formula is sparse).
  3. Finally, each document is assumed to deal with a few topics only (i.e., Latex formula is sparse).

When approximated using currently available methods,  Latex formula and Latex formula are often dense rather than sparse. Dense distributions cause two problems for the quality of topic models when used for measuring topical diversity: generality and impurity. General topics mostly contain general words and are typically assigned to most documents in a corpus. Impure topics contain words that are not related to the topic. Generality and impurity of topics both result in low quality Latex formula distributions.

Different topic re-estimation approaches. TM is a topic modeling approach like, e.g., LDA. DR is document re-estimation, TR is topic re-estimation, and TAR is topic assignment re-estimation.

In this research, we propose HiTR, a hierarchical re-estimation process for making the distributions Latex formula, Latex formula and Latex formula more sparse. We re-estimate the parameters of these distributions so that general, collection-wide items are removed and only salient items are kept. We re-estimate the parameters of these distributions so that general, collection-wide items are removed and only salient items are kept. For the re-estimation, we use the concept of parsimonization to extract only essential parameters of each distribution.

Our main contributions are:

  • We propose a hierarchical re-estimation process for topic models to address two main problems in estimating topical diversity of text documents, using a biologically inspired definition of diversity.
  • We study the efficacy of each level of re-estimation, and improve the accuracy of estimating topical diversity, outperforming the current state-of-the-art on a publicly available dataset commonly used for evaluating document diversity.

For more details on the results of experiments, please take a look at our paper: