# Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

`Our paper "Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity", with Hosein Azarbonyad, Tom Kenter, Maarten Marx, Jaap Kamps, and Maarten de Rijke, has been accepted as a long paper at The 39th European Conference on Information Retrieval (ECIR'17). \o/`

Quantitative notions of topical diversity in text documents are useful in several contexts, e.g., to assess the interdisciplinarity of a research proposal or to determine the interestingness of a document. An influential formalization of diversity has been introduced in biology. It decomposes diversity in terms of elements that belong to categories within a population and formalizes the diversity of a population $Latex formula$ as the expected distance between two randomly selected elements of the population:

$Latex formula$

where $Latex formula$ and $Latex formula$  are the proportions of categories $Latex formula$ and$Latex formula$ in the population and $Latex formula$ is the distance between $Latex formula$ and $Latex formula$.

This notion of diversity had been adapted to quantify the topical diversity of a text document. Words are considered elements, topics are categories, and a document is a population. When using topic modeling for measuring topical diversity of text document $Latex formula$, We can model elements based on the probability of a word $Latex formula$ given $Latex formula$, $Latex formula$, categories based on the probability of $Latex formula$ given topic $Latex formula$ , $Latex formula$, and populations based on the probability of $Latex formula$ given $Latex formula$, $Latex formula$. In probabilistic topic modeling, at estimation time, these distributions are usually assumed to be sparse.

1. First, the content of a document is assumed to be generated by a small subset of words from the vocabulary (i.e., $Latex formula$ is sparse).
2. Second, each topic is assumed to contain only some topic-specific related words (i.e.,  $Latex formula$ is sparse).
3. Finally, each document is assumed to deal with a few topics only (i.e., $Latex formula$ is sparse).

When approximated using currently available methods,  $Latex formula$ and $Latex formula$ are often dense rather than sparse. Dense distributions cause two problems for the quality of topic models when used for measuring topical diversity: generality and impurity. General topics mostly contain general words and are typically assigned to most documents in a corpus. Impure topics contain words that are not related to the topic. Generality and impurity of topics both result in low quality $Latex formula$ distributions.

In this research, we propose HiTR, a hierarchical re-estimation process for making the distributions $Latex formula$, $Latex formula$ and $Latex formula$ more sparse. We re-estimate the parameters of these distributions so that general, collection-wide items are removed and only salient items are kept. We re-estimate the parameters of these distributions so that general, collection-wide items are removed and only salient items are kept. For the re-estimation, we use the concept of parsimonization to extract only essential parameters of each distribution.

Our main contributions are:

• We propose a hierarchical re-estimation process for topic models to address two main problems in estimating topical diversity of text documents, using a biologically inspired definition of diversity.
• We study the efficacy of each level of re-estimation, and improve the accuracy of estimating topical diversity, outperforming the current state-of-the-art on a publicly available dataset commonly used for evaluating document diversity.

For more details on the results of experiments, please take a look at our paper: