Reconstructing Conversation Threads in Email Corpora

BC3_ThreadsEmails data are not completely independent of each other, because they could be written as a response to another email. Detecting dependencies among emails can improve the quality of email data analysis. An email conversation thread is defined as "a topic-centric discussion unit that is composed of exchanged emails among the same group of people by reply or forwarding".  Two structures could be assumed for conversation threads: linear structure and tree structure. In linear structure, emails that belong to the same conversation are detected and arranged in chronological order, shaping a flat structure. On the other hand, since in email conversations, users are allowed to choose any preceding email to reply to, many branches of discussions appear in a conversation and a conversation thread demonstrates a tree-shaped structure. In tree structure, conversation threads are shaped like a rooted tree in which the first email is the root and replies or forwards are shown as its children:


We proposed some methods to reconstruct conversation threads in an email corpus. We have two different approaches to reconstruct conversation threads.

GAIn the first one, an evolutionary-based algorithm is exploited.  Generally, we map the problem of finding conversation threads as an optimization problem in which each conversation thread is represented with a single rooted tree and the target is to find a jungle of best rooted trees in the space of all possible rooted trees. We exploit Genetic Programming using a multiple objectives fitness function to solve this problem.

For more details, please read this paper:

Our second approach for email conversation reconstruction is a supervised approach. We introduce two new feature-enriched learning method, LExLinC (Learning to Extract Linear Structures of Conversations) and LExTreC (Learning to Extract Tree Structures of Conversations) to reconstruct linear structure and tree structure of conversation threads in email data. Briefly, in LExLinc, the problem is mapped to a graph clustering problem Creating a semantic email network

In LExLinC, the problem is mapped to a graph clustering problem  in which first a weighted network  of emails is created and using email are grouped into threads using graph clustering algorithm:


In LExTreC, we break down the problem to finding  paths of arguments in emails and solve this problem as a retrieval problem (Qmail/Dmail) and employ "Learning to Rank" to learn model of fusing information from different features:


Then, we reconstructing conversation trees by integrating explicit and implicit information from extracted arguments:


For more details, please take a look at our paper:

2 thoughts on “Reconstructing Conversation Threads in Email Corpora

Comments are closed.