Email Datasets

In order to evaluate the effectiveness of my proposed conversation thread reconstruction method on multilingual emails, as a part of my Master thesis, I created some datasets. Here are brief descriptions of the created datasets and links for downloading them:

The ConThread-BC3 Corpus

Description: The ConThread-BC3 Corpus is a special preparation of a portion W3C corpus that consists of 40 conversation threads. ConThread-BC3 is a copy of BC3 email dataset that is annotated with conversation threads information. Furthermore, the body text of all emails have been manually segmented into main content and quoted parts and quoted parts are annotated with "quote level" information.

The Multilingual-BC3 Corpus

Description: The Multilingual-BC3 Corpus is a dataset comprises multilingual conversations that have taken place via email. The Multilingual-BC3 is not a real multilingual dataset and it is constructed by manually translating some parts of BC3 dataset. To simulate real multilingual conversations, some parts are translated by a particular policy. The Multilingual-BC3 conversations are in two languages, Persian and English.

The BC3-Network Corpus

Description: The BC3-Network Corpus is a directed graph indicating the social network of email communication. In that graph, nodes represent participants, and for each email in the dataset, the node that represents the sender of the email is connected to the nodes that represent recipients of that email.

One thought on “Email Datasets

Comments are closed.