ConThread-BC3 v1.0: Conversation Threads Annotated BC3 Email Dataset

The ConThread-BC3 Corpus compiled by Mostafa Dehghani at the Intelligent Information Systems Lab, University of Tehran. It is constructed based on data from the BC3 Dataset.

BC3 email dataset is a special preparation of a portion W3C corpus that consists of 40 conversation threads. ConThread-BC3 is a copy of BC3 email dataset that is annotated with conversation threads information. This information has been extracted from W3C email dataset and manually cleaned. Furthermore, The body text of emails have been manually segmented into main content and quoted parts and quoted parts are annotated with "quote level" information.


ConThread-BC3 Statistics:

  • Number of all Conversation Threads: 40
  • Number of Emails: 261
  • Average Number of Emails per Thread: 6.525
  • Number of Emails in The Longest Conversation Thread: 11
  • Maximum Node Degree: 6
  • Max Conversation Tree Depth: 6

Citing the ConThread-BC3 Corpus:

When citing or discussing the ConThread-BC3 corpus, please reference these papers:


Thanks to Arash Koushkestani for making this corpus preparation possible!


Creative Commons License
The ConThread-BC3 Corpus by Mostafa Dehghani is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at

Here you can download ConThread-BC3-1.0


If you have any questions, ideas or suggestions, please do not hesitate to contact me!