BC3 email dataset is a special preparation of a portion W3C corpus that consists of 40 conversation threads. ConThread-BC3 is a copy of BC3 email dataset that is annotated with conversation threads information. This information has been extracted from W3C email dataset and manually cleaned. Furthermore, The body text of emails have been manually segmented into main content and quoted parts and quoted parts are annotated with "quote level" information.
- Number of all Conversation Threads: 40
- Number of Emails: 261
- Average Number of Emails per Thread: 6.525
- Number of Emails in The Longest Conversation Thread: 11
- Maximum Node Degree: 6
- Max Conversation Tree Depth: 6
Citing the ConThread-BC3 Corpus:
When citing or discussing the ConThread-BC3 corpus, please reference these papers:
- Mostafa Dehghani, A. Shakery , M. Asadpour, and A. Koushkestani, "A Learning Approach for Email Conversation Thread Reconstruction", Journal of Information Science (JIS), Volume 39 Issue 6, 2013, pp. 846-863. [ACM-DL Link]
- Mostafa Dehghani, M. Asadpour, and A. Shakery, "An Evolutionary-Based Method for Reconstructing Conversation Threads in Email Corpora", In proceedings of The 2012IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM'12), 2012.[ACM-DL Link]
Thanks to Arash Koushkestani for making this corpus preparation possible!
The ConThread-BC3 Corpus by Mostafa Dehghani is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Based on a work at http://www.cs.ubc.ca/labs/lci/bc3.html. Here you can download ConThread-BC3-1.0
If you have any questions, ideas or suggestions, please do not hesitate to contact me!