Text alignment is a sub-task in the plagiarism detection process. The aim of a text alignment is to detect pairs of related regions from two distinct documents. How these two regions are related and also the length of the regions should be determined regarding the application. Considering the plagiarism detection as an application of text alignment, the relatedness between the regions would be not only in terms of concept and semantic, but also in terms of lexical and grammatical structures. Also about the length of the regions in case of the plagiarism detection, it is reasonable to be at least as long as a short paragraph; while in parallel corpus construction it may be fine for the length of the regions to be as long as a sentence.
As a team from IIS lab at the University of Tehran, we participated in PAN2014 plagiarism detection challenge. Generally, our approach is based on mapping text alignment to the problem of subsequence matching just as previous works. We have prepared a framework, which lets us combine different feature types and different strategies for merging the features.
We have proposed two different solutions to relax the comparison of two documents, so as to consider the semantic relations between them. Our first approach is based on defining a new feature type that contains semantic information about its corresponding document. In our second approach we have proposed a new method for comparing the features considering their semantic relations. Finally, We have applied DBSCAN clustering algorithm to merge features in a neighborhood in both source and suspicious documents. Our experiments indicate that different feature sets are suitable for detecting different types of plagiarism.
To read more about our approach, please read this article:
- S. Abnar, Mostafa Dehghani, H. Zamani, and A. Shakery, "Expanded N-Grams for Semantic Text Alignment", Notebook for PAN at Conference and Labs of the Evaluation Forum (PAN'14), 2014, pp. 928-938.