This project is one of the Persian Corpus Creation projects in the IIS Lab, University of Tehran.
Comparable corpora have been identified as a key resource for obtaining translation knowledge. Domain specificity of available comparable corpora, demands more attentions to create multi-domain corpora. This project seeks to construct a Persian-English comparable corpus using Web data, based on topic criteria. We introduce a new approach to create a domain spanning corpora. The method extracts some queries in different domains by the help of the hierarchies that are provided in Open Directory Project (ODP) . Comparable candidates in the target language are the retrieved documents in response to the translated queries in search engines. The proposed method employs the Learning to Rank algorithm to align documents in different languages in the lack of their publication dates. The algorithm uses some cross-lingual features to score alignment candidates. The method learns a model using a small parallel corpus. Experimental results support the validity of the proposed method over the parallel corpora and the created comparable corpora.
We have an accepted paper on the output of this project:
- R. Rahimi, A. Shakery, J. Dadashkarimi, M. Ariannezhad, Mostafa Dehghani, and H. N. Esfahani. "Building a Multi-Domain Comparable Corpus Using a Learning to Rank Method". Journal of Natural Language Engineering, Cambridge University Press, Volume 22, Special Issue 04, 2016, pp. 627-653.