Building a multi-domain comparable corpus using a learning to rank method

Our paper "Building a multi-domain comparable corpus using a learning to rank method", with Razieh Rahimi, Azadeh Shakery, Javid Dadashkarimi, Mozhdeh Ariannezhad, and Hossein Nasr Esfahani has been published at the Journal of Natural Language Engineering. \o/

Multilingual nature of the Web makes interlingual translation a crucial requirement for information management applications. Bilingual humanly constructed dictionaries are the basic available resources for translation, but they do not provide sufficient coverage for translation purpose due to out of vocabulary words, phrases and neologisms. This naturally motivates the use of data-driven translation systems. Machine translation systems need bilingual training data, which plays a central role in the accuracy of translations provided by the systems. Thus, providing appropriate training data is an important factor for machine translation systems. Parallel corpora are high quality training data that can be provided for such translation systems. However, building these corpora with suitable coverage of different domains is a costly task.

HELLO in eight different languages

Comparable corpora are key resources for translation systems concerning languages and domains lacking linguistic resources. Aligned documents in a comparable corpus describe the same or similar topics. Loose alignments of comparable corpora compared to parallel ones can be created at lower costs. This facilitates building domain-specific comparable corpora or expanding translation resources for language pairs with limited linguistic resources. Comparable corpora can be used to directly obtain translation knowledge, or to train machine translation systems by extraction of parallel sentences/phrases. The most common approach for building comparable corpora is based on Cross-Language Information Retrieval (CLIR). For each document in one corpus (referred to as the source document), a driver query is generated. Documents of another corpus (referred to as target documents) are then ranked with respect to the driver query using a model for cross-lingual information retrieval. In the next step, source and target documents with a similarity score higher than a predefined threshold constitute an alignment. Finally, some heuristics, such as matched named entities, publication dates, and ratio of document lengths, are used to select most reliable alignments.

Using different criteria for measuring the similarity between source and target documents improves the accuracy of alignments. Aligning documents by separately considering CLIR scores, named entities and other possibly available features such as publication dates, does not provide the flexibility and control over the impact of each feature on document similarities. This deficiency is important because the reliability of employed heuristics or features in indicating the degree of document similarity varies depending on language pairs of corpora. For example, features such as named entities, numbers and events are important indicators of comparability for languages sharing many cognates. On the other hand, when language alphabets are totally different, matching named entities and events can exhibit as much noise as matching content words. Therefore, named entity and event features for such languages are not as reliable as they are for languages with large similarity.


In this research, we propose an approach to build high-quality comparable corpora. The proposed approach employs a learning-based ranking algorithm to derive an alignment model for building comparable corpora. In order to train the alignment model, we need cross-lingual features and training data. We derive features based on cross-lingual information retrieval, cross-lingual document similarity, and named entities. To provide appropriate training data for the learning to rank algorithm, we simulate a comparable corpus from an available parallel corpus where properties of comparable corpora, namely differences in the similarity degrees between alignments and the length of aligned documents, are considered.

The proposed method has the following advantages:

  1. Our proposed approach facilitates the integration of any signal on the similarity of a specific language pair into the process of building comparable corpora.
  2. Adopting a learning-to-rank method for ranking target documents with respect to a source document enables learning the weights of various similarity features for a specific language pair.
  3. The incorporated cross-lingual features allow ranking of target documents not only based on matching translated driver queries, but also according to cross-lingual document similarities and named entities.

We exploit the proposed approach for building two comparable corpora. The first comparable corpus is built using the websites of the Open Directory Project (ODP) in different domains. A collection of documents also crawled from websites in Persian as the target collection for this comparable corpus. The second comparable corpus is built using two independent news collections, BBC news in English and Hamshahri news in Persian. 

For more details, please take a look at our paper: