Mixed-language and multilingual text information are rapidly growing on the Web. Processing this type of data poses additional challenges compared to monolingual information.
As a part of my MSc thesis, Email Management in Multilingual Environments, I focused on finding an efficient way for measuring mixed-language and multilingual document similarity.
In order to process multilingual emails, we need to deal with two types of Multilinguality:
- Intra-Email Multilinguality
Inter-Email Multilinguality (Mixed-language emails)
To be able to process multilingual email data, or in general multilingual documents in language modeling framework, we proposed to estimate a “Mixed Language Model” for multilingual or mix-language documents. In loose terms, in this method, translation knowledge is integrated with document representation. The new representation of documents is independent of their language and could be exploited in any monolingual document processing methods.
Mixed Language Model for a document is a distribution over terms which assign probabilities to all terms
of all languages in the collection. Suppose that multilingual collection () contains the vocabulary in() different languages ()( to() .
We define the Probabilistic Count:
and we estimate , i.e Mixed Language Model using this formula:
Here is an example of how the estimation process works:
There are some more information about the details of this model in this paper:
Razieh Rahimi, Azadeh Shakery, and Irwin King. “Multilingual information retrieval in the language modeling framework.” Information Retrieval Journal, 18.3 : 246-281,(2015).