Mixed-Language and Multilingual Document Processing

Mixed-language and multilingual text information are rapidly growing on the Web. Processing this type of data poses additional challenges compared to monolingual information.

As a part of my MSc thesis, Email Management in Multilingual Environments, I focused on finding an efficient way for measuring mixed-language and multilingual document similarity.

In order to process multilingual emails, we need to deal with two types of Multilinguality:

• Intra-Email Multilinguality
• Inter-Email Multilinguality (Mixed-language emails)

To be able to process multilingual email data, or in general multilingual documents in language modeling framework, we proposed to estimate a “Mixed Language Model” for multilingual or mix-language documents. In loose terms,  in this method, translation knowledge is integrated with document representation. The new representation of documents is independent of their language and could be exploited in any monolingual document processing methods.

Mixed Language Model for a document  is a distribution over terms which assign probabilities to all terms

of all languages in the collection.  Suppose that multilingual collection ($Latex formula$) contains the vocabulary in($Latex formula$) different languages ($Latex formula$)($Latex formula$ to($Latex formula$) .

We define the Probabilistic Count:

$Latex formula$

and we estimate  $Latex formula$, i.e Mixed Language Model using this formula:

$Latex formula$

Here is an example of how the estimation process works: