Mixed-Language and Multilingual Document Processing

MultilingualMixed-language and multilingual text information are rapidly growing on the Web. Processing this type of data poses additional challenges compared to monolingual information.

As a part of my MSc thesis, Email Management in Multilingual Environments, I focused on finding an efficient way for measuring mixed-language and multilingual document similarity.

In order to process multilingual emails, we need to deal with two types of Multilinguality:

  • Intra-Email Multilinguality
  • Inter-Email Multilinguality (Mixed-language emails)

To be able to process multilingual email data, or in general multilingual documents in language modeling framework, we proposed to estimate a "Mixed Language Model" for multilingual or mix-language documents. In loose terms,  in this method, translation knowledge is integrated with document representation. The new representation of documents is independent of their language and could be exploited in any monolingual document processing methods.

Mixed Language Model for a document  is a distribution over terms which assign probabilities to all terms

of all languages in the collection.  Suppose that multilingual collection (C) contains the vocabulary in(N) different languages (l_i)(i = 1 to(N) .

We define the Probabilistic Count:

c_{p}(w, D) = \sum_{u \in D}{p(w|u)c(u,D)}

and we estimate  p_{ML}(w|\hat{\theta}_{D}), i.e Mixed Language Model using this formula:

p_{ML}(w|\hat{\theta}_{D}) = \frac{c_{p}(w, D)}{N|D|}

Here is an example of how the estimation process works:


There are some more information about the details of this model in this paper:

Razieh Rahimi,  Azadeh Shakery, and Irwin King. "Multilingual information retrieval in the language modeling framework." Information Retrieval Journal, 18.3 : 246-281,(2015).


One thought on “Mixed-Language and Multilingual Document Processing

Leave a Reply

Your email address will not be published. Required fields are marked *