Persian Linguistics Resources

Persian

Persian is an Iranian language within the Indo-Iranian branch of the Indo-European languages. Farsi (فارسی‎ : fārsi) is the local name of the language in Iran and is sometimes used in English instead of the word Persian when referring to the language. Parsi (پارسی :‎ pārsi) is a variant of this name. There are approximately 110 million Persian speakers worldwide, with the language holding official status in Iran, Afghanistan and Tajikistan.

Datasets:

Dictionaries:

Anthologies:

Normalizer:

  • UT-PersianNormalizer Version 0.2
    • PersianLetterDescription:UT-PersianNormalizer is a simple text cleaning tool developed in Java for the task of normalizing texts in the Persian language.
      In UT-PersianNormalizer-0.2:

      • All letters that have different styles (e.g. the Arabic and Persian common letters) are converted to Persian style with Persian Unicode encoding.
      • Arabic and English digits are all changed to Persian digits.
      • All types of space are edited to unique space character.
      • All text symbols (like ?,%,-,..) are converted to standard Persian style.
    • Author: Mostafa Dehghani
      Download: UT-PersianNormalizer 0.2
      NOTE: The UT-PersianNormalizer is open-sourced and licensed under the MIT license
  • PrePer: Persian Pre-processor

Tokenizer:

Stemmers:

POS Taggers:

Other:

Learn to read/write/speak Persian: