Persian is an Iranian language within the Indo-Iranian branch of the Indo-European languages. Farsi (فارسی : fārsi) is the local name of the language in Iran and is sometimes used in English instead of the word Persian when referring to the language. Parsi (پارسی : pārsi) is a variant of this name. There are approximately 110 million Persian speakers worldwide, with the language holding official status in Iran, Afghanistan and Tajikistan.
Datasets:
- Hamshahri Dataset
- Bijankhan Corpus
- dotIR Collection
- IIS English-Persian Comparable Corpus
- TEP English-Persian Parallel Corpus
Dictionaries:
Anthologies:
Normalizer:
- UT-PersianNormalizer Version 0.2
Description:UT-PersianNormalizer is a simple text cleaning tool developed in Java for the task of normalizing texts in the Persian language.
In UT-PersianNormalizer-0.2:- All letters that have different styles (e.g. the Arabic and Persian common letters) are converted to Persian style with Persian Unicode encoding.
- Arabic and English digits are all changed to Persian digits.
- All types of space are edited to unique space character.
- All text symbols (like ?,%,-,..) are converted to standard Persian style.
- Author: Mostafa Dehghani
Download: UT-PersianNormalizer 0.2
NOTE: The UT-PersianNormalizer is open-sourced and licensed under the MIT license
- PrePer: Persian Pre-processor
Tokenizer:
Stemmers:
POS Taggers:
- TagPer: Persian Language Model for HunPoS
- Ferdowsi University Persian NLP plugin for GATE
- Persian POS-tagger based on Structured SVM
Other:
- Dadegan Research Group
- Question Answering System for Persian (There are some useful components for Persian language processing)
- Persian Simple/Compound Verbs
- Persian Dependency Parser
Learn to read/write/speak Persian: