Jon Dehdari



Some old Persian NLP projects I used to work on:

Here's how some of the programs fit together:


Using an input text like "من کتاب‌های تو را نمی‌بینم" in Unicode HTML decimal form

Orthography

You can convert the text from one character-set encoding to another, including to and from: Romanized, ArabTeX, Windows-1256, ISIRI 3342, UTF-8, and Unicode HTML numeric entities.

Morphology

Syntax

Further information on Perstem

The command "perl perstem.pl --help" gives the following usage information:

Usage:    perl perstem.pl [options] < input > output

Function:  Persian (Farsi) stemmer, morphological analyzer, transliterator,
           and partial part-of-speech tagger.

Options:
 -f, --form          Output forms as one of the following:
                          dict: as they appear in a dictionary (default)
                          linked: show all morphemes, linked together
                          unlinked: show all morphemes as separate tokens
                          untouched: don't stem/analyze; mostly for char-set conversion
     --flush            Autoflush buffer output after every line
 -h, --help             Print this usage
 -i, --input      Input character encoding type {cp1256,isiri3342,ncr,
                        translit,utf8} (default: utf8)
     --irreg-stem {0|1} Resolve irregular present-tense verb stems to their
                        past-tense stems (eg. kon ~ kar).  (default: 1 == true)
 -n, --noroman          Delete all non-Arabic script characters (eg. HTML tags)
 -o, --output     Output character encoding type {arabtex,cp1256,
                        isiri3342,ncr,translit,utf8} (default: utf8)
 -p, --pos              Tag inflected words for parts of speech
     --pos-sep    Separate words from their parts of speech by 
                        (default: "/" )
 -r, --recall           Increase recall by parsing ambiguous affixes; may lower
                        precision
     --skip-comments    Skip commented-out lines, without printing them
 -s, --stem             Return only word stems
 -t, --tokenize {0|1}   Tokenize punctuation (default: 1 == true)
 -u, --unvowel          Remove short vowels
 -v, --version          Print version
 -z, --zwnj {0|1}       Insert Zero Width Non-Joiners where they should be (default: 1 == true)