Persian NLP

Orthography

You can convert the text from one character-set encoding to another, including to and from: Romanized, ArabTeX, Windows-1256, ISIRI 3342, UTF-8, and Unicode HTML numeric entities.

For example, from romanized text to UTF-8:

echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl --form untouched -i translit

and the output:

من کتاب‌های تو را نمی‌بینم

Or from Perso-Arabic script HTML numeric character references to ArabTeX:

echo "من کتاب‌های تو را نمی‌بینم" | perl perstem.pl -d -i unihtml -o arabtex

and the output:

mn ktAb\hspace{0ex}hAI tU rA nmI\hspace{0ex}bInm

This could then be inserted into a LaTeX document as:

  \documentclass{article}
  \usepackage{arabtex}
  
  \begin{document}
  \setfarsi
  \novocalize
  \< mn ktAb\hspace{0ex}hAI tU rA nmI\hspace{0ex}bInm >
  \end{document}

Morphology

You can morphologically analyze this directly:

echo "من کتاب‌های تو را نمی‌بینم" | perl perstem.pl --form linked

and the output:

mn ktAb_+-hA_+e tu rA n+_mi-+_bin_+m

Or you can analyze this as romanized text:

echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -f linked -i translit

and the output:

mn ktAb_+-hA_+e tu rA n+_mi-+_bin_+m

You can analyze the text with part-of-speech tags (only for word containing inflexional morphemes):

echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -f linked -p -i translit

and the output:

mn ktAb_+-hA_+e/N+PL+EZ tu rA n+_mi-+_bin_+m/V+NEG+DUR+1S

You can stem the words (hence the name):

echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -i translit -o translit

and the output:

mn ktAb tu rA didn

You can remove morpheme links:

echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -f unlinked

and the output:

mn ktAb hA e tu rA n mi di m

Syntax

This text can then go into the Link-Grammar syntax parser:

echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -i translit -o translit -f unlinked | link-grammar fa_home/data/4.0.dict

and the output is:

    +--------------------------------Wi-------------------------------+
    |       +---------------------------Spn1--------------------------+
    |       |      +---------------------On---------------------+     |
    |       |      +-----------PA-----------+                   |     |
    |       |      +-----EZ-----+           |     +----VMneg----+     |
    |       |      +-NMSp-+     +--M-+      |     |      +-VMdur+-VMP-+
    |       |      |      |     |    |      |     |      |      |     |
LEFT-WALL mn.pn ktAb.n hA.nms e.ez tu.pn rA.acc n.vmn mi.vmd bin.vp m.vmp

You can input Perso-Arabic script directly:

echo "من کتاب‌های تو نمی‌بینم" | perl perstem.pl -o translit -f unlinked | link-grammar fa_home/data/4.0.dict

which will result in the same output:

    +--------------------------------Wi-------------------------------+
    |       +---------------------------Spn1--------------------------+
    |       |      +---------------------On---------------------+     |
    |       |      +-----------PA-----------+                   |     |
    |       |      +-----EZ-----+           |     +----VMneg----+     |
    |       |      +-NMSp-+     +--M-+      |     |      +-VMdur+-VMP-+
    |       |      |      |     |    |      |     |      |      |     |
LEFT-WALL mn.pn ktAb.n hA.nms e.ez tu.pn rA.acc n.vmn mi.vmd bin.vp m.vmp

Further information on Perstem

The command "perl perstem.pl --help" gives the following usage information:


Usage:    perl perstem.pl [options] < input > output

Function:  Persian (Farsi) stemmer, morphological analyzer, transliterator,
           and partial part-of-speech tagger.

Options:
 -f, --form          Output forms as one of the following:
                          dict: as they appear in a dictionary (default)
                          linked: show all morphemes, linked together
                          unlinked: show all morphemes as separate tokens
                          untouched: don't stem/analyze; mostly for char-set conversion
     --flush            Autoflush buffer output after every line
 -h, --help             Print this usage
 -i, --input      Input character encoding type {cp1256,isiri3342,ncr,
                        translit,utf8} (default: utf8)
     --irreg-stem {0|1} Resolve irregular present-tense verb stems to their
                        past-tense stems (eg. kon ~ kar).  (default: 1 == true)
 -n, --noroman          Delete all non-Arabic script characters (eg. HTML tags)
 -o, --output     Output character encoding type {arabtex,cp1256,
                        isiri3342,ncr,translit,utf8} (default: utf8)
 -p, --pos              Tag inflected words for parts of speech
     --pos-sep    Separate words from their parts of speech by 
                        (default: "/" )
 -r, --recall           Increase recall by parsing ambiguous affixes; may lower
                        precision
     --skip-comments    Skip commented-out lines, without printing them
 -s, --stem             Return only word stems
 -t, --tokenize {0|1}   Tokenize punctuation (default: 1 == true)
 -u, --unvowel          Remove short vowels
 -v, --version          Print version
 -z, --zwnj {0|1}       Insert Zero Width Non-Joiners where they should be (default: 1 == true)

Jon Dehdari

Here's how some of the programs fit together:

Orthography

Morphology

Syntax

Further information on Perstem