Some old Persian NLP projects I used to work on:
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl --form untouched -i translitand the output:
من کتابهای تو را نمیبینم
echo "من کتابهای تو را نمیبینم" | perl perstem.pl -d -i unihtml -o arabtexand the output:
mn ktAb\hspace{0ex}hAI tU rA nmI\hspace{0ex}bInm
This could then be inserted into a LaTeX document as:
\documentclass{article}
\usepackage{arabtex}
\begin{document}
\setfarsi
\novocalize
\< mn ktAb\hspace{0ex}hAI tU rA nmI\hspace{0ex}bInm >
\end{document}
echo "من کتابهای تو را نمیبینم" | perl perstem.pl --form linkedand the output:
mn ktAb_+-hA_+e tu rA n+_mi-+_bin_+m
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -f linked -i translitand the output:
mn ktAb_+-hA_+e tu rA n+_mi-+_bin_+m
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -f linked -p -i translitand the output:
mn ktAb_+-hA_+e/N+PL+EZ tu rA n+_mi-+_bin_+m/V+NEG+DUR+1S
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -i translit -o translitand the output:
mn ktAb tu rA didn
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -f unlinkedand the output:
mn ktAb hA e tu rA n mi di m
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -i translit -o translit -f unlinked | link-grammar fa_home/data/4.0.dictand the output is:
+--------------------------------Wi-------------------------------+
| +---------------------------Spn1--------------------------+
| | +---------------------On---------------------+ |
| | +-----------PA-----------+ | |
| | +-----EZ-----+ | +----VMneg----+ |
| | +-NMSp-+ +--M-+ | | +-VMdur+-VMP-+
| | | | | | | | | | |
LEFT-WALL mn.pn ktAb.n hA.nms e.ez tu.pn rA.acc n.vmn mi.vmd bin.vp m.vmp
echo "من کتابهای تو نمیبینم" | perl perstem.pl -o translit -f unlinked | link-grammar fa_home/data/4.0.dictwhich will result in the same output:
+--------------------------------Wi-------------------------------+
| +---------------------------Spn1--------------------------+
| | +---------------------On---------------------+ |
| | +-----------PA-----------+ | |
| | +-----EZ-----+ | +----VMneg----+ |
| | +-NMSp-+ +--M-+ | | +-VMdur+-VMP-+
| | | | | | | | | | |
LEFT-WALL mn.pn ktAb.n hA.nms e.ez tu.pn rA.acc n.vmn mi.vmd bin.vp m.vmp
Usage: perl perstem.pl [options] < input > output
Function: Persian (Farsi) stemmer, morphological analyzer, transliterator,
and partial part-of-speech tagger.
Options:
-f, --form Output forms as one of the following:
dict: as they appear in a dictionary (default)
linked: show all morphemes, linked together
unlinked: show all morphemes as separate tokens
untouched: don't stem/analyze; mostly for char-set conversion
--flush Autoflush buffer output after every line
-h, --help Print this usage
-i, --input Input character encoding type {cp1256,isiri3342,ncr,
translit,utf8} (default: utf8)
--irreg-stem {0|1} Resolve irregular present-tense verb stems to their
past-tense stems (eg. kon ~ kar). (default: 1 == true)
-n, --noroman Delete all non-Arabic script characters (eg. HTML tags)
-o, --output Output character encoding type {arabtex,cp1256,
isiri3342,ncr,translit,utf8} (default: utf8)
-p, --pos Tag inflected words for parts of speech
--pos-sep Separate words from their parts of speech by
(default: "/" )
-r, --recall Increase recall by parsing ambiguous affixes; may lower
precision
--skip-comments Skip commented-out lines, without printing them
-s, --stem Return only word stems
-t, --tokenize {0|1} Tokenize punctuation (default: 1 == true)
-u, --unvowel Remove short vowels
-v, --version Print version
-z, --zwnj {0|1} Insert Zero Width Non-Joiners where they should be (default: 1 == true)