Some old Persian NLP projects I used to work on:
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl --form untouched -i translitand the output:
من کتابهای تو را نمیبینم
echo "من کتابهای تو را نمیبینم" | perl perstem.pl -d -i unihtml -o arabtexand the output:
mn ktAb\hspace{0ex}hAI tU rA nmI\hspace{0ex}bInmThis could then be inserted into a LaTeX document as:
\documentclass{article} \usepackage{arabtex} \begin{document} \setfarsi \novocalize \< mn ktAb\hspace{0ex}hAI tU rA nmI\hspace{0ex}bInm > \end{document}
echo "من کتابهای تو را نمیبینم" | perl perstem.pl --form linkedand the output:
mn ktAb_+-hA_+e tu rA n+_mi-+_bin_+m
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -f linked -i translitand the output:
mn ktAb_+-hA_+e tu rA n+_mi-+_bin_+m
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -f linked -p -i translitand the output:
mn ktAb_+-hA_+e/N+PL+EZ tu rA n+_mi-+_bin_+m/V+NEG+DUR+1S
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -i translit -o translitand the output:
mn ktAb tu rA didn
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -f unlinkedand the output:
mn ktAb hA e tu rA n mi di m
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -i translit -o translit -f unlinked | link-grammar fa_home/data/4.0.dictand the output is:
+--------------------------------Wi-------------------------------+ | +---------------------------Spn1--------------------------+ | | +---------------------On---------------------+ | | | +-----------PA-----------+ | | | | +-----EZ-----+ | +----VMneg----+ | | | +-NMSp-+ +--M-+ | | +-VMdur+-VMP-+ | | | | | | | | | | | LEFT-WALL mn.pn ktAb.n hA.nms e.ez tu.pn rA.acc n.vmn mi.vmd bin.vp m.vmp
echo "من کتابهای تو نمیبینم" | perl perstem.pl -o translit -f unlinked | link-grammar fa_home/data/4.0.dictwhich will result in the same output:
+--------------------------------Wi-------------------------------+ | +---------------------------Spn1--------------------------+ | | +---------------------On---------------------+ | | | +-----------PA-----------+ | | | | +-----EZ-----+ | +----VMneg----+ | | | +-NMSp-+ +--M-+ | | +-VMdur+-VMP-+ | | | | | | | | | | | LEFT-WALL mn.pn ktAb.n hA.nms e.ez tu.pn rA.acc n.vmn mi.vmd bin.vp m.vmp
Usage: perl perstem.pl [options] < input > output Function: Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Options: -f, --formOutput forms as one of the following: dict: as they appear in a dictionary (default) linked: show all morphemes, linked together unlinked: show all morphemes as separate tokens untouched: don't stem/analyze; mostly for char-set conversion --flush Autoflush buffer output after every line -h, --help Print this usage -i, --input Input character encoding type {cp1256,isiri3342,ncr, translit,utf8} (default: utf8) --irreg-stem {0|1} Resolve irregular present-tense verb stems to their past-tense stems (eg. kon ~ kar). (default: 1 == true) -n, --noroman Delete all non-Arabic script characters (eg. HTML tags) -o, --output Output character encoding type {arabtex,cp1256, isiri3342,ncr,translit,utf8} (default: utf8) -p, --pos Tag inflected words for parts of speech --pos-sep Separate words from their parts of speech by (default: "/" ) -r, --recall Increase recall by parsing ambiguous affixes; may lower precision --skip-comments Skip commented-out lines, without printing them -s, --stem Return only word stems -t, --tokenize {0|1} Tokenize punctuation (default: 1 == true) -u, --unvowel Remove short vowels -v, --version Print version -z, --zwnj {0|1} Insert Zero Width Non-Joiners where they should be (default: 1 == true)