Jon Dehdari


Corpora

Iranian Persian / فارسی

VOA Corpus

Combined years 2003-2008, original UTF-8 Perso-Arabic script, 7.9 million words, 10MB, license: public domain, .xz file

Kayhan Corpus

2005, transliterated, 19 million words, 24MB, messy, license: public domain in most countries outside Iran, .xz file

English

Buckeye Pronunciation Dictionary

Similar to the CMU Pronunciation Dictionary, but the transcriptions are based on a speech corpus instead of intuitions. Includes occurrence counts and mean length of utterance.

Dari / دری

VOA Corpus (small)

This corpus is in the public domain
Combined years (82k words):

Pashto / پښتو

VOA Corpus (small)

This corpus is in the public domain
Combined years (62k words):

Urdu / اردو

VOA Corpus

This corpus is in the public domain
Combined years (4 million words):