Iranian Persian / فارسی
VOA Corpus
Combined years 2003-2008, original UTF-8 Perso-Arabic script, 7.9 million words, 10MB, license: public domain, .xz file
Kayhan Corpus
2005, transliterated, 19 million words, 24MB, messy, license: public domain in most countries outside Iran, .xz file