Seminar on Statistical Language Modeling

Statistical language modeling, which provides probabilities for linguistic utterances, is a vital component in machine translation, automatic speech recognition, information retrieval, and many other language technologies. In this seminar we will discuss different types of language models and how they are used in various applications. The language models include n-gram- (including various smoothing techniques), skip-, class-, factored-, topic-, and neural-network-based approaches. We also will look at how these perform on different language typologies and at different scales of training set sizes.

This seminar will be followed by a project seminar where you will work in small groups to identify a shortcoming of an existing language model, make a novel modification to overcome the shortcoming, compare experimental results of your new method with the existing baseline, and discuss the results. It'll be fun.

Syllabus

Topics

Overview of Language Models, including n-gram models
Cache and Skip Language Models
Factored Language Models
Sentence Mixture Models
PLSA/Topic-based Language Models: A B
Bilingual Language Models
Feedforward Neural Network Language Models: Derivatives: A B
Recurrent (viz. Elman) Neural Network Language Models
Big Language Models

Assignments

External Links

List of Language Modeling Software
Free Corpora:

WMT 2015, especially News Crawl under "Monolingual language model training data"
ACL Wiki, "Resources by Language"
OPUS - open parallel corpora

Corpus Processing Tools:

Jon Dehdari

Graduate Seminar on
Statistical Language Modeling

Instructor: Dr. Jon Dehdari

Winter 2014–2015

Syllabus

Topics

Assignments

External Links

Graduate Seminar on Statistical Language Modeling

Instructor: Dr. Jon Dehdari

Winter 2014–2015

Syllabus

Topics

Assignments

External Links

Graduate Seminar on
Statistical Language Modeling