Jon Dehdari






Moses MT System Logo




Introduction to
Machine Translation

Instructor: Dr. Jon Dehdari

with help from Dr. Raphaël Rubino and Marcos Zampieri, ABD

Summer 2016

Class: Room 2.14 of A2.2
Mondays 16:00 (s.t.) - 18:00
Class starts the 25th of April
Hausarbeit due the 30th of September
Offices: FR4.6: 1.15 Building A2.2 DFKI: 1.11 Building D3.1

This course gives an introduction to machine translation. Since the audience for this class is both computational linguists and translators, we will cover both the theory of how it works as well as how to use it in everyday work.

Outline

  1. Machine Translation Overview
  2. Noisy Channel Model
    • Concepts: encoding/decoding, probability, prior & posterior probabilities
    • Applications: machine translation, speech recognition, spellchecking, OCR, many others
    • Noisy Channel Model Formula
  3. Install and Prepare Text Corpora
    The following is adapted from the Moses documentation:
    1. Create a directory/folder to store your corpora (text files)
      mkdir ~/corpora
      					
    2. Go to that directory
      cd ~/corpora
      
    3. Download big parallel corpora
      wget -c http://www.statmt.org/wmt15/training-parallel-nc-v10.tgz
      wget -c http://www.statmt.org/wmt15/dev-v2.tgz
      
    4. Unzip the compressed corpora
      tar zxvf training-parallel-nc-v10.tgz
      tar zxvf dev-v2.tgz
      
    5. Tokenize data (separate punctuation from words)
      ~/moses/scripts/tokenizer/tokenizer.perl -l de \
        < news-commentary-v10.de-en.de \
        > news-commentary-v10.de-en.tok.de
      ~/moses/scripts/tokenizer/tokenizer.perl -l en \
        < news-commentary-v10.de-en.en \
        > news-commentary-v10.de-en.tok.en
      
    6. Train a Truecaser. This helps Moses know which words should have an uppercase first letter, like nouns in German
      ~/moses/scripts/recaser/train-truecaser.perl \
        --model truecase-model.de \
        --corpus news-commentary-v10.de-en.tok.de
      ~/moses/scripts/recaser/train-truecaser.perl \
        --model truecase-model.en \
        --corpus news-commentary-v10.de-en.tok.en
      
    7. Truecase the input data, so that words are in their normal case. For example "Das" -> "das"
      ~/moses/scripts/recaser/truecase.perl \
        --model truecase-model.de \
        < news-commentary-v10.de-en.tok.de \
        > news-commentary-v10.de-en.tok.truecase.de
      ~/moses/scripts/recaser/truecase.perl \
        --model truecase-model.en \
        < news-commentary-v10.de-en.tok.en \
        > news-commentary-v10.de-en.tok.truecase.en
      
  4. Language Models
    • Concepts: bigram, trigram, & n-gram, history, perplexity, backoff, interpolation
  5. Word Alignment
    • Overview, slides 17-30
    • Additional slides
    • Download the word alignment program MGIZA:
      cd ~/moses
      wget https://github.com/moses-smt/mgiza/archive/master.zip
      mv master.zip mgiza.zip
      					
    • Unzip the file
      unzip mgiza.zip
      mv mgiza-master mgiza
      cd mgiza/mgizapp
      					
    • Compile the program. You first need cmake and gcc (or clang)
      cmake .
      make -j 4
      cp scripts/*.{sh,py,pl} bin/
      					
    • After running EMS (below), here is a sample alignment from the dataset, in ~/corpora/training/giza.1/en-de.A3.final.gz:
      elected parliaments do not own our liberties .
      NULL ({ }) gewählte ({ 1 }) Parlamente ({ 2 }) besitzen ({ 5 }) unsere ({ 6 }) Freiheiten ({ 7 }) nicht ({ 3 4 }) . ({ 8 })
      					

      word alignment illustration
  6. Moses Experiment Management System (EMS)
    1. Download the example configuration file:
      cd ~/corpora
      wget http://jon.dehdari.org/teaching/uds/moses/moses_ems_nc10.conf
      
    2. Edit the configuration file:
      1. nano moses_ems_nc10.conf
      2. Change the first few lines according to your computer's setup
      3. Type Ctrl/Strg + o to save the file
      4. Type Ctrl/Strg + x to exit the Nano text editor
    3. Do a dry run (Probelauf):
      ~/moses/scripts/ems/experiment.perl -config moses_ems_nc10.conf
    4. Do a real Moses run:
      nice ~/moses/scripts/ems/experiment.perl -config moses_ems_nc10.conf -exec
    5. The last command can take several hours to run, so you can just leave it running overnight. The output of a test set will be in the file: ~/corpora/evaluation/newstest2009.truecased.number .
  7. Phrase-based MT
    • Concepts: phrase extraction, log-linear model, weights
  8. Hierarchical MT
    • Concepts: synchronous grammar, hierarchical phrase extraction


    German-English syntax/hierarchical machine translation illustration using hangers


    French-English syntax/hierarchical machine translation illustration using hangers



    German-English hierarchical phrase extraction example



  9. Computer-aided Translation


Term Paper / Hausarbeit

Due September 30th, 2016. Feel free to email me if you have any questions.

External Links



Die Umgangssprache ist ein Teil des menschlichen Organismus und nicht weniger kompliziert als dieser.
–Ludwig Wittgenstein      

(Machine Translation: "Maria did not slap the green witch." –Wittgenstein)