Jon Dehdari






Moses MT System Logo




Seminar on the
Moses Machine Translation System

Instructor: Dr. Jon Dehdari

Summer 2015

Class: Room 2.14 of A2.2
Mondays 16:00 (s.t.) - 18:00
Starts the 27th of April
Offices: FR4.6: 1.15 Building A2.2 DFKI: 1.11 Building D3.1

This course gives an overview of the Moses machine translation system. Since the audience for this class is both computational linguists and translators, we will cover both the theory of how it works as well as how to use it in everyday work.

Note that the installation of Moses can be quite difficult, and it may take a while to get the software running on your computer. Be patient, and remove all hammers from the vicinity of your computer!

Outline

  1. Course Organization
  2. Installation
    Moses runs in the command-line. This is no problem if your computer runs Linux. If you run Mac OSX, the only change you need to make is instead of the command wget, run the command curl -O. If your computer runs Windows, you will need to first install Cygwin.
    • Binary Packages - The Windows version depends on Cygwin. After you have installed Cygwin, an easy way to install new packages is to first run the following (once):
      echo 'alias cyg-get="/cygdrive/c/cygwin64/setup-x86_64.exe -q -P "' >> ~/.bashrc
      source ~/.bashrc
      
      Then install the following packages, thus:
      cyg-get gcc-core, gcc-g++, git, libboost_thread, libboost_system, libboost-devel, wget, zip, unzip, graphviz, imagemagick, gv, make, cmake, automake, nano
      
      Now, download and install Moses:
      For Cygwin on Windows:
      cd
      wget -c http://www.statmt.org/moses/RELEASE-3.0/binaries/cygwin-64bit/cygwin-64bit.tgz
      tar zxvf cygwin-64bit.tgz
      mv cygwin-64bit moses
      				
      For Mac OS X (Yosemite):
      cd
      curl -O http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/macosx-yosemite.tgz
      tar zxvf macosx-yosemite.tgz
      mv macosx-yosemite moses
      				
      For Linux (or FreeBSD):
      cd
      wget -c http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/linux-64bit.tgz
      tar zxvf linux-64bit.tgz
      mv linux-64bit moses
      				
    • Compilation is another option for advanced users
    After you install Moses you can try it out on a very small sample dataset:
    cd moses
    wget -c http://www.statmt.org/moses/download/sample-models.tgz
    tar zxvf sample-models.tgz
    cd sample-models
    ~/moses/bin/moses -f phrase-model/moses.ini < phrase-model/in | tee out
    		
    The above command will take the sentence in the file in, which is »das ist ein kleines haus«, and translate it to English "this is a small house" in the file out .
  3. Machine Translation Overview
  4. Noisy Channel Model
    • Concepts: encoding/decoding, probability, prior & posterior probabilities
    • Applications: machine translation, speech recognition, spellchecking, OCR, many others
    • Noisy Channel Model Formula
  5. Install and Prepare Text Corpora
    The following is adapted from the Moses documentation:
    1. Create a directory/folder to store your corpora (text files)
      mkdir ~/corpora
      					
    2. Go to that directory
      cd ~/corpora
      
    3. Download big parallel corpora
      wget -c http://www.statmt.org/wmt15/training-parallel-nc-v10.tgz
      wget -c http://www.statmt.org/wmt15/dev-v2.tgz
      
    4. Unzip the compressed corpora
      tar zxvf training-parallel-nc-v10.tgz
      tar zxvf dev-v2.tgz
      
    5. Tokenize data (separate punctuation from words)
      ~/moses/scripts/tokenizer/tokenizer.perl -l de \
        < news-commentary-v10.de-en.de \
        > news-commentary-v10.de-en.tok.de
      ~/moses/scripts/tokenizer/tokenizer.perl -l en \
        < news-commentary-v10.de-en.en \
        > news-commentary-v10.de-en.tok.en
      
    6. Train a Truecaser. This helps Moses know which words should have an uppercase first letter, like nouns in German
      ~/moses/scripts/recaser/train-truecaser.perl \
        --model truecase-model.de \
        --corpus news-commentary-v10.de-en.tok.de
      ~/moses/scripts/recaser/train-truecaser.perl \
        --model truecase-model.en \
        --corpus news-commentary-v10.de-en.tok.en
      
    7. Truecase the input data, so that words are in their normal case. For example "Das" -> "das"
      ~/moses/scripts/recaser/truecase.perl \
        --model truecase-model.de \
        < news-commentary-v10.de-en.tok.de \
        > news-commentary-v10.de-en.tok.truecase.de
      ~/moses/scripts/recaser/truecase.perl \
        --model truecase-model.en \
        < news-commentary-v10.de-en.tok.en \
        > news-commentary-v10.de-en.tok.truecase.en
      
  6. Language Models
    • Concepts: bigram, trigram, & n-gram, history, perplexity, backoff, interpolation
  7. Word Alignment
    • Overview, slides 17-30
    • Additional slides
    • Download the word alignment program MGIZA:
      cd ~/moses
      wget https://github.com/moses-smt/mgiza/archive/master.zip
      mv master.zip mgiza.zip
      					
    • Unzip the file
      unzip mgiza.zip
      mv mgiza-master mgiza
      cd mgiza/mgizapp
      					
    • Compile the program. You first need cmake and gcc (or clang)
      cmake .
      make -j 4
      cp scripts/*.{sh,py,pl} bin/
      					
    • After running EMS (below), here is a sample alignment from the dataset, in ~/corpora/training/giza.1/en-de.A3.final.gz:
      elected parliaments do not own our liberties .
      NULL ({ }) gewählte ({ 1 }) Parlamente ({ 2 }) besitzen ({ 5 }) unsere ({ 6 }) Freiheiten ({ 7 }) nicht ({ 3 4 }) . ({ 8 })
      					

      word alignment illustration
  8. Moses Experiment Management System (EMS)
    1. Download the example configuration file:
      cd ~/corpora
      wget http://jon.dehdari.org/teaching/uds/moses/moses_ems_nc10.conf
      
    2. Edit the configuration file:
      1. nano moses_ems_nc10.conf
      2. Change the first few lines according to your computer's setup
      3. Type Ctrl/Strg + o to save the file
      4. Type Ctrl/Strg + x to exit the Nano text editor
    3. Do a dry run (Probelauf):
      ~/moses/scripts/ems/experiment.perl -config moses_ems_nc10.conf
    4. Do a real Moses run:
      nice ~/moses/scripts/ems/experiment.perl -config moses_ems_nc10.conf -exec
    5. The last command can take several hours to run, so you can just leave it running overnight. The output of a test set will be in the file: ~/corpora/evaluation/newstest2009.truecased.number .
  9. Phrase-based MT
    • Concepts: phrase extraction, log-linear model, weights
  10. Hierarchical MT
    • Concepts: synchronous grammar, hierarchical phrase extraction


    German-English syntax/hierarchical machine translation illustration using hangers


    French-English syntax/hierarchical machine translation illustration using hangers



    German-English hierarchical phrase extraction example


  11. Computer-aided Translation


Term Paper / Hausarbeit


External Links



Die Umgangssprache ist ein Teil des menschlichen Organismus und nicht weniger kompliziert als dieser.
–Ludwig Wittgenstein      

(Machine Translation: "Maria did not slap the green witch." –Wittgenstein)