Moses for Mere Mortals – Short Tutorial

download Moses for Mere Mortals – Short Tutorial

of 46

description

machine translation

Transcript of Moses for Mere Mortals – Short Tutorial

Moses for Mere Mortals Short Tutorial

A translation chain for the real world

Table of Contents

Moses for Mere Mortals Short Tutorial1

A. Purpose6

A.1. Windows add-ins6

A.2. Moses for Mere Mortals (the Linux scripts)6

B. Requirements7

B.1. System requirements7

B.2. Software8

C. Overview8

D. Installation for users new to Moses9

D. 1. Demonstration corpus9

D. 2. First steps with the scripts10

E. Using your own corpora12

F. create-moses-irstlm-randlm script13

G. Understanding the directory structure of $mosesdir14

H. Names of the files and of the languages16

I. Files needed for the training and what you need to know before training your own corpus 16

I.1. Where to put the files to be processed16

I.2. Need of strictly aligned corpora files17

I.3. Do not use spaces in file names17

I.4. Corpus files 17

I.4.1. Using TMX files to create Moses corpus files17

I.5. Test files17

I.6. Language model file18

I.7. Tuning files18

I.7.1. maxruns parameter 19

I.8. Evaluation (testing) files19

I.9. Recaser files19

J. make-test-moses-irstlm-randlm script20

J.1. The ominous control characters20

K. train-moses-irstlm-randlm script20

K.1. Description of some important parameters21

K.2. Greatly increasing the training speed22

K.3. Controlling tuning 22

K.4. Avoiding the destruction of a previous training by a subsequent training and reusing parts of training already done in previous trainings 22

K.5. Training of an inverted corpus 22

K.6. Isolating a training from all the other ones23

L. translate-moses-irstlm-randlm script23

L.1. Speed24

L.2. Reusing tuning weights (only for advanced users)24

M. score-moses-irstlm-randlm script25

M.1. Two types of scores26

N. Utilities27

N.1. transfer-training-to-another-location-moses-irstlm-randlm script27

O. Windows add-ins28

P. Improving quality and speed28

Q. Deleting trained corpora 29

Q.1. You want to erase all the trainings that you have done29

Q.2. You want to erase just some of all the trainings that you have done29

Q.2.1. Scenario 1: More than one Moses installations available 29

Q.2.2. Scenario 2: Single Moses installation available30

R. New features30

S. How to contribute31

T. Thanks31

U. Author32

APPENDIX: default parameters of each of the scripts32

1) create-moses-irstlm-randlm script:32

2) make-test-moses-irstlm-randlm script:33

3) train-moses-irstlm-randlm script:33

4) translate-moses-irstlm-randlm script:42

5) score-moses-irstlm-randlm script:45

6) transfer-training-to-another-location-moses-irstlm-randlm script46

A. Purpose

Moses-for-Mere-Mortals builds a translation chain prototype with Moses + IRSTLM + RandLM (with either the giza-pp or the MGIZAaligner). Very large corpora can therefore be processed. Its main aims are the following:

1) To help build a translation chain for the real world;

2) To guide the first steps of users that are just beginning to use Moses;

3) To enable a quick evaluation of Moses.

Even though the main thrust of this work centers on Linux (the operating system where Moses runs), translators usually work on an MS Windows environment. Therefore, two add-ins (collectively called Windows add-ins) help to make the bridge from Windows to Linux and then back from Linux to Windows.

Therefore, this work involves both a Linux component (the Moses-for-Mere-Mortals.tar.gz package) and a MS Windows components (the Windows-add-ins.zip package). For a brief overview of the way they interact, please see http://moses-for-mere-mortals.googlecode.com/files/Overview.jpeg.

A.1. Windows add-ins

Translators also usually do not have corpora large enough for getting excellent machine translation (MT) results and the results they get with MT are highly dependent on training it with a corpus that is highly representative of the domains that are interested in. The synergy between machine translation and translation memories is not often stressed, but it seems to us that it naturally leads to better results: machine translation can be enriched with the human translations stored in the translation memories; and translation memories, if they do not have a certain match percentage for a given segment, can be complemented with machine translation segments.

Therefore, the Windows add-ins are especially concerned with this synergy between MT and translation memories (namely those in the very well known TMX format). Translation memories are not, however, an obligatory part of this pack: the Linux scripts work with any perfectly aligned corpora files in UTF-8 format with Linux line endings.

You will find a README file with instructions in each of the 2 Windows add-ins:

1) Extract_TMX_Corpus, which converts a whole directory of TMX files into a Moses corpus that can be used for training;

2) Moses2TMX, which converts a batch of Moses translations and their corresponding source language documents into TMX files.

They therefore will not be mentioned in the rest of this Help/Short Tutorial file.

A.2. Moses for Mere Mortals (the Linux scripts)

Moses for Mere Mortals, the Linux component of this work, automates the installation, the creation of a representative set of test files, the training, the translation and even the scoring tasks. It also comes with a demonstration corpus (too small for doing justice to the qualitative results that can be achieved with Moses, but capable of giving a realistic view of the relative duration of the steps involved). For building very large Moses corpora using your own translation memories (*.TMX files), please see http://moses-for-mere-mortals.googlecode.com/files/Extract_TMX_Corpus_1.041.exe. If, on the other hand, you want to transfer Moses translations to a *.TMX translation memory tool (e. g., SDL Trados Translator's Workbench), you can use Moses2TMX (http://moses-for-mere-mortals.googlecode.com/files/Moses2TMX-1.032.exe). Together, these 2 open source programs make the link between Linux (where the corpora are trained and the translations are made) and Windows (where most translators actually use Moses translations).

One would expect that the users of these scripts, after having tried the demonstration corpus, can immediately get results with the real corpora they are interested in.

These scripts also avoid the destruction of previously trained corpora by the subsequent training of a new corpus and simultaneously try to reuse the relevant parts of previous trainings in the subsequent training of a new corpus.

It is also possible to train corpora where every word is presented together with its respective lemma and part of speech tag (factored training). The present scripts do not cover this type of training.

Moses-for-Mere-Mortals scripts are based on instructions from several sources, especially the http://www.dlsi.ua.es/~mlf/fosmt-moses.html and the http://www.statmt.org/moses_steps.html web pages and the Moses, IRSTLM, RandLM, giza-pp and MGIZA documentation, as well as on research on the available literature on Moses, namely the Moses mailing list (http://news.gmane.org/gmane.comp.nlp.moses.user). The comments transcribe parts of the manuals of all the tools used.

Moses MT System is an open source project under the guidance of Philipp Koehn, University of Edinburgh, and is supported by the European Commission Framework Programme on Research and Technological Development and others.

For information on the general concepts of Statistical Machine Translation, see Statistical Machine Translation" by Philipp Koehn, Cambridge University Press, 2010.

B. Requirements

B.1. System requirements

Linux (Ubuntu 9.04 64 bits) distribution

These scripts have been tested in Ubuntu 9.04, 64 bits. Ubuntu 9.10 is still a very recent distribution and we had some reports that IRSTLM (as well as SRILM) did have compilation problems. We therefore recommend that you use Ubuntu 9.04. If you have already installed Ubuntu 9.10, you can't go back to version 9.04 but you can install this latter version also in the same computer (see http://ubuntuforums.org/showthread.php?t=1325920 ; if you are new to Linux, this could be a good time to ask for the help of a friend). Parallel training only works in its 64 bits version. The scripts should also work in other Linux distributions with slight changes, but they have not been tested in any other distribution.

Computer:

Minimum 2 Gb RAM (preferably much more)

Preferably a fast multiprocessor computer

Disk space: as a rough rule, the disk space needed for corpus training is approximately 100 times the size of the corpus (source plus target files).

B.2. Software

In order to use the scripts, the following packages should have already be installed in Ubuntu 9.04:

1) subversion

2) automake

3) autoconf

4) bison

5) boost-build

6) build-essential

7) flex

8) help2man

9) libboost1.37-dev

10) libpthread-stubs0-dev

11) libgc-dev

12) zlibc

13) zlib1g-dev

14) gawk

15) tofrodos

You can install them by selecting the System menu and then the Administration > Synaptic Package Manager command.

C. Overview

These scripts have only been tested in an Ubuntu 9.04 64 bits environment. Before using them, you should install the Linux packages upon which they depend (see section B. Requirements).

Note: So as to avoid having to rewrite this Help file every time that a script version changes, in what follows the version numbers of the scripts have been omitted (for example: we write create-moses-irstlm-randlm instead of create-moses-irstlm-randlm-1.27)

1) You should start using these scripts by opening the create-moses-irstlm-randlm script and changing there some important parameters that will make your installation adequate for your hardware (see section F. create-moses-irstlm-randlm script).

2) Then launch the create-moses-irstlm-randlm script, which will download and compile ***all*** the Moses packages. Since some compilation errors do not stop compilation but can lead to an unusable tool, the create-moses-irstlm-randlm script checks, at the end of the compilation of each tool, whether the appropriate files are present or not.

3) Select a set of corpus files (2 strictly aligned files in UTF-8 format, one in the source language and another in the target language) for Moses training and launch the make-test-moses-irstlm-randlm script in order to create a representative set of segments that will be used for testing the corpus that you will train and in order to erase those segments from the corpus files that will be used for training. This step is not obligatory but is highly recommended.

NOTE 1: Even if you do not have your own corpus for training, you can use the demo corpus that comes with these scripts (you do not need to do anything to have this corpus: the create-moses-irstlm-randlm script takes care of this). It is highly recommended that new Moses users start using these scripts using the demo corpus.

NOTE 2: If you do not have text files in UTF-8 format (necessary for Moses), but you do have translation memories in *.TMX format, you can use the Extract-TMX-Corpus tool to create the adequate UTF-8 files from your TMX files (http://code.google.com/p/extract-tmx-corpus/

HYPERLINK "http://code.google.com/p/extract-tmx-corpus/").

4) Next, the train-moses-irstlm-randlm script trains a corpus composed exclusively of plain words (non-factored training). This script does include some advanced features, like memory-mapping (for saving memory resources, what is essential for processing large corpora), tuning (in order to get qualitatively better results) and the ability to change certain parameters that can either reduce the processing time or increase the quality of the results.

5) The translate-moses-irstlm-randlm script then translates one or more documents placed in a specific directory.

6) Finally, the score-moses-irstlm-randlm script allows you to score Moses translations against human translations of the same text, giving BLEU and NIST scores for either the whole document or for each segment of the document (depending on the settings that you define).

7) If you want to exchange your trained corpora with someone else or to another Moses installation (even one in the same computer), the transfer-training-to-another-location-moses-irstlm-randlm script helps you to do that.

Users are expected to open the scripts and to change the parameters according to their needs (the parameters are preceded by comments that explain their purpose and sometimes even their allowable values and limits; many of those comments are citations of the Help of the several packages that Moses uses and of the Moses manual). These parameters are set at the top of each script in a clearly defined section.

D. Installation for users new to Moses

D. 1. Demonstration corpus

Moses-for-Mere-Mortals comes with a small demonstration corpus so that you can quickly see what the scripts can do. This corpus is automatically installed by the create-moses-irstlm-randlm script (no need for you to do anything) and is used by the other scripts.

It was extracted from DGT-TM Acquis available from the Commission's Joint Research Centre website (please note that only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic).

The corpus is small (200 000 segments in the Portuguese and English languages) and the results of its processing cannot be seen as representative of the quality Moses can achieve (especially if you consider that IRSTLM and RandLM are intended to process corpora with several to many millions of segments).

However, a small corpus like this will reveal you facts about Moses (like the length of time needed for each of its steps) and it is therefore highly recommended that you start using the scripts with their default settings. Later on, you can set them so that the actual work you are interested in will be done.

If you don't change the default settings in the create-moses-irstlm-randlm and the train-moses-irstlm-randlm scripts, Moses will train this Portuguese-English corpus, which involves 300 000 segments for language model building and 200 000 segments for corpus training. The create-moses-irstlm-randlm script transfers the necessary files to the right place, so that you do not have to do anything.

This corpus had a BLEU score of 0,6702 and a NIST score of 11.6946 with a 5-gram language model. Even though small, it took some time to train in a machine with a Intel i7 720-QM processor and 8 GB of DDR3 RAM (3h 08m 39s without tuning and 10h 02m 20s with tuning limited to a maximum of 10 iterations). The training of a corpus with 6.6 million segments in this same machine took 2 days 22 h 59m (without tuning).

D. 2. First steps with the scripts

Here you find a description of how to run the whole demo, with minimum changes in the scripts' parameters. In most cases, you just have to launch the scripts as indicated below.

However, if you want to use your own corpora right away, skip this section and go directly to Section E.

You can launch the scripts from wherever you saved them. However, create-moses-irstlm-randlm (installation) should be launched first, make-test-moses-irstlm-randlm next, train-moses-irstlm-randlm (corpus training) next and translate-moses-irstlm-randlm (actual translation) last. After having Moses translations, you can optionally score them against reference (human) translations with the score-moses-irstlm-randlm script. The $mosesdir parameter of most the scripts should always have the same value (e.g., $HOME/moses-irstlm-randlm; if your login name is john, then $HOME means /home/john).

1) Download the pack http://moses-for-mere-mortals.googlecode.com/files/Moses-for-Mere-Mortals-0.97.tar.gz to your $HOME directory (if it is not there already), right-click it and, in the contextual menu that appears, click the Extract here... command. Alternatively, in the Linux Terminal, place yourself in the directory that contains this pack and extract its contents with the following command:

tar -xzvf pack moses.for.mere.mortals.tar.gz

A new Moses-for-Mere-Mortals directory will be created and the scripts will be placed in its scripts subdirectory. This directory has the following scripts:

2) If your computer has just one processor, you can go directly to step 3. If it has more than 1 processor and if you want Moses to use them, see point 1 of section F and follow the instructions to change the number of processors. This change must be done before you launch the create-moses-irstlm-randlm script.

Tip: You can run the Demo without changing this parameter. When you really start working with Moses, you can install it again with the number of processor you wish so that you can fully use all the potential of your computer.

3. Launch the create-moses-irstlm-randlm script: open the Linux Terminal (console), place yourself in the directory where the scripts are and type:

./create-moses-irstlm-randlm

4. Now you are going to extract from the demo corpus a set of segments that will be used for testing the trained corpus. That same set of segments will be erased from the demo corpus files (that will be used later for training):

./make-test-moses-irstlm-randlm

5. Training time! If your computer has more than 1 processor and you want them to be used right away, please see section J. Then, again in the Terminal, type:

./train-moses-irstlm-randlm

6. You can now translate a text with the trained corpus you have created in step 5 (the create-moses-irstlm-randlm script has already put the file to be translated (100.pt) in $mosesdir/translation_input directory).

You have to indicate to the translate-moses-irstlm-randlm script the trained corpus that you want to use. In order to do that:

a) Goto to the $mosesdir/logs directory and copy the name of the file that indicates the corpus that you have just trained (if this is the first time that you are using these scripts and if you have already done a training, there will be just one file there);

b) Open the translate-moses-irstlm-randlm;

c) Set the value of the translate-moses-irstlm-randlm $logfile parameter to the name of the file you identified in the $mosesdir/logs directory (paste that value there).

d) Save your changes.

Now, you can launch the translate-moses-irstlm-randlm script:

./translate-moses-irstlm-randlm

7. If you want, you can score the Moses translation against a reference (human) translation:

./score-moses-irstlm-randlm

Again, the create-moses-irstlm-randlm script has already copied the necessary reference (human) translation to the $mosesdir/translation_reference directory (no need for you to do anything). By default, it will score the whole document and give you the BLEU and NIST scores for the whole document.

You will probably notice that the score you got is much lower than the one you got at the end of the training test. That is normal and serves as a warning that even a well trained corpus with a very good score will perform poorly if it is used to translate segments that come from a domain quite different from that used for training.

After this first practical experience, you can now read the next section in order to learn how to better control all these processes.

E. Using your own corpora

You should do the same steps as described above for new users, but, before launching each of the scripts, you should change the values of the parameters you are interested in (as you have already done before using the translate-moses-irstlm-randlm script).

So as to ease your task, the description of each of the scripts in this Help/Short Tutorial file will start with a section entitled Vital parameters, which tells you the strictest minimum of parameters that you should change in case you want to train a corpus different from the demonstration corpus.

In fact, in order to train your own corpora, you have to define your own settings and to choose your own corpora files and the languages you are interested in. You might also want to change the parameters of Moses or of the packages it uses. In order to do that, before launching the scripts, open them and set the variables defined between

###################################################################################

# The values of the variables that follow should be filled according to your needs: #

###################################################################################and

###################################################################################

# End of parameters that you should fill #

###################################################################################

Each parameter is preceded by a comment that describes its function and, in some cases, even states the values that are the allowable and the default values you can use. They often consist of extracts of the Help files, readmes or manuals of the several packages used.

Please refer to the sections that follow, which describe each of the scripts and some important info in more detail.

F. create-moses-irstlm-randlm script

Vital parameters: mosesdir, mosesnumprocessors

This is a shell script that creates a Moses system.

1. Go to the Moses-for-Mere-Mortals/scripts directory and open the create-moses-irstlm-randlm script. At the top of the script, you can change several variables that allow you to better adapt it to your own requirements:

$mosesdir: this is the base directory where Moses will be installed (default value: $HOME/moses-irstlm-randlm). You can change both its name (in this case, moses-irstlm-randlm) and its location (in this case, $HOME). This variable is defined in all the scripts and its value should be the same in all of them if you want them to be able to work together (and you do want that!).

$mosesnumprocessors: the number of processors of your computer that you want Moses to use (Moses will be compiled to make better use of them). The default value is 1, but nowadays you can easily find computers with 2, 4 and even 8 processors. If your computer has more than one processor, change this parameter so that it reflects the number of processors that you want to make available for Moses.

2. Save your changes. Do not change any other parameter for the time being (later on, after you have run all these scripts, you can start your own very personal experiences).

This script also creates some important directories:

$mosesdir/corpora_for_training: this is the directory where the corpora and the other input files for training are located; there you'll find already several files that can be used to run the training demo; you should place here all the files needed for training the corpora you are interested in.

$mosesdir/corpora_trained: this is the directory where the files created during corpus training are kept; please do not touch this directory, since you can destroy the training of one or several corpora;

$mosesdir/logs: this is the directory where the training summary files are located; these files have the name of several variables that will be used by the translate-moses-irstlm-randlm script (and which this latter script will extract for you); these files are very important because they are the only way to indicate to the translate-moses-irstlm-randlm script the trained corpus you want to use for translation;

$mosesdir/toolsdir: this is the directory where both Moses and all the other tools (giza-pp, irstlm and so on) will be installed; this directory will not change during the training and you should not change it.

G. Understanding the directory structure of $mosesdir

The directory structure indicated in the figure below results from the execution of all the scripts. These directories are created at the time they are needed. After installation, for example, not all of them will exist already.

1. Once you have installed Moses with create-moses-irstlm-randlm, you should put the corpus files that you want to train in the $mosesdir/corpora_for_training directory. You should also place here the files used for creating the language model (if different), for training recasing (if different), for tuning (if any), and for testing the results of training (if any).

2. If you then use the train-moses-irstlm-randlm script for training that corpus, a directory $mosesdir/corpora_trained will be created to store the trained files.

NOTE: Even though you might suppose that the $mosesdir/corpora_trained directory is a vital one (a correct assumption), you are strongly urged not to change any of its contents. This is because, to allow reusing of the work done in previous trainings, it has a complex structure that mixes files from several trainings.

If you change it, you risk destroying not just one, but several trainings. You can, however, use a specific training already done by referring to its logfile (see below) which can be found in the $mosesdir/logs directory. It is also possible to isolate a training from all others (please refer to section J.6 in order to learn how to do that).

3. At the end of the training, a training summary file (logfile) will be created in the $mosesdir/logs directory. This file is very important because its name will be used in the translate-moses-irstlm-randlm script to indicate the trained corpus you want to use for translation.

4. Once a corpus is trained, you can start using it to get actual translations with the translate-moses-irstlm-randlm script. Place the documents to be translated (you can translate one or many documents at the same time) in the $mosesdir/translation_input directory (created by the train-moses-irstlm-randlm script ) and then launch the translate-moses-irstlm-randlm script. You should also make sure, before the translation starts, that the files placed in the translation_input directory are indeed adequate for the trained corpus that this script uses (for instance, the languages used for training should match those of the translation; otherwise, you will waste time translating files that shouldn't have been translated using that trained corpus, e.g. because their language is not adequate).

5. Translation can have 2 types of outputs:

a) a normal Moses translation, if you set the $translate_for_tmx parameter to value different from 1 (default: 0); or

b) a Moses translation especially suited for making translation memories, if you set the $translate_for_tmx parameter to 1.

The normal translation will be located in the $mosesdir/translation_output directory.

The translation intended to build TMX translation memories will appear, together with the corresponding modified input file, in the $mosesdir/translation_files_for_tmx directory.

In both cases, the translation will have the name of the source document plus an ending that corresponds to the destination language and a final suffix .moses. This avoids confusions between source document and translation and between Moses translation and reference translation.

6. A new script (score-moses-irstlm-randlm) enables you to place a reference (that is, human) translation in the translation_reference directory and get the BLEU and NIST scores of the corresponding Moses translations in the translation_scoring directory.

Again, you have 2 choices:

a) get a score for the whole document, if the parameter $score_line_by_line is different from 1; or

b) get a score line by line, with the segments ordered by ascending BLEU score, if the parameter $score_line_by_line is equal to 1.

7. The contents of the $mosesdir/tools directory should not be changed, since it includes all the files needed for Moses to work.

H. Names of the files and of the languages

The names of the files and languages, which are used to create some directories names, should not include spaces, as well as special characters, like asterisks, backslashes or question marks. Try to stick with letters, numbers, and the dash, dot, and underscore if you want to avoid surprises. Avoid using a dash as the first character of a file name, because most Linux commands will treat it as a switch. If your files start with a dot, they'll become hidden files.

I. Files needed for the training and what you need to know before training your own corpus

I.1. Where to put the files to be processed

All the files that are going to be mentioned should be put in $mosesdir/corpora_for_training (as described above, mosesdir is the base Moses system directory, whose default value is $HOME/moses-irstlm-randlm)

I.2. Need of strictly aligned corpora files

Be very sure that the corpus files you use are strictly aligned. Otherwise, you risk getting quite puzzling errors. At the very least, check the number of lines of the source language file and of the target language files, which should be strictly the same. In order to do this, type in your terminal the commands:

wc -l {name_of_source_language_file}

and

wc -l {name_of_target_language_file}

You may also want to check that the last line of the source language file does correspond to the last line of the target language file.

I.3. Do not use spaces in file names

The name of all the files used in these scripts should not include spaces.

I.4. Corpus files

At the very least, you need 2 parallel corpus files for training (one source language file and one target language file, where the line number of a segment in the starting language file is exactly equal to the line number of its translation in the target language file, that is, those 2 files should be strictly aligned). The way the script works requires that both files share a common $corpusbasename (a variable that you set in the train-moses-irstlm-randlm script), that is, a prefix that identifies the corpus in question (e.g., corpus1000) and a suffix that identifies the language in question (e.g., en, "fr" or pt). The name of the corpus files should be exclusively composed of these prefix and suffix. If you want to create a corpus1000 for the pt and en languages (pt being the source language and en the target language in all the examples that follow), you therefore need the following files:

corpus1000.pt

corpus1000.en

I.4.1. Using TMX files to create Moses corpus files

If you do not have separate and strictly aligned files (one in the source language and another in the target language), but you do have *.TMX translation memory files, you can use Extract-TMX-Corpus (http://code.google.com/p/extract-tmx-corpus/

HYPERLINK "http://code.google.com/p/extract-tmx-corpus/") to transform such *.TMX files into files that Moses can use (it namely converts them to UTF-8 format). This program also does some cleaning that will prevent Moses errors during the training phase.

I.5. Test files

Starting with the present version, a new script, make-test-moses-irstlm-randlm, allows you to use these 2 files to obtain a representative sample of their segments and creates 2 new corpus files in which it erases the corresponding lines of the segments that were extracted for testing. These segments can still occur in those new corpus files, since they may occur in more than one line, but, unless your corpus is indeed very repetitive, it is likely that your test files do indeed contain text that does not exist anymore in the corpus to be trained. The new corpus files created by this script (whose name is given at the end of its execution) should then be the files used for setting the $corpusbasename. If you had started it with the example files given above, it would create 4 new files:

corpus1000.for_train.pt (a file used for corpus training)

corpus1000.for_train.en (a file used for corpus training)

corpus1000.for_test.pt (a file used for testing the trained corpus)

corpus1000.for_test.en (a file used for testing the trained corpus)

I.6. Language model file

You also need a single file for building the language model (this file must contain segments only in the target language). If you want, you can use the target language file used for the training, in which case you do not need a new file for this purpose. However, you might want to use a (possibly much bigger) file for building the language model, since this is a very important building-block that costs comparatively little time to process. A language model file that is bigger than your corpus can lead to somewhat better results. If you want to use a specific file for this purpose, you need to use a file, that you yourself create, whose prefix is arbitrary (e.g., corpus145000), but whose suffix is obligatorily the abbreviation of the target language already used in the corpus files (see point H.4). Therefore, continuing our example, if you want to use a file with 145000 segments for building the language model, you could name it:

145000.en

I.7. Tuning files

The tuning of a corpus should lead in principle to better translation results. For that, 2 conditions must be met: 1) the tuning parameter must be set to 1 in the train-moses-irstlm-randlm script (this is its default value); 2) you need two files: one for the source language and one for the target language.

Again, you can use the files described in point H.4, in which case you do not need any new files. However, given that tuning is a very, very long process (and perhaps the longest of all), you might want to use a set of files with a ***much smaller size*** than those described in point 1. However, too small files lead generally to a big number of tuning iterations and too big files take more time to process. You have to strike a balance between these two extremes. Such files should have a common arbitrary prefix and as suffix the abbreviations of the languages already described in point 1. Therefore, continuing our example, you could use 2 much smaller files named:

1. 100tuning.pt

2. 100tuning.en

I.7.1. maxruns parameter

Tuning is a phase that can easily take more time than all the others put together. Furthermore, you can't easily estimate its duration beforehand, since the number of its runs is highly variable from corpus to corpus.

Therefore, a modified mert-moses-new.pl script (mert-moses-new-modif.pl) introduces some changes to the original Moses script so that the user can control the number of tuning runs through the parameter $maxruns of the train-moses-irstlm-randlm script.

A value of -1 means that an unlimited number of runs is allowed. Any positive number >= 2 means that tuning should be stopped after that run number. The default value is 10. Good values can be between 5 and 10 (choose the lower end of this spectrum if you want to speed up things; but you will risk a worse tuning).

I.8. Evaluation (testing) files

If you want to evaluate the training of your trained corpus, you'll need 3 files, one for the source language, another for the human translation in the target language and the Moses translation file (in order to create this later file during training, the runtrainingtest parameter should be set to 1 in the train-moses-irstlm-randlm script (this is its default value).

You can use the files described in point H.4 for that purpose, in which case you need no new files. You can use files especially created for the purpose or you can use the maketestmoses-irstlm-randlm script to create them (see note below).

Such files should have a common arbitrary prefix and as suffix the abbreviations of the languages already described in point H.4. Therefore, continuing our example, you could use 2 much smaller files named:

1000.for_test.pt

1000.for_test.en

NOTE: Since the choice of the segments used for the evaluation can considerably affect its results, a script (make-test-moses-irstlm-randlm) was made that divides the corpus files into X sectors and chooses randomly Y segments in each of those sectors based on a random segment line number that it creates. The resulting test files are probably more representative of the several contexts and vocabularies of your corpus than a set of consecutive segments would be.

I.9. Recaser files

You also need a (possibly much bigger) file in the target language for recaser training. This is a comparatively quick process and you can therefore invest on it. The recaser training file has a prefix that is arbitrary (e.g., 10000recaser), but a suffix that is obligatorily the abbreviation of the target language already used in the corpus files (see point H.4). Therefore, continuing our example, if you want to use a file with 10000 segments for recaser training, you could name it:

10000recaser.en

J. make-test-moses-irstlm-randlm script

Vital parameters: lang1, lang2, mosesdir, basefilename

This script assumes that a Moses installation has already been done with the create-moses-irstlm-randlm script. It uses two aligned input files, one in the source and another in the target language, whose $basefilename should be equal and which differ by a suffix that indicates their respective languages (e.g., 200000.pt and 200000.en, the $basefilename being, in this case, 200000).

The script produces 2 new sets of files:

1) 2 files (one in the source language and another in the target language) used for testing the trained corpus; for that, it divides the corpus into X $sectors (a parameter that you can define) and then randomly selects Y $segments (another parameter you can define) in each sector. All the selected segments will have different line numbers (no line can be chosen more than once). This procedure offers a better guarantee that the segments used for testing the trained corpus are more representative of all the styles and contexts of the corpus being used than they would be if you would arbitrarily choose the same number of consecutive segments somewhere in the input files. These files have a basefilename equal to $basefilename.for_test and will be later used by the train-moses-irstlm-randlm script (e.g, 200000.for_test.pt and 00000.for_test.en).

2) 2 files (one in the source language and another in the destination language) that are equal to the starting files, except that the segments used for creating the 2 test files have been erased from them. These files have a basefilename equal to $basefilename.for_train and will be later used by the train-moses-irstlm-randlm script (e.g, 200000.for_train.pt and 00000.for_train.en).

NOTE: if you want to compare the relative results of a change in training parameters, you should execute the training test before and after the change in parameters with the same set of test files (run make-test-* just once and use the test files it creates to test both trainings).

J.1. The ominous control characters

The Moses clean-corpus-n.perl script does not erase the control characters, what has led, in our experiments, to a crash during the memory-mapping of a reordering-table (with a control-K character), in one case several days after the training had started. Problems with this character have also been described in the Moses mailing list. Therefore, this script (as well as the train-moses-irstlm-randlm script) also substitutes all instances of the control-G, control-H, control-L, control-M and control-K characters by a space in all the files.

K. train-moses-irstlm-randlm script

Vital parameters: mosesdir, lang1, lang2, corpusbasename, lmbasename, tuningbasename, testbasename, recaserbasename, reuse, paralleltraining, memmapping, tuning, runtrainingtest, lngmdl, Gram, mgizanumprocessors, maxruns

This script assumes that a Moses installation has already been done with the create-moses-irstlm-randlm script and can optionally use the files created by the make-test-moses-irstlm-randlm script.

Even though it might not cover all the features that you might like to play with (namely those necessary for factored training, which would involve the use of a lemmatizer and/or a parts of speech tagger), this script does allow you to train a corpus, to memory-map your training files (so that Moses uses less RAM resources), to do tuning, to do a training test (also with memory-mapping), and to get the NIST and BLEU scores of that test. It also makes available all the parameters used by IRSTLM, RandLM, mkcls, GIZA and MGIZA, as well as selected parameters used by the Moses decoder and the Moses scripts. These parameters are all set by default to the values they receive when you use the Moses scripts. If you are new to Moses, do not change them.

If your computer has more than 1 processor, you should also change here the mgizanumprocessors parameter (set by default to 1) to the actual number of processors of your computer that you want to use with MGIZA. Just open the train-moses-irstlm-randlm script, search the name of this parameter change it and save your changes.

At the very least, this script will build a linguistic model and will train a corpus (except if they exist already, in which case it will not rebuild them).

Other steps are optional: memory-mapping, tuning and testing. In order to set the steps that will be executed, you have to change the parameters at the top of the script. The role of those parameters is also indicated there (in comments that precede each one of them).

The directory structure created by these scripts ensures 2 things: 1) no training will interfere with the files of a previous training; and 2) a posterior training will reuse as much as possible the files created in previous trainings.

At the end of the training, a logfile (training summary) file will be created in the $mosesdir/logs directory. It includes details about the duration of the several phases of training, values that will be used when you translate files based on this trained corpus, a list of the main input files, a list of all the files created during training, a list of all the parameters used and the score of the trained corpus test (if a test was done). The name of this file is most important because it is used by the translate-moses-irstlm-randlm script to select the trained corpus that you want to use for translation.

K.1. Description of some important parameters

1. reuse: if set to 1, any step already done in a previous training will not be redone;

2. memmapping: if set to 1, memory-mapping of the corpus will be done (this will diminish your RAM requirements)

3. tuning: if set to 1, tuning will be done; this should lead in principle to better results, but that does not always happen in practice; tuning can easily take more time than all the other steps combined

4. runtrainingtest: if set to 1, a test of the training will be done with scoring

5. lngmdl: package chosen to build the language model (in these scripts, you have 2 choices: 1 = IRSTLM; 5 = RandLM)

6. Gram: n-gram order; can influence significantly the results; the more, the better (maximal value: 9), but the execution time will suffer; normally between 3 and 9 (default: 5)

K.2. Greatly increasing the training speed

The parameter mgizanumprocessors determines the number of processors of your computer that will be used during phase 2 (MGIZA) of training. Since this phase is the longest one, you should set this parameter to the number of processors (actual or virtual) of your computer.

K.3. Controlling tuning

Tuning is a phase that can easily take more time than all the others put together. Furthermore, you can't easily estimate its duration beforehand, since the number of its runs is highly variable from corpus to corpus.

Therefore, a modified mert-moses-new.pl script (mert-moses-new-modif.pl) introduces some changes to the original Moses script so that the user can control the number of tuning runs through the parameter $maxruns of the train-moses-irstlm-randlm script.

A value of -1 means that an unlimited number of runs is allowed. Any positive number >= 2 means that tuning should be stopped after that run number. The default value is 10. Good values can be between 5 and 10 (choose the lower end of this spectrum if you want to speed up things; but you will risk a worse tuning).

K.4. Avoiding the destruction of a previous training by a subsequent training and reusing parts of training already done in previous trainings

In order to guarantee that the training of a corpus doesn't destroy files from a previously trained corpus and in order to ensure that each training reuses as much as possible the files already created in previous trainings, a complex (and confusing) directory structure was created. This, however, implies that the files of all the previous trainings are dispersed in the $mosesdir/corpora_trained directory. As already stated, this is a directory which you shouldn't change, since, by doing that, you can destroy not just one but even several trainings.

However, a $mosesdir/logs directory exists where you can find a summary of every training you made that describes, among other things, the parameters it used and the files it created. In order to use a trained corpus for translation you just have to copy the name of its log file into the $logfile parameter of the translate-moses-irstlm-randlm script. Nothing else is necessary for that and indeed you can ignore where the trained corpus files are for all practical purposes.

K.5. Training of an inverted corpus

A particular case of reusing previously made files is the training of an inverse corpus of a corpus already trained (suppose en->pt when the pt->en trained corpus already exists). In that case, and if all the training parameters stay equal in both instances, this script detects this situation and uses the tokenized, cleaned and lowercased files of the previous training, as well as phases 1 and 2 (GIZA or MGIZA, one of the longest steps of training) of the previous training. This can lead to savings of up to 25% of the execution time.

K.6. Isolating a training from all the other ones

You might feel tempted to isolate a training from all the other ones, perhaps because it is very important for you. However, so that a given training does not erase any part of a previous one and so that it can reuse as much as possible the steps already done in previous trainings, the files of the several trainings are interspersed in the $mosesdir/corpora_trained directory. There is, nevertheless, an easy way to isolate a given training that you are going to make from all the other preceding ones, if you insist on that:

1) Rename the $mosesdir/corpora_trained directory to $mosesdir/corpora_trained-bak;

2) Rename the $mosesdir/logs directory to $mosesdir/logs-bak;

3) Do the training of the corpus you want to isolate from all others (this will create new $mosesdir/corpora_trained and $mosesdir/logs that will just contain the trained corpus data);

4) Move the newly created $mosesdir/corpora_trained and the $mosesdir/logs directories to a safe place (outside $mosesdir);

5) Rename the $mosesdir/corpora_trained-bak directory back to $mosesdir/corpora_trained;

6) Rename the $mosesdir/logs-bak directory to $mosesdir/logs.

In order to be able to reuse again the training that you isolated from all others, you have to simply repeat steps 1) and 2) and to move its corpora_trained and logs directories to $mosesdir.

L. translate-moses-irstlm-randlm script

Vital parameters: mosesdir, logfile, translate_for_tmx (if this latter parameter is set to 1, then look also minseglen, othercleanings, improvesegmentation, removeduplicates)

This script assumes that Moses with IRSTLM and RandLM has been created with create-moses-irstlm-randlm and that a trained model exists already for the type of language pair you want to translate (which requires you to have already run train-moses-irstlm-randlm).

This script translates, using the trained model that you chose in its $logdir and $logfile parameters, the file or files that you yourself have put in $mosesdir/translation_input. In fact, it will translate in one step all the files that are there.

It is very important that you fill correctly the name of the $logdir and $logfile parameters, since they are the only way of telling the script which trained corpus you want to use for translation. By default, $logdir has the value $HOME/moses-irstlm-randlm/logs and, if you haven't changed this parameter, you just have to go to this directory, identify the file that corresponds to the corpus you want to use and copy its name (omit the path!) into $logfile.

Translation can have 2 types of outputs:

1) A normal Moses translation, if the $translate_for_tmx parameter is set to 0 (default value). The normal translation will be located in the $mosesdir/translation_output directory.

or

2) A Moses translation especially suited for using in a translation memory tool, if you set the $translate_for_tmx parameter to 1. This type of translation will be located, together with the modified input file, in the $mosesdir/translation_files_for_tmx directory. It is especially interesting for those who use machine translation together with translation memories (notably those who just use MT segments when there are no translation memory segments above a certain match percentage).

By default, translate_for_tmx=0, which means it will do a "normal" translation. This type of translation respects fully the formatting of the original text, and therefore keeps long paragraphs, which, according to some sources, do not lead to the best results. That, however, didn't stop us to get very respectable BLEU and NIST scores, as you can see yourself if you try the demo corpus.

If you set $translate_for_tmx to 1, then other parameters will be activated:

a) $minseglen: if set to a value different from -1 and greater than 0, all segments with a length of less than minseglen will be erased; if set to -1, no segments will be erased whatever their length; default value: -1;

b) $othercleanings: if set to 1, tabulation signs will be replaced by newlines and lines composed only of digits, spaces and parentheses will be removed; default value: 1;

c) $improvesegmentation: if set to 1, replaces any of the characters [:;.!?] followed by a space by that character followed by a newline, deletes empty lines and substitutes double spaces by a single space; default value: 1;

d) $removeduplicates: if set to 1, removes duplicated segments; default value: 1

If you want to do a scoring of the Moses translation (for that, you need to have a reference human translation) and if $translate_for_tmx is set to 1, then you should set $minseglen = -1, $othercleanings = 0, $improvesegmentation = 0 and $removeduplicates = 0 (so that the source document and the reference translation have the same number of segments).

The names of the output files will be equal to those placed in $mosesdir/translation_input except for a suffix that is appended to them with the abbreviation of the target language. Therefore, if you input the file 100.pt you will get a translated 100.pt.en.moses file (if en is the abbreviation of the target language).

Furthermore, both the source document and the Moses translation are also changed so that:

1) The named entities defined in the TMX specification are duly created (e.g., & &);

2) / /;

L.1. Speed

Especially with very large trained corpora (several million segments), translation can be slow. According to the Moses manual, to get faster performance than the default Moses setting at roughly the same performance, use the parameters $searchalgorithm=1 (default: 0), $cubepruningpoplimit=2000 (default: 1000) and $stack=2000 (default: 100).

You can also try to reduce both of the latter 2 parameters to values of 500 or less (say, 100) and experiment to determine it they significantly change the translation quality.

L.2. Reusing tuning weights (only for advanced users)

Since tuning is a very long phase and since its only useful product is a set of weights that it transfers to the moses.ini file, you could perhaps invest in a single long tuning for each pair of languages that you you are interested in and you could copy those weights from such a moses.ini to every other moses.ini that will be created for that same language pair, a very big time saving trick.

If the files used for tuning are representative of your corpora, they should in principle lead to better results than the default values used when no tuning is done (that is not always the case).

You might be interested in doing this only if the score-moses-irstlm-randlm script shows a significant increase in translation quality after tuning is made. In practice, you should first train a corpus without tuning, translate a representative text and then score that translation with the scoring script. Then, you should retrain the same corpus with tuning and translate that same representative text and score it (since the scripts reuse the previously made steps, the previous training will be reused and you will just do a new tuning and a new training test). You can repeat this for several representative texts. If the scores obtained with tuning are significantly higher than those obtained without tuning, then you can use the tuning weights for all the similar corpora of that language pair.

Depending on your settings, you can have a moses.ini file in $mosesdir/corpora_trained/model, $mosesdir/corpora_trained/memmaps, and $mosesdir/corpora_trained/evaluation. If you want to use the tuning weights, you should change those weights in all those moses.ini. The weights in question are listed in the [weight-d], [weight-l], [weight-t] and [weight-w] sections of the $mosesdir/corpora_trained/tuning/.../moses.weight-reused.ini file.

M. score-moses-irstlm-randlm script

Vital parameters: mosesdir, lang1, lang2, scoreTMXdocuments, s, r, m, score_line_by_line (if this latter parameter is set to 1, then you should set the remove_equal parameters), tokenize and lowercase

In some cases, you might want to score the translations you get with the translate-moses-irstlm-randlm script against reference (human) translations that might be available. That might be useful for you to determine whether some parameter changes lead to improvements or to even indirectly assess the satisfaction of the human users of translations made with your trained corpus (for that, just consider their translation, after it is done with the help of the Moses translation, as the reference translation and score the Moses translation against their reference translation).

For a score to be done, you need do have a source document ($s parameter), a reference (human) translation of the source document ($r parameter) and a Moses translation of the source document ($m parameter).

As seen in section K, you could have done a normal translation (if $translate_for_tmx = 0 in the translate-moses-irstlm-randlm script) or a translation especially suited for TMX translation memories ( if $translate_for_tmx = 1).

If you have done a normal translation, you should set the $scoreTMXdocuments parameter to 0 (its default value).

The $scoreTMXdocuments parameter should be set to 1 if you have previously translated the text with $translate_for_tmx = 1 with the translate-moses-irstlm-randlm script, indicating that you have done a translation especially suited for making translation memories. However, you should know that other settings of this latter script will change the number of lines of the translation (e.g., by removing identical lines) and, in that case, scoring of such a changed document is not possible. More specifically, that happens if $minseglen < > -1 , $othercleanings < > 0, $improvesegmentation < > 0 ***or*** $removeduplicates < > 0.

The source document should be placed in the $mosesdir/translation_input directory, for normal translations, or in $mosesdir/translation_files_for_tmx, for translations suited for TMX. The reference translation should be put in the $mosesdir/translation_reference directory. The scripts will put the Moses translation in the right location ($mosesdir/translation_output directory, for normal translations, and $mosesdir/translation_files_for_tmx for translations suited for TMX).

M.1. Two types of scores

1) This script produces a NIST and BLEU score for a whole Moses translation if you set its $score_line_by_line parameter to a value different from 1 (default: 0).

2) In some cases, you might want to detect types of segments that are systematically very well or very incorrectly translated by Moses. In such cases, you want a BLEU or a NIST score of each segment translated by Moses. For that, you have to set the $score_line_by_line parameter to 1. In that case, a more detailed output file is produced with 6 fields per line:

1) BLEU score;

2) NIST score;

3) Number of the segment in the source document;

4) Source language segment;

5) Reference (human) translation of that segment;

6) Moses translation of that segment.

Furthermore, when $score_line_by_line is set to 1, this script sorts the segments in ascending order of BLEU score.

In our experience, the $tokenize and $lowercase cause scores to better reflect human judgment. Their default values are set to 1.

The scorer seem to have a problem with some segments with just one word since in cases where the Moses translation and the reference translation are absolutely identical or even equal the BLEU score is nevertheless zero. When the $score_line_by_line is set to 1 and the $remove_equal parameter to 1, then those segments will not appear in the scoring report. You can, however, easily determine how many they are if you subtract the total number of segments contained in the scoring report from the total number of segments of the scored text.

N. Utilities

N.1. transfer-training-to-another-location-moses-irstlm-randlm script

Vital parameters: mosesdirmine, newusername, mosesdirotheruser

This script should be used in the computer in which the training that is to be transfered was made (not on the target computer).

You might want to exchange trained corpora with other persons. In several of the training files, though, your own $mosesdir directory is written literally (e.g., /home/mary/moses-irstlm-randlm).

This script creates in the $mosesdir that contains the trainings that you want to transfer ($mosesdirmine parameter) a new subdirectory $mosesdirmine/corpora_trained_for_another_location/$newusername and places there a copy of the $mosesdirmine/corpora_trained and $mosesdirmine/logs that you want to transfer.

In these 2 latter subdirectories in that location, the string that referred literally to the trained corpora initial location is replaced by the correct string that will enable them to be used by another user and/or another location (since you can have several Moses installations in the same computer).

This script will copy all the trainings contained in the $mosesdirmine/corpora_trained directory.

Your original trainings are not affected by this operation.

You can then copy these 2 subdirectories (prepared for being transfered) to the new location or to the new computer where you want them to be used (it is you who has to manually copy them there, for instance copying them to a USB key or to an external hard disk and copying them to the $mosesdir directory where you want them to be used, which corresponds to the $mosesdirotheruser directory).

After you have transferred the corpora to their intended location, you can safely erase the $mosesdirmine/corpora_trained_for_another_location.

The $mosesdirmine parameter is the value of your $mosesdir (by default, $HOME/moses-irstlm-randlm) whose trainings you want to transfer. The $mosesdirotheruser parameter is the value of the $mosesdir to which you want to transfer your training. The $newusername parameter is the Linux login name of the user to whom you want to transfer your training (if you keep your own login, that means that you are trying to transfer the training to another Moses installation in your own computer).

Since reading and writing to disk can lead to errors, we strongly suggest that you make a backup of the $mosesdirotheruser directory before transferring the $mosesdirmine/corpora_trained_for_another_location/$newusername/corpora_trained and the $mosesdirmine/corpora_trained_for_another_location/$newusername/logs subdirectories to it and especially before erasing or overwriting anything.

Please note that you should just copy the corpora_trained and logs subdirectories to $mosesdirotheruser directory (not the $mosesdirmine/corpora_trained_for_another_location/$newusername directory).

If the $mosesdirotheruser directory that is going to receive the new corpora_trained and logs subdirectories already has some trainings, when you merge them there you will be alerted that the subdirectories with the same name will be overwritten by the new ones. Even though we again stress that it is much surer to make a backup of the contents of $mosesdirotheruser before attempting this operation and highly recommend you to do that (any error might destroy the previous trainings already present there!), you should accept that overwriting. In fact, given the structure of those directories, the files already present there should not disappear. But this is something that you make at your own risk and peril.

O. Windows add-ins

In order to be able to prepare corpora so that they can be used to train Moses and in order to convert Moses output so that it can be used in translation memories, 2 MS Windows programs are presented in a separate Windows-add-ins.zip package (http://moses-for-mere-mortals.googlecode.com/files/Windows-add-ins.zip):

- Extract_TMX_Corpus - enables the creation of Moses corpora files from TMX (translation memories) files;

- Moses2TMX - enables the creation of TMX files from a Moses translation file and the corresponding original document in the source language.

Please consult their Readme files to learn how to use them. Together, these 2 programs create a synergy between machine translation and translation memories.

P. Improving quality and speed

These scripts, especially the train-moses-irstlm-randlm and the translate-moses-irstlm-randlm scripts, allow you to control more than 80 parameters that influence either quality or speed and relate to both Moses and the packages it uses. Please refer to the comments that precede the parameters, especially those of the train-moses-irstlm-randlm and translate-moses-irstlm-randlm scripts, in order to learn more about them. They were often extracted from the Moses and accompanying packages manuals. Such parameters are organized in what seems a logical way and an especial care was taken to specify simultaneous changes that should be made of several parameters that work together.

According to the Moses manual, you should try first the (less numerous) parameters of the translate-moses-irstlm-randlm script. In case you want to reset the parameters that you changed to their default values, you can consult the Appendix of the present document to get those values.

Q. Deleting trained corpora

Q.1. You want to erase all the trainings that you have done

That's really easy. Just delete the $mosesdir/corpora_trained and the $mosesdir/logs directories. Next time you'll use the train-moses-irstlm-randlm script, it will re-create these 2 directories.

Q.2. You want to erase just some of all the trainings that you have done

There are 2 ways to delete corpora. The first one, though more accurate, requires you to have 2 Moses installations (you can have as many Moses installations as you want, each one in a $mosesdir with a different name). The second one is much more risky and isn't as effective, but it does not require you to have more than one Moses installation. In both cases, we strongly recommend that you carefully backup the corpora_trained and the logs subdirectories of the $mosesdir which will be changed. As you know, if you accidentally change by mistake the $mosesdir/corpora_trained or the $mosesdir/logs directories, you can lose one, several or even all the trainings you have done.

Q.2.1. Scenario 1: More than one Moses installations available

This is by far the less risky and more efficient method of deleting unwanted trainings.

Tip: If you do not have more than one Moses installations (that is, more than one $mosesdir), you can easily create a new one by running the create-moses-irstlm-randlm script and setting its $mosesdir parameter to a value different from the $mosesdir value that was used to create your present Moses installation (after its execution, you will have 2 different Moses installations).

1) Prepare the $mosesdir where you want to delete a trained corpus (let's call it $mosesdirstart) to be transferred to another location, by using the transfer-training-to-another-location-moses-irstlm-randlm script.

2) As you know (see section M), this script processes all the trained corpora of $mosesdirstart, that is, the ones you want delete and also the ones you do not want to delete, and creates 2 new directories: $mosesdirstart/corpora_trained_for_another_location/$newusername/corpora_trained and $mosesdirstart/corpora_trained_for_another_location/$newusername/logs. Delete in these 2 directories, respectively, the subdirectories and the logfiles that correspond to the corpora that you want to delete.

3) Let's call the $mosesdir that will receive the trained corpora that you do not want to delete $mosesdirfinal. Just for playing sure, backup its $mosesdirfinal/corporatrained and $mosesdirfinal/logs subdirectories.

4) Now you just have to move the $mosesdirstart/corpora_trained_for_another_location/$newusername/corpora_trained and $mosesdirstart/corpora_trained_for_another_location/$newusername/logs to, respectively, $mosesdirfinal/corpora_trained and $mosesdirfinal/logs.

5) In order to verify that everything was well done, make a small translation with one of the trained corpora that were initially present in $mosesdirfinal, as well as a translation with one of the corpora that you have now manually transferred to there.

6) If no problems were detected in the previous step, delete $mosesdirstart.

Q.2.2. Scenario 2: Single Moses installation available

A more convoluted, riskier and less efficient way is also available.

The log file of the training that you want to suppress (located in the $mosesdir/logs subdirectory) contains, at its very end, a list of the files used in that training. You can erase the files that use more space and that are surely not required for any other trainings if you just erase the files, and only the files, listed in the log file, located in the following subdirectories of $mosesdir/corpora_trained:

1) evaluation

2) memmaps

3) model

4) tuning

This might be easier than it seems. The long names of the directories, necessary so that a training does not overwrite another training and so that you can reuse previous steps already done, do not need to be fully inspected: as soon as you find the correct first subdirectory of any of the above directories, you can erase it without checking any further.

Other files, located in other $mosesdir/corpora_trained subdirectories, even though used by the corpus that you want to delete, might also be used by other corpora (since these scripts reuse the steps already done by previous trainings). The best advice, as far as these subdirectories are concerned, is probably not to touch them.

R. New features

Relative to Moses-for-Mere-Mortals-0.64, the following main new features have been added:

1) Removes control characters from the input files (these can crash a training);

2) Extracts from the corpus files 2 test files by pseudorandomly selecting non-consecutive segments that are erased from the corpus files;

3) A new training does not interfere with the files of a previous training;

4) A new training reuses as much as possible the files created in previous trainings;

5) Inversion of corpora (e.g., from en-pt to pt-en) detected, allowing a much quicker training than that of the original language pair;

6) Can limit the duration of tuning;

7) Get the BLEU and NIST scores of a translation (either for the whole document or for each segment of it);

8) Transfer your trainings to someone else or to another Moses installation in the same computer;

9) All the mkcls, GIZA and MGIZA parameters can now be controlled through parameters of the train-moses-irstlm-randlm script;

10) Selected parameters of the Moses scripts and the Moses decoder can now be controlled through the train-moses-irstlm-randlm and translate-moses-irstlm-randlm scripts;

11) Installs RandLM;

12) Installs MGIZA;

13) Implements distributed training with IRSTLM (so as to better manage memory);

14) New make-test-moses-irstlm-randlm, score-moses-irstlm-randlm, and transfer-training-to-another-location-moses-irstlm-randlm scripts.

15) Bigger demo corpus.

S. How to contribute

You can contribute to the improvement of this work by either contacting [email protected] or by participating in the discussion group linked to this site (http://groups.google.com/group/mosesformeremortals ).

Comments, criticisms and further scripts or documentation that will make the process of using Moses more user-friendly are gladly welcome.

If we accept your work, we will fully acknowledge his/her author (and only him/her) and we propose that in the very beginning of it you write:

#copyright {year}, {your name}

#licenced according to the {name of the licence} licence

If you propose a significant change to an existing script, the names of all of the authors will be mentioned on it and the licence will have to be agreed upon.

T. Thanks

Special thanks:

Hilrio Leal Fontes, who made very helpful suggestions about the functionality of several scripts and made comprehensive tests. He is also the author of the nonbreaking_prefix.pt script (for the Portuguese language). He has compiled the corpora that were used to train Moses and to test these scripts, including 2 very large corpora with 6.6 and 12 million segments. He has also revised the Help/Short Tutorial file.

Maria Jos Machado, whose suggestions and research have influenced significantly the score-moses-irstlm-randlm script. She helped in the evaluation of Moses output in general and organised, together with Hilrio, a comparative evaluation, made by professional translators, of the qualitative results of Google, Moses and a rule-based MT engine. She suggested a deep restructuring of the present Help/Short Tutorial file and is a co-author of it.

Manuel Tomas Carrasco Benitez, whose Xdossier application was used to create a pack of the Moses-for-Mere-Mortals files.

Authors of the http://www.dlsi.ua.es/~mlf/fosmt-moses.html (Mikel Forcada and Francis Tyers) and the http://www.statmt.org/moses_steps.html pages. These pages have helped me a lot in the first steps with Moses.

Authors of the documentation of Moses, giza-pp, MGIZA, IRSTLM and RandLM; some of the comments of the present scripts describing the various parameters include extracts of them.

European Commission's Joint Research Center and Directorate-General for Translation for the DGT-TM Acquis - freely available on the JRC website and providing aligned corpora of about 1 million segments of Community law texts in 22 languages - which was used in the demonstration corpus. Please note that only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.

U. Author

Joo Lus Amorim de Castro Rosas

[email protected] author wishes to stress that the very, very long (unimaginable) working hours and the numerous extremely relevant suggestions of Hilrio Leal Fontes and Maria Jos Machado, who tested this software in an actual translation environment, were an immense contribution and created also a very pleasurable working environment (despite the stress we all suffered :-) ). These scripts would not be the same, and would in fact be much worse, without their help, which made it reflect the practical problems of professional translators.

APPENDIX: default parameters of each of the scripts

NOTE: the lines starting with the symbol # are comments that explain the role of the parameters; the parameters are indicated in bold. The vital parameters of each script (those that you probably will want to change if you train your own corpora) are indicated in bold red.

1) create-moses-irstlm-randlm script:

#Full path of the base directory location of your Moses system

mosesdir=$HOME/moses-irstlm-randlm

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# !!! Please set $mosesnumprocessors to the number of processors of your computer !!!

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#Number of processors in your computer

mosesnumprocessors=1

#Install small demo corpus: 1 = Install; Any other value = Do not install (!!! this will install a very small corpus that can be used to see what the scripts and Moses can do; if dodemocorpus is set to 1, this series of scripts will be able to use the demo corpus without you having to change their settings !!!)

dodemocorpus=1

#Remove the downloaded compressed packages and some directories no longer needed once the installation is done; 1 = remove the downloaded packages; any other value = do not remove those packages

removedownloadedpackges=1

2) make-test-moses-irstlm-randlm script:

#Base path of Moses installation

mosesdir=$HOME/moses-irstlm-randlm

#Source language abbreviation

lang1=pt

#Target language abbreviation

lang2=en

#Number of sectors in which each input file will be cut

totalnumsectors=100

#Number of segments pseudorandomly searched in each sector

numsegs=10

#Name of the source language file used for creating one of the test files (!!! omit the path; the name should not include spaces !!!)

basefilename=200000

3) train-moses-irstlm-randlm script:

#Full path of the base directory location of your Moses system

mosesdir=$HOME/moses-irstlm-randlm

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#NOTE 1: The corpus that you want to train, together with the respective tuning files (if different), the testing files (if different), the file used for recasing, and the file used to build the language model (if different) should be placed in $mosesdir/corpora_for_training !!!

#NOTE 2: After the script is executed, you will find a summary of what has been done (the corpus summary file) in $mosesdir/logs

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#====================1. LANGUAGES ========================================

#Abbreviation of language 1 (source language)

lang1=pt

#Abbreviation of language 2 (target language)

lang2=en

#====================2. FILES =============================================

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# !!! The names of the files should not include spaces !!!

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#Basename of the corpus placed in $mosesdir/corpora_for_training (the example that follows refers to the 2 files 200000.for_train.en and 200000.for_train.pt, whose basename is 200000.for_train)

corpusbasename=200000.for_train

#Basename of the file used to build the language model (LM), placed in $mosesdir/corpora_for_training (!!! this is a file in the target language !!!)

lmbasename=300000

#Basename of the tuning corpus, placed in $mosesdir/corpora_for_training

tuningbasename=800

#Basename of the test set files (used for testing the trained corpus), placed in $mosesdir/corpora_for_training

testbasename=200000.for_test

#Basename of the recaser training file, placed in $mosesdir/corpora_for_training

recaserbasename=300000

#===================== 3. TRAINING STEPS ===================================

#---------------------------------------------------------------------------------------------------------------------------

#Reuse all relevant files that have already been created in previous trainings: 1= Do ; Any other value=Don't

reuse=1

#---------------------------------------------------------------------------------------------------------------------------

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#NOTE 1: If in doubt, leave the settings that follow as they are; you will do a full training with memory mapping, tuning, a training test and scoring of the training test of the demo corpus; the results will appear in $mosesdir/corpora_trained and a log file will be available in $mosesdir/logs.

#NOTE 2: You can also proceed step by step (e.g., first doing just LM building and corpus training and then testing), so as to better control the whole process.

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#Do parallel corpus training: 1= Do ; Any other value=Don't !!!

paralleltraining=1

#Number of the first training step (possible values: 1-9); choose 1 for a completely new corpus

firsttrainingstep=1

#Number of the last training step (possible values: 1-9); choose 9 for a completely new corpus

lasttrainingstep=9

#Do memory mapping: 1 = Do ; Any other value = Don't

memmapping=1

#Do tuning: 1= Do ; Any other value=Don't; can lead, but does not always lead, to better results; takes much more time

tuning=1

#Do a test (with scoring) of the training: 1 = Do ; Any other value = Don't

runtrainingtest=1

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# If you are new to Moses, stop here for the time being

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#=================== 4. LANGUAGE MODEL PARAMETERS ======================

# Use IRSTLM (1) or RandLM (5)

lngmdl=1

#Order of ngrams - the higher the better, but more memory required (choose between 3 and 9; good value: 5)

Gram=5

#---------------------------*** 4.1. IRSTLM PARAMETERS ***------------------------------------------------

# Distributed language model: 1= Yes; Any other value = No (splits the file used to build the language model into parts, processes each part separately and finally merges the parts)

distributed=1

# Number of parts to split dictionary into balanced n-gram prefix lists (in the creation of a distributed language model); default: 5; !!! Only used if distributed = 1 !!!

dictnumparts=20

# Smoothing possible values: witten-bell (default); kneser-ney, improved-kneser-ney

s='witten-bell'

# Quantize LM (IRSTLM user manual, p. 4: "Reduces memory comsumption at the cost of some loss of performance") 1 = Do ; Any other value = Don't. May induce some accuracy loss. Reduces the size of the LM.

quantize=0

# Memory-mapping of the LM. 1 = Do; Any other value = Don't. Avoids the creation of the binary LM directly in RAM (allows bigger LM at the cost of lower speed; often necessary when LM file is very big) !!!

lmmemmapping=1

#------------------------------------*** 4.2. RandLM PARAMETERS ***----------------------------------------

# The format of the input data. The following formats are supported: for a CountRandLM, "corpus" (tokenised text corpora, one sentence per line); for a BackoffRandLM, 'arpa' (an ARPA backoff language model)

inputtype=corpus

# The false positive rate of the randomised data structure on an inverse log scale so '-falsepos 8' produces a false positive rate of 1/2^8

falsepos=8

# The quantisation range used by the model. For a CountRandLM, quantisation is performed by taking a logarithm. The base of the logarithm is set as 2^{1/'values'}. For a BackoffRandLM, a binning quantisation algorithm is used. The size of the codebook is set as 2^{'values'}

values=8

#====================== 5. TRAINING PARAMETERS ===========================

#------------------------------------- *** 5.1. TRAINING STEP 1 ***---------------------------------------------

#********** mkcls options

#Number of mkcls interations (default:2)

nummkclsiterations=2

#Number of word classes

numclasses=50

#--------------------------------------*** 5.2. TRAINING STEP 2 ***---------------------------------------------

#.................................................. 5.2.1. MGIZA parameters .......................................................................

#Number of processors of your computer that will be used by MGIZA (if you use all the processors available, the training will be considerably speeded)

mgizanumprocessors=1

#........ 5.2.2. GIZA parameters .......................................................................

#maximum sentence length; !!! never exceed 101 !!!

ml=101

#No. of iterations:

#-------------------

#number of iterations for Model 1

model1iterations=5

#number of iterations for Model 2

model2iterations=0

#number of iterations for HMM (substitutes model 2)

hmmiterations=5

#number of iterations for Model 3

model3iterations=3

#number of iterations for Model 4

model4iterations=3

#number of iterations for Model 5

model5iterations=0

#number of iterations for Model 6

model6iterations=0

#

#parameters for various heuristics in GIZA++ for efficient training:

#------------------------------------------------------------------

#Counts increment cutoff threshold

countincreasecutoff=1e-06

#Counts increment cutoff threshold for alignments in training of fertility models

countincreasecutoffal=1e-05

#minimal count increase

mincountincrease=1e-07

#relative cutoff probability for alignment-centers in pegging

peggedcutoff=0.03

#Probability cutoff threshold for lexicon probabilities

probcutoff=1e-07

#probability smoothing (floor) value

probsmooth=1e-07

#parameters for describing the type and amount of output:

#-----------------------------------------------------------

#0: detailled alignment format, 1: compact alignment format

compactalignmentformat=0

#dump frequency of Model 1

model1dumpfrequency=0

#dump frequency of Model 2

model2dumpfrequency=0

#dump frequency of HMM

hmmdumpfrequency=0

#output: dump of transfer from Model 2 to 3

transferdumpfrequency=0

#dump frequency of Model 3/4/5

model345dumpfrequency=0

#for printing the n best alignments

nbestalignments=0

#1: do not write any files

nodumps=1

#1: write alignment files only

onlyaldumps=1

#0: not verbose; 1: verbose

verbose=0

#number of sentence for which a lot of information should be printed (negative: no output)

verbosesentence=-10

#smoothing parameters:

#---------------------

#f-b-trn: smoothing factor for HMM alignment model #can be ignored by -emSmoothHMM

emalsmooth=0.2

#smoothing parameter for IBM-2/3 (interpolation with constant))

model23smoothfactor=0

#smooting parameter for alignment probabilities in Model 4)

model4smoothfactor=0.4

#smooting parameter for distortion probabilities in Model 5 (linear interpolation with constant

model5smoothfactor=0.1

#smoothing for fertility parameters (good value: 64): weight for wordlength-dependent fertility parameters

nsmooth=4

#smoothing for fertility parameters (default: 0): weight for word-independent fertility parameters

nsmoothgeneral=0

#parameters modifying the models:

#--------------------------------

#0 = IBM-3/IBM-4 as described in (Brown et al. 1993); 1: distortion model of empty word is deficient; 2: distortion model of empty word is deficient (differently); setting this parameter also helps to avoid that during IBM-3 and IBM-4 training too many words are aligned with the empty word); 1 = only 3-dimensional alignment table for IBM-2 and IBM-3

compactadtable=1

deficientdistortionforemptyword=0

#d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E)

depm4=76

#d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E)

depm5=68

#lextrain: dependencies in the HMM alignment model. &1: sentence length; &2: previous class; &4: previous position; &8: French position; &16: French class)

emalignmentdependencies=2

#f-b-trn: probability for empty word

emprobforempty=0.4

#parameters modifying the EM-algorithm:

#--------------------------------------

#fixed value for parameter p_0 in IBM-5 (if negative then it is determined in training)

m5p0=-1

manlexfactor1=0

manlexfactor2=0

manlexmaxmultiplicity=20

#maximum fertility for fertility models

maxfertility=10

#fixed value for parameter p_0 in IBM-3/4 (if negative then it is determined in training)

p0=0.999

#0: no pegging; 1: do pegging

pegging=0

#-------------- *** 5.3. TRAINING SCRIPT PARAMETERS ***----------------------------------------------

#Heuristic used for word alignment; possible values: intersect (intersection seems to be a synonym), union, grow, grow-final,grow-diag, grow-diag-final-and (default value),srctotgt, tgttosrc (Moses manual, p. 72, 144)

alignment=grow-diag-final-and

#Reordering model; possible values: msd-bidirectional-fe (default), msd-bidirectional-f, msd-fe, msd-f, monotonicity-bidirectional-fe, monotonicity-bidirectional-f, monotonicity-fe, monotonicity-f (Moses manual, p. 77)

reordering=msd-bidirectional-fe

#Minimum length of the sentences (used by clean)

MinLen=1

#Maximum length of the sentences (used by clean)

MaxLen=60

#Maximum length of phrases entered into phrase table (max: 7; choose a lower value if phrase size length is an issue; good value for most purposes: 3)

MaxPhraseLength=5

#-------------- *** 5.4. DECODER PARAMETERS ***--------------------------------------------------------

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# !!! Only used in the training evaluation, and only if tuning = 0 !!!

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#***** QUALITY TUNING:

# Weights for phrase translation table (good values: 0.1-1; default: 1); ensures that the phrases are good translations of each other

weight_t=1

# Weights for language model (good values: 0.1-1; default: 1); ensures that output is fluent in target language

weight_l=1

# Weights for reordering model (good values: 0.1-1; default: 1); allows reordering of the input sentence

weight_d=1

# Weights for word penalty (good values: -3 to 3; default: 0; negative values favor large output; positive values favour short output); ensures translations do not get too long or too short

weight_w=0

#------------------------------------------

# U