Moses-support Digest, Vol 85, Issue 43

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. average speeds (Read, James C)
2. EMS set up with mgiza and KenLM (Daniel Valenzuela)


----------------------------------------------------------------------

Message: 1
Date: Tue, 26 Nov 2013 06:50:16 +0000
From: "Read, James C" <jcread@essex.ac.uk>
Subject: [Moses-support] average speeds
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<F00840E41983C645928E21E3C35F4EB1012CF4C66C@mbx1-node2.essex.ac.uk>
Content-Type: text/plain; charset="iso-8859-1"

Hi,

what kind of speeds are others getting out of Moses? I followed the advice here http://www.statmt.org/moses/?n=Moses.Tutorial#ntoc8

With a stack size of -s 1 I'm getting about 1000 sentences in 10 minutes on 32 processors. That's about 3 sentences per minute per processor.

Is this 'normal'?

James



------------------------------

Message: 2
Date: Tue, 26 Nov 2013 13:08:51 +0100 (CET)
From: Daniel Valenzuela <daniel@valenzuela.de>
Subject: [Moses-support] EMS set up with mgiza and KenLM
To: moses-support@mit.edu
Message-ID:
<2059270332.126362.1385467731845.open-xchange@communicator.strato.de>
Content-Type: text/plain; charset="utf-8"

Dear all,

after various manual set ups, I wanted to try the EMS. After trying several
experiment settings I wanted to run it with multi-giza and kenlm, but I cannot
get it to work (tried it again with smaller corpus, same result. I tried to
continue the experiment with different fixes - no success.

The log tells me:
step TUNING:tune crashed

further inspection in TUNE_tune.1.STDERR in steps/1/ told me IRSTLM is messing
with my project, "against" my will (at least I thought so):

line=IRSTLM name=LM0 factor=0
path=/home/moses/project_test_mgiza/experiment/lm/project-syndicate.binlm.1
order=3
Exception: Error: 4 number of threads specified but IRST LM is not threadsafe.
Exit code: 1
Failed to run moses with the config
/home/moses/project_test_mgiza/experiment/tuning/moses.filtered.ini.1 at
/home/moses/mosesdecoder/scripts/training/mert-moses.pl line 1271.
cp: cannot stat
?/home/moses/project_test_mgiza/experiment/tuning/tmp.1/moses.ini?: No such file
or directory


Looking up what happened in the tuning folder, I found out that
moses.filtered.ini.1 has set IRSTLM for Distortion, but filtered.1/moses.ini has
set KenLM for Distortion which satisfies what I hoped to get.

I attached the files from above and the following is the config file of the
experiment:


################################################
### CONFIGURATION FILE FOR AN SMT EXPERIMENT ###
################################################


[GENERAL]

home-dir = /home/moses

working-dir = $home-dir/project_test_mgiza/experiment
moses-src-dir = $home-dir/mosesdecoder
moses-script-dir = $moses-src-dir/scripts
moses-bin-dir = $moses-src-dir/bin
external-bin-dir = $moses-src-dir/BINDIR
data-dir = $home-dir/project_test_mgiza/experiment/corpus
train-dir = $data-dir/training
dev-dir = $data-dir/dev
#irstlm-dir = $home-dir/irstlm/bin


ttable-binarizer = $moses-bin-dir/processPhraseTable
decoder = $moses-bin-dir/moses

input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -l
$input-extension -threads 4"
output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -l
$output-extension"
input-truecaser = $moses-script-dir/recaser/truecase.perl
output-truecaser = $moses-script-dir/recaser/truecase.perl
detruecaser = $moses-script-dir/recaser/detruecase.perl


input-extension = de
output-extension = en
pair-extension = de-en

#################################################################
# PARALLEL CORPUS PREPARATION:
# create a tokenized, sentence-aligned corpus, ready for training

[CORPUS]

max-sentence-length = 80

[CORPUS:project-syndicate]
raw-stem = $train-dir/news-commentary-v8.$pair-extension

[LM]

### tool to be used for language model training
# for instance: ngram-count (SRILM), train-lm-on-disk.perl (Edinburgh)
#
#lm-training = "$moses-script-dir/generic/trainlm-irst2.perl -cores 4 -irst-dir
$irstlm-dir -temp-dir $working-dir/tmp"
#settings = "-s msb -p 0"
#order = 3
#type = 8
#lm-binarizer = $moses-bin-dir/build_binary

# path to lmplz binary
lmplz = $moses-bin-dir/lmplz
# order of the language model
order = 3
# additional parameters to lmplz (check lmplz help message)
settings = "-T $working-dir/tmp -S 10G"
# this tells EMS to use lmplz and tells EMS where lmplz is located
lm-training = "$moses-script-dir/generic/trainlm-lmplz.perl -lmplz $lmplz"
lm-binarizer = $moses-bin-dir/build_binary



[LM:project-syndicate]
raw-corpus = $train-dir/news-commentary-v8.$pair-extension.$output-extension


#################################################################
# TRANSLATION MODEL TRAINING

[TRAINING]


### training script to be used: either a legacy script or
# current moses training script (default)
#
#script = $moses-script-dir/training/train-model.perl


### general options
#
script = $moses-script-dir/training/train-model.perl
training-options = "-mgiza -mgiza-cpus 4 -cores 4 \
-parallel -sort-buffer-size 10G -sort-batch-size 253 \
-sort-compress gzip -sort-parallel 10"
parallel = yes

### symmetrization method to obtain word alignments from giza output
# (commonly used: grow-diag-final-and)
#
#alignment-symmetrization-method = berkeley
alignment-symmetrization-method = grow-diag-final-and

### lexicalized reordering: specify orientation type
# (default: only distance-based reordering model)
#
lexicalized-reordering = msd-bidirectional-fe

### if word alignment (giza symmetrization) should be skipped,
# point to word alignment files
#
#word-alignment =

### if phrase extraction should be skipped,
# point to stem for extract files
#
#extracted-phrases =

### if phrase table training should be skipped,
# point to phrase translation table
#
#phrase-translation-table =

### if reordering table training should be skipped,
# point to reordering table
#
#reordering-table =

### if training should be skipped,
# point to a configuration file that contains
# pointers to all relevant model files
#
#config =

### TUNING: finding good weights for model components

[TUNING]

### instead of tuning with this setting, old weights may be recycled

### tuning script to be used
#
tuning-script = $moses-script-dir/training/mert-moses.pl
tuning-settings = "-mertdir $moses-bin-dir -threads 4"

### specify the corpus used for tuning
# it should contain 100s if not 1000s of sentences
#
raw-input = $dev-dir/news-test2008.$input-extension

raw-reference = $dev-dir/news-test2008.$output-extension

### size of n-best list used (typically 100)
#
nbest = 100

### ranges for weights for random initialization
# if not specified, the tuning script will use generic ranges
# it is not clear, if this matters
#
# lambda =

### additional flags for the decoder
#
decoder-settings = "-threads 4"

### if tuning should be skipped, specify this here
# and also point to a configuration file that contains
# pointers to all relevant model files
#
#config =


#######################################################
## TRUECASER: train model to truecase corpora and input

[TRUECASER]

### script to train truecaser models
#
trainer = $moses-script-dir/recaser/train-truecaser.perl

### training data
# raw input needs to be still tokenized,
# also also tokenized input may be specified
#
raw-stem = CORPUS:raw-stem

### trained model
#
#truecase-model =


##################################
## EVALUATION: score system output

[EVALUATION]

### prepare system output for scoring
# this may include detokenization and wrapping output in sgm
# (needed for nist-bleu, ter, meteor)
#
detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l
$output-extension"

decoder-settings = "-threads 4"

### should output be scored case-sensitive (default: no)?
#
# case-sensitive = yes

### BLEU
#

multi-bleu = "$moses-script-dir/generic/multi-bleu.perl -lc"
# ibm-bleu =

### TER: translation error rate (BBN metric) based on edit distance
#
# ter = $edinburgh-script-dir/tercom_v6a.pl

### METEOR: gives credit to stem / worknet synonym matches
#
# meteor =

[EVALUATION:newstest2010]
raw-input = $dev-dir/newstest2011.$input-extension
raw-reference = $dev-dir/newstest2011.$output-extension


[REPORTING]

### what to do with result (default: store in file evaluation/report)
#
# email = pkoehn@inf.ed.ac.uk



____________________


I hope anybody can help or suggest me what to do.

Thank you and kind regards
Daniel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131126/61b87100/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: issuefiles.tar.gz
Type: application/gzip
Size: 4207 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20131126/61b87100/attachment.bin

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 85, Issue 43
*********************************************

0 Response to "Moses-support Digest, Vol 85, Issue 43"

Post a Comment