Moses-support Digest, Vol 85, Issue 46

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: EMS set up with mgiza and KenLM (Daniel Valenzuela)


----------------------------------------------------------------------

Message: 1
Date: Tue, 26 Nov 2013 16:31:37 +0100 (CET)
From: Daniel Valenzuela <daniel@valenzuela.de>
Subject: Re: [Moses-support] EMS set up with mgiza and KenLM
To: moses-support@mit.edu
Message-ID:
<1798908305.161396.1385479897665.open-xchange@communicator.strato.de>
Content-Type: text/plain; charset="utf-8"

Yes I already added in further workarounds type=8.

To be sure I continued clean by
rm -r tuning/
rm steps/1/TUNING*

.../experiment.perl -continue 1 -exec

same output as before.

Then I continued even cleaner by
rm -r tuning/
rm steps/1/TUNING*
rm -r evaluation/newstest2010.filtered.1/
(there is nothing more *filtered.* in here)
.../experiment.perl -continue 1 -exec
and the output is the same except for evaluation/newstest2010.filtered.1/ is
missing.

But still I get a crash at the same TUNING:tune step.

My [LM] section looks like
[LM]

lmplz = $moses-bin-dir/lmplz
order = 3
settings = "-T $working-dir/tmp -S 10G"
lm-training = "$moses-script-dir/generic/trainlm-lmplz.perl -lmplz $lmplz"
lm-binarizer = $moses-bin-dir/build_binary
type = 8

Crash is still:
line=IRSTLM name=LM0 factor=0
path=/home/moses/project_test_mgiza/experiment/lm/project-syndicate.binlm.1
order=3
Exception: Error: 4 number of threads specified but IRST LM is not threadsafe.
Exit code: 1
Failed to run moses with the config
/home/moses/project_test_mgiza/experiment/tuning/moses.filtered.ini.1 at
/home/moses/mosesdecoder/scripts/training/mert-moses.pl line 1271.
cp: cannot stat
?/home/moses/project_test_mgiza/experiment/tuning/tmp.1/moses.ini?: No such file
or directory

Thank you

> Message: 1
> Date: Tue, 26 Nov 2013 13:03:03 +0000
> From: Hieu Hoang <hieuhoang@gmail.com>
> Subject: Re: [Moses-support] EMS set up with mgiza and KenLM
> To: moses-support@mit.edu
> Message-ID: <52949C07.3050609@gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> in the [LM] section, you have to put
> type = 8
> otherwise the moses.ini will be created to use IRSTLM
>
> You have to delete the filtering directory
> tuning/filtered.?
> evaluation/*.filtered.?
> and delete the tuning sh file
> steps/?/TUNING_tune.*
>
> then continue the experiment
> .../experiment.perl -exec -continue=?
>
> On 26/11/2013 12:08, Daniel Valenzuela wrote:
> > Dear all,
> > after various manual set ups, I wanted to try the EMS. After trying
> > several experiment settings I wanted to run it with multi-giza and
> > kenlm, but I cannot get it to work (tried it again with smaller
> > corpus, same result. I tried to continue the experiment with different
> > fixes - no success.
> > The log tells me:
> > step TUNING:tune crashed
> > further inspection in TUNE_tune.1.STDERR in steps/1/ told me IRSTLM is
> > messing with my project, "against" my will (at least I thought so):
> > line=IRSTLM name=LM0 factor=0
> > path=/home/moses/project_test_mgiza/experiment/lm/project-syndicate.binlm.1
> > order=3
> > Exception: Error: 4 number of threads specified but IRST LM is not
> > threadsafe.
> > Exit code: 1
> > Failed to run moses with the config
> > /home/moses/project_test_mgiza/experiment/tuning/moses.filtered.ini.1
> > at /home/moses/mosesdecoder/scripts/training/mert-moses.pl line 1271.
> > cp: cannot stat
> > '/home/moses/project_test_mgiza/experiment/tuning/tmp.1/moses.ini': No
> > such file or directory
> > Looking up what happened in the tuning folder, I found out that
> > moses.filtered.ini.1 has set IRSTLM for Distortion, but
> > filtered.1/moses.ini has set KenLM for Distortion which satisfies what
> > I hoped to get.
> > I attached the files from above and the following is the config file
> > of the experiment:
> > ################################################
> > ### CONFIGURATION FILE FOR AN SMT EXPERIMENT ###
> > ################################################
> >
> >
> > [GENERAL]
> >
> > home-dir = /home/moses
> >
> > working-dir = $home-dir/project_test_mgiza/experiment
> > moses-src-dir = $home-dir/mosesdecoder
> > moses-script-dir = $moses-src-dir/scripts
> > moses-bin-dir = $moses-src-dir/bin
> > external-bin-dir = $moses-src-dir/BINDIR
> > data-dir = $home-dir/project_test_mgiza/experiment/corpus
> > train-dir = $data-dir/training
> > dev-dir = $data-dir/dev
> > #irstlm-dir = $home-dir/irstlm/bin
> >
> >
> > ttable-binarizer = $moses-bin-dir/processPhraseTable
> > decoder = $moses-bin-dir/moses
> >
> > input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -l
> > $input-extension -threads 4"
> > output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -l
> > $output-extension"
> > input-truecaser = $moses-script-dir/recaser/truecase.perl
> > output-truecaser = $moses-script-dir/recaser/truecase.perl
> > detruecaser = $moses-script-dir/recaser/detruecase.perl
> >
> >
> > input-extension = de
> > output-extension = en
> > pair-extension = de-en
> >
> > #################################################################
> > # PARALLEL CORPUS PREPARATION:
> > # create a tokenized, sentence-aligned corpus, ready for training
> >
> > [CORPUS]
> >
> > max-sentence-length = 80
> >
> > [CORPUS:project-syndicate]
> > raw-stem = $train-dir/news-commentary-v8.$pair-extension
> >
> > [LM]
> >
> > ### tool to be used for language model training
> > # for instance: ngram-count (SRILM), train-lm-on-disk.perl (Edinburgh)
> > #
> > #lm-training = "$moses-script-dir/generic/trainlm-irst2.perl -cores 4
> > -irst-dir $irstlm-dir -temp-dir $working-dir/tmp"
> > #settings = "-s msb -p 0"
> > #order = 3
> > #type = 8
> > #lm-binarizer = $moses-bin-dir/build_binary
> >
> > # path to lmplz binary
> > lmplz = $moses-bin-dir/lmplz
> > # order of the language model
> > order = 3
> > # additional parameters to lmplz (check lmplz help message)
> > settings = "-T $working-dir/tmp -S 10G"
> > # this tells EMS to use lmplz and tells EMS where lmplz is located
> > lm-training = "$moses-script-dir/generic/trainlm-lmplz.perl -lmplz
> > $lmplz"
> > lm-binarizer = $moses-bin-dir/build_binary
> >
> >
> >
> > [LM:project-syndicate]
> > raw-corpus =
> > $train-dir/news-commentary-v8.$pair-extension.$output-extension
> >
> >
> > #################################################################
> > # TRANSLATION MODEL TRAINING
> >
> > [TRAINING]
> >
> >
> > ### training script to be used: either a legacy script or
> > # current moses training script (default)
> > #
> > #script = $moses-script-dir/training/train-model.perl
> >
> >
> > ### general options
> > #
> > script = $moses-script-dir/training/train-model.perl
> > training-options = "-mgiza -mgiza-cpus 4 -cores 4 \
> > -parallel -sort-buffer-size 10G -sort-batch-size 253 \
> > -sort-compress gzip -sort-parallel 10"
> > parallel = yes
> >
> > ### symmetrization method to obtain word alignments from giza output
> > # (commonly used: grow-diag-final-and)
> > #
> > #alignment-symmetrization-method = berkeley
> > alignment-symmetrization-method = grow-diag-final-and
> >
> > ### lexicalized reordering: specify orientation type
> > # (default: only distance-based reordering model)
> > #
> > lexicalized-reordering = msd-bidirectional-fe
> >
> > ### if word alignment (giza symmetrization) should be skipped,
> > # point to word alignment files
> > #
> > #word-alignment =
> >
> > ### if phrase extraction should be skipped,
> > # point to stem for extract files
> > #
> > #extracted-phrases =
> >
> > ### if phrase table training should be skipped,
> > # point to phrase translation table
> > #
> > #phrase-translation-table =
> >
> > ### if reordering table training should be skipped,
> > # point to reordering table
> > #
> > #reordering-table =
> >
> > ### if training should be skipped,
> > # point to a configuration file that contains
> > # pointers to all relevant model files
> > #
> > #config =
> >
> > ### TUNING: finding good weights for model components
> >
> > [TUNING]
> >
> > ### instead of tuning with this setting, old weights may be recycled
> >
> > ### tuning script to be used
> > #
> > tuning-script = $moses-script-dir/training/mert-moses.pl
> > tuning-settings = "-mertdir $moses-bin-dir -threads 4"
> >
> > ### specify the corpus used for tuning
> > # it should contain 100s if not 1000s of sentences
> > #
> > raw-input = $dev-dir/news-test2008.$input-extension
> >
> > raw-reference = $dev-dir/news-test2008.$output-extension
> >
> > ### size of n-best list used (typically 100)
> > #
> > nbest = 100
> >
> > ### ranges for weights for random initialization
> > # if not specified, the tuning script will use generic ranges
> > # it is not clear, if this matters
> > #
> > # lambda =
> >
> > ### additional flags for the decoder
> > #
> > decoder-settings = "-threads 4"
> >
> > ### if tuning should be skipped, specify this here
> > # and also point to a configuration file that contains
> > # pointers to all relevant model files
> > #
> > #config =
> >
> >
> > #######################################################
> > ## TRUECASER: train model to truecase corpora and input
> >
> > [TRUECASER]
> >
> > ### script to train truecaser models
> > #
> > trainer = $moses-script-dir/recaser/train-truecaser.perl
> >
> > ### training data
> > # raw input needs to be still tokenized,
> > # also also tokenized input may be specified
> > #
> > raw-stem = CORPUS:raw-stem
> >
> > ### trained model
> > #
> > #truecase-model =
> >
> >
> > ##################################
> > ## EVALUATION: score system output
> >
> > [EVALUATION]
> >
> > ### prepare system output for scoring
> > # this may include detokenization and wrapping output in sgm
> > # (needed for nist-bleu, ter, meteor)
> > #
> > detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l
> > $output-extension"
> >
> > decoder-settings = "-threads 4"
> >
> > ### should output be scored case-sensitive (default: no)?
> > #
> > # case-sensitive = yes
> >
> > ### BLEU
> > #
> >
> > multi-bleu = "$moses-script-dir/generic/multi-bleu.perl -lc"
> > # ibm-bleu =
> >
> > ### TER: translation error rate (BBN metric) based on edit distance
> > #
> > # ter = $edinburgh-script-dir/tercom_v6a.pl
> >
> > ### METEOR: gives credit to stem / worknet synonym matches
> > #
> > # meteor =
> >
> > [EVALUATION:newstest2010]
> > raw-input = $dev-dir/newstest2011.$input-extension
> > raw-reference = $dev-dir/newstest2011.$output-extension
> >
> >
> > [REPORTING]
> >
> > ### what to do with result (default: store in file evaluation/report)
> > #
> > # email = pkoehn@inf.ed.ac.uk
> > ____________________
> > I hope anybody can help or suggest me what to do.
> > Thank you and kind regards
> > Daniel
> >
> >
> > _______________________________________________
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> ***
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131126/689e4f96/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 85, Issue 46
*********************************************

0 Response to "Moses-support Digest, Vol 85, Issue 46"

Post a Comment