Moses-support Digest, Vol 85, Issue 45

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: EMS set up with mgiza and KenLM (Hieu Hoang)
2. Re: Estimating probabilities with KenLM (Prasanth K)

----------------------------------------------------------------------

Message: 1
Date: Tue, 26 Nov 2013 13:03:03 +0000
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] EMS set up with mgiza and KenLM
To: moses-support@mit.edu
Message-ID: <52949C07.3050609@gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

in the [LM] section, you have to put
type = 8
otherwise the moses.ini will be created to use IRSTLM

You have to delete the filtering directory
tuning/filtered.?
evaluation/*.filtered.?
and delete the tuning sh file
steps/?/TUNING_tune.*

then continue the experiment
.../experiment.perl -exec -continue=?

On 26/11/2013 12:08, Daniel Valenzuela wrote:
> Dear all,
> after various manual set ups, I wanted to try the EMS. After trying
> several experiment settings I wanted to run it with multi-giza and
> kenlm, but I cannot get it to work (tried it again with smaller
> corpus, same result. I tried to continue the experiment with different
> fixes - no success.
> The log tells me:
> step TUNING:tune crashed
> further inspection in TUNE_tune.1.STDERR in steps/1/ told me IRSTLM is
> messing with my project, "against" my will (at least I thought so):
> line=IRSTLM name=LM0 factor=0
> path=/home/moses/project_test_mgiza/experiment/lm/project-syndicate.binlm.1
> order=3
> Exception: Error: 4 number of threads specified but IRST LM is not
> threadsafe.
> Exit code: 1
> Failed to run moses with the config
> /home/moses/project_test_mgiza/experiment/tuning/moses.filtered.ini.1
> at /home/moses/mosesdecoder/scripts/training/mert-moses.pl line 1271.
> cp: cannot stat
> '/home/moses/project_test_mgiza/experiment/tuning/tmp.1/moses.ini': No
> such file or directory
> Looking up what happened in the tuning folder, I found out that
> moses.filtered.ini.1 has set IRSTLM for Distortion, but
> filtered.1/moses.ini has set KenLM for Distortion which satisfies what
> I hoped to get.
> I attached the files from above and the following is the config file
> of the experiment:
> ################################################
> ### CONFIGURATION FILE FOR AN SMT EXPERIMENT ###
> ################################################
>
>
> [GENERAL]
>
> home-dir = /home/moses
>
> working-dir = $home-dir/project_test_mgiza/experiment
> moses-src-dir = $home-dir/mosesdecoder
> moses-script-dir = $moses-src-dir/scripts
> moses-bin-dir = $moses-src-dir/bin
> external-bin-dir = $moses-src-dir/BINDIR
> data-dir = $home-dir/project_test_mgiza/experiment/corpus
> train-dir = $data-dir/training
> dev-dir = $data-dir/dev
> #irstlm-dir = $home-dir/irstlm/bin
>
>
> ttable-binarizer = $moses-bin-dir/processPhraseTable
> decoder = $moses-bin-dir/moses
>
> input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -l
> $input-extension -threads 4"
> output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -l
> $output-extension"
> input-truecaser = $moses-script-dir/recaser/truecase.perl
> output-truecaser = $moses-script-dir/recaser/truecase.perl
> detruecaser = $moses-script-dir/recaser/detruecase.perl
>
>
> input-extension = de
> output-extension = en
> pair-extension = de-en
>
> #################################################################
> # PARALLEL CORPUS PREPARATION:
> # create a tokenized, sentence-aligned corpus, ready for training
>
> [CORPUS]
>
> max-sentence-length = 80
>
> [CORPUS:project-syndicate]
> raw-stem = $train-dir/news-commentary-v8.$pair-extension
>
> [LM]
>
> ### tool to be used for language model training
> # for instance: ngram-count (SRILM), train-lm-on-disk.perl (Edinburgh)
> #
> #lm-training = "$moses-script-dir/generic/trainlm-irst2.perl -cores 4
> -irst-dir $irstlm-dir -temp-dir $working-dir/tmp"
> #settings = "-s msb -p 0"
> #order = 3
> #type = 8
> #lm-binarizer = $moses-bin-dir/build_binary
>
> # path to lmplz binary
> lmplz = $moses-bin-dir/lmplz
> # order of the language model
> order = 3
> # additional parameters to lmplz (check lmplz help message)
> settings = "-T $working-dir/tmp -S 10G"
> # this tells EMS to use lmplz and tells EMS where lmplz is located
> lm-training = "$moses-script-dir/generic/trainlm-lmplz.perl -lmplz
> $lmplz"
> lm-binarizer = $moses-bin-dir/build_binary
>
>
>
> [LM:project-syndicate]
> raw-corpus =
> $train-dir/news-commentary-v8.$pair-extension.$output-extension
>
>
> #################################################################
> # TRANSLATION MODEL TRAINING
>
> [TRAINING]
>
>
> ### training script to be used: either a legacy script or
> # current moses training script (default)
> #
> #script = $moses-script-dir/training/train-model.perl
>
>
> ### general options
> #
> script = $moses-script-dir/training/train-model.perl
> training-options = "-mgiza -mgiza-cpus 4 -cores 4 \
> -parallel -sort-buffer-size 10G -sort-batch-size 253 \
> -sort-compress gzip -sort-parallel 10"
> parallel = yes
>
> ### symmetrization method to obtain word alignments from giza output
> # (commonly used: grow-diag-final-and)
> #
> #alignment-symmetrization-method = berkeley
> alignment-symmetrization-method = grow-diag-final-and
>
> ### lexicalized reordering: specify orientation type
> # (default: only distance-based reordering model)
> #
> lexicalized-reordering = msd-bidirectional-fe
>
> ### if word alignment (giza symmetrization) should be skipped,
> # point to word alignment files
> #
> #word-alignment =
>
> ### if phrase extraction should be skipped,
> # point to stem for extract files
> #
> #extracted-phrases =
>
> ### if phrase table training should be skipped,
> # point to phrase translation table
> #
> #phrase-translation-table =
>
> ### if reordering table training should be skipped,
> # point to reordering table
> #
> #reordering-table =
>
> ### if training should be skipped,
> # point to a configuration file that contains
> # pointers to all relevant model files
> #
> #config =
>
> ### TUNING: finding good weights for model components
>
> [TUNING]
>
> ### instead of tuning with this setting, old weights may be recycled
>
> ### tuning script to be used
> #
> tuning-script = $moses-script-dir/training/mert-moses.pl
> tuning-settings = "-mertdir $moses-bin-dir -threads 4"
>
> ### specify the corpus used for tuning
> # it should contain 100s if not 1000s of sentences
> #
> raw-input = $dev-dir/news-test2008.$input-extension
>
> raw-reference = $dev-dir/news-test2008.$output-extension
>
> ### size of n-best list used (typically 100)
> #
> nbest = 100
>
> ### ranges for weights for random initialization
> # if not specified, the tuning script will use generic ranges
> # it is not clear, if this matters
> #
> # lambda =
>
> ### additional flags for the decoder
> #
> decoder-settings = "-threads 4"
>
> ### if tuning should be skipped, specify this here
> # and also point to a configuration file that contains
> # pointers to all relevant model files
> #
> #config =
>
>
> #######################################################
> ## TRUECASER: train model to truecase corpora and input
>
> [TRUECASER]
>
> ### script to train truecaser models
> #
> trainer = $moses-script-dir/recaser/train-truecaser.perl
>
> ### training data
> # raw input needs to be still tokenized,
> # also also tokenized input may be specified
> #
> raw-stem = CORPUS:raw-stem
>
> ### trained model
> #
> #truecase-model =
>
>
> ##################################
> ## EVALUATION: score system output
>
> [EVALUATION]
>
> ### prepare system output for scoring
> # this may include detokenization and wrapping output in sgm
> # (needed for nist-bleu, ter, meteor)
> #
> detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l
> $output-extension"
>
> decoder-settings = "-threads 4"
>
> ### should output be scored case-sensitive (default: no)?
> #
> # case-sensitive = yes
>
> ### BLEU
> #
>
> multi-bleu = "$moses-script-dir/generic/multi-bleu.perl -lc"
> # ibm-bleu =
>
> ### TER: translation error rate (BBN metric) based on edit distance
> #
> # ter = $edinburgh-script-dir/tercom_v6a.pl
>
> ### METEOR: gives credit to stem / worknet synonym matches
> #
> # meteor =
>
> [EVALUATION:newstest2010]
> raw-input = $dev-dir/newstest2011.$input-extension
> raw-reference = $dev-dir/newstest2011.$output-extension
>
>
> [REPORTING]
>
> ### what to do with result (default: store in file evaluation/report)
> #
> # email = pkoehn@inf.ed.ac.uk
> ____________________
> I hope anybody can help or suggest me what to do.
> Thank you and kind regards
> Daniel
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131126/e8172036/attachment-0001.htm

------------------------------

Message: 2
Date: Tue, 26 Nov 2013 15:57:28 +0100
From: Prasanth K <prasanthk.ms09@gmail.com>
Subject: Re: [Moses-support] Estimating probabilities with KenLM
To: Hieu Hoang <hieuhoang@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CA+n+9-jOYkjoksjVLwptAhOXC0LPwfv_cNw32f-wiEN-UsVH+g@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Ok. I have managed to re-create this error (no reason why it shouldn't come
back, I knew exactly what I told moses to do). So, the exact command run to
create the language model from the logs is as follows:

scripts/generic/trainlm-lmplz.perl -lmplz bin/lmplz -order 5 -T
europarl.en-sv/phrase-based-dup/tmp
-S 10G -text europarl.en-sv/phrase-based-dup/lm/europarl.lowercased.1 -lm
europarl.en-sv/phrase-based-dup/lm/europarl.lm.1

Of course, all paths in the above command given were absolute paths (I just
removed them for readability.) When this is run, my log file from EMS gives
the following in LM_europarl_train.id.STDERR

EXECUTING bin/lmplz --order 5 -T europarl.en-sv/phrase-based-dup/tmp -S 10G
< europarl.en-sv/phrase-based-dup/lm/europarl.lowercased.1 >
europarl.en-sv/phrase-based-dup/lm/europarl.lm.1

=== 1/5 Counting and sorting n-grams ===

Reading stdin

----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

****************************************************************************************************

Function not implemented

This does not get the language model step to crash, instead creates an
empty language model (0 lines). The below is the log file for
LM_europarl_binarize.id.STDERR

Reading europarl.en-sv/phrase-based-dup/lm/europarl.lm.1

----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

End of file Byte: 0 File: europarl.en-sv/phrase-based-dup/lm/europarl.lm.1

ERROR

Clearly, something is wrong with my installation of kenlm (the decoding
with kenlm works just fine ..I have confirmed that now), which makes the
estimation go funny. The question is where I start to fix this?

Thanks.

- Regards,

Prasanth

On Tue, Nov 26, 2013 at 1:56 PM, Hieu Hoang <hieuhoang@gmail.com> wrote:

> ok, i can't reproduce your error
> FUnction not implemented
> you should find out exactly how lmplz is being run, it may be that you
> have a slightly older version and doesn't know all the arguments you've
> given it.
>
>
> On 26/11/2013 06:47, Prasanth K wrote:
>
> Hello Hieu,
>
> My first attempt was to specify the absolute amount of memory (10G) but
> that gave an error saying function not implemented. Later, when I tried
> specifying the relative size (80%), I got a similar parse error to what you
> have given above. Strange that it should
>
> @Kenneth, thanks for the code to estimate physical memory. I am going to
> give it a shot and let you know how it goes.
>
> - Regards,
> Prasanth
>
>
> On Mon, Nov 25, 2013 at 9:20 PM, Hieu Hoang <hieuhoang@gmail.com> wrote:
>
>> Prasanth - what is the exact lmplz command that was ran by the EMS?
>>
>>
>> This works
>> .../lmplz --order 5 --text lm/europarl.lowercased.1 --arpa
>> lm/europarl.lmplz -T /tmp -S 1G
>> This doesn't
>> .../lmplz --order 5 --text lm/europarl.lowercased.1 --arpa
>> lm/europarl.lmplz -T /tmp -S 80%
>> it give the error
>> util/usage.cc:220 in uint64_t util::<anonymous
>> namespace>::ParseNum(const std::string &) [Num = double] threw
>> SizeParseError because `!mem'.
>> Failed to parse 80% into a memory size because % was specified but the
>> physical memory size could not be determined.
>>
>> However, it worked even with the source code from 4 days ago.
>>
>>
>> On 25/11/2013 19:07, Kenneth Heafield wrote:
>> > Hi,
>> >
>> > I've taken a shot in the dark based on physmem.c to support
>> physical
>> > memory estimation on BSD and OS X. Please clone
>> >
>> > github.com/kpu/kenlm
>> >
>> > and compile with
>> >
>> > ./bjam
>> >
>> > If that fails, please let Hieu and I know (maybe Hieu can help since he
>> > has OS X). If it doesn't fail, run
>> >
>> > bin/lmplz
>> >
>> > with no argument. The help message will include a line e.g.
>> >
>> > "This machine has 135224176640 bytes of memory."
>> >
>> > or
>> >
>> > "Unable to determine the amount of memory on this machine."
>> >
>> > If it works, then I'll push to Moses. Trying to not break Moses master
>> > for OS X.
>> >
>> > Kenneth
>> >
>> > On 11/24/13 22:40, Prasanth K wrote:
>> >> Hi Kenneth,
>> >>
>> >> Thanks for the clarification w.r.t. calculating the memory size. But I
>> >> am running these on a Mac (10.9 Mavericks). Do you think I should still
>> >> port the lmplz code to Mac for the estimation of probabilities?
>> >>
>> >> One thing though, I did change the default clang compiler that comes
>> >> with this new Mac to a gcc-4.8 (not sure that changes anything in this
>> >> context).
>> >>
>> >> - Prasanth
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Nov 22, 2013 at 6:50 PM, Kenneth Heafield <moses@kheafield.com
>> >> <mailto:moses@kheafield.com>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> What OS are you on? Cygwin? Apparently every OS reports
>> >> memory size
>> >> in a different way:
>> >>
>> >>
>> http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/physmem.c;h=2629936146e3042f927523322f18aca76996cd7f;hb=HEAD
>> >>
>> >> The good news is that the above code is LGPLv2:
>> >>
>> >>
>> http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=modules/physmem;h=9644522e0493a85a9fb4ae7c4449741c2c1500ea;hb=HEAD
>> >>
>> >> But currently I'm just using this short function that will fail
>> on some
>> >> platforms:
>> >>
>> >> uint64_t GuessPhysicalMemory() {
>> >> #if defined(_WIN32) || defined(_WIN64)
>> >> return 0;
>> >> #elif defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
>> >> long pages = sysconf(_SC_PHYS_PAGES);
>> >> if (pages == -1) return 0;
>> >> long page_size = sysconf(_SC_PAGESIZE);
>> >> if (page_size == -1) return 0;
>> >> return static_cast<uint64_t>(pages) *
>> >> static_cast<uint64_t>(page_size);
>> >> #else
>> >> return 0;
>> >>

Moses-support Digest, Vol 85, Issue 45

0 Response to "Moses-support Digest, Vol 85, Issue 45"

Post a Comment