Moses-support Digest, Vol 102, Issue 45

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Working with big models (Marcin Junczys-Dowmunt)
2. SRILM error in Moses, when we're not using SRILM (Lane Schwartz)
3. EMS question (Hieu Hoang)

----------------------------------------------------------------------

Message: 1
Date: Tue, 21 Apr 2015 20:02:02 +0200
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Working with big models
To: moses-support@mit.edu
Message-ID: <5536909A.6030308@amu.edu.pl>
Content-Type: text/plain; charset="windows-1252"

Hi,
> @Marcin, the bigger than usual reordering-table is due to our
> allowance for high distortion. 2.4 is after cleaning it up, the
> original size contains loads of rubbish sentence pairs.
Where do have that distortion?
>
> BTW, the compactization finished at <4hrs. I guess at the 3rd hour i
> was starting to doubt whether the server can handle that amount.
The binarization is not that heavy on the server. It just takes a while.
As long as there is progress you are fine.
>
> But the phrase size didn't go down as much as i expect, it's
> still 1.1G which might take forever to load when decoding. Will
> .minphr file be faster to load (it looks binarized, i think) than the
> normal .gz phrase table? If not, we're still looking at >18hrs of
> loading time on the server.
Try it :) Should not take more than a couple of seconds.
>
> But the reordering went down to from 6.7GB -> 420M.
Weird. I a little bit suspicious of your text tables, as the size
distributions seem so unusual. But if it works for you, then alright.
>
> What exactly is the process of dealing with models >4GB? The standard
> moses tutorial on the "moses rights of passage" and processes would be
> failing at every instances when considering non-binarized LM,
> non-compactize phrase-table/lexical-table, non-threaded
> processing/training/decoding.
>
> Is there a guide on dealing with big models? How big can a model grow
> and what is the proportional server clockspeed/RAM necessary?
>
I have a 128 GB server and I am building and using models from 150 M
parallel sentences, and LMs from hundreds of GB of monolingual text, I
am doing just fine. Unbinarized models are not meant for deployment on
any machine whatever size. Treat the text models as intermediate
representations, binarized models as final deployment models. You are
fine in terms of RAM if your binarized models fit into RAM + a couple of
GB for computations.
>
> Regards,
> Liling
>
>
> On Tue, Apr 21, 2015 at 6:39 PM, liling tan <alvations@gmail.com
> <mailto:alvations@gmail.com>> wrote:
>
> Dear Moses devs/users,
>
> *How should one work with big models?*
>
> Originally, I've 4.5 million parallel sentences and ~13 million
> sentences monolingual data for source and target languages.
>
> After cleaning with
> https://github.com/alvations/mosesdecoder/blob/master/scripts/other/gacha_filter.py
> and
> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl,
> I got 2.6 million parallel sentences.
>
>
> And after training a phrase-based model with reordering, i get:
>
> 9.9GB of phrase-table.gz
> 3.2GB of reordering-table.gz
> ~45GB of language-model.arpa.gz
>
>
> With language model, I've binarized it and got to
>
> ~75GB of language-model.binary
>
> We ran moses-mert.pl <http://moses-mert.pl> and it completed the
> tuning in 3-4 days on both directions on the dev set (3000
> sentences), after filtering:
>
>
> 364M phrase-table.gz
> 1.8GB reordering-table.gz
>
>
> On the test set, we did the filtering too but when decoding it
> took 18 hours to load only 50% of the phrase table:
>
> 1.5GB phrase-table.gz
> 6.7GB reordering-table.gz
>
>
> So we decided to compactize the phrase table.
>
> With the phrase-table and reordering, we used the
> processPhraseTableMin and processLexicalTableMin and I'm still
> waiting to get the minimized phrasetable table. It has been
> running for 3 hours on 10 threads each on a 2.5GHz cores.
>
> *Anyone have any rough idea how small the phrase table and lexical
> table would get?*
> *
> *
> *With that kind of model, how much RAM would be necessary? And how
> long would it take to load the model onto the RAM?
>
> Any other tips/hints on working with big models efficiently? *
>
> *Is it even possible for us to use models at such a size on our
> small server (24 cores, 2.5GHz, 128RAM)? If not, how big should
> our sever get?*
>
> Regards,
> Liling
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150421/8e909ec2/attachment-0001.htm

------------------------------

Message: 2
Date: Tue, 21 Apr 2015 13:55:25 -0500
From: Lane Schwartz <dowobeha@gmail.com>
Subject: [Moses-support] SRILM error in Moses, when we're not using
SRILM
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CABv3vZnBLEBmO8diSh46mYuch-b9xnuFfGoOfNh-V8kkJM6_bw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

One of my students has encountered the following error. He's using
EMS, and as far as I can tell, it's all configured to use KenLM. Any
ideas on what might be going wrong?

Using SCRIPTS_ROOTDIR: /opt/moses/scripts

Asking moses for feature names and values from
/home/massung1/moses-fi-en/tuning/moses.filtered.ini.1

Executing: /opt/moses/bin/moses -threads 12 -v 0 -config
/home/massung1/moses-fi-en/tuning/moses.filtered.ini.1 -show-weights >
./features.list

Executing: /opt/moses/bin/moses -threads 12 -v 0 -config
/home/massung1/moses-fi-en/tuning/moses.filtered.ini.1 -show-weights >
./features.list

Initializing LexicalReordering..

Exception: moses/FF/Factory.cpp:321 in void
Moses::FeatureRegistry::Construct(const string&, const string&) threw
UnknownFeatureException because `i == registry_.end()'.

Feature name SRILM is not registered.

Exit code: 1

Failed to run moses with the config
/home/massung1/moses-fi-en/tuning/moses.filtered.ini.1 at
/opt/moses/scripts/training/mert-moses.pl line 1354.

cp: cannot stat ?/home/massung1/moses-fi-en/tuning/tmp.1/moses.ini?:
No such file or directory

Running moses in

/home/massung1/moses-fi-en/ using config.toy

Thanks,
Lane

------------------------------

Message: 3
Date: Wed, 22 Apr 2015 17:07:50 +0400
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: [Moses-support] EMS question
To: moses-support <moses-support@mit.edu>
Message-ID:
<CAEKMkbi+i4fvrEmCjN3jLsuWLx5Mk_UbkU_FWp0rueZWCSBvzA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

does anyone know how I can specify that the *tokenized* data for LM
training should be taken from the target side of the parallel corpus? eg.
something like

------------------
[CORPUS]
[CORPUS:nc]
.....

[LM]
[LM:nc]
tokenized-corpus = [CORPUS:nc].$output-extension
-------------------

my tokenizer is rather slow so I don't want it tokenized twice, once for
the parallel data and again for the LM data

Hieu Hoang
Researcher
New York University, Abu Dhabi
http://www.hoang.co.uk/hieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150422/aa3bc1da/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 102, Issue 45
**********************************************

Moses-support Digest, Vol 102, Issue 45

0 Response to "Moses-support Digest, Vol 102, Issue 45"

Post a Comment