Moses-support Digest, Vol 102, Issue 44

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Working with big models (liling tan)
2. Re: Working with big models (Marcin Junczys-Dowmunt)
3. Re: Working with big models (liling tan)


----------------------------------------------------------------------

Message: 1
Date: Tue, 21 Apr 2015 18:39:51 +0200
From: liling tan <alvations@gmail.com>
Subject: [Moses-support] Working with big models
To: moses-support <moses-support@mit.edu>
Message-ID:
<CAKzPaJ+7fGDQyhPXxfZMBgOp6N7cagj1U_JLunrFqiSnBT+8Tg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Dear Moses devs/users,

*How should one work with big models?*

Originally, I've 4.5 million parallel sentences and ~13 million sentences
monolingual data for source and target languages.

After cleaning with
https://github.com/alvations/mosesdecoder/blob/master/scripts/other/gacha_filter.py
and
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl
, I got 2.6 million parallel sentences.


And after training a phrase-based model with reordering, i get:

9.9GB of phrase-table.gz
3.2GB of reordering-table.gz
~45GB of language-model.arpa.gz


With language model, I've binarized it and got to

~75GB of language-model.binary

We ran moses-mert.pl and it completed the tuning in 3-4 days on both
directions on the dev set (3000 sentences), after filtering:


364M phrase-table.gz
1.8GB reordering-table.gz


On the test set, we did the filtering too but when decoding it took 18
hours to load only 50% of the phrase table:

1.5GB phrase-table.gz
6.7GB reordering-table.gz


So we decided to compactize the phrase table.

With the phrase-table and reordering, we used the processPhraseTableMin and
processLexicalTableMin and I'm still waiting to get the minimized
phrasetable table. It has been running for 3 hours on 10 threads each on a
2.5GHz cores.

*Anyone have any rough idea how small the phrase table and lexical table
would get?*



*With that kind of model, how much RAM would be necessary? And how long
would it take to load the model onto the RAM? Any other tips/hints on
working with big models efficiently? *

*Is it even possible for us to use models at such a size on our small
server (24 cores, 2.5GHz, 128RAM)? If not, how big should our sever get?*

Regards,
Liling
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150421/47f9d927/attachment-0001.htm

------------------------------

Message: 2
Date: Tue, 21 Apr 2015 19:12:32 +0200
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Working with big models
To: moses-support@mit.edu
Message-ID: <55368500.8040306@amu.edu.pl>
Content-Type: text/plain; charset="windows-1252"

Hi,
2-4M sentences is not that big :)

As for the compact phrase table, the binarized version will be roughly
half the size of your gzipped text phrase-table, the lexical table
should be smaller. However, how come your gzipped reordering-table is
bigger than your phrase-table, that's unusual?

Also, 128 GB of RAM is plenty.

Best,
Marcin

W dniu 21.04.2015 o 18:39, liling tan pisze:
> Dear Moses devs/users,
>
> *How should one work with big models?*
>
> Originally, I've 4.5 million parallel sentences and ~13 million
> sentences monolingual data for source and target languages.
>
> After cleaning with
> https://github.com/alvations/mosesdecoder/blob/master/scripts/other/gacha_filter.py
> and
> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl,
> I got 2.6 million parallel sentences.
>
>
> And after training a phrase-based model with reordering, i get:
>
> 9.9GB of phrase-table.gz
> 3.2GB of reordering-table.gz
> ~45GB of language-model.arpa.gz
>
>
> With language model, I've binarized it and got to
>
> ~75GB of language-model.binary
>
> We ran moses-mert.pl <http://moses-mert.pl> and it completed the
> tuning in 3-4 days on both directions on the dev set (3000 sentences),
> after filtering:
>
>
> 364M phrase-table.gz
> 1.8GB reordering-table.gz
>
>
> On the test set, we did the filtering too but when decoding it took 18
> hours to load only 50% of the phrase table:
>
> 1.5GB phrase-table.gz
> 6.7GB reordering-table.gz
>
>
> So we decided to compactize the phrase table.
>
> With the phrase-table and reordering, we used the
> processPhraseTableMin and processLexicalTableMin and I'm still waiting
> to get the minimized phrasetable table. It has been running for 3
> hours on 10 threads each on a 2.5GHz cores.
>
> *Anyone have any rough idea how small the phrase table and lexical
> table would get?*
> *
> *
> *With that kind of model, how much RAM would be necessary? And how
> long would it take to load the model onto the RAM?
>
> Any other tips/hints on working with big models efficiently? *
>
> *Is it even possible for us to use models at such a size on our small
> server (24 cores, 2.5GHz, 128RAM)? If not, how big should our sever get?*
>
> Regards,
> Liling
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150421/0ca8a6de/attachment-0001.htm

------------------------------

Message: 3
Date: Tue, 21 Apr 2015 19:48:14 +0200
From: liling tan <alvations@gmail.com>
Subject: Re: [Moses-support] Working with big models
To: moses-support <moses-support@mit.edu>
Message-ID:
<CAKzPaJJSLg+0pvyeUYgdPEyLud-WWrwpetYC01MSGc5mGqMWBw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Dear Moses dev/users,


@Marcin, the bigger than usual reordering-table is due to our allowance for
high distortion. 2.4 is after cleaning it up, the original size contains
loads of rubbish sentence pairs.

BTW, the compactization finished at <4hrs. I guess at the 3rd hour i was
starting to doubt whether the server can handle that amount.

But the phrase size didn't go down as much as i expect, it's still 1.1G
which might take forever to load when decoding. Will .minphr file be faster
to load (it looks binarized, i think) than the normal .gz phrase table? If
not, we're still looking at >18hrs of loading time on the server.

But the reordering went down to from 6.7GB -> 420M.

What exactly is the process of dealing with models >4GB? The standard moses
tutorial on the "moses rights of passage" and processes would be failing at
every instances when considering non-binarized LM, non-compactize
phrase-table/lexical-table, non-threaded processing/training/decoding.

Is there a guide on dealing with big models? How big can a model grow and
what is the proportional server clockspeed/RAM necessary?


Regards,
Liling


On Tue, Apr 21, 2015 at 6:39 PM, liling tan <alvations@gmail.com> wrote:

> Dear Moses devs/users,
>
> *How should one work with big models?*
>
> Originally, I've 4.5 million parallel sentences and ~13 million sentences
> monolingual data for source and target languages.
>
> After cleaning with
> https://github.com/alvations/mosesdecoder/blob/master/scripts/other/gacha_filter.py
> and
> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl
> , I got 2.6 million parallel sentences.
>
>
> And after training a phrase-based model with reordering, i get:
>
> 9.9GB of phrase-table.gz
> 3.2GB of reordering-table.gz
> ~45GB of language-model.arpa.gz
>
>
> With language model, I've binarized it and got to
>
> ~75GB of language-model.binary
>
> We ran moses-mert.pl and it completed the tuning in 3-4 days on both
> directions on the dev set (3000 sentences), after filtering:
>
>
> 364M phrase-table.gz
> 1.8GB reordering-table.gz
>
>
> On the test set, we did the filtering too but when decoding it took 18
> hours to load only 50% of the phrase table:
>
> 1.5GB phrase-table.gz
> 6.7GB reordering-table.gz
>
>
> So we decided to compactize the phrase table.
>
> With the phrase-table and reordering, we used the processPhraseTableMin
> and processLexicalTableMin and I'm still waiting to get the minimized
> phrasetable table. It has been running for 3 hours on 10 threads each on a
> 2.5GHz cores.
>
> *Anyone have any rough idea how small the phrase table and lexical table
> would get?*
>
>
>
> *With that kind of model, how much RAM would be necessary? And how long
> would it take to load the model onto the RAM? Any other tips/hints on
> working with big models efficiently? *
>
> *Is it even possible for us to use models at such a size on our small
> server (24 cores, 2.5GHz, 128RAM)? If not, how big should our sever get?*
>
> Regards,
> Liling
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150421/f3570287/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 102, Issue 44
**********************************************

0 Response to "Moses-support Digest, Vol 102, Issue 44"

Post a Comment