Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Adding a language model built on Google Web (Alla Rozovskaya)
2. Re: Working with big models (liling tan)
3. Re: Working with big models (Marcin Junczys-Dowmunt)
----------------------------------------------------------------------
Message: 1
Date: Sat, 25 Apr 2015 12:24:03 -0400
From: Alla Rozovskaya <sigaliyah@gmail.com>
Subject: [Moses-support] Adding a language model built on Google Web
To: moses-support@mit.edu
Message-ID:
<CA+iaor17pGygKzH8Kspg6yO_=pTWkYS+CPNEm--Vup=n=7kMNA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hello,
I have built an interpolated count-based LM on the Google Web N-gram corpus
using SRILM toolkit, as specified here:
http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html
Is it possible to use it in moses? In particular, since this model uses
count files and a file specifying weights, what is the right way to specify
the path in moses.ini?
Thank you,
Alla
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150425/aafb18f3/attachment-0001.htm
------------------------------
Message: 2
Date: Sat, 25 Apr 2015 21:05:44 +0200
From: liling tan <alvations@gmail.com>
Subject: Re: [Moses-support] Working with big models
To: moses-support <moses-support@mit.edu>
Message-ID:
<CAKzPaJKxztUjiNCoCk++z_5k-PXryiQu-k+CzdVG=KbMyUCtug@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Dear Moses devs/users,
I've automated the binarization for phrase-model and reordering-model and
added the multi-threadness to the filter script in moses (
https://github.com/moses-smt/mosesdecoder/pull/109). Loading the binarized
and filterd translation models work fine.
The issue now comes to huge language models.
I've a 38GB compressed arpa language model from a 16GB of raw text. Then I
binarized it with "moses/bin/build_binary" and it grows to 71GB. It works
pretty fine if i don't tune my system but when MERT tuning on 100
iterations on 71GB, it's taking almost forever to tune.
I did a google search and found KenLM's filter:
https://kheafield.com/code/kenlm/filter/
But i'm clueless as to how to make it work.
*What should I do to the LM after binarization? *
*Is there any other steps to manipulate large language models to reduce the
computing load when tuning?*
*What is the usual way to tune on a large LM file?*
@Marcin, how did you deal with the large LM file when tuning?
Regards,
Liling
On Tue, Apr 21, 2015 at 7:48 PM, liling tan <alvations@gmail.com> wrote:
> Dear Moses dev/users,
>
>
> @Marcin, the bigger than usual reordering-table is due to our allowance
> for high distortion. 2.4 is after cleaning it up, the original size
> contains loads of rubbish sentence pairs.
>
> BTW, the compactization finished at <4hrs. I guess at the 3rd hour i was
> starting to doubt whether the server can handle that amount.
>
> But the phrase size didn't go down as much as i expect, it's still 1.1G
> which might take forever to load when decoding. Will .minphr file be faster
> to load (it looks binarized, i think) than the normal .gz phrase table? If
> not, we're still looking at >18hrs of loading time on the server.
>
> But the reordering went down to from 6.7GB -> 420M.
>
> What exactly is the process of dealing with models >4GB? The standard
> moses tutorial on the "moses rights of passage" and processes would be
> failing at every instances when considering non-binarized LM,
> non-compactize phrase-table/lexical-table, non-threaded
> processing/training/decoding.
>
> Is there a guide on dealing with big models? How big can a model grow and
> what is the proportional server clockspeed/RAM necessary?
>
>
> Regards,
> Liling
>
>
> On Tue, Apr 21, 2015 at 6:39 PM, liling tan <alvations@gmail.com> wrote:
>
>> Dear Moses devs/users,
>>
>> *How should one work with big models?*
>>
>> Originally, I've 4.5 million parallel sentences and ~13 million
>> sentences monolingual data for source and target languages.
>>
>> After cleaning with
>> https://github.com/alvations/mosesdecoder/blob/master/scripts/other/gacha_filter.py
>> and
>> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl
>> , I got 2.6 million parallel sentences.
>>
>>
>> And after training a phrase-based model with reordering, i get:
>>
>> 9.9GB of phrase-table.gz
>> 3.2GB of reordering-table.gz
>> ~45GB of language-model.arpa.gz
>>
>>
>> With language model, I've binarized it and got to
>>
>> ~75GB of language-model.binary
>>
>> We ran moses-mert.pl and it completed the tuning in 3-4 days on both
>> directions on the dev set (3000 sentences), after filtering:
>>
>>
>> 364M phrase-table.gz
>> 1.8GB reordering-table.gz
>>
>>
>> On the test set, we did the filtering too but when decoding it took 18
>> hours to load only 50% of the phrase table:
>>
>> 1.5GB phrase-table.gz
>> 6.7GB reordering-table.gz
>>
>>
>> So we decided to compactize the phrase table.
>>
>> With the phrase-table and reordering, we used the processPhraseTableMin
>> and processLexicalTableMin and I'm still waiting to get the minimized
>> phrasetable table. It has been running for 3 hours on 10 threads each on a
>> 2.5GHz cores.
>>
>> *Anyone have any rough idea how small the phrase table and lexical table
>> would get?*
>>
>>
>>
>> *With that kind of model, how much RAM would be necessary? And how long
>> would it take to load the model onto the RAM? Any other tips/hints on
>> working with big models efficiently? *
>>
>> *Is it even possible for us to use models at such a size on our small
>> server (24 cores, 2.5GHz, 128RAM)? If not, how big should our sever get?*
>>
>> Regards,
>> Liling
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150425/333c8bff/attachment-0001.htm
------------------------------
Message: 3
Date: Sat, 25 Apr 2015 21:20:02 +0200
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Working with big models
To: moses-support@mit.edu
Message-ID: <553BE8E2.7060702@amu.edu.pl>
Content-Type: text/plain; charset="windows-1252"
Hi,
binarizing like this gives you a lot smaller file:
build_binary trie -a 22 -b 8 -q 8 lm.arpa.gz lm.kenlm
This uses quantization, in theory that could cause quality loss, but I
never saw that happen. Remove "-b 8 -q 8" if you are afraid of that, the
file will be larger, but still a lot smaller than what you have. That's
about all I do. You said "100 MERT iterations" ... what do you mean by
that? Also the LM uses memory mapping in shared memory, so running
several moses instances in parallel does not use additional memory due
to the LM, similar for the phrase table.
W dniu 25.04.2015 o 21:05, liling tan pisze:
> Dear Moses devs/users,
>
> I've automated the binarization for phrase-model and reordering-model
> and added the multi-threadness to the filter script in moses
> (https://github.com/moses-smt/mosesdecoder/pull/109). Loading the
> binarized and filterd translation models work fine.
>
> The issue now comes to huge language models.
>
> I've a 38GB compressed arpa language model from a 16GB of raw text.
> Then I binarized it with "moses/bin/build_binary" and it grows to
> 71GB. It works pretty fine if i don't tune my system but when MERT
> tuning on 100 iterations on 71GB, it's taking almost forever to tune.
>
> I did a google search and found KenLM's filter:
> https://kheafield.com/code/kenlm/filter/
>
> But i'm clueless as to how to make it work.
>
> *What should I do to the LM after binarization? *
> *
> *
> *Is there any other steps to manipulate large language models to
> reduce the computing load when tuning?*
> *
> *
> *What is the usual way to tune on a large LM file?*
>
> @Marcin, how did you deal with the large LM file when tuning?
>
>
> Regards,
> Liling
>
> On Tue, Apr 21, 2015 at 7:48 PM, liling tan <alvations@gmail.com
> <mailto:alvations@gmail.com>> wrote:
>
> Dear Moses dev/users,
>
>
> @Marcin, the bigger than usual reordering-table is due to our
> allowance for high distortion. 2.4 is after cleaning it up, the
> original size contains loads of rubbish sentence pairs.
>
> BTW, the compactization finished at <4hrs. I guess at the 3rd hour
> i was starting to doubt whether the server can handle that amount.
>
> But the phrase size didn't go down as much as i expect, it's
> still 1.1G which might take forever to load when decoding. Will
> .minphr file be faster to load (it looks binarized, i think) than
> the normal .gz phrase table? If not, we're still looking at >18hrs
> of loading time on the server.
>
> But the reordering went down to from 6.7GB -> 420M.
>
> What exactly is the process of dealing with models >4GB? The
> standard moses tutorial on the "moses rights of passage" and
> processes would be failing at every instances when considering
> non-binarized LM, non-compactize phrase-table/lexical-table,
> non-threaded processing/training/decoding.
>
> Is there a guide on dealing with big models? How big can a model
> grow and what is the proportional server clockspeed/RAM necessary?
>
>
> Regards,
> Liling
>
>
> On Tue, Apr 21, 2015 at 6:39 PM, liling tan <alvations@gmail.com
> <mailto:alvations@gmail.com>> wrote:
>
> Dear Moses devs/users,
>
> *How should one work with big models?*
>
> Originally, I've 4.5 million parallel sentences and ~13
> million sentences monolingual data for source and target
> languages.
>
> After cleaning with
> https://github.com/alvations/mosesdecoder/blob/master/scripts/other/gacha_filter.py
> and
> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl,
> I got 2.6 million parallel sentences.
>
>
> And after training a phrase-based model with reordering, i get:
>
> 9.9GB of phrase-table.gz
> 3.2GB of reordering-table.gz
> ~45GB of language-model.arpa.gz
>
>
> With language model, I've binarized it and got to
>
> ~75GB of language-model.binary
>
> We ran moses-mert.pl <http://moses-mert.pl> and it completed
> the tuning in 3-4 days on both directions on the dev set (3000
> sentences), after filtering:
>
>
> 364M phrase-table.gz
> 1.8GB reordering-table.gz
>
>
> On the test set, we did the filtering too but when decoding it
> took 18 hours to load only 50% of the phrase table:
>
> 1.5GB phrase-table.gz
> 6.7GB reordering-table.gz
>
>
> So we decided to compactize the phrase table.
>
> With the phrase-table and reordering, we used the
> processPhraseTableMin and processLexicalTableMin and I'm still
> waiting to get the minimized phrasetable table. It has been
> running for 3 hours on 10 threads each on a 2.5GHz cores.
>
> *Anyone have any rough idea how small the phrase table and
> lexical table would get?*
> *
> *
> *With that kind of model, how much RAM would be necessary? And
> how long would it take to load the model onto the RAM?
>
> Any other tips/hints on working with big models efficiently? *
>
> *Is it even possible for us to use models at such a size on
> our small server (24 cores, 2.5GHz, 128RAM)? If not, how big
> should our sever get?*
>
> Regards,
> Liling
>
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150425/2d11e32d/attachment.htm
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 102, Issue 51
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 102, Issue 51"
Post a Comment