Moses-support Digest, Vol 97, Issue 78

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Too largeely sve language models - how to handle that?
(Holger Schwenk)
2. Re: Too large language models - how to handle that?
(Marcin Junczys-Dowmunt)

----------------------------------------------------------------------

Message: 1
Date: Tue, 25 Nov 2014 12:45:55 +0100
From: Holger Schwenk <Holger.Schwenk@lium.univ-lemans.fr>
Subject: Re: [Moses-support] Too largeely sve language models - how to
handle that?
To: moses-support@mit.edu
Message-ID: <54746BF3.9040801@lium.univ-lemans.fr>
Content-Type: text/plain; charset="utf-8"

Hello,

another option is to perform data selection to only keep the data
relevant to yout task.
Usually you improve your performance, and as a nice side effect, you LM
is much smaller ;-)

Many people use the algorithm proposed by Moore and Lewis, which is
implemented in the freely available tool XenC (on github)

best,

Holger

On 11/25/2014 12:02 PM, Hoang Cuong wrote:
> Hi Raj, Tom and Marcin,
> I binarized the ARPA file last night, following your suggestion. In
> the end, it resulted a binarized LM file of roughly *100GB* (@Marcin -
> it is not 20-30GB as you suggest, is it okay with this size?)
> Fortunately, the infrastructure at my university allows me to run
> experiments with that.
> Thanks a lot for your help.
> It is so great to play with such huge LMs :))
> Best,
>
>
> On Mon, Nov 24, 2014 at 3:19 PM, Marcin Junczys-Dowmunt
> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>
> The command
>
> moses/bin/build_binary trie -a 22 -b 8 -q 8 lm.arpa lm.kenlm
>
> will build a compressed binarized model with quantization. You can run
>
> moses/bin/build_binary lm.arpa
>
> without any parameters to get size estimates for different
> parameter settings. I would guess you will get a binarized LM of
> roughly 20 to 30 GB which is managable (provided the size you gave
> us is that of an uncompressed text file). You can also use lmplz
> to build pruned models in the first place, these will be much
> smaller.
>
> W dniu 2014-11-24 15:11, Tom Hoar napisa?(a):
>
>> After binarizing such a large ARPA file with KenLM, you'll need
>> to configure your moses.ini file to "lazily load the model using
>> mmap." This involves using lmodel-file code "9" vs code "8." More
>> details here: https://kheafield.com/code/kenlm/moses/
>>
>> Performance improves significantly if you store the binarized
>> file on an SSD.
>>
>>
>>
>>
>> On 11/24/2014 07:00 PM, Raj Dabre wrote:
>>> Hey Hoang,
>>> You should binarize the arpa file.
>>> The readme of the LM tool (KenLM or IRSTLM or SRILM) will tell
>>> you how.
>>> Regards.
>>>
>>> On Mon, Nov 24, 2014 at 7:07 PM, Hoang Cuong
>>> <hoangcuong2011@gmail.com <mailto:hoangcuong2011@gmail.com>> wrote:
>>>
>>> Hi all,
>>> I have trained an (unpruned) 5-grams language model on a
>>> large corpus of 5 billion words, resulting an ARPA-format
>>> file of roughly 300GB (is it a normal LM size with such a
>>> big monolingual data?). This is obviously too big for
>>> running an SMT system.
>>> I read several works where their system uses language models
>>> trained on similar monolingual corpus. Could you give me
>>> some advice how to handle this, making it feasible to run
>>> SMT systems?
>>> I appreciate your help a lot,
>>> Best,
>>> --
>>> Best Regards,
>>> Hoang Cuong
>>> SMTNerd
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>>
>>>
>>> --
>>> Raj Dabre.
>>> Research Student,
>>> Graduate School of Informatics,
>>> Kyoto University.
>>> CSE MTech, IITB., 2011-2014
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> /
> Best Regards,
> /
> Hoang Cuong
> /
> /
> SMTNerd
> /
> /
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141125/9ad81d3f/attachment-0001.htm

------------------------------

Message: 2
Date: Tue, 25 Nov 2014 12:51:51 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Too large language models - how to handle
that?
To: Hoang Cuong <hoangcuong2011@gmail.com>
Cc: moses-support@mit.edu
Message-ID: <54746D57.2020105@amu.edu.pl>
Content-Type: text/plain; charset="utf-8"

Hi Hoang,
Yes, you would get a size reduction to roughly 30-40% by binarizing a
gzipped ARPA file with quantization, I was assuming your file was an
uncompressed text file.
Happy to hear it works for you now.
Best,
Marcin

W dniu 25.11.2014 o 12:02, Hoang Cuong pisze:
> Hi Raj, Tom and Marcin,
> I binarized the ARPA file last night, following your suggestion. In
> the end, it resulted a binarized LM file of roughly *100GB* (@Marcin -
> it is not 20-30GB as you suggest, is it okay with this size?)
> Fortunately, the infrastructure at my university allows me to run
> experiments with that.
> Thanks a lot for your help.
> It is so great to play with such huge LMs :))
> Best,
>
>
> On Mon, Nov 24, 2014 at 3:19 PM, Marcin Junczys-Dowmunt
> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>
> The command
>
> moses/bin/build_binary trie -a 22 -b 8 -q 8 lm.arpa lm.kenlm
>
> will build a compressed binarized model with quantization. You can run
>
> moses/bin/build_binary lm.arpa
>
> without any parameters to get size estimates for different
> parameter settings. I would guess you will get a binarized LM of
> roughly 20 to 30 GB which is managable (provided the size you gave
> us is that of an uncompressed text file). You can also use lmplz
> to build pruned models in the first place, these will be much
> smaller.
>
> W dniu 2014-11-24 15:11, Tom Hoar napisa?(a):
>
>> After binarizing such a large ARPA file with KenLM, you'll need
>> to configure your moses.ini file to "lazily load the model using
>> mmap." This involves using lmodel-file code "9" vs code "8." More
>> details here: https://kheafield.com/code/kenlm/moses/
>>
>> Performance improves significantly if you store the binarized
>> file on an SSD.
>>
>>
>>
>>
>> On 11/24/2014 07:00 PM, Raj Dabre wrote:
>>> Hey Hoang,
>>> You should binarize the arpa file.
>>> The readme of the LM tool (KenLM or IRSTLM or SRILM) will tell
>>> you how.
>>> Regards.
>>>
>>> On Mon, Nov 24, 2014 at 7:07 PM, Hoang Cuong
>>> <hoangcuong2011@gmail.com <mailto:hoangcuong2011@gmail.com>> wrote:
>>>
>>> Hi all,
>>> I have trained an (unpruned) 5-grams language model on a
>>> large corpus of 5 billion words, resulting an ARPA-format
>>> file of roughly 300GB (is it a normal LM size with such a
>>> big monolingual data?). This is obviously too big for
>>> running an SMT system.
>>> I read several works where their system uses language models
>>> trained on similar monolingual corpus. Could you give me
>>> some advice how to handle this, making it feasible to run
>>> SMT systems?
>>> I appreciate your help a lot,
>>> Best,
>>> --
>>> Best Regards,
>>> Hoang Cuong
>>> SMTNerd
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>>
>>>
>>> --
>>> Raj Dabre.
>>> Research Student,
>>> Graduate School of Informatics,
>>> Kyoto University.
>>> CSE MTech, IITB., 2011-2014
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> /
> Best Regards,
> /
> Hoang Cuong
> /
> /
> SMTNerd
> /
> /

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141125/a56c629d/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 97, Issue 78
*********************************************

Moses-support Digest, Vol 97, Issue 78

0 Response to "Moses-support Digest, Vol 97, Issue 78"

Post a Comment