Moses-support Digest, Vol 97, Issue 80

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: How to train a tree-based model? (Rico Sennrich)
2. how to test whether tcmalloc is used? (Li Xiang)
3. Format of binarized phrase tables (Raj Dabre)
4. Re: Too largeely sve language models - how to handle that?
(Jorg Tiedemann)

----------------------------------------------------------------------

Message: 1
Date: Tue, 25 Nov 2014 17:57:01 +0000 (UTC)
From: Rico Sennrich <rico.sennrich@gmx.ch>
Subject: Re: [Moses-support] How to train a tree-based model?
To: moses-support@mit.edu
Message-ID: <loom.20141125T183408-810@post.gmane.org>
Content-Type: text/plain; charset=utf-8

Steven Huang <d98922047@...> writes:
>
> The question is:
> 1. Can I use all the 3 factors when training tree-based model? If yes, how
the parallel corpus should be like? The XML format shown in the MOSES
tutorial seems not able to accept factors except surface.?

I've successfully tested a toy syntactic models with factors, but there is
no systematic testing and I imagine many things won't work (what does: have
different factors for translation model and language model). The format in
my corpus was like this:

<tree label="sent"><tree label="root">c|x</tree><tree
label="root">b|y</tree><tree label="root">b|y</tree></tree>

> 2. I want to use trees on both source and target side, is it correct to
add the following arguments to train-model.perl?
>
>
> --ghkm \
> --source-syntax \
> --target-syntax \
> --LeftBinarize \

the GHKM implementation currently assumes string-to-tree (or tree-to-string)
rules, but I think you can try the hierarchical extractor (just leave out
'--ghkm') with both source and target syntax.

>
> 3. I noticed that after using Stanford-Parser to generate trees for
parallel corpus, the resulted trees might be 1 to many (or many to 1) for a
particular sentence. e.g., the sentence of source language is parsed into a
single tree, while the target language sentence is parsed into 2 trees. Will
this break the "parallel" property of parallel corpus?

you'll need to ensure that you get one tree per sentence. Either you do some
post-processing and merge the two trees into one by creating a virtual root
node, or throw out theses sentence pairs.

hope this helps,
Rico

------------------------------

Message: 2
Date: Wed, 26 Nov 2014 11:05:32 +0800
From: Li Xiang <lixiang.ict@gmail.com>
Subject: [Moses-support] how to test whether tcmalloc is used?
To: moses-support <moses-support@mit.edu>
Message-ID: <7402ABE0-4D5A-4BD5-A214-E249ABA085E4@gmail.com>
Content-Type: text/plain; charset=us-ascii

I compile Moses with tcmalloc. How can I test whether tcmalloc is used and evaluate the performance ?

------------------------------

Message: 3
Date: Wed, 26 Nov 2014 12:22:28 +0900
From: Raj Dabre <prajdabre@gmail.com>
Subject: [Moses-support] Format of binarized phrase tables
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAB3gfjBv0uOLLqemcgeYtCCsVnmyp6rk6rMCCDzQRgUBL6jj7A@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hello All,

I know that Moses allows for binarization of a phrase table which can be
read on demand at decoding time.
We get 5 files named: phrase-table.binphr.*
I want to write my own routine in Java to read phrase pairs from these on
demand.
Can anyone guide me ?

PS: If an explanation of the same for binary reordering tables can be done
then it would be great too.

Thanks in advance.

--
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141126/4a0e2640/attachment-0001.htm

------------------------------

Message: 4
Date: Wed, 26 Nov 2014 08:40:05 +0100
From: Jorg Tiedemann <tiedeman@gmail.com>
Subject: Re: [Moses-support] Too largeely sve language models - how to
handle that?
To: Holger Schwenk <Holger.Schwenk@lium.univ-lemans.fr>
Cc: moses-support@mit.edu
Message-ID: <DF374466-7D43-4D2E-A646-72D4815E072A@gmail.com>
Content-Type: text/plain; charset="utf-8"

Could we add all these tricks of the trade to the Moses website, for example, at http://www.statmt.org/moses/?n=Moses.Optimize
(also for other topics?). I would really like that ...

Cheers,
J?rg

On Nov 25, 2014, at 12:45 PM, Holger Schwenk wrote:

> Hello,
>
> another option is to perform data selection to only keep the data relevant to yout task.
> Usually you improve your performance, and as a nice side effect, you LM is much smaller ;-)
>
> Many people use the algorithm proposed by Moore and Lewis, which is implemented in the freely available tool XenC (on github)
>
> best,
>
> Holger
>
> On 11/25/2014 12:02 PM, Hoang Cuong wrote:
>> Hi Raj, Tom and Marcin,
>> I binarized the ARPA file last night, following your suggestion. In the end, it resulted a binarized LM file of roughly 100GB (@Marcin - it is not 20-30GB as you suggest, is it okay with this size?)
>> Fortunately, the infrastructure at my university allows me to run experiments with that.
>> Thanks a lot for your help.
>> It is so great to play with such huge LMs :))
>> Best,
>>
>>
>> On Mon, Nov 24, 2014 at 3:19 PM, Marcin Junczys-Dowmunt <junczys@amu.edu.pl> wrote:
>> The command
>>
>> moses/bin/build_binary trie -a 22 -b 8 -q 8 lm.arpa lm.kenlm
>>
>> will build a compressed binarized model with quantization. You can run
>>
>> moses/bin/build_binary lm.arpa
>>
>> without any parameters to get size estimates for different parameter settings. I would guess you will get a binarized LM of roughly 20 to 30 GB which is managable (provided the size you gave us is that of an uncompressed text file). You can also use lmplz to build pruned models in the first place, these will be much smaller.
>>
>> W dniu 2014-11-24 15:11, Tom Hoar napisa?(a):
>>
>>> After binarizing such a large ARPA file with KenLM, you'll need to configure your moses.ini file to "lazily load the model using mmap." This involves using lmodel-file code "9" vs code "8." More details here: https://kheafield.com/code/kenlm/moses/
>>>
>>> Performance improves significantly if you store the binarized file on an SSD.
>>>
>>>
>>>
>>>
>>> On 11/24/2014 07:00 PM, Raj Dabre wrote:
>>>> Hey Hoang,
>>>> You should binarize the arpa file.
>>>> The readme of the LM tool (KenLM or IRSTLM or SRILM) will tell you how.
>>>> Regards.
>>>>
>>>> On Mon, Nov 24, 2014 at 7:07 PM, Hoang Cuong <hoangcuong2011@gmail.com> wrote:
>>>> Hi all,
>>>> I have trained an (unpruned) 5-grams language model on a large corpus of 5 billion words, resulting an ARPA-format file of roughly 300GB (is it a normal LM size with such a big monolingual data?). This is obviously too big for running an SMT system.
>>>> I read several works where their system uses language models trained on similar monolingual corpus. Could you give me some advice how to handle this, making it feasible to run SMT systems?
>>>> I appreciate your help a lot,
>>>> Best,
>>>> --
>>>> Best Regards,
>>>> Hoang Cuong
>>>> SMTNerd
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Raj Dabre.
>>>> Research Student,
>>>> Graduate School of Informatics,
>>>> Kyoto University.
>>>> CSE MTech, IITB., 2011-2014
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> --
>> Best Regards,
>> Hoang Cuong
>> SMTNerd
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141126/d17cfe13/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 97, Issue 80
*********************************************

Moses-support Digest, Vol 97, Issue 80

0 Response to "Moses-support Digest, Vol 97, Issue 80"

Post a Comment