Moses-support Digest, Vol 97, Issue 73

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Too large language models - how to handle that? (Hoang Cuong)
2. Re: Too large language models - how to handle that? (Raj Dabre)
3. How to train a tree-based model? (Steven Huang)
4. (no subject) (Daramola Olaife)
5. Re: Too large language models - how to handle that? (Tom Hoar)

----------------------------------------------------------------------

Message: 1
Date: Mon, 24 Nov 2014 11:07:12 +0100
From: Hoang Cuong <hoangcuong2011@gmail.com>
Subject: [Moses-support] Too large language models - how to handle
that?
To: moses-support@mit.edu
Message-ID:
<CAG1fz7c8XKDgMazrMh9gefu=ceML6cE=8dwWwpLdZi6EPYufvQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi all,
I have trained an (unpruned) 5-grams language model on a large corpus of 5
billion words, resulting an ARPA-format file of roughly 300GB (is it a
normal LM size with such a big monolingual data?). This is obviously too
big for running an SMT system.
I read several works where their system uses language models trained on
similar monolingual corpus. Could you give me some advice how to handle
this, making it feasible to run SMT systems?
I appreciate your help a lot,
Best,
--

*Best Regards,Hoang CuongSMTNerd*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141124/4fcdbceb/attachment-0001.htm

------------------------------

Message: 2
Date: Mon, 24 Nov 2014 21:00:30 +0900
From: Raj Dabre <prajdabre@gmail.com>
Subject: Re: [Moses-support] Too large language models - how to handle
that?
To: Hoang Cuong <hoangcuong2011@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAB3gfjA=cBMOeD-0aV_mosXkDUVC6aWsCFYTyypkeZG7uOd3nw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hey Hoang,
You should binarize the arpa file.
The readme of the LM tool (KenLM or IRSTLM or SRILM) will tell you how.
Regards.

On Mon, Nov 24, 2014 at 7:07 PM, Hoang Cuong <hoangcuong2011@gmail.com>
wrote:

> Hi all,
> I have trained an (unpruned) 5-grams language model on a large corpus of 5
> billion words, resulting an ARPA-format file of roughly 300GB (is it a
> normal LM size with such a big monolingual data?). This is obviously too
> big for running an SMT system.
> I read several works where their system uses language models trained on
> similar monolingual corpus. Could you give me some advice how to handle
> this, making it feasible to run SMT systems?
> I appreciate your help a lot,
> Best,
> --
>
> *Best Regards,Hoang CuongSMTNerd*
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

--
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141124/2ea09636/attachment-0001.htm

------------------------------

Message: 3
Date: Mon, 24 Nov 2014 20:41:45 +0800
From: Steven Huang <d98922047@ntu.edu.tw>
Subject: [Moses-support] How to train a tree-based model?
To: moses-support@mit.edu, ??? <farmer.tw@gmail.com>
Message-ID:
<CAG-iPUrbBXqLQYPiryve4XF0oj7iBSVE82aAvcCFasNkYZTDUw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

I am trying to do English-Chinese translation.
I've build a factored model successfully.
However, I am not quite clear about how to build a tree-based model after
reading the tutorial.

What I have in hand:
1. English-Chinese parallel corpus with 3 factors (surface, lemma and POS).
2. English-Chinses parallel corpus parsed with Stanford-Parser, and
formatted as XMLs in MOSES format.
3. The training command for my factored model is shown below:

$MOSES_DIR/scripts/training/train-model.perl \
-mgiza -mgiza-cpus 20 \
--root-dir train \
--corpus $WORK_DIR/en-ch.clean \
--f en \
--e ch \
--alignment grow-diag-final-and \
--reordering msd-bidirectional-fe \
--lm 0:3:$LANG_MOD_DIR/en-ch-surface.arpa.ch:8

\
--lm 2:3:$LANG_MOD_DIR/en-ch-pos.arpa.ch:8

\
--translation-factors 1,2-1,2+0-0,2 \
--generation-factors 1,2-0+0,2-0 \
--reordering-factors 0,2-0,2 \
--decoding-steps t0,g0:t1,g1 \
--external-bin-dir $MOSES_DIR/tools > $WORK_DIR/training.out 2>&1

The question is:
1. Can I use all the 3 factors when training tree-based model? If yes, how
the parallel corpus should be like? The XML format shown in the MOSES
tutorial seems not able to accept factors except surface.
2. I want to use trees on both source and target side, is it correct to add
the following arguments to train-model.perl?

--ghkm \
--source-syntax \
--target-syntax \
--LeftBinarize \

3. I noticed that after using Stanford-Parser to generate trees for
parallel corpus, the resulted trees might be 1 to many (or many to 1) for a
particular sentence. e.g., the sentence of source language is parsed into a
single tree, while the target language sentence is parsed into 2 trees.
Will this break the "parallel" property of parallel corpus?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141124/f7a33629/attachment-0001.htm

------------------------------

Message: 4
Date: Mon, 24 Nov 2014 13:50:36 +0100
From: Daramola Olaife <d3ripleo@gmail.com>
Subject: [Moses-support] (no subject)
To: moses-support@mit.edu
Message-ID:
<CAPxW3bHrGB77yojDyf5nXoygZcZRxGh_AofCTF0oJa3G9uuB0Q@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

After installing irstlm, I tried linking it to moses with
./bjam --with-irstlm=/home/olaife/irstlm-5.80.06 -j8
but it was giving me error.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: build.log.gz
Type: application/x-gzip
Size: 2012 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141124/67bf13b1/attachment-0001.bin

------------------------------

Message: 5
Date: Mon, 24 Nov 2014 21:11:38 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Too large language models - how to handle
that?
To: moses-support@mit.edu
Message-ID: <54733C9A.9050000@precisiontranslationtools.com>
Content-Type: text/plain; charset="windows-1252"

After binarizing such a large ARPA file with KenLM, you'll need to
configure your moses.ini file to "lazily load the model using mmap."
This involves using lmodel-file code "9" vs code "8." More details here:
https://kheafield.com/code/kenlm/moses/

Performance improves significantly if you store the binarized file on an
SSD.

On 11/24/2014 07:00 PM, Raj Dabre wrote:
> Hey Hoang,
> You should binarize the arpa file.
> The readme of the LM tool (KenLM or IRSTLM or SRILM) will tell you how.
> Regards.
>
> On Mon, Nov 24, 2014 at 7:07 PM, Hoang Cuong <hoangcuong2011@gmail.com
> <mailto:hoangcuong2011@gmail.com>> wrote:
>
> Hi all,
> I have trained an (unpruned) 5-grams language model on a large
> corpus of 5 billion words, resulting an ARPA-format file of
> roughly 300GB (is it a normal LM size with such a big monolingual
> data?). This is obviously too big for running an SMT system.
> I read several works where their system uses language models
> trained on similar monolingual corpus. Could you give me some
> advice how to handle this, making it feasible to run SMT systems?
> I appreciate your help a lot,
> Best,
> --
> /
> Best Regards,
> /
> Hoang Cuong
> /
> /
> SMTNerd
> /
> /
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> Raj Dabre.
> Research Student,
> Graduate School of Informatics,
> Kyoto University.
> CSE MTech, IITB., 2011-2014
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141124/ddb581c9/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 97, Issue 73
*********************************************

Moses-support Digest, Vol 97, Issue 73

0 Response to "Moses-support Digest, Vol 97, Issue 73"

Post a Comment