Moses-support Digest, Vol 103, Issue 50

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. When to truecase (Lane Schwartz)
2. Re: How to tell EMS to concatenate training corpora
(Rico Sennrich)
3. Re: How to tell EMS to concatenate training corpora
(Lane Schwartz)
4. Re: Factored models and xml-input (Hieu Hoang)
5. Re: When to truecase (Philipp Koehn)


----------------------------------------------------------------------

Message: 1
Date: Wed, 20 May 2015 11:31:22 -0500
From: Lane Schwartz <dowobeha@gmail.com>
Subject: [Moses-support] When to truecase
To: "moses-support@mit.edu" <moses-support@mit.edu>
Cc: Philipp Koehn <phi@jhu.edu>
Message-ID:
<CABv3vZkDv8WviB=J=6U22i=B3HiUonjFhx9pF4SXDypho6Z7SA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Philipp (and others),

I'm wondering what people's experience is regarding when truecasing is
applied.

One option is to truecase the training data, then train your TM and LM
using that truecased data. Another option would be to lowercase the data,
train TM and LM on the lowercased data, and then perform truecasing after
decoding.

I assume that the former gives better results, but the latter approach has
an advantage in terms of extensibility (namely if you get more data and
update your truecase model, you don't have to re-train all of your TMs and
LMs).

Does anyone have any insights they would care to share on this?

Thanks,
Lane
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150520/226f5554/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 20 May 2015 16:40:43 +0000 (UTC)
From: Rico Sennrich <rico.sennrich@gmx.ch>
Subject: Re: [Moses-support] How to tell EMS to concatenate training
corpora
To: moses-support@mit.edu
Message-ID: <loom.20150520T182058-254@post.gmane.org>
Content-Type: text/plain; charset=us-ascii

Lane Schwartz <dowobeha@...> writes:

>
> I have a number of distinct monolingual corpora. I've been training them
as separate LMs. I now want to run a variant where they are all concatenated
together, and then trained as a single LM. The EMS walkthrough says this
should be possible
(http://www.statmt.org/moses/?n=FactoredTraining.EMS#ntoc19), but doesn't
give the requisite syntax. What is the EMS syntax to do this?
>
> Thanks,
> Lane

Hi Lane,

I tried to do solve the problem quickly on Monday, but that didn't turn out
too well (see the next few commits fixing bugs with it). I was also unhappy
that I couldn't have multiple CONCATENATED-LMs on the same corpus, or define
which corpora to concatenate. This implementation solves that. Assume you
have these two LMs defined:

[LM:parallelA]
raw-corpus = /some/path

[LM:parallelB]
raw-corpus = /some/path
order = 5

we can have a second LM trained on the data of parallelA, but with different
settings, like this:

[LM:parallelA2]

stripped-corpus = [LM:parallelA:stripped-corpus]
exclude-from-interpolation = true
order = 6

[this was actually possible before, but I've added the property
'exclude-from-interpolation', which tells INTERPOLATED-LM to skip this LM.]

If you want an LM on concatenated data, you can define it like this:

[LM:parallelAB]

concatenate-files = [LM:{parallelA,parallelB}:stripped-corpus]
exclude-from-interpolation = true

finally, you can also use 'custom-training' train a language model that
train-model.perl doesn't know about, like NPLM. You'll also have to define
how the model should be added to the moses.ini:

[LM:parallelAB]

stripped-corpus = [LM:parallelAB:stripped-corpus]
custom-training = "my_training_script.sh -order 5 -some_setting 8"
config-feature-line = "NPLM path=/some/path order=5 some-setting=8"
config-weight-line = "NPLM0= 0.1"





------------------------------

Message: 3
Date: Wed, 20 May 2015 11:46:16 -0500
From: Lane Schwartz <dowobeha@gmail.com>
Subject: Re: [Moses-support] How to tell EMS to concatenate training
corpora
To: Rico Sennrich <rico.sennrich@gmx.ch>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CABv3vZnV2ALSGfCHRhO=QScy_-AjPxSLBbY+05GvT8ACN00wKg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Thanks, Rico! At the moment, I ended up just manually cat'ing the files,
but this should be very useful in future.

On Wed, May 20, 2015 at 11:40 AM, Rico Sennrich <rico.sennrich@gmx.ch>
wrote:

> Lane Schwartz <dowobeha@...> writes:
>
> >
> > I have a number of distinct monolingual corpora. I've been training them
> as separate LMs. I now want to run a variant where they are all
> concatenated
> together, and then trained as a single LM. The EMS walkthrough says this
> should be possible
> (http://www.statmt.org/moses/?n=FactoredTraining.EMS#ntoc19), but doesn't
> give the requisite syntax. What is the EMS syntax to do this?
> >
> > Thanks,
> > Lane
>
> Hi Lane,
>
> I tried to do solve the problem quickly on Monday, but that didn't turn out
> too well (see the next few commits fixing bugs with it). I was also unhappy
> that I couldn't have multiple CONCATENATED-LMs on the same corpus, or
> define
> which corpora to concatenate. This implementation solves that. Assume you
> have these two LMs defined:
>
> [LM:parallelA]
> raw-corpus = /some/path
>
> [LM:parallelB]
> raw-corpus = /some/path
> order = 5
>
> we can have a second LM trained on the data of parallelA, but with
> different
> settings, like this:
>
> [LM:parallelA2]
>
> stripped-corpus = [LM:parallelA:stripped-corpus]
> exclude-from-interpolation = true
> order = 6
>
> [this was actually possible before, but I've added the property
> 'exclude-from-interpolation', which tells INTERPOLATED-LM to skip this LM.]
>
> If you want an LM on concatenated data, you can define it like this:
>
> [LM:parallelAB]
>
> concatenate-files = [LM:{parallelA,parallelB}:stripped-corpus]
> exclude-from-interpolation = true
>
> finally, you can also use 'custom-training' train a language model that
> train-model.perl doesn't know about, like NPLM. You'll also have to define
> how the model should be added to the moses.ini:
>
> [LM:parallelAB]
>
> stripped-corpus = [LM:parallelAB:stripped-corpus]
> custom-training = "my_training_script.sh -order 5 -some_setting 8"
> config-feature-line = "NPLM path=/some/path order=5 some-setting=8"
> config-weight-line = "NPLM0= 0.1"
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



--
When a place gets crowded enough to require ID's, social collapse is not
far away. It is time to go elsewhere. The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150520/7dd3ebc2/attachment-0001.htm

------------------------------

Message: 4
Date: Wed, 20 May 2015 21:03:12 +0400
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] Factored models and xml-input
To: Standa K <standa.kurik@gmail.com>, moses-support@mit.edu
Message-ID: <555CBE50.5030007@gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

as far as i know, it works. Are you getting errors? Can I have a look at
your moses.ini, your input and the exact command you executed

On 20/05/2015 13:19, Standa K wrote:
> Hello,
>
> may I ask if there is a plan to add the support for xml-input to factored
> models in foreseeable future or is this not a priority at all?
>
> If it is not, could someone perhaps point me to the places in code which
> need to be modified? I tried finding these myself, but got lost pretty
> soon.
>
> Thank you.
>
> Best regards
> Standa Kurik
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

--
Hieu Hoang
Researcher
New York University, Abu Dhabi
http://www.hoang.co.uk/hieu



------------------------------

Message: 5
Date: Wed, 20 May 2015 13:43:09 -0400
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] When to truecase
To: Lane Schwartz <dowobeha@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAAFADDDDN8a0L9e78JdkFzom=8=5WXqfzytaPN5a9Jtuy6JSig@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

see Section 2.2 in our WMT 2009 submission:
http://www.statmt.org/wmt09/pdf/WMT-0929.pdf

One practical reason to avoid recasing is the need
for a second large cased language model.

But there is of course also the practical issue with
have a unique truecasing scheme for each data
condition, handling of headlines, all-caps emphasis,
etc.

It would be worth to revisit this issue again under
different data conditions / language pairs. Both
options are readily available in EMS.

Each of the two alternative methods could be
improved as well. See for instance:
http://www.aclweb.org/anthology/N06-1001

-phi

-phi


On Wed, May 20, 2015 at 12:31 PM, Lane Schwartz <dowobeha@gmail.com> wrote:

> Philipp (and others),
>
> I'm wondering what people's experience is regarding when truecasing is
> applied.
>
> One option is to truecase the training data, then train your TM and LM
> using that truecased data. Another option would be to lowercase the data,
> train TM and LM on the lowercased data, and then perform truecasing after
> decoding.
>
> I assume that the former gives better results, but the latter approach has
> an advantage in terms of extensibility (namely if you get more data and
> update your truecase model, you don't have to re-train all of your TMs and
> LMs).
>
> Does anyone have any insights they would care to share on this?
>
> Thanks,
> Lane
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150520/d5b370fc/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 103, Issue 50
**********************************************

0 Response to "Moses-support Digest, Vol 103, Issue 50"

Post a Comment