Moses-support Digest, Vol 103, Issue 57

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Factored models and xml-input (Marcin Junczys-Dowmunt)
2. Re: Factored models and xml-input (Standa K)
3. Re: Factored models and xml-input (Standa K)
4. Re: When to truecase (Ondrej Bojar)
5. Re: When to truecase (Matthias Huck)


----------------------------------------------------------------------

Message: 1
Date: Thu, 21 May 2015 21:58:46 +0200
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Factored models and xml-input
To: moses-support@mit.edu
Message-ID: <555E38F6.5040002@amu.edu.pl>
Content-Type: text/plain; charset="windows-1252"

Just for testing, what happens if you remove the second phrase table and
add a langauge model for factor 1. Usually this kind of setup fails for
me with xml-input, regardless if add factors to the XML option or not.

W dniu 21.05.2015 o 08:18, Hieu Hoang pisze:
> it works for me. My input and ini files are attached
>
> On 21/05/2015 10:05, Standa K wrote:
>> Yes, I tried that as well, it gives the same error.
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150521/eb67ed7c/attachment-0001.htm

------------------------------

Message: 2
Date: Fri, 22 May 2015 08:14:18 +0000 (UTC)
From: Standa K <standa.kurik@gmail.com>
Subject: Re: [Moses-support] Factored models and xml-input
To: moses-support@mit.edu
Message-ID: <loom.20150522T101245-718@post.gmane.org>
Content-Type: text/plain; charset=us-ascii

I think you're onto something here, Marcin. If I remove all my language
models and leave just the translation model, it works for me.

> Just for testing, what happens if you
> remove the second phrase table and add a langauge model for factor
> 1. Usually this kind of setup fails for me with xml-input,
> regardless if add factors to the XML option or not.




------------------------------

Message: 3
Date: Fri, 22 May 2015 08:54:53 +0000 (UTC)
From: Standa K <standa.kurik@gmail.com>
Subject: Re: [Moses-support] Factored models and xml-input
To: moses-support@mit.edu
Message-ID: <loom.20150522T105409-320@post.gmane.org>
Content-Type: text/plain; charset=us-ascii

Update: It still works if there is any number of language models for
factor 0. Once I add a single language model for factor 1, it fails.



------------------------------

Message: 4
Date: Fri, 22 May 2015 11:20:16 +0200 (CEST)
From: Ondrej Bojar <bojar@ufal.mff.cuni.cz>
Subject: Re: [Moses-support] When to truecase
To: Lane Schwartz <dowobeha@gmail.com>
Cc: moses-support@mit.edu, Philipp Koehn <phi@jhu.edu>
Message-ID:
<457215236.821631.1432286416676.JavaMail.zimbra@ufal.mff.cuni.cz>
Content-Type: text/plain; charset=utf-8

Hi,

we also have an experiment on truecasing, see Table 1 in http://www.statmt.org/wmt13/pdf/WMT08.pdf

What works best for us is relying on the casing as guessed by the lemmatizer. (Our lemmatizer recognizes names as separate lemmas and keeps the lemma upcased; which we then cast to the token in the sentence.)

Moses recaser was the worst option, it was actually better to lowercase only the source side of the parallel data, i.e. have the main search also pick the casing.

Cheers, O.

----- Original Message -----
> From: "Lane Schwartz" <dowobeha@gmail.com>
> To: "Philipp Koehn" <phi@jhu.edu>
> Cc: moses-support@mit.edu
> Sent: Wednesday, 20 May, 2015 20:50:41
> Subject: Re: [Moses-support] When to truecase

> Got it. So then, how was casing handled in the "mbr/mp" column? Was all of
> the data lowercased, then models trained, then recasing applied after
> decoding? Or something else?
>
> On Wed, May 20, 2015 at 1:30 PM, Philipp Koehn <phi@jhu.edu> wrote:
>
>> Hi,
>>
>> no, the changes are made incrementally.
>>
>> So the recesed "baseline" is the previous "mbr/mp" column.
>>
>> -phi
>>
>> On Wed, May 20, 2015 at 2:01 PM, Lane Schwartz <dowobeha@gmail.com> wrote:
>>
>>> Philipp,
>>>
>>> In Table 2 of the WMT 2009 paper, are the "baseline" and "truecased"
>>> columns directly comparable? In other words, do the two columns indicate
>>> identical conditions other than a single variable (how and/or when casing
>>> was handled)?
>>>
>>> In the baseline condition, how and when was casing handled?
>>>
>>> Thanks,
>>> Lane
>>>
>>>
>>> On Wed, May 20, 2015 at 12:43 PM, Philipp Koehn <phi@jhu.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> see Section 2.2 in our WMT 2009 submission:
>>>> http://www.statmt.org/wmt09/pdf/WMT-0929.pdf
>>>>
>>>> One practical reason to avoid recasing is the need
>>>> for a second large cased language model.
>>>>
>>>> But there is of course also the practical issue with
>>>> have a unique truecasing scheme for each data
>>>> condition, handling of headlines, all-caps emphasis,
>>>> etc.
>>>>
>>>> It would be worth to revisit this issue again under
>>>> different data conditions / language pairs. Both
>>>> options are readily available in EMS.
>>>>
>>>> Each of the two alternative methods could be
>>>> improved as well. See for instance:
>>>> http://www.aclweb.org/anthology/N06-1001
>>>>
>>>> -phi
>>>>
>>>> -phi
>>>>
>>>>
>>>> On Wed, May 20, 2015 at 12:31 PM, Lane Schwartz <dowobeha@gmail.com>
>>>> wrote:
>>>>
>>>>> Philipp (and others),
>>>>>
>>>>> I'm wondering what people's experience is regarding when truecasing is
>>>>> applied.
>>>>>
>>>>> One option is to truecase the training data, then train your TM and LM
>>>>> using that truecased data. Another option would be to lowercase the data,
>>>>> train TM and LM on the lowercased data, and then perform truecasing after
>>>>> decoding.
>>>>>
>>>>> I assume that the former gives better results, but the latter approach
>>>>> has an advantage in terms of extensibility (namely if you get more data and
>>>>> update your truecase model, you don't have to re-train all of your TMs and
>>>>> LMs).
>>>>>
>>>>> Does anyone have any insights they would care to share on this?
>>>>>
>>>>> Thanks,
>>>>> Lane
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> When a place gets crowded enough to require ID's, social collapse is not
>>> far away. It is time to go elsewhere. The best thing about space travel
>>> is that it made it possible to go elsewhere.
>>> -- R.A. Heinlein, "Time Enough For Love"
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away. It is time to go elsewhere. The best thing about space travel
> is that it made it possible to go elsewhere.
> -- R.A. Heinlein, "Time Enough For Love"
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

--
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo


------------------------------

Message: 5
Date: Fri, 22 May 2015 13:24:24 +0100
From: Matthias Huck <mhuck@inf.ed.ac.uk>
Subject: Re: [Moses-support] When to truecase
To: Ondrej Bojar <bojar@ufal.mff.cuni.cz>
Cc: moses-support@mit.edu, Philipp Koehn <phi@jhu.edu>
Message-ID: <1432297464.30904.772.camel@portedgar>
Content-Type: text/plain; charset="UTF-8"

Hi,

If your system output is lowercase, you could try SRILM's `disambig`
tool for predicting the correct casing in a postprocessing step.

http://www.speech.sri.com/projects/srilm/manpages/disambig.1.html

Cheers,
Matthias


On Fri, 2015-05-22 at 11:20 +0200, Ondrej Bojar wrote:
> Hi,
>
> we also have an experiment on truecasing, see Table 1 in
> http://www.statmt.org/wmt13/pdf/WMT08.pdf
>
> What works best for us is relying on the casing as guessed by the
> lemmatizer. (Our lemmatizer recognizes names as separate lemmas and
> keeps the lemma upcased; which we then cast to the token in the
> sentence.)
>
> Moses recaser was the worst option, it was actually better to
> lowercase only the source side of the parallel data, i.e. have the
> main search also pick the casing.
>
> Cheers, O.
>
> ----- Original Message -----
> > From: "Lane Schwartz" <dowobeha@gmail.com>
> > To: "Philipp Koehn" <phi@jhu.edu>
> > Cc: moses-support@mit.edu
> > Sent: Wednesday, 20 May, 2015 20:50:41
> > Subject: Re: [Moses-support] When to truecase
>
> > Got it. So then, how was casing handled in the "mbr/mp" column? Was all of
> > the data lowercased, then models trained, then recasing applied after
> > decoding? Or something else?
> >
> > On Wed, May 20, 2015 at 1:30 PM, Philipp Koehn <phi@jhu.edu> wrote:
> >
> >> Hi,
> >>
> >> no, the changes are made incrementally.
> >>
> >> So the recesed "baseline" is the previous "mbr/mp" column.
> >>
> >> -phi
> >>
> >> On Wed, May 20, 2015 at 2:01 PM, Lane Schwartz <dowobeha@gmail.com> wrote:
> >>
> >>> Philipp,
> >>>
> >>> In Table 2 of the WMT 2009 paper, are the "baseline" and "truecased"
> >>> columns directly comparable? In other words, do the two columns indicate
> >>> identical conditions other than a single variable (how and/or when casing
> >>> was handled)?
> >>>
> >>> In the baseline condition, how and when was casing handled?
> >>>
> >>> Thanks,
> >>> Lane
> >>>
> >>>
> >>> On Wed, May 20, 2015 at 12:43 PM, Philipp Koehn <phi@jhu.edu> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> see Section 2.2 in our WMT 2009 submission:
> >>>> http://www.statmt.org/wmt09/pdf/WMT-0929.pdf
> >>>>
> >>>> One practical reason to avoid recasing is the need
> >>>> for a second large cased language model.
> >>>>
> >>>> But there is of course also the practical issue with
> >>>> have a unique truecasing scheme for each data
> >>>> condition, handling of headlines, all-caps emphasis,
> >>>> etc.
> >>>>
> >>>> It would be worth to revisit this issue again under
> >>>> different data conditions / language pairs. Both
> >>>> options are readily available in EMS.
> >>>>
> >>>> Each of the two alternative methods could be
> >>>> improved as well. See for instance:
> >>>> http://www.aclweb.org/anthology/N06-1001
> >>>>
> >>>> -phi
> >>>>
> >>>> -phi
> >>>>
> >>>>
> >>>> On Wed, May 20, 2015 at 12:31 PM, Lane Schwartz <dowobeha@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Philipp (and others),
> >>>>>
> >>>>> I'm wondering what people's experience is regarding when truecasing is
> >>>>> applied.
> >>>>>
> >>>>> One option is to truecase the training data, then train your TM and LM
> >>>>> using that truecased data. Another option would be to lowercase the data,
> >>>>> train TM and LM on the lowercased data, and then perform truecasing after
> >>>>> decoding.
> >>>>>
> >>>>> I assume that the former gives better results, but the latter approach
> >>>>> has an advantage in terms of extensibility (namely if you get more data and
> >>>>> update your truecase model, you don't have to re-train all of your TMs and
> >>>>> LMs).
> >>>>>
> >>>>> Does anyone have any insights they would care to share on this?
> >>>>>
> >>>>> Thanks,
> >>>>> Lane
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Moses-support mailing list
> >>>>> Moses-support@mit.edu
> >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> When a place gets crowded enough to require ID's, social collapse is not
> >>> far away. It is time to go elsewhere. The best thing about space travel
> >>> is that it made it possible to go elsewhere.
> >>> -- R.A. Heinlein, "Time Enough For Love"
> >>>
> >>> _______________________________________________
> >>> Moses-support mailing list
> >>> Moses-support@mit.edu
> >>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>
> >>>
> >>
> >
> >
> > --
> > When a place gets crowded enough to require ID's, social collapse is not
> > far away. It is time to go elsewhere. The best thing about space travel
> > is that it made it possible to go elsewhere.
> > -- R.A. Heinlein, "Time Enough For Love"
> >
> > _______________________________________________
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
>



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 103, Issue 57
**********************************************

0 Response to "Moses-support Digest, Vol 103, Issue 57"

Post a Comment