Moses-support Digest, Vol 106, Issue 30

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Domain adaptation (Vincent Nguyen)
2. Thread-safe Lattice Decoding (James H. Cross III)
3. Re: Domain adaptation (Rico Sennrich)
4. Re: Domain adaptation (Vincent Nguyen)
5. Re: Domain adaptation (Barry Haddow)
6. Re: Domain adaptation (Matthias Huck)

----------------------------------------------------------------------

Message: 1
Date: Fri, 14 Aug 2015 17:22:35 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: [Moses-support] Domain adaptation
To: moses-support <moses-support@mit.edu>
Message-ID: <55CE07BB.1050202@neuf.fr>
Content-Type: text/plain; charset=utf-8; format=flowed

Hi,

I can't find a sort of "tutorial " on domain adaptation path to follow.
I read this in the doc :
The language model should be trained on a corpus that is suitable to the
domain. If the translation model is trained on a parallel corpus, then
the language model should be trained on the output side of that corpus,
although using additional training data is often beneficial.

And in the training section of the EMS, there is a sub section with
domain-features=....

What is the best practice ?

Let's say for instance that I would like to specialize my modem in
finance translation, with specific corpus.

Should I train the Language model with finance stuff ?
Should I include parallel corpus in the translation model training ?
Should I tune with financial data sets ?

Please help me to understand.
Vincent

------------------------------

Message: 2
Date: Fri, 14 Aug 2015 10:16:02 -0700
From: "James H. Cross III" <james.henry.cross.iii@gmail.com>
Subject: [Moses-support] Thread-safe Lattice Decoding
To: moses-support@mit.edu
Message-ID:
<CACdKcAE3H7DNT6S2dmXJWjXLwRdTvfFFVJhBMnvF4pVA6hfp5A@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

Hi:

The website documentation notes that lattice input may not work with
multi-threaded decoding. Is there a reason to believe this is not
likely to work? To the extent that each thread processes a single
input example (lattice instead of sentence), it seems like the
shared-resource issues would be no different than with sentence input.

If it is indeed not supported, can you give me some idea of what might
be necessary to extend it to this use case?

Thanks!
James

------------------------------

Message: 3
Date: Fri, 14 Aug 2015 19:52:47 +0100
From: Rico Sennrich <rico.sennrich@gmx.ch>
Subject: Re: [Moses-support] Domain adaptation
To: moses-support@mit.edu
Message-ID: <55CE38FF.7010601@gmx.ch>
Content-Type: text/plain; charset=windows-1252; format=flowed

Hi Vincent,

this section describes some domain adaptation methods that are
implemented in Moses: http://www.statmt.org/moses/?n=Advanced.Domain

It is incomplete (focusing on parallel data and the translation model),
and does not recommend best practices.

In general, my recommendation is to use in-domain data whenever possible
(for the language model, translation model, and held-out in-domain data
for tuning/testing). Out-of-domain data can help, but also hurt your
system: the effect depends on your domains and the amount of data you
have for each. Data selection, instance weighting, model interpolation
and domain features are different methods that give you the benefits of
out-of-domain data, but reduce its harmful effects, and are often better
than just concatenating all the data you have.

best wishes,
Rico

On 14/08/15 16:22, Vincent Nguyen wrote:
> Hi,
>
> I can't find a sort of "tutorial " on domain adaptation path to follow.
> I read this in the doc :
> The language model should be trained on a corpus that is suitable to the
> domain. If the translation model is trained on a parallel corpus, then
> the language model should be trained on the output side of that corpus,
> although using additional training data is often beneficial.
>
> And in the training section of the EMS, there is a sub section with
> domain-features=....
>
> What is the best practice ?
>
> Let's say for instance that I would like to specialize my modem in
> finance translation, with specific corpus.
>
> Should I train the Language model with finance stuff ?
> Should I include parallel corpus in the translation model training ?
> Should I tune with financial data sets ?
>
> Please help me to understand.
> Vincent
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

------------------------------

Message: 4
Date: Fri, 14 Aug 2015 21:20:58 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] Domain adaptation
To: moses-support@mit.edu
Message-ID: <55CE3F9A.6000804@neuf.fr>
Content-Type: text/plain; charset=windows-1252; format=flowed

I had read this section, which deals with translation model combination.
not much on language model or tuning.

For instance : if I want to make sure that a specific expression
"titres" is translated in "equities" from French to English.

These 2 words have specifically to be in the Monolingual corpus of the
language model, or in the parallel corpus ?

the fact that 2 "parallel expressions" are in the tuning set but not
present in the parallel corpora nor the monolingual LM, can it trigger a
good translation ?

I am not sure to be clear ....

thanks again for your help.

Le 14/08/2015 20:52, Rico Sennrich a ?crit :
> Hi Vincent,
>
> this section describes some domain adaptation methods that are
> implemented in Moses: http://www.statmt.org/moses/?n=Advanced.Domain
>
> It is incomplete (focusing on parallel data and the translation model),
> and does not recommend best practices.
>
> In general, my recommendation is to use in-domain data whenever possible
> (for the language model, translation model, and held-out in-domain data
> for tuning/testing). Out-of-domain data can help, but also hurt your
> system: the effect depends on your domains and the amount of data you
> have for each. Data selection, instance weighting, model interpolation
> and domain features are different methods that give you the benefits of
> out-of-domain data, but reduce its harmful effects, and are often better
> than just concatenating all the data you have.
>
> best wishes,
> Rico
>
>
> On 14/08/15 16:22, Vincent Nguyen wrote:
>> Hi,
>>
>> I can't find a sort of "tutorial " on domain adaptation path to follow.
>> I read this in the doc :
>> The language model should be trained on a corpus that is suitable to the
>> domain. If the translation model is trained on a parallel corpus, then
>> the language model should be trained on the output side of that corpus,
>> although using additional training data is often beneficial.
>>
>> And in the training section of the EMS, there is a sub section with
>> domain-features=....
>>
>> What is the best practice ?
>>
>> Let's say for instance that I would like to specialize my modem in
>> finance translation, with specific corpus.
>>
>> Should I train the Language model with finance stuff ?
>> Should I include parallel corpus in the translation model training ?
>> Should I tune with financial data sets ?
>>
>> Please help me to understand.
>> Vincent
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

Message: 5
Date: Fri, 14 Aug 2015 20:37:50 +0100
From: Barry Haddow <bhaddow@inf.ed.ac.uk>
Subject: Re: [Moses-support] Domain adaptation
To: Vincent Nguyen <vnguyen@neuf.fr>, moses-support@mit.edu
Message-ID: <55CE438E.1030005@inf.ed.ac.uk>
Content-Type: text/plain; charset=windows-1252; format=flowed

You could try this tutorial

http://www.statmt.org/mtma15/uploads/mtma15-domain-adaptation.pdf

On 14/08/15 20:20, Vincent Nguyen wrote:
> I had read this section, which deals with translation model combination.
> not much on language model or tuning.
>
> For instance : if I want to make sure that a specific expression
> "titres" is translated in "equities" from French to English.
>
> These 2 words have specifically to be in the Monolingual corpus of the
> language model, or in the parallel corpus ?
>
> the fact that 2 "parallel expressions" are in the tuning set but not
> present in the parallel corpora nor the monolingual LM, can it trigger a
> good translation ?
>
> I am not sure to be clear ....
>
> thanks again for your help.
>
>
> Le 14/08/2015 20:52, Rico Sennrich a ?crit :
>> Hi Vincent,
>>
>> this section describes some domain adaptation methods that are
>> implemented in Moses: http://www.statmt.org/moses/?n=Advanced.Domain
>>
>> It is incomplete (focusing on parallel data and the translation model),
>> and does not recommend best practices.
>>
>> In general, my recommendation is to use in-domain data whenever possible
>> (for the language model, translation model, and held-out in-domain data
>> for tuning/testing). Out-of-domain data can help, but also hurt your
>> system: the effect depends on your domains and the amount of data you
>> have for each. Data selection, instance weighting, model interpolation
>> and domain features are different methods that give you the benefits of
>> out-of-domain data, but reduce its harmful effects, and are often better
>> than just concatenating all the data you have.
>>
>> best wishes,
>> Rico
>>
>>
>> On 14/08/15 16:22, Vincent Nguyen wrote:
>>> Hi,
>>>
>>> I can't find a sort of "tutorial " on domain adaptation path to follow.
>>> I read this in the doc :
>>> The language model should be trained on a corpus that is suitable to the
>>> domain. If the translation model is trained on a parallel corpus, then
>>> the language model should be trained on the output side of that corpus,
>>> although using additional training data is often beneficial.
>>>
>>> And in the training section of the EMS, there is a sub section with
>>> domain-features=....
>>>
>>> What is the best practice ?
>>>
>>> Let's say for instance that I would like to specialize my modem in
>>> finance translation, with specific corpus.
>>>
>>> Should I train the Language model with finance stuff ?
>>> Should I include parallel corpus in the translation model training ?
>>> Should I tune with financial data sets ?
>>>
>>> Please help me to understand.
>>> Vincent
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

------------------------------

Message: 6
Date: Fri, 14 Aug 2015 21:14:20 +0100
From: Matthias Huck <mhuck@inf.ed.ac.uk>
Subject: Re: [Moses-support] Domain adaptation
To: Barry Haddow <bhaddow@inf.ed.ac.uk>
Cc: moses-support@mit.edu
Message-ID: <1439583260.2143.17.camel@inf.ed.ac.uk>
Content-Type: text/plain; charset="UTF-8"

Hi,

I found this older tutorial to be very useful as well:

"Practical Domain Adaptation" by Marcello Federico and Nicola Bertoldi
http://www.mt-archive.info/10/AMTA-2012-Bertoldi-ppt.pdf
(The document formatting is unfortunately slightly messed up.)

SMT research survey wiki:
http://www.statmt.org/survey/Topic/DomainAdaptation

Cheers,
Matthias

On Fri, 2015-08-14 at 20:37 +0100, Barry Haddow wrote:
> You could try this tutorial
>
> http://www.statmt.org/mtma15/uploads/mtma15-domain-adaptation.pdf
>
> On 14/08/15 20:20, Vincent Nguyen wrote:
> > I had read this section, which deals with translation model combination.
> > not much on language model or tuning.
> >
> > For instance : if I want to make sure that a specific expression
> > "titres" is translated in "equities" from French to English.
> >
> > These 2 words have specifically to be in the Monolingual corpus of the
> > language model, or in the parallel corpus ?
> >
> > the fact that 2 "parallel expressions" are in the tuning set but not
> > present in the parallel corpora nor the monolingual LM, can it trigger a
> > good translation ?
> >
> > I am not sure to be clear ....
> >
> > thanks again for your help.
> >
> >
> > Le 14/08/2015 20:52, Rico Sennrich a ?crit :
> >> Hi Vincent,
> >>
> >> this section describes some domain adaptation methods that are
> >> implemented in Moses: http://www.statmt.org/moses/?n=Advanced.Domain
> >>
> >> It is incomplete (focusing on parallel data and the translation model),
> >> and does not recommend best practices.
> >>
> >> In general, my recommendation is to use in-domain data whenever possible
> >> (for the language model, translation model, and held-out in-domain data
> >> for tuning/testing). Out-of-domain data can help, but also hurt your
> >> system: the effect depends on your domains and the amount of data you
> >> have for each. Data selection, instance weighting, model interpolation
> >> and domain features are different methods that give you the benefits of
> >> out-of-domain data, but reduce its harmful effects, and are often better
> >> than just concatenating all the data you have.
> >>
> >> best wishes,
> >> Rico
> >>
> >>
> >> On 14/08/15 16:22, Vincent Nguyen wrote:
> >>> Hi,
> >>>
> >>> I can't find a sort of "tutorial " on domain adaptation path to follow.
> >>> I read this in the doc :
> >>> The language model should be trained on a corpus that is suitable to the
> >>> domain. If the translation model is trained on a parallel corpus, then
> >>> the language model should be trained on the output side of that corpus,
> >>> although using additional training data is often beneficial.
> >>>
> >>> And in the training section of the EMS, there is a sub section with
> >>> domain-features=....
> >>>
> >>> What is the best practice ?
> >>>
> >>> Let's say for instance that I would like to specialize my modem in
> >>> finance translation, with specific corpus.
> >>>
> >>> Should I train the Language model with finance stuff ?
> >>> Should I include parallel corpus in the translation model training ?
> >>> Should I tune with financial data sets ?
> >>>
> >>> Please help me to understand.
> >>> Vincent
> >>>
> >>> _______________________________________________
> >>> Moses-support mailing list
> >>> Moses-support@mit.edu
> >>> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>>
> >> _______________________________________________
> >> Moses-support mailing list
> >> Moses-support@mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> > _______________________________________________
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
>
>

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 106, Issue 30
**********************************************

Moses-support Digest, Vol 106, Issue 30

0 Response to "Moses-support Digest, Vol 106, Issue 30"

Post a Comment