Moses-support Digest, Vol 93, Issue 29

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Phrase extraction with --IncludeSentenceId messes up
phrase table counts (Philipp Koehn)
2. Re: Phrase extraction with --IncludeSentenceId messes up
phrase table counts (Marcin Junczys-Dowmunt)


----------------------------------------------------------------------

Message: 1
Date: Wed, 23 Jul 2014 11:53:30 -0400
From: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Subject: Re: [Moses-support] Phrase extraction with
--IncludeSentenceId messes up phrase table counts
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>, Barry Haddow
<bhaddow@staffmail.ed.ac.uk>
Message-ID:
<CAAFADDCPggnav_5DiySyBSMgm+FqM7DJhC3YpnY5A+ao+HWMUg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

this is how extract is called:
extract corpus.en corpus.fr align extract 5 --IncludeSentenceId

this is how score is called:
score extract lex.f2e phrase-table.half --GoodTuring --DomainIndicator
domains.5

phrase table looks fine to me

-phi


On Wed, Jul 23, 2014 at 11:42 AM, Marcin Junczys-Dowmunt <junczys@amu.edu.pl
> wrote:

> In a corpus sorted with sentences sorted by release date this could
> actually make sense :)
>
> W dniu 23.07.2014 17:40, Barry Haddow pisze:
>
> Because calculating translation probabilities from sentence ids is
>> unexpectedly beneficial?
>>
>> On 23/07/14 16:34, Marcin Junczys-Dowmunt wrote:
>>
>>>
>>> So, how come this is not damaging the Edinburgh system?
>>>
>>> W dniu 23.07.2014 17:32, Hieu Hoang pisze:
>>>
>>>> ah ok.
>>>>
>>>> I thought it was just for debugging. I'm not gonna change it since it's
>>>> gonna involve months of debugging.
>>>>
>>>> Ideally, the extract format should be fixed like the phrase-table, with
>>>> the last column being key-value pairs. Also, way the key-value pairs are
>>>> processed should be automatic like in the decoder.
>>>>
>>>> marcin - sorry mate. you're on your own
>>>>
>>>> On 23/07/14 16:20, Philipp Koehn wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> the sentence ID is being used for the domain indicator features.
>>>>>
>>>>> If you run phrase-extract's score with specifying a domain file,
>>>>> it then it uses the sentence IDs to find out which domain the
>>>>> phrase pair was found in.
>>>>>
>>>>> This is a standard features in Edinburgh's phrase-based system
>>>>> for the last 1-2 years, so if you want to make changes, make
>>>>> sure that this functionality still works (see [1381-5] for an example
>>>>> with extract* files still in place).
>>>>>
>>>>> -phi
>>>>>
>>>>>
>>>>> On Wed, Jul 23, 2014 at 7:15 AM, Marcin Junczys-Dowmunt <
>>>>> junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>>>>>
>>>>> Key-value format would actually be fine.
>>>>>
>>>>> W dniu 23.07.2014 13:12, Marcin Junczys-Dowmunt pisze:
>>>>>
>>>>>> I was planning to use it for a custom feature function later.
>>>>>>
>>>>>> W dniu 23.07.2014 13:11, Hieu Hoang pisze:
>>>>>>
>>>>>>> i can change it so that the sentence id is put into a
>>>>>>> key-value field in the last column.
>>>>>>>
>>>>>>> what is the sentence id used for? is it just for debugging
>>>>>>> purposes?
>>>>>>>
>>>>>>>
>>>>>>> On 23 July 2014 11:36, Marcin Junczys-Dowmunt
>>>>>>> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>> I am using train-model.perl with
>>>>>>>
>>>>>>> --extract-options="--IncludeSentenceId"
>>>>>>>
>>>>>>> and it seems that the sentence id is somehow getting into
>>>>>>> the phrase
>>>>>>> table as a count and later used for phrase translation weight
>>>>>>> calculation, for instance the extract (last column is the
>>>>>>> Id):
>>>>>>>
>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>> verfahren ||| 0-0 2-1
>>>>>>> 3-2 4-3 ||| 1374618
>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>> verfahren ||| 0-0 2-1
>>>>>>> 3-2 4-3 ||| 1374619
>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>> verfahren ||| 0-0 2-1
>>>>>>> 3-2 4-3 ||| 1374620
>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>> verfahren ||| 0-0 2-1
>>>>>>> 3-2 4-3 ||| 1374621
>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>> verfahren ||| 0-0 2-1
>>>>>>> 3-2 4-3 ||| 1374622
>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>> verfahren ||| 0-0 2-1
>>>>>>> 3-2 4-3 ||| 4587318
>>>>>>>
>>>>>>> results in a phrase table entry like this:
>>>>>>>
>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>> verfahren ||| 1
>>>>>>> 0.0100206 5.23542e-07 0.524577 ||| 0-0 2-1 3-2 4-3 ||| 6
>>>>>>> 1.14604e+07 6
>>>>>>> ||| |||
>>>>>>>
>>>>>>> The count is equal to the sum of sentence ids, which of
>>>>>>> course make the
>>>>>>> phrase probability useless.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- Hieu Hoang
>>>>>>> Research Associate
>>>>>>> University of Edinburgh
>>>>>>> http://www.hoang.co.uk/hieu
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140723/940d70dd/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 23 Jul 2014 17:56:00 +0200
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Phrase extraction with
--IncludeSentenceId messes up phrase table counts
To: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>, Barry Haddow
<bhaddow@staffmail.ed.ac.uk>
Message-ID: <53CFDB10.70601@amu.edu.pl>
Content-Type: text/plain; charset="utf-8"

So, adding "--IgnoreSentenceId" to "score" might fix that without
messing up your stuff? I guess I can do that if you can't be bothered,
Hieu.

W dniu 23.07.2014 17:53, Philipp Koehn pisze:
> Hi,
>
> this is how extract is called:
> extract corpus.en corpus.fr <http://corpus.fr> align extract 5
> --IncludeSentenceId
>
> this is how score is called:
> score extract lex.f2e phrase-table.half --GoodTuring --DomainIndicator
> domains.5
>
> phrase table looks fine to me
>
> -phi
>
>
> On Wed, Jul 23, 2014 at 11:42 AM, Marcin Junczys-Dowmunt
> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>
> In a corpus sorted with sentences sorted by release date this
> could actually make sense :)
>
> W dniu 23.07.2014 17:40, Barry Haddow pisze:
>
> Because calculating translation probabilities from sentence
> ids is unexpectedly beneficial?
>
> On 23/07/14 16:34, Marcin Junczys-Dowmunt wrote:
>
>
> So, how come this is not damaging the Edinburgh system?
>
> W dniu 23.07.2014 17:32, Hieu Hoang pisze:
>
> ah ok.
>
> I thought it was just for debugging. I'm not gonna
> change it since it's gonna involve months of debugging.
>
> Ideally, the extract format should be fixed like the
> phrase-table, with the last column being key-value
> pairs. Also, way the key-value pairs are processed
> should be automatic like in the decoder.
>
> marcin - sorry mate. you're on your own
>
> On 23/07/14 16:20, Philipp Koehn wrote:
>
> Hi,
>
> the sentence ID is being used for the domain
> indicator features.
>
> If you run phrase-extract's score with specifying
> a domain file,
> it then it uses the sentence IDs to find out which
> domain the
> phrase pair was found in.
>
> This is a standard features in Edinburgh's
> phrase-based system
> for the last 1-2 years, so if you want to make
> changes, make
> sure that this functionality still works (see
> [1381-5] for an example
> with extract* files still in place).
>
> -phi
>
>
> On Wed, Jul 23, 2014 at 7:15 AM, Marcin
> Junczys-Dowmunt <junczys@amu.edu.pl
> <mailto:junczys@amu.edu.pl>
> <mailto:junczys@amu.edu.pl
> <mailto:junczys@amu.edu.pl>>> wrote:
>
> Key-value format would actually be fine.
>
> W dniu 23.07.2014 13:12, Marcin
> Junczys-Dowmunt pisze:
>
> I was planning to use it for a custom
> feature function later.
>
> W dniu 23.07.2014 13:11, Hieu Hoang pisze:
>
> i can change it so that the sentence
> id is put into a
> key-value field in the last column.
>
> what is the sentence id used for? is
> it just for debugging
> purposes?
>
>
> On 23 July 2014 11:36, Marcin
> Junczys-Dowmunt
> <junczys@amu.edu.pl
> <mailto:junczys@amu.edu.pl>
> <mailto:junczys@amu.edu.pl
> <mailto:junczys@amu.edu.pl>>> wrote:
>
> Hi,
> I am using train-model.perl with
>
>
> --extract-options="--IncludeSentenceId"
>
> and it seems that the sentence id
> is somehow getting into
> the phrase
> table as a count and later used
> for phrase translation weight
> calculation, for instance the
> extract (last column is the Id):
>
> #c the compound or process ||| #c
> verbindung oder
> verfahren ||| 0-0 2-1
> 3-2 4-3 ||| 1374618
> #c the compound or process ||| #c
> verbindung oder
> verfahren ||| 0-0 2-1
> 3-2 4-3 ||| 1374619
> #c the compound or process ||| #c
> verbindung oder
> verfahren ||| 0-0 2-1
> 3-2 4-3 ||| 1374620
> #c the compound or process ||| #c
> verbindung oder
> verfahren ||| 0-0 2-1
> 3-2 4-3 ||| 1374621
> #c the compound or process ||| #c
> verbindung oder
> verfahren ||| 0-0 2-1
> 3-2 4-3 ||| 1374622
> #c the compound or process ||| #c
> verbindung oder
> verfahren ||| 0-0 2-1
> 3-2 4-3 ||| 4587318
>
> results in a phrase table entry
> like this:
>
> #c the compound or process ||| #c
> verbindung oder
> verfahren ||| 1
> 0.0100206 5.23542e-07 0.524577 |||
> 0-0 2-1 3-2 4-3 ||| 6
> 1.14604e+07 6
> ||| |||
>
> The count is equal to the sum of
> sentence ids, which of
> course make the
> phrase probability useless.
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> <mailto:Moses-support@mit.edu>
> <mailto:Moses-support@mit.edu
> <mailto:Moses-support@mit.edu>>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> -- Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> <mailto:Moses-support@mit.edu>
> <mailto:Moses-support@mit.edu
> <mailto:Moses-support@mit.edu>>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> <mailto:Moses-support@mit.edu>
> <mailto:Moses-support@mit.edu
> <mailto:Moses-support@mit.edu>>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140723/93f0325f/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 93, Issue 29
*********************************************

0 Response to "Moses-support Digest, Vol 93, Issue 29"

Post a Comment