Moses-support Digest, Vol 93, Issue 28

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Phrase extraction with --IncludeSentenceId messes up
phrase table counts (Hieu Hoang)
2. Re: Phrase extraction with --IncludeSentenceId messes up
phrase table counts (Barry Haddow)
3. Re: Phrase extraction with --IncludeSentenceId messes up
phrase table counts (Marcin Junczys-Dowmunt)

----------------------------------------------------------------------

Message: 1
Date: Wed, 23 Jul 2014 16:40:34 +0100
From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Subject: Re: [Moses-support] Phrase extraction with
--IncludeSentenceId messes up phrase table counts
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAEKMkbgFRL9niC=_XPTN_w69J8LDbogY_puZ9R0KZ9srCccn5A@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

it's likely we're using fractional count so there's a extra column

On 23 July 2014 16:34, Marcin Junczys-Dowmunt <junczys@amu.edu.pl> wrote:

>
> So, how come this is not damaging the Edinburgh system?
>
> W dniu 23.07.2014 17:32, Hieu Hoang pisze:
>
> ah ok.
>
> I thought it was just for debugging. I'm not gonna change it since it's
> gonna involve months of debugging.
>
> Ideally, the extract format should be fixed like the phrase-table, with
> the last column being key-value pairs. Also, way the key-value pairs are
> processed should be automatic like in the decoder.
>
> marcin - sorry mate. you're on your own
>
> On 23/07/14 16:20, Philipp Koehn wrote:
>
> Hi,
>
> the sentence ID is being used for the domain indicator features.
>
> If you run phrase-extract's score with specifying a domain file,
> it then it uses the sentence IDs to find out which domain the
> phrase pair was found in.
>
> This is a standard features in Edinburgh's phrase-based system
> for the last 1-2 years, so if you want to make changes, make
> sure that this functionality still works (see [1381-5] for an example
> with extract* files still in place).
>
> -phi
>
>
> On Wed, Jul 23, 2014 at 7:15 AM, Marcin Junczys-Dowmunt <
> junczys@amu.edu.pl> wrote:
>
>> Key-value format would actually be fine.
>>
>> W dniu 23.07.2014 13:12, Marcin Junczys-Dowmunt pisze:
>>
>> I was planning to use it for a custom feature function later.
>>
>> W dniu 23.07.2014 13:11, Hieu Hoang pisze:
>>
>> i can change it so that the sentence id is put into a key-value field
>> in the last column.
>>
>> what is the sentence id used for? is it just for debugging purposes?
>>
>>
>> On 23 July 2014 11:36, Marcin Junczys-Dowmunt <junczys@amu.edu.pl> wrote:
>>
>>> Hi,
>>> I am using train-model.perl with
>>>
>>> --extract-options="--IncludeSentenceId"
>>>
>>> and it seems that the sentence id is somehow getting into the phrase
>>> table as a count and later used for phrase translation weight
>>> calculation, for instance the extract (last column is the Id):
>>>
>>> #c the compound or process ||| #c verbindung oder verfahren ||| 0-0 2-1
>>> 3-2 4-3 ||| 1374618
>>> #c the compound or process ||| #c verbindung oder verfahren ||| 0-0 2-1
>>> 3-2 4-3 ||| 1374619
>>> #c the compound or process ||| #c verbindung oder verfahren ||| 0-0 2-1
>>> 3-2 4-3 ||| 1374620
>>> #c the compound or process ||| #c verbindung oder verfahren ||| 0-0 2-1
>>> 3-2 4-3 ||| 1374621
>>> #c the compound or process ||| #c verbindung oder verfahren ||| 0-0 2-1
>>> 3-2 4-3 ||| 1374622
>>> #c the compound or process ||| #c verbindung oder verfahren ||| 0-0 2-1
>>> 3-2 4-3 ||| 4587318
>>>
>>> results in a phrase table entry like this:
>>>
>>> #c the compound or process ||| #c verbindung oder verfahren ||| 1
>>> 0.0100206 5.23542e-07 0.524577 ||| 0-0 2-1 3-2 4-3 ||| 6 1.14604e+07 6
>>> ||| |||
>>>
>>> The count is equal to the sum of sentence ids, which of course make the
>>> phrase probability useless.
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>>
>> --
>> Hieu Hoang
>> Research Associate
>> University of Edinburgh
>> http://www.hoang.co.uk/hieu
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>

--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140723/27381ef9/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 23 Jul 2014 16:40:53 +0100
From: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Subject: Re: [Moses-support] Phrase extraction with
--IncludeSentenceId messes up phrase table counts
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>, Hieu Hoang
<hieuhoang@gmail.com>, Philipp Koehn <pkoehn@inf.ed.ac.uk>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <53CFD785.7080601@staffmail.ed.ac.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Because calculating translation probabilities from sentence ids is
unexpectedly beneficial?

On 23/07/14 16:34, Marcin Junczys-Dowmunt wrote:
>
> So, how come this is not damaging the Edinburgh system?
>
> W dniu 23.07.2014 17:32, Hieu Hoang pisze:
>> ah ok.
>>
>> I thought it was just for debugging. I'm not gonna change it since
>> it's gonna involve months of debugging.
>>
>> Ideally, the extract format should be fixed like the phrase-table,
>> with the last column being key-value pairs. Also, way the key-value
>> pairs are processed should be automatic like in the decoder.
>>
>> marcin - sorry mate. you're on your own
>>
>> On 23/07/14 16:20, Philipp Koehn wrote:
>>> Hi,
>>>
>>> the sentence ID is being used for the domain indicator features.
>>>
>>> If you run phrase-extract's score with specifying a domain file,
>>> it then it uses the sentence IDs to find out which domain the
>>> phrase pair was found in.
>>>
>>> This is a standard features in Edinburgh's phrase-based system
>>> for the last 1-2 years, so if you want to make changes, make
>>> sure that this functionality still works (see [1381-5] for an example
>>> with extract* files still in place).
>>>
>>> -phi
>>>
>>>
>>> On Wed, Jul 23, 2014 at 7:15 AM, Marcin Junczys-Dowmunt
>>> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>>>
>>> Key-value format would actually be fine.
>>>
>>> W dniu 23.07.2014 13:12, Marcin Junczys-Dowmunt pisze:
>>>> I was planning to use it for a custom feature function later.
>>>>
>>>> W dniu 23.07.2014 13:11, Hieu Hoang pisze:
>>>>> i can change it so that the sentence id is put into a
>>>>> key-value field in the last column.
>>>>>
>>>>> what is the sentence id used for? is it just for debugging
>>>>> purposes?
>>>>>
>>>>>
>>>>> On 23 July 2014 11:36, Marcin Junczys-Dowmunt
>>>>> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>>>>>
>>>>> Hi,
>>>>> I am using train-model.perl with
>>>>>
>>>>> --extract-options="--IncludeSentenceId"
>>>>>
>>>>> and it seems that the sentence id is somehow getting into
>>>>> the phrase
>>>>> table as a count and later used for phrase translation weight
>>>>> calculation, for instance the extract (last column is the Id):
>>>>>
>>>>> #c the compound or process ||| #c verbindung oder
>>>>> verfahren ||| 0-0 2-1
>>>>> 3-2 4-3 ||| 1374618
>>>>> #c the compound or process ||| #c verbindung oder
>>>>> verfahren ||| 0-0 2-1
>>>>> 3-2 4-3 ||| 1374619
>>>>> #c the compound or process ||| #c verbindung oder
>>>>> verfahren ||| 0-0 2-1
>>>>> 3-2 4-3 ||| 1374620
>>>>> #c the compound or process ||| #c verbindung oder
>>>>> verfahren ||| 0-0 2-1
>>>>> 3-2 4-3 ||| 1374621
>>>>> #c the compound or process ||| #c verbindung oder
>>>>> verfahren ||| 0-0 2-1
>>>>> 3-2 4-3 ||| 1374622
>>>>> #c the compound or process ||| #c verbindung oder
>>>>> verfahren ||| 0-0 2-1
>>>>> 3-2 4-3 ||| 4587318
>>>>>
>>>>> results in a phrase table entry like this:
>>>>>
>>>>> #c the compound or process ||| #c verbindung oder
>>>>> verfahren ||| 1
>>>>> 0.0100206 5.23542e-07 0.524577 ||| 0-0 2-1 3-2 4-3 ||| 6
>>>>> 1.14604e+07 6
>>>>> ||| |||
>>>>>
>>>>> The count is equal to the sum of sentence ids, which of
>>>>> course make the
>>>>> phrase probability useless.
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Hieu Hoang
>>>>> Research Associate
>>>>> University of Edinburgh
>>>>> http://www.hoang.co.uk/hieu
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

------------------------------

Message: 3
Date: Wed, 23 Jul 2014 17:42:59 +0200
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Phrase extraction with
--IncludeSentenceId messes up phrase table counts
To: Barry Haddow <bhaddow@staffmail.ed.ac.uk>, Hieu Hoang
<hieuhoang@gmail.com>, Philipp Koehn <pkoehn@inf.ed.ac.uk>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <53CFD803.2080806@amu.edu.pl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

In a corpus sorted with sentences sorted by release date this could
actually make sense :)

W dniu 23.07.2014 17:40, Barry Haddow pisze:
> Because calculating translation probabilities from sentence ids is
> unexpectedly beneficial?
>
> On 23/07/14 16:34, Marcin Junczys-Dowmunt wrote:
>>
>> So, how come this is not damaging the Edinburgh system?
>>
>> W dniu 23.07.2014 17:32, Hieu Hoang pisze:
>>> ah ok.
>>>
>>> I thought it was just for debugging. I'm not gonna change it since
>>> it's gonna involve months of debugging.
>>>
>>> Ideally, the extract format should be fixed like the phrase-table,
>>> with the last column being key-value pairs. Also, way the key-value
>>> pairs are processed should be automatic like in the decoder.
>>>
>>> marcin - sorry mate. you're on your own
>>>
>>> On 23/07/14 16:20, Philipp Koehn wrote:
>>>> Hi,
>>>>
>>>> the sentence ID is being used for the domain indicator features.
>>>>
>>>> If you run phrase-extract's score with specifying a domain file,
>>>> it then it uses the sentence IDs to find out which domain the
>>>> phrase pair was found in.
>>>>
>>>> This is a standard features in Edinburgh's phrase-based system
>>>> for the last 1-2 years, so if you want to make changes, make
>>>> sure that this functionality still works (see [1381-5] for an example
>>>> with extract* files still in place).
>>>>
>>>> -phi
>>>>
>>>>
>>>> On Wed, Jul 23, 2014 at 7:15 AM, Marcin Junczys-Dowmunt
>>>> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>>>>
>>>> Key-value format would actually be fine.
>>>>
>>>> W dniu 23.07.2014 13:12, Marcin Junczys-Dowmunt pisze:
>>>>> I was planning to use it for a custom feature function later.
>>>>>
>>>>> W dniu 23.07.2014 13:11, Hieu Hoang pisze:
>>>>>> i can change it so that the sentence id is put into a
>>>>>> key-value field in the last column.
>>>>>>
>>>>>> what is the sentence id used for? is it just for debugging
>>>>>> purposes?
>>>>>>
>>>>>>
>>>>>> On 23 July 2014 11:36, Marcin Junczys-Dowmunt
>>>>>> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> I am using train-model.perl with
>>>>>>
>>>>>> --extract-options="--IncludeSentenceId"
>>>>>>
>>>>>> and it seems that the sentence id is somehow getting into
>>>>>> the phrase
>>>>>> table as a count and later used for phrase translation
>>>>>> weight
>>>>>> calculation, for instance the extract (last column is the
>>>>>> Id):
>>>>>>
>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>> verfahren ||| 0-0 2-1
>>>>>> 3-2 4-3 ||| 1374618
>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>> verfahren ||| 0-0 2-1
>>>>>> 3-2 4-3 ||| 1374619
>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>> verfahren ||| 0-0 2-1
>>>>>> 3-2 4-3 ||| 1374620
>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>> verfahren ||| 0-0 2-1
>>>>>> 3-2 4-3 ||| 1374621
>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>> verfahren ||| 0-0 2-1
>>>>>> 3-2 4-3 ||| 1374622
>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>> verfahren ||| 0-0 2-1
>>>>>> 3-2 4-3 ||| 4587318
>>>>>>
>>>>>> results in a phrase table entry like this:
>>>>>>
>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>> verfahren ||| 1
>>>>>> 0.0100206 5.23542e-07 0.524577 ||| 0-0 2-1 3-2 4-3 ||| 6
>>>>>> 1.14604e+07 6
>>>>>> ||| |||
>>>>>>
>>>>>> The count is equal to the sum of sentence ids, which of
>>>>>> course make the
>>>>>> phrase probability useless.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Hieu Hoang
>>>>>> Research Associate
>>>>>> University of Edinburgh
>>>>>> http://www.hoang.co.uk/hieu
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 93, Issue 28
*********************************************

Moses-support Digest, Vol 93, Issue 28

0 Response to "Moses-support Digest, Vol 93, Issue 28"

Post a Comment