Moses-support Digest, Vol 93, Issue 30

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Phrase extraction with --IncludeSentenceId messes up
phrase table counts (Barry Haddow)
2. Re: Phrase extraction with --IncludeSentenceId messes up
phrase table counts (Hieu Hoang)

----------------------------------------------------------------------

Message: 1
Date: Wed, 23 Jul 2014 17:06:03 +0100
From: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Subject: Re: [Moses-support] Phrase extraction with
--IncludeSentenceId messes up phrase table counts
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>, Philipp Koehn
<pkoehn@inf.ed.ac.uk>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <53CFDD6B.7050605@staffmail.ed.ac.uk>
Content-Type: text/plain; charset=UTF-8; format=flowed

Hi Marcin

It appears that there is an --IgnoreSentenceId argument already, added
by Maria during last year's MTM

> [gna]bhaddow: git blame ScoreFeature.cpp | grep Ignore
> bff12363 (maria nadejde 2013-09-13 12:45:46 +0200 42) if (args[i] ==
> "--IgnoreSentenceId") {

cheers - Barry

On 23/07/14 16:56, Marcin Junczys-Dowmunt wrote:
> So, adding "--IgnoreSentenceId" to "score" might fix that without
> messing up your stuff? I guess I can do that if you can't be bothered,
> Hieu.
>
> W dniu 23.07.2014 17:53, Philipp Koehn pisze:
>> Hi,
>>
>> this is how extract is called:
>> extract corpus.en corpus.fr <http://corpus.fr> align extract 5
>> --IncludeSentenceId
>>
>> this is how score is called:
>> score extract lex.f2e phrase-table.half --GoodTuring
>> --DomainIndicator domains.5
>>
>> phrase table looks fine to me
>>
>> -phi
>>
>>
>> On Wed, Jul 23, 2014 at 11:42 AM, Marcin Junczys-Dowmunt
>> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>>
>> In a corpus sorted with sentences sorted by release date this
>> could actually make sense :)
>>
>> W dniu 23.07.2014 17:40, Barry Haddow pisze:
>>
>> Because calculating translation probabilities from sentence
>> ids is unexpectedly beneficial?
>>
>> On 23/07/14 16:34, Marcin Junczys-Dowmunt wrote:
>>
>>
>> So, how come this is not damaging the Edinburgh system?
>>
>> W dniu 23.07.2014 17:32, Hieu Hoang pisze:
>>
>> ah ok.
>>
>> I thought it was just for debugging. I'm not gonna
>> change it since it's gonna involve months of debugging.
>>
>> Ideally, the extract format should be fixed like the
>> phrase-table, with the last column being key-value
>> pairs. Also, way the key-value pairs are processed
>> should be automatic like in the decoder.
>>
>> marcin - sorry mate. you're on your own
>>
>> On 23/07/14 16:20, Philipp Koehn wrote:
>>
>> Hi,
>>
>> the sentence ID is being used for the domain
>> indicator features.
>>
>> If you run phrase-extract's score with specifying
>> a domain file,
>> it then it uses the sentence IDs to find out
>> which domain the
>> phrase pair was found in.
>>
>> This is a standard features in Edinburgh's
>> phrase-based system
>> for the last 1-2 years, so if you want to make
>> changes, make
>> sure that this functionality still works (see
>> [1381-5] for an example
>> with extract* files still in place).
>>
>> -phi
>>
>>
>> On Wed, Jul 23, 2014 at 7:15 AM, Marcin
>> Junczys-Dowmunt <junczys@amu.edu.pl
>> <mailto:junczys@amu.edu.pl>
>> <mailto:junczys@amu.edu.pl
>> <mailto:junczys@amu.edu.pl>>> wrote:
>>
>> Key-value format would actually be fine.
>>
>> W dniu 23.07.2014 13:12, Marcin
>> Junczys-Dowmunt pisze:
>>
>> I was planning to use it for a custom
>> feature function later.
>>
>> W dniu 23.07.2014 13:11, Hieu Hoang pisze:
>>
>> i can change it so that the sentence
>> id is put into a
>> key-value field in the last column.
>>
>> what is the sentence id used for? is
>> it just for debugging
>> purposes?
>>
>>
>> On 23 July 2014 11:36, Marcin
>> Junczys-Dowmunt
>> <junczys@amu.edu.pl
>> <mailto:junczys@amu.edu.pl>
>> <mailto:junczys@amu.edu.pl
>> <mailto:junczys@amu.edu.pl>>> wrote:
>>
>> Hi,
>> I am using train-model.perl with
>>
>> --extract-options="--IncludeSentenceId"
>>
>> and it seems that the sentence id
>> is somehow getting into
>> the phrase
>> table as a count and later used
>> for phrase translation weight
>> calculation, for instance the
>> extract (last column is the Id):
>>
>> #c the compound or process ||| #c
>> verbindung oder
>> verfahren ||| 0-0 2-1
>> 3-2 4-3 ||| 1374618
>> #c the compound or process ||| #c
>> verbindung oder
>> verfahren ||| 0-0 2-1
>> 3-2 4-3 ||| 1374619
>> #c the compound or process ||| #c
>> verbindung oder
>> verfahren ||| 0-0 2-1
>> 3-2 4-3 ||| 1374620
>> #c the compound or process ||| #c
>> verbindung oder
>> verfahren ||| 0-0 2-1
>> 3-2 4-3 ||| 1374621
>> #c the compound or process ||| #c
>> verbindung oder
>> verfahren ||| 0-0 2-1
>> 3-2 4-3 ||| 1374622
>> #c the compound or process ||| #c
>> verbindung oder
>> verfahren ||| 0-0 2-1
>> 3-2 4-3 ||| 4587318
>>
>> results in a phrase table entry
>> like this:
>>
>> #c the compound or process ||| #c
>> verbindung oder
>> verfahren ||| 1
>> 0.0100206 5.23542e-07 0.524577
>> ||| 0-0 2-1 3-2 4-3 ||| 6
>> 1.14604e+07 6
>> ||| |||
>>
>> The count is equal to the sum of
>> sentence ids, which of
>> course make the
>> phrase probability useless.
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> <mailto:Moses-support@mit.edu>
>> <mailto:Moses-support@mit.edu
>> <mailto:Moses-support@mit.edu>>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> -- Hieu Hoang
>> Research Associate
>> University of Edinburgh
>> http://www.hoang.co.uk/hieu
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> <mailto:Moses-support@mit.edu>
>> <mailto:Moses-support@mit.edu
>> <mailto:Moses-support@mit.edu>>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> <mailto:Moses-support@mit.edu>
>> <mailto:Moses-support@mit.edu
>> <mailto:Moses-support@mit.edu>>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>>
>

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

------------------------------

Message: 2
Date: Wed, 23 Jul 2014 17:11:21 +0100
From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Subject: Re: [Moses-support] Phrase extraction with
--IncludeSentenceId messes up phrase table counts
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>, Barry Haddow
<bhaddow@staffmail.ed.ac.uk>
Message-ID:
<CAEKMkbhp0SV7tkLg2-Dq8fF-OYTGvLDJ62T1ST9bM-s_E=qs+Q@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

i was doing it it, but mine was a more holistic approach but it would have
broken compability.

so i can't be bothered

On 23 July 2014 16:56, Marcin Junczys-Dowmunt <junczys@amu.edu.pl> wrote:

> So, adding "--IgnoreSentenceId" to "score" might fix that without
> messing up your stuff? I guess I can do that if you can't be bothered,
> Hieu.
>
> W dniu 23.07.2014 17:53, Philipp Koehn pisze:
>
> Hi,
>
> this is how extract is called:
> extract corpus.en corpus.fr align extract 5 --IncludeSentenceId
>
> this is how score is called:
> score extract lex.f2e phrase-table.half --GoodTuring --DomainIndicator
> domains.5
>
> phrase table looks fine to me
>
> -phi
>
>
> On Wed, Jul 23, 2014 at 11:42 AM, Marcin Junczys-Dowmunt <
> junczys@amu.edu.pl> wrote:
>
>> In a corpus sorted with sentences sorted by release date this could
>> actually make sense :)
>>
>> W dniu 23.07.2014 17:40, Barry Haddow pisze:
>>
>> Because calculating translation probabilities from sentence ids is
>>> unexpectedly beneficial?
>>>
>>> On 23/07/14 16:34, Marcin Junczys-Dowmunt wrote:
>>>
>>>>
>>>> So, how come this is not damaging the Edinburgh system?
>>>>
>>>> W dniu 23.07.2014 17:32, Hieu Hoang pisze:
>>>>
>>>>> ah ok.
>>>>>
>>>>> I thought it was just for debugging. I'm not gonna change it since
>>>>> it's gonna involve months of debugging.
>>>>>
>>>>> Ideally, the extract format should be fixed like the phrase-table,
>>>>> with the last column being key-value pairs. Also, way the key-value pairs
>>>>> are processed should be automatic like in the decoder.
>>>>>
>>>>> marcin - sorry mate. you're on your own
>>>>>
>>>>> On 23/07/14 16:20, Philipp Koehn wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> the sentence ID is being used for the domain indicator features.
>>>>>>
>>>>>> If you run phrase-extract's score with specifying a domain file,
>>>>>> it then it uses the sentence IDs to find out which domain the
>>>>>> phrase pair was found in.
>>>>>>
>>>>>> This is a standard features in Edinburgh's phrase-based system
>>>>>> for the last 1-2 years, so if you want to make changes, make
>>>>>> sure that this functionality still works (see [1381-5] for an example
>>>>>> with extract* files still in place).
>>>>>>
>>>>>> -phi
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 23, 2014 at 7:15 AM, Marcin Junczys-Dowmunt <
>>>>>> junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>>>>>>
>>>>>> Key-value format would actually be fine.
>>>>>>
>>>>>> W dniu 23.07.2014 13:12, Marcin Junczys-Dowmunt pisze:
>>>>>>
>>>>>>> I was planning to use it for a custom feature function later.
>>>>>>>
>>>>>>> W dniu 23.07.2014 13:11, Hieu Hoang pisze:
>>>>>>>
>>>>>>>> i can change it so that the sentence id is put into a
>>>>>>>> key-value field in the last column.
>>>>>>>>
>>>>>>>> what is the sentence id used for? is it just for debugging
>>>>>>>> purposes?
>>>>>>>>
>>>>>>>>
>>>>>>>> On 23 July 2014 11:36, Marcin Junczys-Dowmunt
>>>>>>>> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I am using train-model.perl with
>>>>>>>>
>>>>>>>> --extract-options="--IncludeSentenceId"
>>>>>>>>
>>>>>>>> and it seems that the sentence id is somehow getting into
>>>>>>>> the phrase
>>>>>>>> table as a count and later used for phrase translation
>>>>>>>> weight
>>>>>>>> calculation, for instance the extract (last column is the
>>>>>>>> Id):
>>>>>>>>
>>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>>> verfahren ||| 0-0 2-1
>>>>>>>> 3-2 4-3 ||| 1374618
>>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>>> verfahren ||| 0-0 2-1
>>>>>>>> 3-2 4-3 ||| 1374619
>>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>>> verfahren ||| 0-0 2-1
>>>>>>>> 3-2 4-3 ||| 1374620
>>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>>> verfahren ||| 0-0 2-1
>>>>>>>> 3-2 4-3 ||| 1374621
>>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>>> verfahren ||| 0-0 2-1
>>>>>>>> 3-2 4-3 ||| 1374622
>>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>>> verfahren ||| 0-0 2-1
>>>>>>>> 3-2 4-3 ||| 4587318
>>>>>>>>
>>>>>>>> results in a phrase table entry like this:
>>>>>>>>
>>>>>>>> #c the compound or process ||| #c verbindung oder
>>>>>>>> verfahren ||| 1
>>>>>>>> 0.0100206 5.23542e-07 0.524577 ||| 0-0 2-1 3-2 4-3 ||| 6
>>>>>>>> 1.14604e+07 6
>>>>>>>> ||| |||
>>>>>>>>
>>>>>>>> The count is equal to the sum of sentence ids, which of
>>>>>>>> course make the
>>>>>>>> phrase probability useless.
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Moses-support mailing list
>>>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- Hieu Hoang
>>>>>>>> Research Associate
>>>>>>>> University of Edinburgh
>>>>>>>> http://www.hoang.co.uk/hieu
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>>
>>
>
>

--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140723/a2bad703/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 93, Issue 30
*********************************************

Moses-support Digest, Vol 93, Issue 30

0 Response to "Moses-support Digest, Vol 93, Issue 30"

Post a Comment