Moses-support Digest, Vol 99, Issue 33

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Legacy tokenizer.perl functionality. (Christian Hardmeier)
2. Re: Legacy tokenizer.perl functionality. (Tom Hoar)
3. Re: Legacy tokenizer.perl functionality. (Christian Hardmeier)
4. Re: Legacy tokenizer.perl functionality. (Hieu Hoang)
5. Re: Legacy tokenizer.perl functionality. (Christian Hardmeier)


----------------------------------------------------------------------

Message: 1
Date: Fri, 16 Jan 2015 11:36:19 +0100
From: Christian Hardmeier <ch@rax.ch>
Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.
To: Hieu Hoang <hieuhoang@gmail.com>
Cc: Tom Hoar <tahoar@precisiontranslationtools.com>,
moses-support@mit.edu
Message-ID: <527B8165-0038-411C-BB00-26B3829F534A@rax.ch>
Content-Type: text/plain; charset=us-ascii

I'd like to suggest that there should be a version number in the tokeniser that is incremented whenever the output changes, even if the change is minor and even if it's just a bugfix. Otherwise when you pull a new version of moses you don't know if the output of tokenizer.perl is still compatible with your existing models. (Moving functionality from tokenizer.perl to normalize-punctuation.perl would count as a change from my point of view. I don't always use normalize-punctutation.)

/Christian

On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:

> it's probably a good idea to make this change. If you've done it
> already, please send me the updated scripts and I'll check it in. If
> not, I'll do it myself
>
> there's hopefully a fast, C++ tokenizer replacement coming soon.
> Highlighting these issues now is useful to understanding exactly how the
> tokenizer works/should work
>
> On 15/01/15 01:52, Tom Hoar wrote:
>> This is a separate issue from the parallel "Tokenization problem" thread...
>>
>> The tokenizer.perl has had one line that transforms the grave accent (`)
>> to apostrophe and another that transforms double apostrophe ('') to to
>> single quote. I suspect these have been in the script since the
>> beginning. However, they recently "bit" me on a recent project. Easy
>> enough to work around.
>>
>> Still, I'm wondering. Do they still belong in the tokenizer.perl script?
>> Or, should they moved into one of the other scripts? The
>> normalize-punctuation.perl script seems to be a good candidate.
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support




------------------------------

Message: 2
Date: Fri, 16 Jan 2015 17:51:15 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.
To: moses-support@mit.edu
Message-ID: <54B8ED23.4060700@precisiontranslationtools.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

I agree with versioning. Could be added to the command line.

Also agree that this proposed change qualifies as a version change.

How to you propose managing the issue of output changes due to
command-line switches, like -no-escape?


On 01/16/2015 05:36 PM, Christian Hardmeier wrote:
> I'd like to suggest that there should be a version number in the tokeniser that is incremented whenever the output changes, even if the change is minor and even if it's just a bugfix. Otherwise when you pull a new version of moses you don't know if the output of tokenizer.perl is still compatible with your existing models. (Moving functionality from tokenizer.perl to normalize-punctuation.perl would count as a change from my point of view. I don't always use normalize-punctutation.)
>
> /Christian
>
> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:
>
>> it's probably a good idea to make this change. If you've done it
>> already, please send me the updated scripts and I'll check it in. If
>> not, I'll do it myself
>>
>> there's hopefully a fast, C++ tokenizer replacement coming soon.
>> Highlighting these issues now is useful to understanding exactly how the
>> tokenizer works/should work
>>
>> On 15/01/15 01:52, Tom Hoar wrote:
>>> This is a separate issue from the parallel "Tokenization problem" thread...
>>>
>>> The tokenizer.perl has had one line that transforms the grave accent (`)
>>> to apostrophe and another that transforms double apostrophe ('') to to
>>> single quote. I suspect these have been in the script since the
>>> beginning. However, they recently "bit" me on a recent project. Easy
>>> enough to work around.
>>>
>>> Still, I'm wondering. Do they still belong in the tokenizer.perl script?
>>> Or, should they moved into one of the other scripts? The
>>> normalize-punctuation.perl script seems to be a good candidate.
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support



------------------------------

Message: 3
Date: Fri, 16 Jan 2015 12:12:31 +0100
From: Christian Hardmeier <ch@rax.ch>
Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: moses-support support <moses-support@mit.edu>
Message-ID: <E5401A75-F73B-4F80-A58B-79AF69453853@rax.ch>
Content-Type: text/plain; charset=us-ascii


On Jan 16, 2015, at 11:51 AM, Tom Hoar wrote:

> I agree with versioning. Could be added to the command line.
>
> Also agree that this proposed change qualifies as a version change.
>
> How to you propose managing the issue of output changes due to
> command-line switches, like -no-escape?

Very good question. To be consistent, you'd probably have to increment the version number even if the change only applies when you use a certain command-line switch. But not if it doesn't affect the input, and maybe not if you just add a new command-line switch that is off by default. What do you think?



>
>
> On 01/16/2015 05:36 PM, Christian Hardmeier wrote:
>> I'd like to suggest that there should be a version number in the tokeniser that is incremented whenever the output changes, even if the change is minor and even if it's just a bugfix. Otherwise when you pull a new version of moses you don't know if the output of tokenizer.perl is still compatible with your existing models. (Moving functionality from tokenizer.perl to normalize-punctuation.perl would count as a change from my point of view. I don't always use normalize-punctutation.)
>>
>> /Christian
>>
>> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:
>>
>>> it's probably a good idea to make this change. If you've done it
>>> already, please send me the updated scripts and I'll check it in. If
>>> not, I'll do it myself
>>>
>>> there's hopefully a fast, C++ tokenizer replacement coming soon.
>>> Highlighting these issues now is useful to understanding exactly how the
>>> tokenizer works/should work
>>>
>>> On 15/01/15 01:52, Tom Hoar wrote:
>>>> This is a separate issue from the parallel "Tokenization problem" thread...
>>>>
>>>> The tokenizer.perl has had one line that transforms the grave accent (`)
>>>> to apostrophe and another that transforms double apostrophe ('') to to
>>>> single quote. I suspect these have been in the script since the
>>>> beginning. However, they recently "bit" me on a recent project. Easy
>>>> enough to work around.
>>>>
>>>> Still, I'm wondering. Do they still belong in the tokenizer.perl script?
>>>> Or, should they moved into one of the other scripts? The
>>>> normalize-punctuation.perl script seems to be a good candidate.
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support




------------------------------

Message: 4
Date: Fri, 16 Jan 2015 11:46:30 +0000
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.
To: Christian Hardmeier <ch@rax.ch>, Tom Hoar
<tahoar@precisiontranslationtools.com>
Cc: moses-support support <moses-support@mit.edu>
Message-ID: <54B8FA16.7040001@gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

i think it's too difficult to police.

Another idea is to get the script to md5 its own source code, and the
non-prefix files it uses.

On 16/01/15 11:12, Christian Hardmeier wrote:
> On Jan 16, 2015, at 11:51 AM, Tom Hoar wrote:
>
>> I agree with versioning. Could be added to the command line.
>>
>> Also agree that this proposed change qualifies as a version change.
>>
>> How to you propose managing the issue of output changes due to
>> command-line switches, like -no-escape?
> Very good question. To be consistent, you'd probably have to increment the version number even if the change only applies when you use a certain command-line switch. But not if it doesn't affect the input, and maybe not if you just add a new command-line switch that is off by default. What do you think?
>
>
>
>>
>> On 01/16/2015 05:36 PM, Christian Hardmeier wrote:
>>> I'd like to suggest that there should be a version number in the tokeniser that is incremented whenever the output changes, even if the change is minor and even if it's just a bugfix. Otherwise when you pull a new version of moses you don't know if the output of tokenizer.perl is still compatible with your existing models. (Moving functionality from tokenizer.perl to normalize-punctuation.perl would count as a change from my point of view. I don't always use normalize-punctutation.)
>>>
>>> /Christian
>>>
>>> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:
>>>
>>>> it's probably a good idea to make this change. If you've done it
>>>> already, please send me the updated scripts and I'll check it in. If
>>>> not, I'll do it myself
>>>>
>>>> there's hopefully a fast, C++ tokenizer replacement coming soon.
>>>> Highlighting these issues now is useful to understanding exactly how the
>>>> tokenizer works/should work
>>>>
>>>> On 15/01/15 01:52, Tom Hoar wrote:
>>>>> This is a separate issue from the parallel "Tokenization problem" thread...
>>>>>
>>>>> The tokenizer.perl has had one line that transforms the grave accent (`)
>>>>> to apostrophe and another that transforms double apostrophe ('') to to
>>>>> single quote. I suspect these have been in the script since the
>>>>> beginning. However, they recently "bit" me on a recent project. Easy
>>>>> enough to work around.
>>>>>
>>>>> Still, I'm wondering. Do they still belong in the tokenizer.perl script?
>>>>> Or, should they moved into one of the other scripts? The
>>>>> normalize-punctuation.perl script seems to be a good candidate.
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



------------------------------

Message: 5
Date: Fri, 16 Jan 2015 15:26:15 +0100
From: Christian Hardmeier <ch@rax.ch>
Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.
To: Hieu Hoang <hieuhoang@gmail.com>
Cc: Tom Hoar <tahoar@precisiontranslationtools.com>, moses-support
support <moses-support@mit.edu>
Message-ID: <8A5BEFA4-1D6F-4A5C-AE8C-9953FA53A648@rax.ch>
Content-Type: text/plain; charset=us-ascii


On Jan 16, 2015, at 12:46 PM, Hieu Hoang wrote:

> i think it's too difficult to police.

You'd probably need a regression test that checks if the tokenised output is still the same so changes don't go unnoticed. But of course it's still some extra work.

> Another idea is to get the script to md5 its own source code, and the non-prefix files it uses.

That would definitely be better than nothing, even though it would raise false alarms from time to time.

>
> On 16/01/15 11:12, Christian Hardmeier wrote:
>> On Jan 16, 2015, at 11:51 AM, Tom Hoar wrote:
>>
>>> I agree with versioning. Could be added to the command line.
>>>
>>> Also agree that this proposed change qualifies as a version change.
>>>
>>> How to you propose managing the issue of output changes due to
>>> command-line switches, like -no-escape?
>> Very good question. To be consistent, you'd probably have to increment the version number even if the change only applies when you use a certain command-line switch. But not if it doesn't affect the input, and maybe not if you just add a new command-line switch that is off by default. What do you think?
>>
>>
>>
>>>
>>> On 01/16/2015 05:36 PM, Christian Hardmeier wrote:
>>>> I'd like to suggest that there should be a version number in the tokeniser that is incremented whenever the output changes, even if the change is minor and even if it's just a bugfix. Otherwise when you pull a new version of moses you don't know if the output of tokenizer.perl is still compatible with your existing models. (Moving functionality from tokenizer.perl to normalize-punctuation.perl would count as a change from my point of view. I don't always use normalize-punctutation.)
>>>>
>>>> /Christian
>>>>
>>>> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:
>>>>
>>>>> it's probably a good idea to make this change. If you've done it
>>>>> already, please send me the updated scripts and I'll check it in. If
>>>>> not, I'll do it myself
>>>>>
>>>>> there's hopefully a fast, C++ tokenizer replacement coming soon.
>>>>> Highlighting these issues now is useful to understanding exactly how the
>>>>> tokenizer works/should work
>>>>>
>>>>> On 15/01/15 01:52, Tom Hoar wrote:
>>>>>> This is a separate issue from the parallel "Tokenization problem" thread...
>>>>>>
>>>>>> The tokenizer.perl has had one line that transforms the grave accent (`)
>>>>>> to apostrophe and another that transforms double apostrophe ('') to to
>>>>>> single quote. I suspect these have been in the script since the
>>>>>> beginning. However, they recently "bit" me on a recent project. Easy
>>>>>> enough to work around.
>>>>>>
>>>>>> Still, I'm wondering. Do they still belong in the tokenizer.perl script?
>>>>>> Or, should they moved into one of the other scripts? The
>>>>>> normalize-punctuation.perl script seems to be a good candidate.
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>




------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 99, Issue 33
*********************************************

0 Response to "Moses-support Digest, Vol 99, Issue 33"

Post a Comment