Moses-support Digest, Vol 93, Issue 40

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Backticks tokenized as apostrophes (Judah Schvimer)
2. Re: Backticks tokenized as apostrophes (Philipp Koehn)
3. Re: Backticks tokenized as apostrophes (Judah Schvimer)
4. Re: Backticks tokenized as apostrophes (Philipp Koehn)


----------------------------------------------------------------------

Message: 1
Date: Wed, 30 Jul 2014 17:38:17 -0400
From: Judah Schvimer <judah.schvimer@mongodb.com>
Subject: [Moses-support] Backticks tokenized as apostrophes
To: moses-support <moses-support@mit.edu>
Message-ID:
<CALF9aB7XfT89Qu7TGDN7jJ+Ry72-Mv5V2=qVnTgoJHqFUb_F9w@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

It seems that back ticks(`) are being tokenized to apostrophes(') so when
they get detokenized they show up as an apostrophe and not a backtick.
Additionally, "-no-escape" seems to turn backticks into apostrophes as
well. I think this is a bug in the tokenizer. Let me know if you think I'm
doing something wrong.

Thanks,
Judah
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140730/da773ceb/attachment-0001.htm

------------------------------

Message: 2
Date: Thu, 31 Jul 2014 08:52:25 -0400
From: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Subject: Re: [Moses-support] Backticks tokenized as apostrophes
To: Judah Schvimer <judah.schvimer@mongodb.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAFADDAaYsQWcmb9pLNDfjwQBnbBSwaS9LGTEu1zE3EATaePmA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

this is done deliberately:

# turn `into '
$text =~ s/\`/\'/g;

#turn '' into "
$text =~ s/\'\'/ \" /g;

The motivation is to normalize corpora who used more ``creative'' ways
of quoting. You may want to remove these lines from the tokenizer or
create a switch for the script to optionally turn it off.

-phi


On Wed, Jul 30, 2014 at 5:38 PM, Judah Schvimer <judah.schvimer@mongodb.com>
wrote:

> It seems that back ticks(`) are being tokenized to apostrophes(') so when
> they get detokenized they show up as an apostrophe and not a backtick.
> Additionally, "-no-escape" seems to turn backticks into apostrophes as
> well. I think this is a bug in the tokenizer. Let me know if you think I'm
> doing something wrong.
>
> Thanks,
> Judah
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140731/e42d6c5f/attachment-0001.htm

------------------------------

Message: 3
Date: Thu, 31 Jul 2014 09:09:22 -0400
From: Judah Schvimer <judah.schvimer@mongodb.com>
Subject: Re: [Moses-support] Backticks tokenized as apostrophes
To: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CALF9aB5_YvKTd_6e+ics2F2zLcyctdk6ZmH7UgzW5iCTD417VQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Thanks, that makes sense. One more question. If I use the -no-escape flag
will that cause any problems to moses, or does that still escape the
special characters that break moses?

Judah


On Thu, Jul 31, 2014 at 8:52 AM, Philipp Koehn <pkoehn@inf.ed.ac.uk> wrote:

> Hi,
>
> this is done deliberately:
>
> # turn `into '
> $text =~ s/\`/\'/g;
>
> #turn '' into "
> $text =~ s/\'\'/ \" /g;
>
> The motivation is to normalize corpora who used more ``creative'' ways
> of quoting. You may want to remove these lines from the tokenizer or
> create a switch for the script to optionally turn it off.
>
> -phi
>
>
> On Wed, Jul 30, 2014 at 5:38 PM, Judah Schvimer <
> judah.schvimer@mongodb.com> wrote:
>
>> It seems that back ticks(`) are being tokenized to apostrophes(') so when
>> they get detokenized they show up as an apostrophe and not a backtick.
>> Additionally, "-no-escape" seems to turn backticks into apostrophes as
>> well. I think this is a bug in the tokenizer. Let me know if you think I'm
>> doing something wrong.
>>
>> Thanks,
>> Judah
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140731/f393bf72/attachment-0001.htm

------------------------------

Message: 4
Date: Thu, 31 Jul 2014 09:32:58 -0400
From: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Subject: Re: [Moses-support] Backticks tokenized as apostrophes
To: Judah Schvimer <judah.schvimer@mongodb.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAFADDDCTtzDiS59j8CFtSAfitSz4n3Ym34w2ZCDZbeukiGC-A@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

-no-escape turns off this:

if (!$NO_ESCAPING)
{
$text =~ s/\&/\&amp;/g; # escape escape
$text =~ s/\|/\&#124;/g; # factor separator
$text =~ s/\</\&lt;/g; # xml
$text =~ s/\>/\&gt;/g; # xml
$text =~ s/\'/\&apos;/g; # xml
$text =~ s/\"/\&quot;/g; # xml
$text =~ s/\[/\&#91;/g; # syntax non-terminal
$text =~ s/\]/\&#93;/g; # syntax non-terminal
}

Especially not escaping the "|" will cause trouble.

So, you should not turn this off -- it is completely reversible by the
detokenizer anyway.

-phi



On Thu, Jul 31, 2014 at 9:09 AM, Judah Schvimer <judah.schvimer@mongodb.com>
wrote:

> Thanks, that makes sense. One more question. If I use the -no-escape flag
> will that cause any problems to moses, or does that still escape the
> special characters that break moses?
>
> Judah
>
>
> On Thu, Jul 31, 2014 at 8:52 AM, Philipp Koehn <pkoehn@inf.ed.ac.uk>
> wrote:
>
>> Hi,
>>
>> this is done deliberately:
>>
>> # turn `into '
>> $text =~ s/\`/\'/g;
>>
>> #turn '' into "
>> $text =~ s/\'\'/ \" /g;
>>
>> The motivation is to normalize corpora who used more ``creative'' ways
>> of quoting. You may want to remove these lines from the tokenizer or
>> create a switch for the script to optionally turn it off.
>>
>> -phi
>>
>>
>> On Wed, Jul 30, 2014 at 5:38 PM, Judah Schvimer <
>> judah.schvimer@mongodb.com> wrote:
>>
>>> It seems that back ticks(`) are being tokenized to apostrophes(') so
>>> when they get detokenized they show up as an apostrophe and not a backtick.
>>> Additionally, "-no-escape" seems to turn backticks into apostrophes as
>>> well. I think this is a bug in the tokenizer. Let me know if you think I'm
>>> doing something wrong.
>>>
>>> Thanks,
>>> Judah
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140731/525c2952/attachment-0001.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 93, Issue 40
*********************************************

0 Response to "Moses-support Digest, Vol 93, Issue 40"

Post a Comment