Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Backticks tokenized as apostrophes (Judah Schvimer)
2. Re: Backticks tokenized as apostrophes (Philipp Koehn)
3. Re: Backticks tokenized as apostrophes (Judah Schvimer)
4. Re: Backticks tokenized as apostrophes (Philipp Koehn)
----------------------------------------------------------------------
Message: 1
Date: Wed, 30 Jul 2014 17:38:17 -0400
From: Judah Schvimer <judah.schvimer@mongodb.com>
Subject: [Moses-support] Backticks tokenized as apostrophes
To: moses-support <moses-support@mit.edu>
Message-ID:
<CALF9aB7XfT89Qu7TGDN7jJ+Ry72-Mv5V2=qVnTgoJHqFUb_F9w@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
It seems that back ticks(`) are being tokenized to apostrophes(') so when
they get detokenized they show up as an apostrophe and not a backtick.
Additionally, "-no-escape" seems to turn backticks into apostrophes as
well. I think this is a bug in the tokenizer. Let me know if you think I'm
doing something wrong.
Thanks,
Judah
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140730/da773ceb/attachment-0001.htm
------------------------------
Message: 2
Date: Thu, 31 Jul 2014 08:52:25 -0400
From: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Subject: Re: [Moses-support] Backticks tokenized as apostrophes
To: Judah Schvimer <judah.schvimer@mongodb.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAFADDAaYsQWcmb9pLNDfjwQBnbBSwaS9LGTEu1zE3EATaePmA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi,
this is done deliberately:
# turn `into '
$text =~ s/\`/\'/g;
#turn '' into "
$text =~ s/\'\'/ \" /g;
The motivation is to normalize corpora who used more ``creative'' ways
of quoting. You may want to remove these lines from the tokenizer or
create a switch for the script to optionally turn it off.
-phi
On Wed, Jul 30, 2014 at 5:38 PM, Judah Schvimer <judah.schvimer@mongodb.com>
wrote:
> It seems that back ticks(`) are being tokenized to apostrophes(') so when
> they get detokenized they show up as an apostrophe and not a backtick.
> Additionally, "-no-escape" seems to turn backticks into apostrophes as
> well. I think this is a bug in the tokenizer. Let me know if you think I'm
> doing something wrong.
>
> Thanks,
> Judah
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140731/e42d6c5f/attachment-0001.htm
------------------------------
Message: 3
Date: Thu, 31 Jul 2014 09:09:22 -0400
From: Judah Schvimer <judah.schvimer@mongodb.com>
Subject: Re: [Moses-support] Backticks tokenized as apostrophes
To: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CALF9aB5_YvKTd_6e+ics2F2zLcyctdk6ZmH7UgzW5iCTD417VQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Thanks, that makes sense. One more question. If I use the -no-escape flag
will that cause any problems to moses, or does that still escape the
special characters that break moses?
Judah
On Thu, Jul 31, 2014 at 8:52 AM, Philipp Koehn <pkoehn@inf.ed.ac.uk> wrote:
> Hi,
>
> this is done deliberately:
>
> # turn `into '
> $text =~ s/\`/\'/g;
>
> #turn '' into "
> $text =~ s/\'\'/ \" /g;
>
> The motivation is to normalize corpora who used more ``creative'' ways
> of quoting. You may want to remove these lines from the tokenizer or
> create a switch for the script to optionally turn it off.
>
> -phi
>
>
> On Wed, Jul 30, 2014 at 5:38 PM, Judah Schvimer <
> judah.schvimer@mongodb.com> wrote:
>
>> It seems that back ticks(`) are being tokenized to apostrophes(') so when
>> they get detokenized they show up as an apostrophe and not a backtick.
>> Additionally, "-no-escape" seems to turn backticks into apostrophes as
>> well. I think this is a bug in the tokenizer. Let me know if you think I'm
>> doing something wrong.
>>
>> Thanks,
>> Judah
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140731/f393bf72/attachment-0001.htm
------------------------------
Message: 4
Date: Thu, 31 Jul 2014 09:32:58 -0400
From: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Subject: Re: [Moses-support] Backticks tokenized as apostrophes
To: Judah Schvimer <judah.schvimer@mongodb.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAFADDDCTtzDiS59j8CFtSAfitSz4n3Ym34w2ZCDZbeukiGC-A@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi,
-no-escape turns off this:
if (!$NO_ESCAPING)
{
$text =~ s/\&/\&/g; # escape escape
$text =~ s/\|/\|/g; # factor separator
$text =~ s/\</\</g; # xml
$text =~ s/\>/\>/g; # xml
$text =~ s/\'/\'/g; # xml
$text =~ s/\"/\"/g; # xml
$text =~ s/\[/\[/g; # syntax non-terminal
$text =~ s/\]/\]/g; # syntax non-terminal
}
Especially not escaping the "|" will cause trouble.
So, you should not turn this off -- it is completely reversible by the
detokenizer anyway.
-phi
On Thu, Jul 31, 2014 at 9:09 AM, Judah Schvimer <judah.schvimer@mongodb.com>
wrote:
> Thanks, that makes sense. One more question. If I use the -no-escape flag
> will that cause any problems to moses, or does that still escape the
> special characters that break moses?
>
> Judah
>
>
> On Thu, Jul 31, 2014 at 8:52 AM, Philipp Koehn <pkoehn@inf.ed.ac.uk>
> wrote:
>
>> Hi,
>>
>> this is done deliberately:
>>
>> # turn `into '
>> $text =~ s/\`/\'/g;
>>
>> #turn '' into "
>> $text =~ s/\'\'/ \" /g;
>>
>> The motivation is to normalize corpora who used more ``creative'' ways
>> of quoting. You may want to remove these lines from the tokenizer or
>> create a switch for the script to optionally turn it off.
>>
>> -phi
>>
>>
>> On Wed, Jul 30, 2014 at 5:38 PM, Judah Schvimer <
>> judah.schvimer@mongodb.com> wrote:
>>
>>> It seems that back ticks(`) are being tokenized to apostrophes(') so
>>> when they get detokenized they show up as an apostrophe and not a backtick.
>>> Additionally, "-no-escape" seems to turn backticks into apostrophes as
>>> well. I think this is a bug in the tokenizer. Let me know if you think I'm
>>> doing something wrong.
>>>
>>> Thanks,
>>> Judah
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140731/525c2952/attachment-0001.htm
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 93, Issue 40
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 93, Issue 40"
Post a Comment