Moses-support Digest, Vol 94, Issue 9

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Backticks tokenized as apostrophes (Judah Schvimer)
2. Re: Backticks tokenized as apostrophes (Philipp Koehn)
3. Re: Fwd: Question on Phrase Extraction implementation (liling tan)


----------------------------------------------------------------------

Message: 1
Date: Tue, 5 Aug 2014 14:35:29 -0400
From: Judah Schvimer <judah.schvimer@mongodb.com>
Subject: Re: [Moses-support] Backticks tokenized as apostrophes
To: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CALF9aB7-rD0=mmnGjREMY3+JsNUxNV4teULzfQZGZMdTvjuxZQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

I've been playing around with this and I noticed that the protected flag
only "protects" the first example of a regex in a line. Is there any way to
fix this so that it protects every occurrence?

Thanks,
Judah


On Thu, Jul 31, 2014 at 9:32 AM, Philipp Koehn <pkoehn@inf.ed.ac.uk> wrote:

> Hi,
>
> -no-escape turns off this:
>
> if (!$NO_ESCAPING)
> {
> $text =~ s/\&/\&amp;/g; # escape escape
> $text =~ s/\|/\&#124;/g; # factor separator
> $text =~ s/\</\&lt;/g; # xml
> $text =~ s/\>/\&gt;/g; # xml
> $text =~ s/\'/\&apos;/g; # xml
> $text =~ s/\"/\&quot;/g; # xml
> $text =~ s/\[/\&#91;/g; # syntax non-terminal
> $text =~ s/\]/\&#93;/g; # syntax non-terminal
> }
>
> Especially not escaping the "|" will cause trouble.
>
> So, you should not turn this off -- it is completely reversible by the
> detokenizer anyway.
>
> -phi
>
>
>
> On Thu, Jul 31, 2014 at 9:09 AM, Judah Schvimer <
> judah.schvimer@mongodb.com> wrote:
>
>> Thanks, that makes sense. One more question. If I use the -no-escape flag
>> will that cause any problems to moses, or does that still escape the
>> special characters that break moses?
>>
>> Judah
>>
>>
>> On Thu, Jul 31, 2014 at 8:52 AM, Philipp Koehn <pkoehn@inf.ed.ac.uk>
>> wrote:
>>
>>> Hi,
>>>
>>> this is done deliberately:
>>>
>>> # turn `into '
>>> $text =~ s/\`/\'/g;
>>>
>>> #turn '' into "
>>> $text =~ s/\'\'/ \" /g;
>>>
>>> The motivation is to normalize corpora who used more ``creative'' ways
>>> of quoting. You may want to remove these lines from the tokenizer or
>>> create a switch for the script to optionally turn it off.
>>>
>>> -phi
>>>
>>>
>>> On Wed, Jul 30, 2014 at 5:38 PM, Judah Schvimer <
>>> judah.schvimer@mongodb.com> wrote:
>>>
>>>> It seems that back ticks(`) are being tokenized to apostrophes(') so
>>>> when they get detokenized they show up as an apostrophe and not a backtick.
>>>> Additionally, "-no-escape" seems to turn backticks into apostrophes as
>>>> well. I think this is a bug in the tokenizer. Let me know if you think I'm
>>>> doing something wrong.
>>>>
>>>> Thanks,
>>>> Judah
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140805/bffc49de/attachment-0001.htm

------------------------------

Message: 2
Date: Tue, 5 Aug 2014 16:07:07 -0400
From: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Subject: Re: [Moses-support] Backticks tokenized as apostrophes
To: Judah Schvimer <judah.schvimer@mongodb.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAFADDD1QHf406PffgYiN17R0hKHUdRBuseH8LVHROA1nuQhqQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi.

the fix was checked in a few hours ago.

-phi


On Tue, Aug 5, 2014 at 2:35 PM, Judah Schvimer <judah.schvimer@mongodb.com>
wrote:

> Hi,
>
> I've been playing around with this and I noticed that the protected flag
> only "protects" the first example of a regex in a line. Is there any way to
> fix this so that it protects every occurrence?
>
> Thanks,
> Judah
>
>
> On Thu, Jul 31, 2014 at 9:32 AM, Philipp Koehn <pkoehn@inf.ed.ac.uk>
> wrote:
>
>> Hi,
>>
>> -no-escape turns off this:
>>
>> if (!$NO_ESCAPING)
>> {
>> $text =~ s/\&/\&amp;/g; # escape escape
>> $text =~ s/\|/\&#124;/g; # factor separator
>> $text =~ s/\</\&lt;/g; # xml
>> $text =~ s/\>/\&gt;/g; # xml
>> $text =~ s/\'/\&apos;/g; # xml
>> $text =~ s/\"/\&quot;/g; # xml
>> $text =~ s/\[/\&#91;/g; # syntax non-terminal
>> $text =~ s/\]/\&#93;/g; # syntax non-terminal
>> }
>>
>> Especially not escaping the "|" will cause trouble.
>>
>> So, you should not turn this off -- it is completely reversible by the
>> detokenizer anyway.
>>
>> -phi
>>
>>
>>
>> On Thu, Jul 31, 2014 at 9:09 AM, Judah Schvimer <
>> judah.schvimer@mongodb.com> wrote:
>>
>>> Thanks, that makes sense. One more question. If I use the -no-escape
>>> flag will that cause any problems to moses, or does that still escape the
>>> special characters that break moses?
>>>
>>> Judah
>>>
>>>
>>> On Thu, Jul 31, 2014 at 8:52 AM, Philipp Koehn <pkoehn@inf.ed.ac.uk>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> this is done deliberately:
>>>>
>>>> # turn `into '
>>>> $text =~ s/\`/\'/g;
>>>>
>>>> #turn '' into "
>>>> $text =~ s/\'\'/ \" /g;
>>>>
>>>> The motivation is to normalize corpora who used more ``creative'' ways
>>>> of quoting. You may want to remove these lines from the tokenizer or
>>>> create a switch for the script to optionally turn it off.
>>>>
>>>> -phi
>>>>
>>>>
>>>> On Wed, Jul 30, 2014 at 5:38 PM, Judah Schvimer <
>>>> judah.schvimer@mongodb.com> wrote:
>>>>
>>>>> It seems that back ticks(`) are being tokenized to apostrophes(') so
>>>>> when they get detokenized they show up as an apostrophe and not a backtick.
>>>>> Additionally, "-no-escape" seems to turn backticks into apostrophes as
>>>>> well. I think this is a bug in the tokenizer. Let me know if you think I'm
>>>>> doing something wrong.
>>>>>
>>>>> Thanks,
>>>>> Judah
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140805/88019a1d/attachment-0001.htm

------------------------------

Message: 3
Date: Tue, 5 Aug 2014 23:46:32 +0200
From: liling tan <alvations@gmail.com>
Subject: Re: [Moses-support] Fwd: Question on Phrase Extraction
implementation
To: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAKzPaJ+beeMdP9pJXq7XnOO2JG8MQB43K-Ua5-yHVBeb_93RAg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Dear Hieu,

Thanks for the note on thax and the original .cpp code.

Fredik fixed the bug. Apparently there was an error in the original
textbook print and it was addressed in the errata but i didn't notice it.
http://stackoverflow.com/questions/25109001/phrase-extraction-algorithm-for-statistical-machine-translation

Regards,
Liling
P/S: Cam on anh nhieu.




On Mon, Aug 4, 2014 at 11:38 AM, Hieu Hoang <Hieu.Hoang@ed.ac.uk> wrote:

> hi liling
>
>
> On 3 August 2014 22:02, liling tan <alvations@gmail.com> wrote:
>
>>
>> Dear Moses community,
>>
>> I have reimplemented the phrasal extraction algorithm as presented on the
>> page 133 of Philip Koehn's SMT book for NLTK in
>> https://github.com/alvations/nltk/blob/develop/nltk/align/phrase_based.py
>>
>> However, there is some bug that i can't figure out why am I not achieving
>> the desired output as shown on the alignment table, see
>> http://stackoverflow.com/questions/25109001/phrase-extraction-algorithm-for-statistical-machine-translation
>> for more detail
>>
>>
>> *Does anyone find what went wrong with my implementation?*
>>
>> *Are there other python based implementation of the same algorithm?*
>>
> i don't know of a python implementation. There is a java implementaton by
> Jonathon Weese called Thrax.
> http://cs.jhu.edu/~jonny/thrax/
>
>>
>> *Where in the Moses toolkit is can the phrasal extraction function be
>> found? What is the input of that function?*
>>
> phrase-extract/extract-main.cpp
> void ExtractTask::extract(SentenceAlignment &sentence) line 350 - 447
> You may want to base your code on my cleaned up implementation
>
> https://github.com/hieuhoang/mosesdecoder/tree/hieu/contrib/other-builds/extract-mixed-syntax
>
>
>> Regards,
>> Liling
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140805/85b46032/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 94, Issue 9
********************************************

0 Response to "Moses-support Digest, Vol 94, Issue 9"

Post a Comment