Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: Moses tokenizer treats combining diaeresis inconsistently
(Tom Hoar)
2. Re: Moses tokenizer treats combining diaeresis inconsistently
(Kenneth Heafield)
3. how to compile with nplm library (Xiaoqiang Feng)
4. Re: how to compile with nplm library (Nikolay Bogoychev)
----------------------------------------------------------------------
Message: 1
Date: Tue, 30 Dec 2014 11:29:42 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Moses tokenizer treats combining
diaeresis inconsistently
To: moses-support@mit.edu
Message-ID: <54A22A36.3000602@precisiontranslationtools.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Japanese is another language that suffers from standard Unicode NFKC
because the normalization applies changes that can not be reversed.
On 12/30/2014 04:40 AM, John D Burger wrote:
>> This is also a reason to turn Unicode normalization on. If the
>> tokenizer did NFKC at the beginning, then the problem would go away.
> If I understand the situation correctly, this would only fix this particular example and a few others like it. There are many base+combining grapheme clusters in Unicode text which cannot be normalized to a single pre-composed character. Vietnamese comes to mind.
>
> - JB
>
> On Dec 29, 2014, at 16:05 , Kenneth Heafield <moses@kheafield.com> wrote:
>
>> Dear Moses,
>>
>> The attached file, taken from line 2345157 of
>> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
>> , tokenizes differently on different machines.
>>
>> I'm running tokenizer.perl from head (481a07dc) with this perl:
>>
>> This is perl 5, version 18, subversion 2 (v5.18.2) built for
>> x86_64-linux-thread-multi
>> (with 25 registered patches, see perl -V for more detail)
>>
>> perl -V is attached from newer machines.
>>
>> The input is "J?rgen" with a specific encoding:
>>
>> uconv -f utf-8 -x any-name jur
>>
>> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
>> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
>> LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
>>
>> So the umlaut is encoded as a normal "u" character followed by a
>> combining diaeresis marker. This encoding is legal, but it differs from
>> the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
>> DIAERESIS}.
>>
>> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
>> DIAERESIS} is a single character and recognizing it as part of the
>> IsAlnum class. Tokenizing on these machines outputs
>>
>> J?rgen
>>
>> Newer machines are treating them separately, recognizing \N{COMBINING
>> DIAERESIS} as a separate character that is not part of IsAlnum. The
>> Moses tokenizer then treats it as something to split off, yielding this
>> tokenization:
>>
>> Ju ? rgen
>>
>> I thought it might be locale-related but IsAlnum is supposed to be
>> locale-agnostic. I couldn't come up with environment variables that
>> made the new machines tokenize as a single word.
>>
>> Maybe this is a perl bug, but the result is that two different machines
>> running the same perl script produce different tokenization :-(.
>>
>> This is also a reason to turn Unicode normalization on. If the
>> tokenizer did NFKC at the beginning, then the problem would go away.
>>
>> Kenneth
>>
>> <jur.gz><perl_V.txt>_______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
------------------------------
Message: 2
Date: Tue, 30 Dec 2014 00:37:01 -0500
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Moses tokenizer treats combining
diaeresis inconsistently
To: moses-support@mit.edu
Message-ID: <54A239FD.20804@kheafield.com>
Content-Type: text/plain; charset=utf-8
So to summarize:
The main issue is that the Moses tokenizer operates at the character
rather than grapheme level on some versions of perl, treating combining
characters (which are arguably parts of words in many cases) as
non-alphanumeric and splitting them off.
Older versions of perl appear to be operating at the grapheme level or
internally normalizing for purposes of evaluating IsAlnum, making the
tokenizer inconsistent across machines.
Some graphemes, such as those in Vietnamese, do not have a
single-character codepoint, so NFKC is insufficient to mask this issue.
Tom doesn't want NFKC for Japanese (which the Moses tokenizer doesn't
support at the moment). I still think it makes sense for the Latin
alphabet. Also, there are lighter forms of canonicalization.
For once, my favorite Unicode FAQ is relevant:
http://www.unicode.org/faq/char_combmark.html#17
Kenneth
On 12/29/2014 11:29 PM, Tom Hoar wrote:
> Japanese is another language that suffers from standard Unicode NFKC
> because the normalization applies changes that can not be reversed.
>
>
>
> On 12/30/2014 04:40 AM, John D Burger wrote:
>>> This is also a reason to turn Unicode normalization on. If the
>>> tokenizer did NFKC at the beginning, then the problem would go away.
>> If I understand the situation correctly, this would only fix this particular example and a few others like it. There are many base+combining grapheme clusters in Unicode text which cannot be normalized to a single pre-composed character. Vietnamese comes to mind.
>>
>> - JB
>>
>> On Dec 29, 2014, at 16:05 , Kenneth Heafield <moses@kheafield.com> wrote:
>>
>>> Dear Moses,
>>>
>>> The attached file, taken from line 2345157 of
>>> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
>>> , tokenizes differently on different machines.
>>>
>>> I'm running tokenizer.perl from head (481a07dc) with this perl:
>>>
>>> This is perl 5, version 18, subversion 2 (v5.18.2) built for
>>> x86_64-linux-thread-multi
>>> (with 25 registered patches, see perl -V for more detail)
>>>
>>> perl -V is attached from newer machines.
>>>
>>> The input is "J?rgen" with a specific encoding:
>>>
>>> uconv -f utf-8 -x any-name jur
>>>
>>> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
>>> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
>>> LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
>>>
>>> So the umlaut is encoded as a normal "u" character followed by a
>>> combining diaeresis marker. This encoding is legal, but it differs from
>>> the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
>>> DIAERESIS}.
>>>
>>> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
>>> DIAERESIS} is a single character and recognizing it as part of the
>>> IsAlnum class. Tokenizing on these machines outputs
>>>
>>> J?rgen
>>>
>>> Newer machines are treating them separately, recognizing \N{COMBINING
>>> DIAERESIS} as a separate character that is not part of IsAlnum. The
>>> Moses tokenizer then treats it as something to split off, yielding this
>>> tokenization:
>>>
>>> Ju ? rgen
>>>
>>> I thought it might be locale-related but IsAlnum is supposed to be
>>> locale-agnostic. I couldn't come up with environment variables that
>>> made the new machines tokenize as a single word.
>>>
>>> Maybe this is a perl bug, but the result is that two different machines
>>> running the same perl script produce different tokenization :-(.
>>>
>>> This is also a reason to turn Unicode normalization on. If the
>>> tokenizer did NFKC at the beginning, then the problem would go away.
>>>
>>> Kenneth
>>>
>>> <jur.gz><perl_V.txt>_______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
------------------------------
Message: 3
Date: Tue, 30 Dec 2014 14:28:31 +0800
From: Xiaoqiang Feng <feng.x.q.2006@gmail.com>
Subject: [Moses-support] how to compile with nplm library
To: moses-support@mit.edu
Message-ID:
<CADHOrU7e8Nt7q1XUHnEBRab6GC5CU+c_MFwe8-Jtg=fTspgmyA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi,
nplm is one toolkit of neural probabilistic language model. This toolkit
can be used in Moses for language model and bilingual LM(neural network
joint model, ACL 2014). These two parts have been updated in github
mosesdecoder.
If you want to use nplm in Moses, you have to compile Moses by linking
libnplm.a (generated by nplm).
Here is the probelm : how to compile Moses with libnplm.a ? Do I need to
modify the Jamroot file and how to modify ?
Thanks,
Xiaoqiang Feng
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/a715ee2b/attachment-0001.htm
------------------------------
Message: 4
Date: Tue, 30 Dec 2014 06:36:34 +0000
From: Nikolay Bogoychev <nheart@gmail.com>
Subject: Re: [Moses-support] how to compile with nplm library
To: feng.x.q.2006@gmail.com
Cc: moses-support@mit.edu
Message-ID:
<CAJzPUEwd4ETLTg1XJ9wXfvFW7AHU1ArSC+BLDXrioU4Xz2M9tA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hey,
First you need to checkout and compile this fork of nplm:
https://github.com/rsennrich/nplm
Then you need to compile moses with nplm switch:
./bjam --with-nplm=path/to/nplm
Then you can see how to use it here
http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#ntoc31
On 30 Dec 2014 06:28, "Xiaoqiang Feng" <feng.x.q.2006@gmail.com> wrote:
> Hi,
>
> nplm is one toolkit of neural probabilistic language model. This toolkit
> can be used in Moses for language model and bilingual LM(neural network
> joint model, ACL 2014). These two parts have been updated in github
> mosesdecoder.
>
> If you want to use nplm in Moses, you have to compile Moses by linking
> libnplm.a (generated by nplm).
> Here is the probelm : how to compile Moses with libnplm.a ? Do I need to
> modify the Jamroot file and how to modify ?
>
> Thanks,
> Xiaoqiang Feng
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/458bc22e/attachment-0001.htm
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 98, Issue 66
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 98, Issue 66"
Post a Comment