Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: how to compile with nplm library (Rico Sennrich)
2. Re: Moses-support Digest, Vol 98, Issue 65 (Ihab Ramadan)
3. Re: Moses-support Digest, Vol 98, Issue 65 (Tom Hoar)
----------------------------------------------------------------------
Message: 1
Date: Tue, 30 Dec 2014 19:10:33 +0000 (UTC)
From: Rico Sennrich <rico.sennrich@gmx.ch>
Subject: Re: [Moses-support] how to compile with nplm library
To: moses-support@mit.edu
Message-ID: <loom.20141230T200747-46@post.gmane.org>
Content-Type: text/plain; charset=us-ascii
Xiaoqiang Feng <feng.x.q.2006@...> writes:
>
> Hi,
> nplm is one toolkit of neural probabilistic language model. This toolkit
can be used in Moses for language model and bilingual LM(neural network
joint model, ACL 2014). These two parts have been updated in github
mosesdecoder.
Hi,
basic usage instructions for the monolingual version (NPLM) and the joint
model (BilingualLM) are on:
http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel
best wishes,
Rico
------------------------------
Message: 2
Date: Wed, 31 Dec 2014 14:05:56 +0200
From: "Ihab Ramadan" <i.ramadan@saudisoft.com>
Subject: Re: [Moses-support] Moses-support Digest, Vol 98, Issue 65
To: <moses-support@mit.edu>
Message-ID: <000b01d024f2$232383c0$696a8b40$@saudisoft.com>
Content-Type: text/plain; charset="us-ascii"
Thanks Tom for your reply,
I think I found where is the problem, when I use the tokenizer.perl script
to tokenize a string it generates the output you mentioned like
" keep your notification 's payload under 5 kb ." but if use the
tokenizer.perl script to process a file the output will be
" keep your notification ' s payload under 5 kb ." which adds a space
between ' and s and this makes some translation problems
Can you please tell me why this happens
Thanks
-----Original Message-----
From: moses-support-bounces@mit.edu [mailto:moses-support-bounces@mit.edu]
On Behalf Of moses-support-request@mit.edu
Sent: Tuesday, December 30, 2014 5:56 AM
To: moses-support@mit.edu
Subject: Moses-support Digest, Vol 98, Issue 65
Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific than
"Re: Contents of Moses-support digest..."
Today's Topics:
1. Moses tokenizer treats combining diaeresis inconsistently
(Kenneth Heafield)
2. Re: Moses tokenizer treats combining diaeresis inconsistently
(John D Burger)
3. Re: "'" in tokenization (Tom Hoar)
----------------------------------------------------------------------
Message: 1
Date: Mon, 29 Dec 2014 16:05:51 -0500
From: Kenneth Heafield <moses@kheafield.com>
Subject: [Moses-support] Moses tokenizer treats combining diaeresis
inconsistently
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <54A1C22F.60404@kheafield.com>
Content-Type: text/plain; charset="utf-8"
Dear Moses,
The attached file, taken from line 2345157 of
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shu
ffled.gz
, tokenizes differently on different machines.
I'm running tokenizer.perl from head (481a07dc) with this perl:
This is perl 5, version 18, subversion 2 (v5.18.2) built for
x86_64-linux-thread-multi (with 25 registered patches, see perl -V for more
detail)
perl -V is attached from newer machines.
The input is "J?rgen" with a specific encoding:
uconv -f utf-8 -x any-name jur
\N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
So the umlaut is encoded as a normal "u" character followed by a combining
diaeresis marker. This encoding is legal, but it differs from the
single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
DIAERESIS}.
Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}
is a single character and recognizing it as part of the IsAlnum class.
Tokenizing on these machines outputs
J?rgen
Newer machines are treating them separately, recognizing \N{COMBINING
DIAERESIS} as a separate character that is not part of IsAlnum. The Moses
tokenizer then treats it as something to split off, yielding this
tokenization:
Ju ? rgen
I thought it might be locale-related but IsAlnum is supposed to be
locale-agnostic. I couldn't come up with environment variables that made
the new machines tokenize as a single word.
Maybe this is a perl bug, but the result is that two different machines
running the same perl script produce different tokenization :-(.
This is also a reason to turn Unicode normalization on. If the tokenizer
did NFKC at the beginning, then the problem would go away.
Kenneth
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jur.gz
Type: application/gzip
Size: 33 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20141229/9c
e44a08/attachment-0001.bin
-------------- next part --------------
Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
Platform:
osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi
uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64
intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux '
config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread
-Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe
-Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr
-Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin
-Dprivlib=/usr/lib64/perl5/5.18.2
-Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
-Dsitelib=/usr/local/lib64/perl5/5.18.2
-Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
-Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2
-Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
-Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3
-Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3
-Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3
-Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2
-Dlocincpth=/usr/include -Dglibpth=/lib64 /usr/lib64 -Duselargefiles
-Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost
-Dperladmin=root@loca!
lhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db
-Dusethreads -DDEBUGGING=none
-Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0
5.18.1/x86_64-linux-thread-multi 5.18.1 -Dlibpth=/usr/local/lib64 /lib64
/usr/lib64 -Dnoextensions=ODBM_File'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE
-fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O3 -march=native -pipe',
cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe'
ccversion='', gccversion='4.7.3', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed'
libpth=/usr/local/lib64 /lib64 /usr/lib64
libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
-lgdbm_compat
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=/lib/libc-2.19.so, so=so, useshrplib=true,
libperl=libperl.so.5.18.2
gnulibc_version='2.19'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1
-Wl,--as-needed'
Characteristics of this binary (from libperl):
Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
PERL_DONT_CREATE_GVSV
PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL
USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE
USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
USE_REENTRANT_API
Locally applied patches:
gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054
cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH
gentoo/EUMM_delete_packlist - Don't install .packlist or
perllocal.pod for perl or vendor
gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags
gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default
for modules installed from CPAN.
gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site
directories by default.
gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 Set
libperl soname
gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 Don't
force -fstack-protector on everyone.
gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing
@INC directories.
gentoo/mod_paths - Add /etc/perl to @INC
gentoo/patchlevel - List packaged patches for perl-5.18.2-r2(#2) in
patchlevel.h
gentoo/aix_soname - aix gcc detection and shared library soname
support
gentoo/opensolars_headers - Add headers for opensolaris
gentoo/cleanup-paths - Cleanup PATH and shrpenv
gentoo/usr_local - Remove /usr/local paths
gentoo/hints_hpux - Fix hpux hints
gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 darwin: Use $CC
to link
gentoo/interix - Fix interix hints
fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP
'Port' option
debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with
nonexisting site dirs if a parent is writable
fixes/memoize_storable_nstore - [rt.cpan.org #77790]
Memoize::Storable: respect 'nstore' option not respected
fixes/net_ftp_failed_command - [rt.cpan.org #37700] Net::FTP: cope
gracefully with a failed command
fixes/perlbug-patchlist - [3541c11] [perl #118433] Make perlbug look
up the list of local patches at run time
fixes/module_metadata_taint_fix - [bff978f] [rt.cpan.org #88576]
untaint version, if needed, in Module::Metadata
fixes/IPC-SysV-spelling - [rt.cpan.org #86736] Fix spelling of
IPC_CREAT in IPC-SysV documentation
fixes/freemint -
Built under linux
Compiled at Oct 29 2014 20:59:02
@INC:
/etc/perl
/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
/usr/local/lib64/perl5/5.18.2
/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
/usr/lib64/perl5/vendor_perl/5.18.2
/usr/local/lib64/perl5
/usr/lib64/perl5/vendor_perl
/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
/usr/lib64/perl5/5.18.2
.
------------------------------
Message: 2
Date: Mon, 29 Dec 2014 16:40:42 -0500
From: John D Burger <john@mitre.org>
Subject: Re: [Moses-support] Moses tokenizer treats combining
diaeresis inconsistently
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <FEBAB28D-774D-425D-81E0-6192CD4FEDB4@mitre.org>
Content-Type: text/plain; charset=utf-8
> This is also a reason to turn Unicode normalization on. If the
> tokenizer did NFKC at the beginning, then the problem would go away.
If I understand the situation correctly, this would only fix this particular
example and a few others like it. There are many base+combining grapheme
clusters in Unicode text which cannot be normalized to a single pre-composed
character. Vietnamese comes to mind.
- JB
On Dec 29, 2014, at 16:05 , Kenneth Heafield <moses@kheafield.com> wrote:
> Dear Moses,
>
> The attached file, taken from line 2345157 of
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.
> en.shuffled.gz , tokenizes differently on different machines.
>
> I'm running tokenizer.perl from head (481a07dc) with this perl:
>
> This is perl 5, version 18, subversion 2 (v5.18.2) built for
> x86_64-linux-thread-multi (with 25 registered patches, see perl -V for
> more detail)
>
> perl -V is attached from newer machines.
>
> The input is "J?rgen" with a specific encoding:
>
> uconv -f utf-8 -x any-name jur
>
> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN
> SMALL LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
>
> So the umlaut is encoded as a normal "u" character followed by a
> combining diaeresis marker. This encoding is legal, but it differs
> from the single-character canonical encoding of \N{LATIN SMALL LETTER
> U WITH DIAERESIS}.
>
> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
> DIAERESIS} is a single character and recognizing it as part of the
> IsAlnum class. Tokenizing on these machines outputs
>
> J?rgen
>
> Newer machines are treating them separately, recognizing \N{COMBINING
> DIAERESIS} as a separate character that is not part of IsAlnum. The
> Moses tokenizer then treats it as something to split off, yielding
> this
> tokenization:
>
> Ju ? rgen
>
> I thought it might be locale-related but IsAlnum is supposed to be
> locale-agnostic. I couldn't come up with environment variables that
> made the new machines tokenize as a single word.
>
> Maybe this is a perl bug, but the result is that two different
> machines running the same perl script produce different tokenization :-(.
>
> This is also a reason to turn Unicode normalization on. If the
> tokenizer did NFKC at the beginning, then the problem would go away.
>
> Kenneth
>
> <jur.gz><perl_V.txt>_______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
------------------------------
Message: 3
Date: Tue, 30 Dec 2014 10:54:18 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] "'" in tokenization
To: moses-support@mit.edu
Message-ID: <54A221EA.8080805@precisiontranslationtools.com>
Content-Type: text/plain; charset="utf-8"
The escaping is necessary because Moses reserves these characters for other
uses. When corpora are consistently prepared, the escaping has no effect on
translation results. It looks like you have not prepared your corpora
consistently. Note my results ('s) are different from yours (' s):
user@host:~$ echo "keep your notification's payload under 5 kb." |
tokenizer.perl -l en Tokenizer Version 1.1
Language: en
Number of threads: 1
keep your notification 's payload under 5 kb .
Go back and double-check how you prepare your training corpus and your
translation jobs.
On 12/29/2014 09:26 PM, Ihab Ramadan wrote:
>
> Dears,
>
> When I make tokenization on files it replaces the apostrophes with
> ?'? which make sense, but in the other side it crashes the
> meaning and the order of the words at all, for example:
>
> Sentence before tokenization :
>
> Src : keep your notification's payload under 5 kb.
>
> Trg: ???? ????? ??????? ??? ?? 5 ????????.
>
> Sentence after tokenization :
>
> Src: keep your notification ' s payload under 5 kb .
>
> Trg: ???? ????? ??????? ??? ?? 5 ????????.
>
> If I translate ?keep? without using tokenization it will generates
> ?????? which Is correct but after using tokenization moses generates
> ????????? which means that the alignment is crashed
>
> do I make something wrong?
>
> do I miss something or just it is a natural behavior when I use
> tokenization
>
> Thanks
>
> Best Regards
>
> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>
> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
> Fax+20233032036 | *Follow us on *linked
>
<http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=V
SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri
mary>* |
> **ZA102637861*
>
<https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
okmark>* |
> **ZA102637858* <https://twitter.com/Saudisoft>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
3cde56/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1314 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
3cde56/attachment.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1317 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
3cde56/attachment-0001.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1351 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
3cde56/attachment-0002.gif
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 98, Issue 65
*********************************************
------------------------------
Message: 3
Date: Wed, 31 Dec 2014 20:06:02 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Moses-support Digest, Vol 98, Issue 65
To: moses-support@mit.edu
Message-ID: <54A3F4BA.3080207@precisiontranslationtools.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Three ideas come to mind.
1) make sure you are setting the `-l en` argument properly. Each
language isolates punctuation differently.
2) are you sure your file is UTF-8 character encoding?
3) maybe the version of Perl you're using is sensitive to other things
like Locale settings. Try setting the terminal's environment variable to
LC_ALL=C.
Happy New Year!
On 12/31/2014 07:05 PM, Ihab Ramadan wrote:
> Thanks Tom for your reply,
> I think I found where is the problem, when I use the tokenizer.perl script
> to tokenize a string it generates the output you mentioned like
> " keep your notification 's payload under 5 kb ." but if use the
> tokenizer.perl script to process a file the output will be
> " keep your notification ' s payload under 5 kb ." which adds a space
> between ' and s and this makes some translation problems
> Can you please tell me why this happens
> Thanks
>
> -----Original Message-----
> From: moses-support-bounces@mit.edu [mailto:moses-support-bounces@mit.edu]
> On Behalf Of moses-support-request@mit.edu
> Sent: Tuesday, December 30, 2014 5:56 AM
> To: moses-support@mit.edu
> Subject: Moses-support Digest, Vol 98, Issue 65
>
> Send Moses-support mailing list submissions to
> moses-support@mit.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.mit.edu/mailman/listinfo/moses-support
> or, via email, send a message with subject or body 'help' to
> moses-support-request@mit.edu
>
> You can reach the person managing the list at
> moses-support-owner@mit.edu
>
> When replying, please edit your Subject line so it is more specific than
> "Re: Contents of Moses-support digest..."
>
>
> Today's Topics:
>
> 1. Moses tokenizer treats combining diaeresis inconsistently
> (Kenneth Heafield)
> 2. Re: Moses tokenizer treats combining diaeresis inconsistently
> (John D Burger)
> 3. Re: "'" in tokenization (Tom Hoar)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 29 Dec 2014 16:05:51 -0500
> From: Kenneth Heafield <moses@kheafield.com>
> Subject: [Moses-support] Moses tokenizer treats combining diaeresis
> inconsistently
> To: "moses-support@mit.edu" <moses-support@mit.edu>
> Message-ID: <54A1C22F.60404@kheafield.com>
> Content-Type: text/plain; charset="utf-8"
>
> Dear Moses,
>
> The attached file, taken from line 2345157 of
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shu
> ffled.gz
> , tokenizes differently on different machines.
>
> I'm running tokenizer.perl from head (481a07dc) with this perl:
>
> This is perl 5, version 18, subversion 2 (v5.18.2) built for
> x86_64-linux-thread-multi (with 25 registered patches, see perl -V for more
> detail)
>
> perl -V is attached from newer machines.
>
> The input is "J?rgen" with a specific encoding:
>
> uconv -f utf-8 -x any-name jur
>
> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
> LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
>
> So the umlaut is encoded as a normal "u" character followed by a combining
> diaeresis marker. This encoding is legal, but it differs from the
> single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
> DIAERESIS}.
>
> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}
> is a single character and recognizing it as part of the IsAlnum class.
> Tokenizing on these machines outputs
>
> J?rgen
>
> Newer machines are treating them separately, recognizing \N{COMBINING
> DIAERESIS} as a separate character that is not part of IsAlnum. The Moses
> tokenizer then treats it as something to split off, yielding this
> tokenization:
>
> Ju ? rgen
>
> I thought it might be locale-related but IsAlnum is supposed to be
> locale-agnostic. I couldn't come up with environment variables that made
> the new machines tokenize as a single word.
>
> Maybe this is a perl bug, but the result is that two different machines
> running the same perl script produce different tokenization :-(.
>
> This is also a reason to turn Unicode normalization on. If the tokenizer
> did NFKC at the beginning, then the problem would go away.
>
> Kenneth
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: jur.gz
> Type: application/gzip
> Size: 33 bytes
> Desc: not available
> Url :
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141229/9c
> e44a08/attachment-0001.bin
> -------------- next part --------------
> Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
>
> Platform:
> osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi
> uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64
> intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux '
> config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread
> -Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe
> -Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr
> -Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin
> -Dprivlib=/usr/lib64/perl5/5.18.2
> -Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
> -Dsitelib=/usr/local/lib64/perl5/5.18.2
> -Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
> -Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2
> -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
> -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3
> -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3
> -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3
> -Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2
> -Dlocincpth=/usr/include -Dglibpth=/lib64 /usr/lib64 -Duselargefiles
> -Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost
> -Dperladmin=root@loca!
> lhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db
> -Dusethreads -DDEBUGGING=none
> -Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0
> 5.18.1/x86_64-linux-thread-multi 5.18.1 -Dlibpth=/usr/local/lib64 /lib64
> /usr/lib64 -Dnoextensions=ODBM_File'
> hint=recommended, useposix=true, d_sigaction=define
> useithreads=define, usemultiplicity=define
> useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
> use64bitint=define, use64bitall=define, uselongdouble=undef
> usemymalloc=n, bincompat5005=undef
> Compiler:
> cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE
> -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
> optimize='-O3 -march=native -pipe',
> cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe'
> ccversion='', gccversion='4.7.3', gccosandvers=''
> intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
> d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
> ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
> lseeksize=8
> alignbytes=8, prototype=define
> Linker and Libraries:
> ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed'
> libpth=/usr/local/lib64 /lib64 /usr/lib64
> libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
> -lgdbm_compat
> perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
> libc=/lib/libc-2.19.so, so=so, useshrplib=true,
> libperl=libperl.so.5.18.2
> gnulibc_version='2.19'
> Dynamic Linking:
> dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
> cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1
> -Wl,--as-needed'
>
>
> Characteristics of this binary (from libperl):
> Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
> PERL_DONT_CREATE_GVSV
> PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
> PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
> PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL
> USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
> USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE
> USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
> USE_REENTRANT_API
> Locally applied patches:
> gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054
> cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH
> gentoo/EUMM_delete_packlist - Don't install .packlist or
> perllocal.pod for perl or vendor
> gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags
> gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default
> for modules installed from CPAN.
> gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site
> directories by default.
> gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 Set
> libperl soname
> gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 Don't
> force -fstack-protector on everyone.
> gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing
> @INC directories.
> gentoo/mod_paths - Add /etc/perl to @INC
> gentoo/patchlevel - List packaged patches for perl-5.18.2-r2(#2) in
> patchlevel.h
> gentoo/aix_soname - aix gcc detection and shared library soname
> support
> gentoo/opensolars_headers - Add headers for opensolaris
> gentoo/cleanup-paths - Cleanup PATH and shrpenv
> gentoo/usr_local - Remove /usr/local paths
> gentoo/hints_hpux - Fix hpux hints
> gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 darwin: Use $CC
> to link
> gentoo/interix - Fix interix hints
> fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP
> 'Port' option
> debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with
> nonexisting site dirs if a parent is writable
> fixes/memoize_storable_nstore - [rt.cpan.org #77790]
> Memoize::Storable: respect 'nstore' option not respected
> fixes/net_ftp_failed_command - [rt.cpan.org #37700] Net::FTP: cope
> gracefully with a failed command
> fixes/perlbug-patchlist - [3541c11] [perl #118433] Make perlbug look
> up the list of local patches at run time
> fixes/module_metadata_taint_fix - [bff978f] [rt.cpan.org #88576]
> untaint version, if needed, in Module::Metadata
> fixes/IPC-SysV-spelling - [rt.cpan.org #86736] Fix spelling of
> IPC_CREAT in IPC-SysV documentation
> fixes/freemint -
> Built under linux
> Compiled at Oct 29 2014 20:59:02
> @INC:
> /etc/perl
> /usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
> /usr/local/lib64/perl5/5.18.2
> /usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
> /usr/lib64/perl5/vendor_perl/5.18.2
> /usr/local/lib64/perl5
> /usr/lib64/perl5/vendor_perl
> /usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
> /usr/lib64/perl5/5.18.2
> .
>
> ------------------------------
>
> Message: 2
> Date: Mon, 29 Dec 2014 16:40:42 -0500
> From: John D Burger <john@mitre.org>
> Subject: Re: [Moses-support] Moses tokenizer treats combining
> diaeresis inconsistently
> To: "moses-support@mit.edu" <moses-support@mit.edu>
> Message-ID: <FEBAB28D-774D-425D-81E0-6192CD4FEDB4@mitre.org>
> Content-Type: text/plain; charset=utf-8
>
>> This is also a reason to turn Unicode normalization on. If the
>> tokenizer did NFKC at the beginning, then the problem would go away.
> If I understand the situation correctly, this would only fix this particular
> example and a few others like it. There are many base+combining grapheme
> clusters in Unicode text which cannot be normalized to a single pre-composed
> character. Vietnamese comes to mind.
>
> - JB
>
> On Dec 29, 2014, at 16:05 , Kenneth Heafield <moses@kheafield.com> wrote:
>
>> Dear Moses,
>>
>> The attached file, taken from line 2345157 of
>> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.
>> en.shuffled.gz , tokenizes differently on different machines.
>>
>> I'm running tokenizer.perl from head (481a07dc) with this perl:
>>
>> This is perl 5, version 18, subversion 2 (v5.18.2) built for
>> x86_64-linux-thread-multi (with 25 registered patches, see perl -V for
>> more detail)
>>
>> perl -V is attached from newer machines.
>>
>> The input is "J?rgen" with a specific encoding:
>>
>> uconv -f utf-8 -x any-name jur
>>
>> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
>> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN
>> SMALL LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
>>
>> So the umlaut is encoded as a normal "u" character followed by a
>> combining diaeresis marker. This encoding is legal, but it differs
>> from the single-character canonical encoding of \N{LATIN SMALL LETTER
>> U WITH DIAERESIS}.
>>
>> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
>> DIAERESIS} is a single character and recognizing it as part of the
>> IsAlnum class. Tokenizing on these machines outputs
>>
>> J?rgen
>>
>> Newer machines are treating them separately, recognizing \N{COMBINING
>> DIAERESIS} as a separate character that is not part of IsAlnum. The
>> Moses tokenizer then treats it as something to split off, yielding
>> this
>> tokenization:
>>
>> Ju ? rgen
>>
>> I thought it might be locale-related but IsAlnum is supposed to be
>> locale-agnostic. I couldn't come up with environment variables that
>> made the new machines tokenize as a single word.
>>
>> Maybe this is a perl bug, but the result is that two different
>> machines running the same perl script produce different tokenization :-(.
>>
>> This is also a reason to turn Unicode normalization on. If the
>> tokenizer did NFKC at the beginning, then the problem would go away.
>>
>> Kenneth
>>
>> <jur.gz><perl_V.txt>_______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 30 Dec 2014 10:54:18 +0700
> From: Tom Hoar <tahoar@precisiontranslationtools.com>
> Subject: Re: [Moses-support] "'" in tokenization
> To: moses-support@mit.edu
> Message-ID: <54A221EA.8080805@precisiontranslationtools.com>
> Content-Type: text/plain; charset="utf-8"
>
>
> The escaping is necessary because Moses reserves these characters for other
> uses. When corpora are consistently prepared, the escaping has no effect on
> translation results. It looks like you have not prepared your corpora
> consistently. Note my results ('s) are different from yours (' s):
>
> user@host:~$ echo "keep your notification's payload under 5 kb." |
> tokenizer.perl -l en Tokenizer Version 1.1
> Language: en
> Number of threads: 1
> keep your notification 's payload under 5 kb .
>
> Go back and double-check how you prepare your training corpus and your
> translation jobs.
>
>
> On 12/29/2014 09:26 PM, Ihab Ramadan wrote:
>> Dears,
>>
>> When I make tokenization on files it replaces the apostrophes with
>> ?'? which make sense, but in the other side it crashes the
>> meaning and the order of the words at all, for example:
>>
>> Sentence before tokenization :
>>
>> Src : keep your notification's payload under 5 kb.
>>
>> Trg: ???? ????? ??????? ??? ?? 5 ????????.
>>
>> Sentence after tokenization :
>>
>> Src: keep your notification ' s payload under 5 kb .
>>
>> Trg: ???? ????? ??????? ??? ?? 5 ????????.
>>
>> If I translate ?keep? without using tokenization it will generates
>> ?????? which Is correct but after using tokenization moses generates
>> ????????? which means that the alignment is crashed
>>
>> do I make something wrong?
>>
>> do I miss something or just it is a natural behavior when I use
>> tokenization
>>
>> Thanks
>>
>> Best Regards
>>
>> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>
>> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
>> Fax+20233032036 | *Follow us on *linked
>>
> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=V
> SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri
> mary>* |
>> **ZA102637861*
>>
> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
> okmark>* |
>> **ZA102637858* <https://twitter.com/Saudisoft>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
> 3cde56/attachment.htm
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: image/gif
> Size: 1314 bytes
> Desc: not available
> Url :
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
> 3cde56/attachment.gif
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: image/gif
> Size: 1317 bytes
> Desc: not available
> Url :
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
> 3cde56/attachment-0001.gif
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: image/gif
> Size: 1351 bytes
> Desc: not available
> Url :
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb
> 3cde56/attachment-0002.gif
>
> ------------------------------
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> End of Moses-support Digest, Vol 98, Issue 65
> *********************************************
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 98, Issue 67
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 98, Issue 67"
Post a Comment