Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Moses tokenizer treats combining diaeresis inconsistently
(Kenneth Heafield)
2. Re: Moses tokenizer treats combining diaeresis inconsistently
(John D Burger)
3. Re: "'" in tokenization (Tom Hoar)
----------------------------------------------------------------------
Message: 1
Date: Mon, 29 Dec 2014 16:05:51 -0500
From: Kenneth Heafield <moses@kheafield.com>
Subject: [Moses-support] Moses tokenizer treats combining diaeresis
inconsistently
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <54A1C22F.60404@kheafield.com>
Content-Type: text/plain; charset="utf-8"
Dear Moses,
The attached file, taken from line 2345157 of
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
, tokenizes differently on different machines.
I'm running tokenizer.perl from head (481a07dc) with this perl:
This is perl 5, version 18, subversion 2 (v5.18.2) built for
x86_64-linux-thread-multi
(with 25 registered patches, see perl -V for more detail)
perl -V is attached from newer machines.
The input is "J?rgen" with a specific encoding:
uconv -f utf-8 -x any-name jur
\N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
So the umlaut is encoded as a normal "u" character followed by a
combining diaeresis marker. This encoding is legal, but it differs from
the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
DIAERESIS}.
Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
DIAERESIS} is a single character and recognizing it as part of the
IsAlnum class. Tokenizing on these machines outputs
J?rgen
Newer machines are treating them separately, recognizing \N{COMBINING
DIAERESIS} as a separate character that is not part of IsAlnum. The
Moses tokenizer then treats it as something to split off, yielding this
tokenization:
Ju ? rgen
I thought it might be locale-related but IsAlnum is supposed to be
locale-agnostic. I couldn't come up with environment variables that
made the new machines tokenize as a single word.
Maybe this is a perl bug, but the result is that two different machines
running the same perl script produce different tokenization :-(.
This is also a reason to turn Unicode normalization on. If the
tokenizer did NFKC at the beginning, then the problem would go away.
Kenneth
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jur.gz
Type: application/gzip
Size: 33 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141229/9ce44a08/attachment-0001.bin
-------------- next part --------------
Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
Platform:
osname=linux, osvers=3.16.1, archname=x86_64-linux-thread-multi
uname='linux lister 3.16.1 #2 smp sun aug 31 21:04:00 edt 2014 x86_64 intel(r) core(tm) i5-2430m cpu @ 2.40ghz genuineintel gnulinux '
config_args='-des -Duseshrplib -Darchname=x86_64-linux-thread -Dcc=x86_64-pc-linux-gnu-gcc -Doptimize=-O3 -march=native -pipe -Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr -Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin -Dprivlib=/usr/lib64/perl5/5.18.2 -Darchlib=/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi -Dsitelib=/usr/local/lib64/perl5/5.18.2 -Dsitearch=/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi -Dvendorlib=/usr/lib64/perl5/vendor_perl/5.18.2 -Dvendorarch=/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.18.2 -Dlocincpth=/usr/include -Dglibpth=/lib64 /usr/lib64 -Duselargefiles -Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost -Dperladmin=root@loca!
lhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db -Dusethreads -DDEBUGGING=none -Dinc_version_list=5.18.0/x86_64-linux-thread-multi 5.18.0 5.18.1/x86_64-linux-thread-multi 5.18.1 -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Dnoextensions=ODBM_File'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='x86_64-pc-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O3 -march=native -pipe',
cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe'
ccversion='', gccversion='4.7.3', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='x86_64-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed'
libpth=/usr/local/lib64 /lib64 /usr/lib64
libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
libc=/lib/libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.18.2
gnulibc_version='2.19'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O3 -march=native -pipe -Wl,-O1 -Wl,--as-needed'
Characteristics of this binary (from libperl):
Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
PERL_DONT_CREATE_GVSV
PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
PERL_PRESERVE_IVUV PERL_SAWAMPERSAND USE_64_BIT_ALL
USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE
USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
USE_REENTRANT_API
Locally applied patches:
gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054 cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH
gentoo/EUMM_delete_packlist - Don't install .packlist or perllocal.pod for perl or vendor
gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags
gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site directories by default.
gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 Set libperl soname
gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 Don't force -fstack-protector on everyone.
gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing @INC directories.
gentoo/mod_paths - Add /etc/perl to @INC
gentoo/patchlevel - List packaged patches for perl-5.18.2-r2(#2) in patchlevel.h
gentoo/aix_soname - aix gcc detection and shared library soname support
gentoo/opensolars_headers - Add headers for opensolaris
gentoo/cleanup-paths - Cleanup PATH and shrpenv
gentoo/usr_local - Remove /usr/local paths
gentoo/hints_hpux - Fix hpux hints
gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 darwin: Use $CC to link
gentoo/interix - Fix interix hints
fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP 'Port' option
debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable
fixes/memoize_storable_nstore - [rt.cpan.org #77790] Memoize::Storable: respect 'nstore' option not respected
fixes/net_ftp_failed_command - [rt.cpan.org #37700] Net::FTP: cope gracefully with a failed command
fixes/perlbug-patchlist - [3541c11] [perl #118433] Make perlbug look up the list of local patches at run time
fixes/module_metadata_taint_fix - [bff978f] [rt.cpan.org #88576] untaint version, if needed, in Module::Metadata
fixes/IPC-SysV-spelling - [rt.cpan.org #86736] Fix spelling of IPC_CREAT in IPC-SysV documentation
fixes/freemint -
Built under linux
Compiled at Oct 29 2014 20:59:02
@INC:
/etc/perl
/usr/local/lib64/perl5/5.18.2/x86_64-linux-thread-multi
/usr/local/lib64/perl5/5.18.2
/usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux-thread-multi
/usr/lib64/perl5/vendor_perl/5.18.2
/usr/local/lib64/perl5
/usr/lib64/perl5/vendor_perl
/usr/lib64/perl5/5.18.2/x86_64-linux-thread-multi
/usr/lib64/perl5/5.18.2
.
------------------------------
Message: 2
Date: Mon, 29 Dec 2014 16:40:42 -0500
From: John D Burger <john@mitre.org>
Subject: Re: [Moses-support] Moses tokenizer treats combining
diaeresis inconsistently
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <FEBAB28D-774D-425D-81E0-6192CD4FEDB4@mitre.org>
Content-Type: text/plain; charset=utf-8
> This is also a reason to turn Unicode normalization on. If the
> tokenizer did NFKC at the beginning, then the problem would go away.
If I understand the situation correctly, this would only fix this particular example and a few others like it. There are many base+combining grapheme clusters in Unicode text which cannot be normalized to a single pre-composed character. Vietnamese comes to mind.
- JB
On Dec 29, 2014, at 16:05 , Kenneth Heafield <moses@kheafield.com> wrote:
> Dear Moses,
>
> The attached file, taken from line 2345157 of
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
> , tokenizes differently on different machines.
>
> I'm running tokenizer.perl from head (481a07dc) with this perl:
>
> This is perl 5, version 18, subversion 2 (v5.18.2) built for
> x86_64-linux-thread-multi
> (with 25 registered patches, see perl -V for more detail)
>
> perl -V is attached from newer machines.
>
> The input is "J?rgen" with a specific encoding:
>
> uconv -f utf-8 -x any-name jur
>
> \N{LATIN CAPITAL LETTER J}\N{LATIN SMALL LETTER U}\N{COMBINING
> DIAERESIS}\N{LATIN SMALL LETTER R}\N{LATIN SMALL LETTER G}\N{LATIN SMALL
> LETTER E}\N{LATIN SMALL LETTER N}\N{<control-000A>}
>
> So the umlaut is encoded as a normal "u" character followed by a
> combining diaeresis marker. This encoding is legal, but it differs from
> the single-character canonical encoding of \N{LATIN SMALL LETTER U WITH
> DIAERESIS}.
>
> Older machines are treating \N{LATIN SMALL LETTER U}\N{COMBINING
> DIAERESIS} is a single character and recognizing it as part of the
> IsAlnum class. Tokenizing on these machines outputs
>
> J?rgen
>
> Newer machines are treating them separately, recognizing \N{COMBINING
> DIAERESIS} as a separate character that is not part of IsAlnum. The
> Moses tokenizer then treats it as something to split off, yielding this
> tokenization:
>
> Ju ? rgen
>
> I thought it might be locale-related but IsAlnum is supposed to be
> locale-agnostic. I couldn't come up with environment variables that
> made the new machines tokenize as a single word.
>
> Maybe this is a perl bug, but the result is that two different machines
> running the same perl script produce different tokenization :-(.
>
> This is also a reason to turn Unicode normalization on. If the
> tokenizer did NFKC at the beginning, then the problem would go away.
>
> Kenneth
>
> <jur.gz><perl_V.txt>_______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
------------------------------
Message: 3
Date: Tue, 30 Dec 2014 10:54:18 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] "'" in tokenization
To: moses-support@mit.edu
Message-ID: <54A221EA.8080805@precisiontranslationtools.com>
Content-Type: text/plain; charset="utf-8"
The escaping is necessary because Moses reserves these characters for
other uses. When corpora are consistently prepared, the escaping has no
effect on translation results. It looks like you have not prepared your
corpora consistently. Note my results ('s) are different from yours
(' s):
user@host:~$ echo "keep your notification's payload under 5 kb." |
tokenizer.perl -l en
Tokenizer Version 1.1
Language: en
Number of threads: 1
keep your notification 's payload under 5 kb .
Go back and double-check how you prepare your training corpus and your
translation jobs.
On 12/29/2014 09:26 PM, Ihab Ramadan wrote:
>
> Dears,
>
> When I make tokenization on files it replaces the apostrophes with
> ?'? which make sense, but in the other side it crashes the
> meaning and the order of the words at all, for example:
>
> Sentence before tokenization :
>
> Src : keep your notification's payload under 5 kb.
>
> Trg: ???? ????? ??????? ??? ?? 5 ????????.
>
> Sentence after tokenization :
>
> Src: keep your notification ' s payload under 5 kb .
>
> Trg: ???? ????? ??????? ??? ?? 5 ????????.
>
> If I translate ?keep? without using tokenization it will generates
> ?????? which Is correct but after using tokenization moses generates
> ????????? which means that the alignment is crashed
>
> do I make something wrong?
>
> do I miss something or just it is a natural behavior when I use
> tokenization
>
> Thanks
>
> Best Regards
>
> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>
> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
> Fax+20233032036 | *Follow us on *linked
> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* |
> **ZA102637861*
> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* |
> **ZA102637858* <https://twitter.com/Saudisoft>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb3cde56/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1314 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb3cde56/attachment.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1317 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb3cde56/attachment-0001.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1351 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141230/bb3cde56/attachment-0002.gif
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 98, Issue 65
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 98, Issue 65"
Post a Comment