Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: Unicode Issues when Using Compact Phrase Table, Binaries
vs. Own Build (Kenneth Heafield)
----------------------------------------------------------------------
Message: 1
Date: Mon, 30 Mar 2015 08:21:56 -0400
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Unicode Issues when Using Compact Phrase
Table, Binaries vs. Own Build
To: moses-support@mit.edu
Message-ID: <55193FE4.9060401@kheafield.com>
Content-Type: text/plain; charset=UTF-8
Sounds like a case of composed characters.
Try passing the input through this:
uconv -f utf8 -t utf8 -x Any-NFKC --callback skip --remove-signature
On 03/30/2015 04:53 AM, "????????? ????? (Ventsislav Zhechev)" wrote:
> Hi all,
>
> I?m having this really weird Unicode issue when using compact phrase
> tables that could be related to endianness somehow, but I?ve no idea how.
> I compiled the training tools from v3 on my Mac and built a few models
> using compact phrase (and reordering) tables and KenLM, including (for
> simplicity) a recasing model for DE (download it
> from https://autodesk.box.com/DE-Recaser). Things become strange when I
> try to use the models, though:
> 1. All works fine when I use the decoder binary I compiled myself on the
> Mac (10.10.2, self-built Boost 1.57)
> 2. Unicode input is not recognised when I use the binary
> from http://www.statmt.org/moses/RELEASE-3.0/binaries/macosx-yosemite/ i.e.
> words like ?f?r? or ?ausf?hrlich? are marked as UNK.
> 3. Unicode input is not recognised when I use a binary I compiled myself
> on Ubuntu 12.04.5 (self-built Boost 1.57)
> 4. All works fine when I use the binary
> from http://www.statmt.org/moses/RELEASE-3.0/binaries/linux-64bit/
>
> I tested the above with the queryPhraseTableMin tool (rather than the
> decoder) and got the same results, which is what makes me think this
> could be somehow related to binary incompatibility with the way the
> phrase table is compacted. Haven?t investigated deeper than that, though.
>
>
> Any clues?
> One would say, just use the Linux binary then on Linux... However, I
> have a number of CentOS/RHEL 5 and 6 boxes, where the pre-compiled
> binary doesn?t work, as the system glibc is too old. So there I need to
> compile Moses myself, but then Unicode isn?t recognised...
>
>
>
> Cheers,
>
> Ventzi
>
> ???????
> *Dr. Ventsislav Zhechev*
> Computational Linguist, Certified ScrumMaster?
> Platform Architecture and Technologies
> Localisation Services
>
> *MAIN* +41 32 723 91 22
> *FAX* +41 32 723 93 99
>
> _http://VentsislavZhechev.eu_
>
> *Autodesk, Inc.*
> Rue de Puits-Godet 6
> 2000 Neuch?tel, Switzerland
> _www.autodesk.com <http://www.autodesk.com/>_
>
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 101, Issue 83
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 101, Issue 83"
Post a Comment