Moses-support Digest, Vol 97, Issue 32

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Encoding in MGIZA (Kenneth Heafield)
2. Re: Encoding in MGIZA (Hieu Hoang)


----------------------------------------------------------------------

Message: 1
Date: Fri, 14 Nov 2014 09:12:04 -0500
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Encoding in MGIZA
To: moses-support@mit.edu
Message-ID: <54660DB4.30801@kheafield.com>
Content-Type: text/plain; charset=ISO-8859-1

For what it's worth the server is running Python 3.2.2. Rico seems to
know what he's doing with Python much more than I do.

Kenneth

On 11/14/14 04:55, Hieu Hoang wrote:
> Ken - should we add encoding on open to all python scripts, rather than
> set the PYTHONIOENCODING env variable? That's basically what happens
> with the perl scripts/
>
> What python/Linux version are you using? I don't see it on my version
> (Python 2.7.3, Ubuntu 12.04)
>
> Qin - Thanks. I've added you as admin for moses on github. We may change
> this if it doesn't suit you. mgiza is a sister project of moses
> https://github.com/moses-smt/mgiza
> So everyone who has commit access to moses also has access to mgiza,
> which is quite a lot!
>
> We monitor all commits to mgiza on the same mailing list as moses in
> case people mess around, eg.
>
> http://lists.inf.ed.ac.uk/pipermail/moses-commits/2014-November/001826.html
>
> On 14 November 2014 09:42, Gao Qin <pku.gaoqin@gmail.com
> <mailto:pku.gaoqin@gmail.com>> wrote:
>
> Good idea, I am not yet admin of the new repro, Hieu will add me and
> I cam make change then.
>
> --Q
>
> On Thu, Nov 13, 2014 at 8:54 AM, Kenneth Heafield <me@kheafield.com
> <mailto:me@kheafield.com>> wrote:
>
> Hi,
>
> MGIZA has some Python programs that process raw text:
> https://github.com/moses-smt/mgiza/tree/master/mgizapp/scripts .
>
> Since those scripts were released, Python messed up file
> encoding and
> made the default ascii. Should we just change every open call
> to have
> encoding = 'utf-8' ?
>
> Kenneth
>
>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


------------------------------

Message: 2
Date: Fri, 14 Nov 2014 14:38:40 +0000
From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Subject: Re: [Moses-support] Encoding in MGIZA
To: Rico Sennrich <rico.sennrich@gmx.ch>, Qin Gao
<qigao@microsoft.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAEKMkbjDnfdKEzsz9v8=kuo8_dKhtT6s-m3Lkn0hno5sFP9zhw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

ah. I've rolled back Ken's change 'cos I need it to work with Python 2.7.

I've set the env variable in train-model.perl just before the call to
merge-alignment.py. That should patch ken's problem for now.

https://github.com/moses-smt/mosesdecoder/commit/acd3ac964a7df646e15e3c4210853e7b70bebcbf
But the better way is adding Rico's code to all python scripts


On 14 November 2014 13:20, Rico Sennrich <rico.sennrich@gmx.ch> wrote:

> Hieu Hoang <Hieu.Hoang@...> writes:
>
> > Ken - should we add encoding on open to all python scripts, rather than
> set the PYTHONIOENCODING env variable? That's basically what happens with
> the perl scripts/
> >
> > What python/Linux version are you using? I don't see it on my version
> (Python 2.7.3, Ubuntu 12.04)
>
> Hi all,
>
> It's kinda tricky to have consistent encoding between Python 2.X and Python
> 3. The patch to merge_alignment.py will fail under 2.X. I suggest to use
> io.open instead, which works with all versions from 2.6 up. And if any
> string processing is done, I suggest using 'from __future__ import
> unicode_literals' to ensure that all string literals are interpreted as
> unicode, and making sure that all input/output is UTF-8 (including
> stdin/stdout/stderr). I usually do this with the following code block:
>
> import codecs
> if sys.version_info < (3,0,0):
> sys.stdin = codecs.getreader('UTF-8')(sys.stdin)
> sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
> sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)
>
> best,
> Rico
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141114/c4e1286c/attachment-0001.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 97, Issue 32
*********************************************

0 Response to "Moses-support Digest, Vol 97, Issue 32"

Post a Comment