Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. beautify.py now also reformats Perl files (Jeroen Vermeulen)
2. How to tell EMS to concatenate training corpora (Lane Schwartz)
3. Stripping carriage returns in FilePiece? (Jeroen Vermeulen)
4. Re: Stripping carriage returns in FilePiece? (Hieu Hoang)
5. Re: Stripping carriage returns in FilePiece? (Jeroen Vermeulen)
6. Re: How to tell EMS to concatenate training corpora
(Rico Sennrich)
----------------------------------------------------------------------
Message: 1
Date: Mon, 18 May 2015 00:52:12 +0700
From: Jeroen Vermeulen <jtv@precisiontranslationtools.com>
Subject: [Moses-support] beautify.py now also reformats Perl files
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <5558D54C.4040305@precisiontranslationtools.com>
Content-Type: text/plain; charset=utf-8
Next time we run beautify.py to reformat the source code, it will also
reformat Perl source, using Perltidy. It looks like a very mature
program, and our tests passed normally with reformatted scripts.
I twiddled the style options until I found something that looked like it
would fit in with what we already have. We can tweak that further to
people's liking ? Perltidy is highly customizable. But what we have is
pretty diverse, style-wise, so whatever style we come up with will cause
a lot of initial diff.
Jeroen
------------------------------
Message: 2
Date: Sun, 17 May 2015 13:58:07 -0500
From: Lane Schwartz <dowobeha@gmail.com>
Subject: [Moses-support] How to tell EMS to concatenate training
corpora
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CABv3vZnrpFxcNpSGS5QguWWytNx=6y4kq8fFWaxWGY9QincB4g@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
I have a number of distinct monolingual corpora. I've been training them as
separate LMs. I now want to run a variant where they are all concatenated
together, and then trained as a single LM. The EMS walkthrough says this
should be possible (
http://www.statmt.org/moses/?n=FactoredTraining.EMS#ntoc19), but doesn't
give the requisite syntax. What is the EMS syntax to do this?
Thanks,
Lane
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150517/c0e5fe13/attachment-0001.htm
------------------------------
Message: 3
Date: Mon, 18 May 2015 12:05:50 +0700
From: Jeroen Vermeulen <jtv@precisiontranslationtools.com>
Subject: [Moses-support] Stripping carriage returns in FilePiece?
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <5559732E.9080505@precisiontranslationtools.com>
Content-Type: text/plain; charset=utf-8
Hi all,
I'm trying to fix a problem on Windows where lexical-reordering-score
breaks because of Windows-style line endings ? "\r\n" instead of "\n".
These inevitably get in here and there when users produce files on Windows.
A simple solution I've been testing successfully is this: in
FilePiece::ReadLine() (and its sibling ReadLineEOF()), if the last
character before the \n is a \r (carriage return), then don't include
that character in the line that is returned. And of course there's a
parameter to disable this behaviour if desired.
This looks relatively safe to me, to the extent that calling ReadLine()
implies that what you're reading is a text file. It's not something
you'd want to do with a binary file.
However in principle there could be situations where you have a carriage
return at the end of a line in your file (on a non-Windows system), and
you want to keep it.
Can anyone think of such a situation? Any objections against merging my
patch?
Jeroen
------------------------------
Message: 4
Date: Mon, 18 May 2015 11:02:02 +0400
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] Stripping carriage returns in FilePiece?
To: Jeroen Vermeulen <jtv@precisiontranslationtools.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAEKMkbgn4VZYbzPHMcJv6h-axBJnjQjUjCjhWW0xdJiHNxCvfw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
i prefer FilePiece outputs a failthful representation of the file. If you
need to clean your data, I think it should go into the cleaning or
normalization scripts
Hieu Hoang
Researcher
New York University, Abu Dhabi
http://www.hoang.co.uk/hieu
On 18 May 2015 at 09:05, Jeroen Vermeulen <jtv@precisiontranslationtools.com
> wrote:
> Hi all,
>
> I'm trying to fix a problem on Windows where lexical-reordering-score
> breaks because of Windows-style line endings ? "\r\n" instead of "\n".
> These inevitably get in here and there when users produce files on Windows.
>
> A simple solution I've been testing successfully is this: in
> FilePiece::ReadLine() (and its sibling ReadLineEOF()), if the last
> character before the \n is a \r (carriage return), then don't include
> that character in the line that is returned. And of course there's a
> parameter to disable this behaviour if desired.
>
> This looks relatively safe to me, to the extent that calling ReadLine()
> implies that what you're reading is a text file. It's not something
> you'd want to do with a binary file.
>
> However in principle there could be situations where you have a carriage
> return at the end of a line in your file (on a non-Windows system), and
> you want to keep it.
>
> Can anyone think of such a situation? Any objections against merging my
> patch?
>
>
> Jeroen
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150518/b578b3c1/attachment-0001.htm
------------------------------
Message: 5
Date: Mon, 18 May 2015 14:41:36 +0700
From: Jeroen Vermeulen <jtv@precisiontranslationtools.com>
Subject: Re: [Moses-support] Stripping carriage returns in FilePiece?
To: Hieu Hoang <hieuhoang@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <555997B0.4060604@precisiontranslationtools.com>
Content-Type: text/plain; charset=utf-8
On 18/05/15 14:02, Hieu Hoang wrote:
> i prefer FilePiece outputs a failthful representation of the file. If
> you need to clean your data, I think it should go into the cleaning or
> normalization scripts
That could go into a lot more places and end up being more brittle though.
Would it help if I made the default "do not strip carriage returns", and
made lexical-reordering-score request the conversion explicitly?
Bear in mind here that every time we fopen() a file without the "b" mode
flag, we're really saying we want the same conversion if the runtime
feels the need ? as it would on Windows. When we call ReadLine(), at
least it knows we really want the file interpreted as text.
Jeroen
------------------------------
Message: 6
Date: Mon, 18 May 2015 09:56:41 +0000 (UTC)
From: Rico Sennrich <rico.sennrich@gmx.ch>
Subject: Re: [Moses-support] How to tell EMS to concatenate training
corpora
To: moses-support@mit.edu
Message-ID: <loom.20150518T114158-112@post.gmane.org>
Content-Type: text/plain; charset=us-ascii
Lane Schwartz <dowobeha@...> writes:
>
> I have a number of distinct monolingual corpora. I've been training them
as separate LMs. I now want to run a variant where they are all concatenated
together, and then trained as a single LM. The EMS walkthrough says this
should be possible
(http://www.statmt.org/moses/?n=FactoredTraining.EMS#ntoc19), but doesn't
give the requisite syntax. What is the EMS syntax to do this?
>
> Thanks,
> Lane
Hi Lane,
I think the documentation refers to the ability to interpolate language
models - I don't think concatenation is currently supported. I've been
meaning to add this option for a while, and it shouldn't be too hard. I'll
come back to you when it's done.
best wishes,
Rico
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 103, Issue 41
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 103, Issue 41"
Post a Comment