Moses-support Digest, Vol 99, Issue 14

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Europarl monolingual pipeline (Kenneth Heafield)
2. Re: Europarl monolingual pipeline (Philipp Koehn)
3. Re: Europarl monolingual pipeline (Kenneth Heafield)
4. Off-topic - Internship/part-time opportunity in Toronto
(Wei JIANG [PT-COM])
5. Re: Trouble building Moses (Matt Munson)


----------------------------------------------------------------------

Message: 1
Date: Tue, 06 Jan 2015 17:20:05 -0500
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Europarl monolingual pipeline
To: Philipp Koehn <phi@jhu.edu>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <54AC5F95.5090402@kheafield.com>
Content-Type: text/plain; charset=utf-8

Hi,

It seems that the WMT release is missing data. For example, why does
en/ep-10-02.txt line 4 contain "21 September 2000" but this does not
appear in the WMT europarl-v7.en file from the WMT site?

Kenneth

On 01/06/15 14:24, Philipp Koehn wrote:
> Hi,
>
> the Perl script that was used to build this corpus is:
>
> #!/usr/bin/perl -w
>
> use strict;
> my ($l) = @ARGV;
>
> my $data = "/home/pkoehn/statmt/data/europarl-v7";
> my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools";
> my $preprocessor = "$tools/split-sentences.perl -q";
>
> die("ERROR: no data for language $l") unless -e "$data/txt/$l";
> open(SPLIT,"cat $data/txt/$l/ep-00-0* $data/txt/$l/ep-0[123456789]*
> $data/txt/$l/ep-[19]* | $preprocessor -l $l |");
> while(<SPLIT>) {
> next if /^\s*$/;
> next if /^</;
> print $_;
> }
> close(SPLIT);
>
>
> The sentence splitting code is in the tools package that comes
> with the Europarl source release.
>
> -phi
>
> On Tue, Jan 6, 2015 at 11:09 AM, Kenneth Heafield <moses@kheafield.com
> <mailto:moses@kheafield.com>> wrote:
>
> Dear Moses,
>
> Where does this data come from?
>
> http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz
>
> Specifically, if I wanted non-WMT languages, then I can download
> Europarl from http://www.statmt.org/europarl/ .
>
> There are some tools, like a perl script to strip XML, but
> that also
> strips out <P> tags which are meant to be preserved for
> split-sentences.perl. And I don't think split-sentences.perl was
> designed to run before stripping XML but could be wrong.
>
> Does one write a custom XML strip program to remove all the
> tags except
> <P> then pass it to split-sentences.perl?
>
> Kenneth
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


------------------------------

Message: 2
Date: Tue, 6 Jan 2015 14:22:33 -0800
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] Europarl monolingual pipeline
To: Kenneth Heafield <moses@kheafield.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAAFADDDU9ODe9cr_UBag=o=qUU_-Zg3ojwO3PCkACdro2TGLoA@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

Hi,

this is done on purpose - the Q4 2000 is used for test sets, so it is excluded
from the parallel and monolingual training corpora.

-phi

On Tue, Jan 6, 2015 at 2:20 PM, Kenneth Heafield <moses@kheafield.com> wrote:
> Hi,
>
> It seems that the WMT release is missing data. For example, why does
> en/ep-10-02.txt line 4 contain "21 September 2000" but this does not
> appear in the WMT europarl-v7.en file from the WMT site?
>
> Kenneth
>
> On 01/06/15 14:24, Philipp Koehn wrote:
>> Hi,
>>
>> the Perl script that was used to build this corpus is:
>>
>> #!/usr/bin/perl -w
>>
>> use strict;
>> my ($l) = @ARGV;
>>
>> my $data = "/home/pkoehn/statmt/data/europarl-v7";
>> my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools";
>> my $preprocessor = "$tools/split-sentences.perl -q";
>>
>> die("ERROR: no data for language $l") unless -e "$data/txt/$l";
>> open(SPLIT,"cat $data/txt/$l/ep-00-0* $data/txt/$l/ep-0[123456789]*
>> $data/txt/$l/ep-[19]* | $preprocessor -l $l |");
>> while(<SPLIT>) {
>> next if /^\s*$/;
>> next if /^</;
>> print $_;
>> }
>> close(SPLIT);
>>
>>
>> The sentence splitting code is in the tools package that comes
>> with the Europarl source release.
>>
>> -phi
>>
>> On Tue, Jan 6, 2015 at 11:09 AM, Kenneth Heafield <moses@kheafield.com
>> <mailto:moses@kheafield.com>> wrote:
>>
>> Dear Moses,
>>
>> Where does this data come from?
>>
>> http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz
>>
>> Specifically, if I wanted non-WMT languages, then I can download
>> Europarl from http://www.statmt.org/europarl/ .
>>
>> There are some tools, like a perl script to strip XML, but
>> that also
>> strips out <P> tags which are meant to be preserved for
>> split-sentences.perl. And I don't think split-sentences.perl was
>> designed to run before stripping XML but could be wrong.
>>
>> Does one write a custom XML strip program to remove all the
>> tags except
>> <P> then pass it to split-sentences.perl?
>>
>> Kenneth
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>


------------------------------

Message: 3
Date: Tue, 06 Jan 2015 17:26:58 -0500
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Europarl monolingual pipeline
To: moses-support@mit.edu
Message-ID: <54AC6132.9070609@kheafield.com>
Content-Type: text/plain; charset=windows-1252

Hi again,

Sorry, never mind!

"We recommend using the last quarter of 2000 for testing (2000-10 until
2000-12) for consistency in reporting research results on this data."

Kenneth

On 01/06/15 17:20, Kenneth Heafield wrote:
> Hi,
>
> It seems that the WMT release is missing data. For example, why does
> en/ep-10-02.txt line 4 contain "21 September 2000" but this does not
> appear in the WMT europarl-v7.en file from the WMT site?
>
> Kenneth
>
> On 01/06/15 14:24, Philipp Koehn wrote:
>> Hi,
>>
>> the Perl script that was used to build this corpus is:
>>
>> #!/usr/bin/perl -w
>>
>> use strict;
>> my ($l) = @ARGV;
>>
>> my $data = "/home/pkoehn/statmt/data/europarl-v7";
>> my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools";
>> my $preprocessor = "$tools/split-sentences.perl -q";
>>
>> die("ERROR: no data for language $l") unless -e "$data/txt/$l";
>> open(SPLIT,"cat $data/txt/$l/ep-00-0* $data/txt/$l/ep-0[123456789]*
>> $data/txt/$l/ep-[19]* | $preprocessor -l $l |");
>> while(<SPLIT>) {
>> next if /^\s*$/;
>> next if /^</;
>> print $_;
>> }
>> close(SPLIT);
>>
>>
>> The sentence splitting code is in the tools package that comes
>> with the Europarl source release.
>>
>> -phi
>>
>> On Tue, Jan 6, 2015 at 11:09 AM, Kenneth Heafield <moses@kheafield.com
>> <mailto:moses@kheafield.com>> wrote:
>>
>> Dear Moses,
>>
>> Where does this data come from?
>>
>> http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz
>>
>> Specifically, if I wanted non-WMT languages, then I can download
>> Europarl from http://www.statmt.org/europarl/ .
>>
>> There are some tools, like a perl script to strip XML, but
>> that also
>> strips out <P> tags which are meant to be preserved for
>> split-sentences.perl. And I don't think split-sentences.perl was
>> designed to run before stripping XML but could be wrong.
>>
>> Does one write a custom XML strip program to remove all the
>> tags except
>> <P> then pass it to split-sentences.perl?
>>
>> Kenneth
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


------------------------------

Message: 4
Date: Tue, 06 Jan 2015 19:23:05 -0500
From: "Wei JIANG [PT-COM]" <jiangw@polytrans.com>
Subject: [Moses-support] Off-topic - Internship/part-time opportunity
in Toronto
To: moses-support@mit.edu
Message-ID: <54AC7C69.20909@polytrans.com>
Content-Type: text/plain; charset=utf-8; format=flowed

Hello Everyone,

I am aware that this could considered off-topic, but just in case you
or anyone you know would be interested -

I am a project manager, heading a team of multilingual linguists and
translators. We plan to build a p2p system, that will enable real-time
collaboration on translation and localization projects on virtually all
major desktop and mobile platforms. We want a software developer/programmer
in cross-platform software development, including Windows/iOS/Android.
Initially, this would be an internship opportunity//part-time job in
Toronto Canada. New grats / post-grats / senior student applicants -
majoring in software, NLP, CAT, MT - are welcome. Please contact me by email
off list.

Thanks,
Wei Jiang




------------------------------

Message: 5
Date: Wed, 07 Jan 2015 11:14:07 +0100
From: Matt Munson <munson@dh.uni-leipzig.de>
Subject: Re: [Moses-support] Trouble building Moses
To: Hieu Hoang <hieuhoang@gmail.com>, moses-support@mit.edu
Message-ID: <54AD06EF.1010607@dh.uni-leipzig.de>
Content-Type: text/plain; charset="windows-1252"

It looks like I have two different versions installed. 1_49 in
/usr/include and 1_57 in /usr/local/include. How can I make sure that
it uses the more recent version? I tried ./bjam
--with-boost=/usr/local/include/boost/ but it still gives me the same
multiple failures and the build fails at the end.

Best,

Matt

On 06.01.2015 17:59, Hieu Hoang wrote:
> it seems like there's an issue with 1 of the boost header file.
>
> do you know what boost version you have?
>
> On 06/01/15 15:14, Matt Munson wrote:
>> When I run ./bjam, Moses does not actually build but, instead, tells
>> me that it fails essentially on every build step and then tells me
>> that it didn't build. See attached build.log.gz file. The server I
>> am installing this on is running Linux 3.2.0-4-amd64 (Debian).
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>

--
Matthew Munson
Researcher
Alexander von Humboldt Chair of Digital Humanities
Universit?t Leipzig, Institut f?r Informatik
Augustusplatz 10, 04109 Leipzig
Deutschland

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150107/fe545e1c/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 99, Issue 14
*********************************************

0 Response to "Moses-support Digest, Vol 99, Issue 14"

Post a Comment