Moses-support Digest, Vol 100, Issue 88

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: My phrase-table.tgz is 20-bytes long (Tom Hoar)
2. Re: My phrase-table.tgz is 20-bytes long (Marcin Junczys-Dowmunt)
3. Re: My phrase-table.tgz is 20-bytes long (????????? ???????)

----------------------------------------------------------------------

Message: 1
Date: Wed, 25 Feb 2015 17:37:47 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long
To: moses-support@mit.edu
Message-ID: <54EDA5FB.5030207@precisiontranslationtools.com>
Content-Type: text/plain; charset=utf-8; format=flowed

Alexander,

If your MGIZA word alignment .gz files are empty, the error is happening
in step 2. Errors there aren't trapped and the system continues running.
Therefore, the outputs of steps 3 (alignment file), 4 (lex files) & 5
(extract files) are all garbage. If the word alignment files are ok and
the extract files are missing, you probably ran out of hard drive space,
as Barry suggested.

Running for 10 days on a 40-core configuration is a lot to manage. It
sounds like a large corpus. Have you run a successful training session
on a sample subset of your data? I would suggest extracting a random
sample of ~15,000 pairs and run your configuration with -mgiza-cpus 8 &
-cores 8. It should take about 30 minutes to run and you shouldn't have
any disk space problems. Work out any bugs in your corpus prep and/or
runtime with this smaller subset. Then, scale up to your full-sized
corpus. With large corpora that run 10 days, you might need several
hundred gigabytes of available space for temp files in your final output
folder, i.e. not /tmp.

On 02/25/2015 05:19 PM, Barry Haddow wrote:
> Hi Alexander,
>
> It looks like something went wrong at the extract stage. If you could
> make your training.out available then we can look for clues.
>
> Could the system have run out of disk space, either in the working
> directory or in /tmp? A lot of space is required to build the extract
> files and phrase tables.
>
> cheers - Barry
>
> On 25/02/15 05:32, ????????? ??????? wrote:
>> Ok, I've started from scratch. I'm pretty sure that I worked with
>> corpus such a way:
>>
>> 1. I tokenized the initial corpuses with tokenizer.perl. Learned
>> numbers of lines caused any errors and warnings
>> 2. Deleted these lines from both files using sed
>> 3. Tokenized the files again. No errors
>> 5. Created truecase-model and truecases the files.
>> 6. Deleted too long lines by using clean-corpus-n.perl 1 50
>>
>> Started translation model creation process by:
>>
>> nohup nice /opt/moses/scripts/training/train-model.perl --parallel
>> -mgiza -mgiza-cpus 40 -cores 40 -root-dir train -corpus
>> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and
>> -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/ru-en.arpa.en:8
>> -external-bin-dir /opt/moses/mgiza >& training.out &
>>
>> After ten days of waiting I have 20-bytes long phraze-table.tgz again!
>> What I'm doing wrong?
>>
>> I have both ru-en and en-ru A3.final.gz files,
>> aligned-grow-diag-final.and, lex.e2f, lex.f2e of quite good size, but
>> empty phrase-table, extract.*.sorted.gz and reordering table.
>>
>> I'm still having no idea what and why goes wrong:(
>>
>> 2015-02-14 21:54 GMT+07:00 Kenneth Heafield <moses@kheafield.com
>> <mailto:moses@kheafield.com>>:
>>
>> Sign my petition to add return code checking to train-model.perl.
>>
>> On 02/14/2015 09:33 AM, Tom Hoar wrote:
>> > An empty phrase-table.gz file is usually the result of an
>> ill-prepared
>> > training corpus. Make sure you run the final corpus through
>> > clean-corpus-n.perl.
>> >
>> >
>> >
>> > On 02/14/2015 09:19 PM, ????????? ??????? wrote:
>> >> Hello, everybody!
>> >>
>> >> I have a problem with moses. I created big parallel corpus by
>> >> concatenating a bunch of existing corpuses on
>> >> http://opus.lingfil.uu.se. After that I cleaned up results (while
>> >> creating tokens script reported some errors. I deleted error-prone
>> >> rows from both of parts).
>> >>
>> >> Then I started to train translation model using mgiza with such an
>> >> executable:
>> >>
>> >> nohup nice /opt/moses/scripts/training/train-model.perl --parallel
>> >> -mgiza -mgiza-cpus 20 -cores 20 -root-dir train -corpus
>> >> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and
>> >> -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/ru-en.arpa.en:8
>> >> -external-bin-dir /opt/moses/mgiza >& training.out &
>> >>
>> >> After a week of work I have this in the end of training.out:
>> >> (7) learn reordering model @ Sun Feb 8 15:30:35 MSK 2015
>> >> (7.1) [no factors] learn reordering model @ Sun Feb 8 15:30:35
>> MSK 2015
>> >> (7.2) building tables @ Sun Feb 8 15:30:35 MSK 2015
>> >> Executing: /opt/moses/scripts/../bin/lexical-reordering-score
>> >> /home/adminadmin/working/train/model/extract.o.sorted.gz 0.5
>> >> /home/adminadmin/working/train/model/reordering-table. --model "wbe
>> >> msd wbe-msd-bidirectional-fe"
>> >> Lexical Reordering Scorer
>> >> scores lexical reordering models of several types (hierarchical,
>> >> phrase-based and word-based-extraction
>> >> (8) learn generation model @ Sun Feb 8 15:30:35 MSK 2015
>> >> no generation model requested, skipping step
>> >> (9) create moses.ini @ Sun Feb 8 15:30:35 MSK 2015
>> >>
>> >> There is a bunch of files in ~/working/train folder. Looks like
>> >> everything is ok, except the tiny problem: phrase-table.tgz has
>> size
>> >> of 20 bytes. And, of course, it's not usable at all!
>> >>
>> >> Can somebody help and give me a direction where to dig?
>> >>
>> >>
>> >> _______________________________________________
>> >> Moses-support mailing list
>> >> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> >> http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>> >
>> >
>> > _______________________________________________
>> > Moses-support mailing list
>> > Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>

------------------------------

Message: 2
Date: Wed, 25 Feb 2015 11:54:58 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: moses-support@mit.edu
Message-ID: <4789da3ad9a596f1f4e15f7fdf4e7c95@amu.edu.pl>
Content-Type: text/plain; charset="utf-8"

Hi,

Running mgiza with 40 cores is a bad idea anyway, there is some heavy
locking going on. Try 8 to 16. It might be much faster.

W dniu 2015-02-25 11:37, Tom Hoar napisa?(a):

> Alexander,
>
> If your MGIZA word alignment .gz files are empty, the error is happening
> in step 2. Errors there aren't trapped and the system continues running.
> Therefore, the outputs of steps 3 (alignment file), 4 (lex files) & 5
> (extract files) are all garbage. If the word alignment files are ok and
> the extract files are missing, you probably ran out of hard drive space,
> as Barry suggested.
>
> Running for 10 days on a 40-core configuration is a lot to manage. It
> sounds like a large corpus. Have you run a successful training session
> on a sample subset of your data? I would suggest extracting a random
> sample of ~15,000 pairs and run your configuration with -mgiza-cpus 8 &
> -cores 8. It should take about 30 minutes to run and you shouldn't have
> any disk space problems. Work out any bugs in your corpus prep and/or
> runtime with this smaller subset. Then, scale up to your full-sized
> corpus. With large corpora that run 10 days, you might need several
> hundred gigabytes of available space for temp files in your final output
> folder, i.e. not /tmp.
>
> On 02/25/2015 05:19 PM, Barry Haddow wrote:
> Hi Alexander, It looks like something went wrong at the extract stage. If you could make your training.out available then we can look for clues. Could the system have run out of disk space, either in the working directory or in /tmp? A lot of space is required to build the extract files and phrase tables. cheers - Barry On 25/02/15 05:32, ????????? ??????? wrote: Ok, I've started from scratch. I'm pretty sure that I worked with corpus such a way: 1. I tokenized the initial corpuses with tokenizer.perl. Learned numbers of lines caused any errors and warnings 2. Deleted these lines from both files using sed 3. Tokenized the files again. No errors 5. Created truecase-model and truecases the files. 6. Deleted too long lines by using clean-corpus-n.perl 1 50 Started translation model creation process by: nohup nice /opt/moses/scripts/training/train-model.perl --parallel -mgiza -mgiza-cpus 40 -cores 40 -root-dir train -corpus ~/corpus/ru-en.clean -f ru -e en -alignment
grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/ru-en.arpa.en:8 -external-bin-dir /opt/moses/mgiza >& training.out & After ten days of waiting I have 20-bytes long phraze-table.tgz again! What I'm doing wrong? I have both ru-en and en-ru A3.final.gz files, aligned-grow-diag-final.and, lex.e2f, lex.f2e of quite good size, but empty phrase-table, extract.*.sorted.gz and reordering table. I'm still having no idea what and why goes wrong:( 2015-02-14 21:54 GMT+07:00 Kenneth Heafield <moses@kheafield.com <mailto:moses@kheafield.com>>: Sign my petition to add return code checking to train-model.perl. On 02/14/2015 09:33 AM, Tom Hoar wrote: > An empty phrase-table.gz file is usually the result of an ill-prepared > training corpus. Make sure you run the final corpus through > clean-corpus-n.perl. > > > > On 02/14/2015 09:19 PM, ????????? ??????? wrote: >> Hello, everybody! >> >> I have a problem with moses. I created big parallel corpus by >>
concatenating a bunch of existing corpuses on >> http://opus.lingfil.uu.se [1]. After that I cleaned up results (while >> creating tokens script reported some errors. I deleted error-prone >> rows from both of parts). >> >> Then I started to train translation model using mgiza with such an >> executable: >> >> nohup nice /opt/moses/scripts/training/train-model.perl --parallel >> -mgiza -mgiza-cpus 20 -cores 20 -root-dir train -corpus >> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and >> -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/ru-en.arpa.en:8 >> -external-bin-dir /opt/moses/mgiza >& training.out & >> >> After a week of work I have this in the end of training.out: >> (7) learn reordering model @ Sun Feb 8 15:30:35 MSK 2015 >> (7.1) [no factors] learn reordering model @ Sun Feb 8 15:30:35 MSK 2015 >> (7.2) building tables @ Sun Feb 8 15:30:35 MSK 2015 >> Executing: /opt/moses/scripts/../bin/lexical-reordering-score >>
/home/adminadmin/working/train/model/extract.o.sorted.gz 0.5 >> /home/adminadmin/working/train/model/reordering-table. --model "wbe >> msd wbe-msd-bidirectional-fe" >> Lexical Reordering Scorer >> scores lexical reordering models of several types (hierarchical, >> phrase-based and word-based-extraction >> (8) learn generation model @ Sun Feb 8 15:30:35 MSK 2015 >> no generation model requested, skipping step >> (9) create moses.ini @ Sun Feb 8 15:30:35 MSK 2015 >> >> There is a bunch of files in ~/working/train folder. Looks like >> everything is ok, except the tiny problem: phrase-table.tgz has size >> of 20 bytes. And, of course, it's not usable at all! >> >> Can somebody help and give me a direction where to dig? >> >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >> http://mailman.mit.edu/mailman/listinfo/moses-support [2] > > > > _______________________________________________ >
Moses-support mailing list > Moses-support@mit.edu <mailto:Moses-support@mit.edu> > http://mailman.mit.edu/mailman/listinfo/moses-support [2] > _______________________________________________ Moses-support mailing list Moses-support@mit.edu <mailto:Moses-support@mit.edu> http://mailman.mit.edu/mailman/listinfo/moses-support [2] _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support [2]

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support [2]

Links:
------
[1] http://opus.lingfil.uu.se
[2] http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150225/80c39b33/attachment-0001.htm

------------------------------

Message: 3
Date: Wed, 25 Feb 2015 19:06:11 +0800
From: ????????? ??????? <deadyaga@gmail.com>
Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long
To: moses-support@mit.edu
Message-ID:
<CAOAX5pn37=CVhV=fZ3+a2CBgESVov0nSbkRkWfzRv9B5-JapQg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

There 23.5 million lines in cleaned-up corpus.

Thanks for advices. I'll try this.

2015-02-25 17:37 GMT+07:00 Tom Hoar <tahoar@precisiontranslationtools.com>:

> Alexander,
>
> If your MGIZA word alignment .gz files are empty, the error is happening
> in step 2. Errors there aren't trapped and the system continues running.
> Therefore, the outputs of steps 3 (alignment file), 4 (lex files) & 5
> (extract files) are all garbage. If the word alignment files are ok and
> the extract files are missing, you probably ran out of hard drive space,
> as Barry suggested.
>
> Running for 10 days on a 40-core configuration is a lot to manage. It
> sounds like a large corpus. Have you run a successful training session
> on a sample subset of your data? I would suggest extracting a random
> sample of ~15,000 pairs and run your configuration with -mgiza-cpus 8 &
> -cores 8. It should take about 30 minutes to run and you shouldn't have
> any disk space problems. Work out any bugs in your corpus prep and/or
> runtime with this smaller subset. Then, scale up to your full-sized
> corpus. With large corpora that run 10 days, you might need several
> hundred gigabytes of available space for temp files in your final output
> folder, i.e. not /tmp.
>
>
>
> On 02/25/2015 05:19 PM, Barry Haddow wrote:
> > Hi Alexander,
> >
> > It looks like something went wrong at the extract stage. If you could
> > make your training.out available then we can look for clues.
> >
> > Could the system have run out of disk space, either in the working
> > directory or in /tmp? A lot of space is required to build the extract
> > files and phrase tables.
> >
> > cheers - Barry
> >
> > On 25/02/15 05:32, ????????? ??????? wrote:
> >> Ok, I've started from scratch. I'm pretty sure that I worked with
> >> corpus such a way:
> >>
> >> 1. I tokenized the initial corpuses with tokenizer.perl. Learned
> >> numbers of lines caused any errors and warnings
> >> 2. Deleted these lines from both files using sed
> >> 3. Tokenized the files again. No errors
> >> 5. Created truecase-model and truecases the files.
> >> 6. Deleted too long lines by using clean-corpus-n.perl 1 50
> >>
> >> Started translation model creation process by:
> >>
> >> nohup nice /opt/moses/scripts/training/train-model.perl --parallel
> >> -mgiza -mgiza-cpus 40 -cores 40 -root-dir train -corpus
> >> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and
> >> -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/ru-en.arpa.en:8
> >> -external-bin-dir /opt/moses/mgiza >& training.out &
> >>
> >> After ten days of waiting I have 20-bytes long phraze-table.tgz again!
> >> What I'm doing wrong?
> >>
> >> I have both ru-en and en-ru A3.final.gz files,
> >> aligned-grow-diag-final.and, lex.e2f, lex.f2e of quite good size, but
> >> empty phrase-table, extract.*.sorted.gz and reordering table.
> >>
> >> I'm still having no idea what and why goes wrong:(
> >>
> >> 2015-02-14 21:54 GMT+07:00 Kenneth Heafield <moses@kheafield.com
> >> <mailto:moses@kheafield.com>>:
> >>
> >> Sign my petition to add return code checking to train-model.perl.
> >>
> >> On 02/14/2015 09:33 AM, Tom Hoar wrote:
> >> > An empty phrase-table.gz file is usually the result of an
> >> ill-prepared
> >> > training corpus. Make sure you run the final corpus through
> >> > clean-corpus-n.perl.
> >> >
> >> >
> >> >
> >> > On 02/14/2015 09:19 PM, ????????? ??????? wrote:
> >> >> Hello, everybody!
> >> >>
> >> >> I have a problem with moses. I created big parallel corpus by
> >> >> concatenating a bunch of existing corpuses on
> >> >> http://opus.lingfil.uu.se. After that I cleaned up results
> (while
> >> >> creating tokens script reported some errors. I deleted
> error-prone
> >> >> rows from both of parts).
> >> >>
> >> >> Then I started to train translation model using mgiza with such
> an
> >> >> executable:
> >> >>
> >> >> nohup nice /opt/moses/scripts/training/train-model.perl
> --parallel
> >> >> -mgiza -mgiza-cpus 20 -cores 20 -root-dir train -corpus
> >> >> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and
> >> >> -reordering msd-bidirectional-fe -lm
> 0:3:$HOME/lm/ru-en.arpa.en:8
> >> >> -external-bin-dir /opt/moses/mgiza >& training.out &
> >> >>
> >> >> After a week of work I have this in the end of training.out:
> >> >> (7) learn reordering model @ Sun Feb 8 15:30:35 MSK 2015
> >> >> (7.1) [no factors] learn reordering model @ Sun Feb 8 15:30:35
> >> MSK 2015
> >> >> (7.2) building tables @ Sun Feb 8 15:30:35 MSK 2015
> >> >> Executing: /opt/moses/scripts/../bin/lexical-reordering-score
> >> >> /home/adminadmin/working/train/model/extract.o.sorted.gz 0.5
> >> >> /home/adminadmin/working/train/model/reordering-table. --model
> "wbe
> >> >> msd wbe-msd-bidirectional-fe"
> >> >> Lexical Reordering Scorer
> >> >> scores lexical reordering models of several types (hierarchical,
> >> >> phrase-based and word-based-extraction
> >> >> (8) learn generation model @ Sun Feb 8 15:30:35 MSK 2015
> >> >> no generation model requested, skipping step
> >> >> (9) create moses.ini @ Sun Feb 8 15:30:35 MSK 2015
> >> >>
> >> >> There is a bunch of files in ~/working/train folder. Looks like
> >> >> everything is ok, except the tiny problem: phrase-table.tgz has
> >> size
> >> >> of 20 bytes. And, of course, it's not usable at all!
> >> >>
> >> >> Can somebody help and give me a direction where to dig?
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> Moses-support mailing list
> >> >> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> >> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > Moses-support mailing list
> >> > Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >> >
> >> _______________________________________________
> >> Moses-support mailing list
> >> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Moses-support mailing list
> >> Moses-support@mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150225/468d4980/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 100, Issue 88
**********************************************

Moses-support Digest, Vol 100, Issue 88

0 Response to "Moses-support Digest, Vol 100, Issue 88"

Post a Comment