Moses-support Digest, Vol 110, Issue 10

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Training script documentation (Read, James C)
2. Re: Training script documentation (Philipp Koehn)
3. Re: System requiremnts for Moses (Ulrich Germann)


----------------------------------------------------------------------

Message: 1
Date: Thu, 3 Dec 2015 17:01:49 +0000
From: "Read, James C" <jcread@essex.ac.uk>
Subject: Re: [Moses-support] Training script documentation
To: Philipp Koehn <phi@jhu.edu>
Cc: Moses Support <moses-support@mit.edu>
Message-ID:
<DB5PR06MB1478C60DEB73F112B254D0B7850D0@DB5PR06MB1478.eurprd06.prod.outlook.com>

Content-Type: text/plain; charset="iso-8859-1"

If I just clean and escape-special-characters would that be the minimum requirement to get the training script to complete?


James


________________________________
From: phkoehn@gmail.com <phkoehn@gmail.com> on behalf of Philipp Koehn <phi@jhu.edu>
Sent: Wednesday, December 2, 2015 6:31 PM
To: Read, James C
Cc: Moses Support
Subject: Re: [Moses-support] Training script documentation

Hi,

the script expects tokenized data, and word alignment will fail if there are too long sentences or if there is length mismatch in a sentence pair (e.g., 1 word sentence translated as 70 word sentence). That's what the cleaning script does. It also removes spurious spaces, which may throw some processing steps off. Also, the provided tokenizer deals with special characters like "|". If you do not use this tokenizer, you should run scripts/tokenizer/escape-special-chars.perl to escape them.

Truecasing is optional. Many do lowercasing.

It does not matter to the training script how you prepare the data, so you do not have to explicitly run these steps. You may already have tokenized data, so no need to run the tokenizer.

Whatever you specify with "-corpus" (full path!) should work, as long as the issues spelled out in the first paragraph above are addressed.

-phi

On Wed, Dec 2, 2015 at 10:28 AM, Read, James C <jcread@essex.ac.uk<mailto:jcread@essex.ac.uk>> wrote:

In the past I've never been able to get the training script to run to completion without rigorously following the instructions here http://www.statmt.org/moses/?n=moses.baseline



1) Tokenise

2) Train truecaser

3) Truecase

4) Clean


What if somebody wants to just tokenize and clean without truecasing or just clean without tokenizing? Why should the script bomb out? Is this something to do with formats required by early stages of the training process?


James


NOTE: This is not an open invitation to discuss why somebody would want to train models without tokenzing or truecasing. This is nothing more than a request for technical assistance.

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu<mailto:Moses-support@mit.edu>
http://mailman.mit.edu/mailman/listinfo/moses-support


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151203/62141907/attachment-0001.html

------------------------------

Message: 2
Date: Thu, 3 Dec 2015 13:05:14 -0500
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] Training script documentation
To: "Read, James C" <jcread@essex.ac.uk>
Cc: Moses Support <moses-support@mit.edu>
Message-ID:
<CAAFADDD8Pv04EPvGabUq2t5h-14OGLptXGeiYv7QSwsw8hB__A@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

Hi,

yes, that would be the only two things required to avoid crashes.

-phi

On Thu, Dec 3, 2015 at 12:01 PM, Read, James C <jcread@essex.ac.uk> wrote:
> If I just clean and escape-special-characters would that be the minimum
> requirement to get the training script to complete?
>
>
> James
>
>
>
> ________________________________
> From: phkoehn@gmail.com <phkoehn@gmail.com> on behalf of Philipp Koehn
> <phi@jhu.edu>
> Sent: Wednesday, December 2, 2015 6:31 PM
> To: Read, James C
> Cc: Moses Support
> Subject: Re: [Moses-support] Training script documentation
>
> Hi,
>
> the script expects tokenized data, and word alignment will fail if there are
> too long sentences or if there is length mismatch in a sentence pair (e.g.,
> 1 word sentence translated as 70 word sentence). That's what the cleaning
> script does. It also removes spurious spaces, which may throw some
> processing steps off. Also, the provided tokenizer deals with special
> characters like "|". If you do not use this tokenizer, you should run
> scripts/tokenizer/escape-special-chars.perl to escape them.
>
> Truecasing is optional. Many do lowercasing.
>
> It does not matter to the training script how you prepare the data, so you
> do not have to explicitly run these steps. You may already have tokenized
> data, so no need to run the tokenizer.
>
> Whatever you specify with "-corpus" (full path!) should work, as long as the
> issues spelled out in the first paragraph above are addressed.
>
> -phi
>
> On Wed, Dec 2, 2015 at 10:28 AM, Read, James C <jcread@essex.ac.uk> wrote:
>>
>> In the past I've never been able to get the training script to run to
>> completion without rigorously following the instructions here
>> http://www.statmt.org/moses/?n=moses.baseline
>>
>>
>>
>> 1) Tokenise
>>
>> 2) Train truecaser
>>
>> 3) Truecase
>>
>> 4) Clean
>>
>>
>> What if somebody wants to just tokenize and clean without truecasing or
>> just clean without tokenizing? Why should the script bomb out? Is this
>> something to do with formats required by early stages of the training
>> process?
>>
>>
>> James
>>
>>
>> NOTE: This is not an open invitation to discuss why somebody would want to
>> train models without tokenzing or truecasing. This is nothing more than a
>> request for technical assistance.
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


------------------------------

Message: 3
Date: Fri, 4 Dec 2015 00:36:41 +0000
From: Ulrich Germann <ulrich.germann@gmail.com>
Subject: Re: [Moses-support] System requiremnts for Moses
To: "Hegde, Sujay" <Sujay.Hegde@xerox.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>, Philipp Koehn
<phi@jhu.edu>, "MudaliarMudaliar, Preeti J"
<preeti.mudaliarmudaliar@xerox.com>
Message-ID:
<CAHQSRUoDFy2-fUq2H7YwiBS5pzdFCfRdg2xUsfbufEyZzWFXYA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi there,

Assuming that you are using phrase-based SMT, not syntax-based systems, my
recommendation would be to use suffix-array-based sampling phrase tables
and just skip the whole phrase table building process. (Disclaimer: I'm the
component's author, so I'm biased.) You can find details here:
https://docs.google.com/viewer?url=https://ufal.mff.cuni.cz/pbml/104/art-germann.pdf

Best regards - Ulrich Germann

On Thu, Dec 3, 2015 at 5:32 AM, Hegde, Sujay <Sujay.Hegde@xerox.com> wrote:

> HI Philipp,
>
>
>
> Thanks a lot.
>
>
>
> Actually it?s a VIRTUAL machine.
>
>
>
> Also we have compressed the models into .minphr and
> .minlexr but we couldn?t prune it as while pruning we got an error saying
> some of the sentences in the Corpus are too long and it cannot be pruned.
>
>
>
> We used pruning using SALM and get the following error:
>
>
>
> /mnt/hd1/git/salm/Bin/Linux/Index/IndexSA.O64
> opensub.train.it
>
> Initialize vocabulary file: opensub.train.it.id_voc
>
> Loading existing vocabulary file: opensub.train.it.id_voc
>
> Total 100 word types loaded
>
> Max VocID=100
>
> *Sentence 4152148 has more than 256 words. Can not handle such long
> sentence. Please cut it short first!*
>
>
>
> Is there anything we could do about the above?
>
>
>
>
>
>
>
> Thanks and Regards,
>
> Sujay,
>
> Xerox Business Services, Bangalore, India
>
>
>
> *From:* phkoehn@gmail.com [mailto:phkoehn@gmail.com] *On Behalf Of *Philipp
> Koehn
> *Sent:* 03 December 2015 03:13
> *To:* Hegde, Sujay
> *Cc:* moses-support@mit.edu
> *Subject:* Re: [Moses-support] System requiremnts for Moses
>
>
>
> Hi,
>
>
>
> the machine you have is certainly sufficient even for large models.
>
>
>
> If you are running two language pairs in parallel and run into RAM
> problems, you may want to look into ways to compress the model files
> (phrase table, reordering table, language model) using either more
> efficient data structures (e.g., various KENLM options), or pruning the
> models.
>
>
>
> -phi
>
>
>
>
>
> On Tue, Dec 1, 2015 at 5:08 AM, Hegde, Sujay <Sujay.Hegde@xerox.com>
> wrote:
>
> Dear Moses Admin,
>
>
>
> We are using Moses decoder for commercial environment.
>
>
>
> We have 132GB RAM, 1TB disk and quadcore *Virtual Machine*
> with CentOs OS.
>
>
>
> We have 2 language pairs installed, and when running both
> the models together the Translation hangs(Takes a LONG time).
>
> It is fine when we run only one language model.
>
>
>
> Is there any Specific System requirements needed for moses?
>
> Please let me know
>
>
>
> Thanks and Regards,
>
> Sujay,
>
> Xerox Business Services, Bangalore, India
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


--
Ulrich Germann
Senior Researcher
School of Informatics
University of Edinburgh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151203/70c642e6/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 110, Issue 10
**********************************************

0 Response to "Moses-support Digest, Vol 110, Issue 10"

Post a Comment