Moses-support Digest, Vol 110, Issue 12

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: decoder question (Kenneth Heafield)
2. Re: System requiremnts for Moses (Philipp Koehn)
3. Re: Exit code: 127 ERROR: Can't generate symmetrized
alignment file (Philipp Koehn)
4. Re: continue partial translation (Philipp Koehn)

----------------------------------------------------------------------

Message: 1
Date: Fri, 4 Dec 2015 13:27:37 +0000
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] decoder question
To: moses-support@mit.edu
Message-ID: <566194C9.4010105@kheafield.com>
Content-Type: text/plain; charset=windows-1252

Indeed, you should split sentences into separate lines. Here's the script:

https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl

Note that the script assumes you have placed <P> tags in the text to
force sentence boundaries. It will not assume that existing linebreaks
indicate sentence boundaries. If you don't put <P> tags in, it will
read the entire corpus into RAM then try to break it, which will
typically run out of memory.

Kenneth

On 12/04/2015 01:18 PM, Vincent Nguyen wrote:
>
> well not exactly my question. I know Moses translate one "line" at a
> time, meaning a string ending with a line feed.
>
> My question is more, if the string contains a PERIOD (tokenized as
> such), separating the line in 2 "sentences" then how does it behave ?
>
> given my observation I have the feeling that we really need to
> "sentence-tokenize" first before word-tokenizing.
>
>
>
> Le 04/12/2015 13:52, John D Burger a ?crit :
>> I think you're asking if Moses translates one sentence at a time. The answer is yes.
>>
>> - John Burger
>> MITRE
>>
>>> On Dec 4, 2015, at 04:43, Vincent Nguyen <vnguyen@neuf.fr> wrote:
>>>
>>> Actually I don't know if this is a decoder question or such.
>>>
>>> Here is my issue
>>>
>>> Let's say I have a text string with 2 sentences, with a period ending
>>> the first sentence, but no CR+LF, just a space before the second sentence.
>>>
>>> When I pass the full string to the pipe :
>>> tokenizer + truecaser + moses + detruecase + detokenizer
>>> the output is only one sentence, the period at the end of the first
>>> sentence has been eliminated, the sentence is nonsense (well not good at
>>> all)
>>>
>>> If I insert a CRLF just after the period of the first sentence and send
>>> the whole thing to the pipe, the output is correct.
>>>
>>> Am I missing something ?
>>>
>>> Should we only send string to moses segment by segment ?
>>>
>>> thanks,
>>> Vincent
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

------------------------------

Message: 2
Date: Fri, 4 Dec 2015 09:13:47 -0500
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] System requiremnts for Moses
To: "Hegde, Sujay" <Sujay.Hegde@xerox.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>,
"MudaliarMudaliar, Preeti J" <preeti.mudaliarmudaliar@xerox.com>
Message-ID:
<CAAFADDA9=VdkTtUzZAcxOjAuWik8xkrEmttOMm8Kf2d8sYozsw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

Hi,

if you are using experiment.perl, then the config parameter for phrase length is

max-phrase-length = 5

Similarly, if you call train-model.perl directly, then the switch is

$MOSES/scripts/training/train-model.perl -max-phrase-length 5 [...]

You can check the size of your model by looking at the size of the
*minphr* *minlexr* and *binlm* files. These are binary data files that
are loaded into RAM (see files referenced in moses.ini).

Typically, the language model is the largest, because typically it is
trained with large amounts of additional monolingual data.

-phi

On Fri, Dec 4, 2015 at 12:22 AM, Hegde, Sujay <Sujay.Hegde@xerox.com> wrote:
> Hi Phillip,
>
> How do we limit phrase length during training .Is there a config parameter in moses training config file?
>
> Is the phrase table the biggest model or the language model? ----> We have 6-7 phrase tables that are combined in a log-linear fashion during decoding.
>
>
>
>
> Thanks and Regards,
> Sujay,
> Xerox Business Services, Bangalore, India
>
>
> -----Original Message-----
> From: phkoehn@gmail.com [mailto:phkoehn@gmail.com] On Behalf Of Philipp Koehn
> Sent: 03 December 2015 21:52
> To: Hegde, Sujay
> Cc: moses-support@mit.edu; MudaliarMudaliar, Preeti J
> Subject: Re: [Moses-support] System requiremnts for Moses
>
> Hi,
>
> having such long sentences should cause all kinds of problems with word alignment, so I am bit puzzled that they still show up when pruning the phrase table.
>
> A good way to prune the phrase table is to limit the length of phrases (max 5 does no harm, even max 4 is not a big deal), and reduce low probability phrase pairs ($MOSES/scripts/training/threshold-filter.perl).
>
> Is the phrase table the biggest model or the language model? For the latter, there are several compression options.
>
> -phi
>
> On Thu, Dec 3, 2015 at 12:32 AM, Hegde, Sujay <Sujay.Hegde@xerox.com> wrote:
>> HI Philipp,
>>
>>
>>
>> Thanks a lot.
>>
>>
>>
>> Actually it?s a VIRTUAL machine.
>>
>>
>>
>> Also we have compressed the models into .minphr and
>> .minlexr but we couldn?t prune it as while pruning we got an error
>> saying some of the sentences in the Corpus are too long and it cannot be pruned.
>>
>>
>>
>> We used pruning using SALM and get the following error:
>>
>>
>>
>> /mnt/hd1/git/salm/Bin/Linux/Index/IndexSA.O64
>> opensub.train.it
>>
>> Initialize vocabulary file: opensub.train.it.id_voc
>>
>> Loading existing vocabulary file: opensub.train.it.id_voc
>>
>> Total 100 word types loaded
>>
>> Max VocID=100
>>
>> Sentence 4152148 has more than 256 words. Can not handle such long sentence.
>> Please cut it short first!
>>
>>
>>
>> Is there anything we could do about the above?
>>
>>
>>
>>
>>
>>
>>
>> Thanks and Regards,
>>
>> Sujay,
>>
>> Xerox Business Services, Bangalore, India
>>
>>
>>
>> From: phkoehn@gmail.com [mailto:phkoehn@gmail.com] On Behalf Of
>> Philipp Koehn
>> Sent: 03 December 2015 03:13
>> To: Hegde, Sujay
>> Cc: moses-support@mit.edu
>> Subject: Re: [Moses-support] System requiremnts for Moses
>>
>>
>>
>> Hi,
>>
>>
>>
>> the machine you have is certainly sufficient even for large models.
>>
>>
>>
>> If you are running two language pairs in parallel and run into RAM
>> problems, you may want to look into ways to compress the model files
>> (phrase table, reordering table, language model) using either more
>> efficient data structures (e.g., various KENLM options), or pruning the models.
>>
>>
>>
>> -phi
>>
>>
>>
>>
>>
>> On Tue, Dec 1, 2015 at 5:08 AM, Hegde, Sujay <Sujay.Hegde@xerox.com> wrote:
>>
>> Dear Moses Admin,
>>
>>
>>
>> We are using Moses decoder for commercial environment.
>>
>>
>>
>> We have 132GB RAM, 1TB disk and quadcore Virtual
>> Machine with CentOs OS.
>>
>>
>>
>> We have 2 language pairs installed, and when running
>> both the models together the Translation hangs(Takes a LONG time).
>>
>> It is fine when we run only one language model.
>>
>>
>>
>> Is there any Specific System requirements needed for moses?
>>
>> Please let me know
>>
>>
>>
>> Thanks and Regards,
>>
>> Sujay,
>>
>> Xerox Business Services, Bangalore, India
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

Message: 3
Date: Fri, 4 Dec 2015 11:05:47 -0500
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] Exit code: 127 ERROR: Can't generate
symmetrized alignment file
To: "Read, James C" <jcread@essex.ac.uk>
Cc: Moses Support <moses-support@mit.edu>
Message-ID:
<CAAFADDAvL8RcECK=P6eESPXFkn40Hm+xZBMVLyq=uXGr==y56w@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

Hi,

when compiling Moses with bjam, then all binaries will be compiled.

You will need the bin and the script directory on any production machine.

-phi

On Thu, Dec 3, 2015 at 11:49 AM, Read, James C <jcread@essex.ac.uk> wrote:
> Indeed.
>
>
> On this box I will not be making use of the decoder and most of the Moses
> software. All I need is for the training script to complete and to be able
> to use evaluation scripts to output bleu scores at the end. Will be using
> own custom translation software. Just need a phrase table to extract
> information from to make own custom translation tables.
>
>
> Which binaries need to be compiled to make the training script complete? Any
> suggestions for modifications to install scripts so I can install only the
> essential components would be most welcomed.
>
>
> James
>
>
>
> ________________________________
> From: phkoehn@gmail.com <phkoehn@gmail.com> on behalf of Philipp Koehn
> <phi@jhu.edu>
> Sent: Wednesday, December 2, 2015 9:22 PM
> To: Read, James C
> Cc: Moses Support
> Subject: Re: [Moses-support] Exit code: 127 ERROR: Can't generate
> symmetrized alignment file
>
> Hi,
>
> it looks like that symal was not compiled - it should be in $MOSES/bin
>
> Can you check what went wrong during compilation?
>
> -phi
>
> On Wed, Dec 2, 2015 at 10:21 AM, Read, James C <jcread@essex.ac.uk> wrote:
>>
>> nohup nice
>> /media/bigdata/jcread/3rd_party_software/mosesdecoder/scripts/training/train-model.perl
>> -root-dir phrase_table -corpus
>> /media/bigdata/jcread/llv/data/europarlv7/prealigned/tokenized_truecased_cleaned/1-0010/00001000/europarl-v7.it-en.1-0010.00001000
>> -f it -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe
>> -lm 0:3:/media/bigdata/jcread/llv/lm:8 -external-bin-dir
>> /media/bigdata/jcread/3rd_party_software/bin >& training.out &
>>
>>
>>
>> Runs well for a while and then bombs out with following output and Error
>> 127
>>
>>
>>
>> (3) generate word alignment @ Wed Dec 2 01:56:06 GMT 2015
>> Combining forward and inverted alignment from files:
>>
>> /media/bigdata/jcread/llv/data/europarlv7/prealigned/tokenized_truecased_cleaned/1-0010/00001000/phrase_table/giza.it-en/it-en.A3.final.{bz2,gz}
>>
>> /media/bigdata/jcread/llv/data/europarlv7/prealigned/tokenized_truecased_cleaned/1-0010/00001000/phrase_table/giza.en-it/en-it.A3.final.{bz2,gz}
>> Executing: mkdir -p
>> /media/bigdata/jcread/llv/data/europarlv7/prealigned/tokenized_truecased_cleaned/1-0010/00001000/phrase_table/model
>> Executing:
>> /media/bigdata/jcread/3rd_party_software/mosesdecoder/scripts/training/giza2bal.pl
>> -d "gzip -cd
>> /media/bigdata/jcread/llv/data/europarlv7/prealigned/tokenized_truecased_cleaned/1-0010/00001000/phrase_table/giza.en-it/en-it.A3.final.gz"
>> -i "gzip -cd
>> /media/bigdata/jcread/llv/data/europarlv7/prealigned/tokenized_truecased_cleaned/1-0010/00001000/phrase_table/giza.it-en/it-en.A3.final.gz"
>> |/media/bigdata/jcread/3rd_party_software/mosesdecoder/scripts/../bin/symal
>> -alignment="grow" -diagonal="yes" -final="yes" -both="yes" >
>> /media/bigdata/jcread/llv/data/europarlv7/prealigned/tokenized_truecased_cleaned/1-0010/00001000/phrase_table/model/aligned.grow-diag-final-and
>> sh: 1:
>> /media/bigdata/jcread/3rd_party_software/mosesdecoder/scripts/../bin/symal:
>> not found
>> Exit code: 127
>> ERROR: Can't generate symmetrized alignment file
>>
>>
>> It seems this problem with the script has been encountered before:
>>
>>
>> http://comments.gmane.org/gmane.comp.nlp.moses.user/10489
>>
>>
>> I'm not sure I understand the accepted solution.
>>
>>
>> "Use absolute paths to all the scripts, and make sure your parallel files
>> have the same names but the extension"
>>
>>
>> The command I issued uses only absolute paths. Is this referring to
>> modifications in the training script itself?
>>
>>
>> James
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

------------------------------

Message: 4
Date: Fri, 4 Dec 2015 11:28:00 -0500
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] continue partial translation
To: He He <hhe.xiy@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAAFADDCbJvdP1V9MMr7yTTFh6DYSMivwsF0ko-aWktXTdMu9Fg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

Hi,

interesting... this may be due to the maximum phrase length (the XML
specified translation is treated as a phrase translation), which is 20
by default.

You can tell the decoder otherwise with the switch -max-phrase-length.

I'd be interested to know, if this fixes the problem.

-phi

On Wed, Dec 2, 2015 at 9:30 PM, He He <hhe.xiy@gmail.com> wrote:
> Hi,
>
> Yes. The input to the decoder is " -v 0 -threads 4 -n-best-list - 10
> --print-alignment-info-in-n-best -xml-input exclusive
>
> If I break the long translation into parts it works though.
>
> He
>
> On Wed, Dec 2, 2015 at 6:14 PM, Philipp Koehn <phi@jhu.edu> wrote:
>>
>> Hi,
>>
>> it's not clear to me what you are exactly specifying to the decoder,
>> but what you intend to do should work.
>>
>> Did you use the switch "-xml-input exclusive"?
>> What exactly do you specify as input?
>>
>> -phi
>>
>>
>>
>>
>>
>> On Tue, Nov 24, 2015 at 10:19 PM, He He <hhe.xiy@gmail.com> wrote:
>> > Hi there,
>> >
>> > I'm trying to do translation conditioned on some already translated
>> > prefix
>> > (essentially what -continue-partial-translation was supposed to do). I'm
>> > using -xml-input exclusive to pass in the prefix source and translation.
>> >
>> > However, when the prefix becomes long, this doesn't work, e.g.
>> > <p translation="Britain 's trade house E D & F Man said on the amount of
>> > money in eastern europe , sugar beet output both Ukraine and Russia in">
>> > ??
>> > ? ED & F ?? ? ? ?? ? , 96 / 97 ?? ? ?? ? ??? ?? ? , ????? ???? ? ?? ??
>> > ???</p> ?? ? ?? ? ?? ? ? , ??? ?? ? ??> 0 ||| ?? ? ED & F?? ? ED & F ??
>> > ? ?
>> > ?? ? / 96 , 97 ?? ? ?? ? ??? ?? ? ????? , ? ??? ? ?? ?? ??? substantial
>> > decline was expeted to be tough ||| LexicalReordering0= -4.48185
>> > -7.01678
>> > -1.48808 -4.3759 -6.89465-0.942918 Distortion0= -12 LM0= -227.918
>> > WordPenalty0= -40 PhrasePenalty0= 36 TransltionModel0= -7.25771 -34.5474
>> > -2.80336 -22.3651 ||| -3322.64"
>> >
>> > It just copies the source prefix. I suspect it's because many words now
>> > becomes UNK due to ignoring entries in phrase table that overlaps the
>> > prefix.
>> >
>> > Is there a way around this? Thanks a lot in advance!
>> >
>> > Best,
>> > He
>> >
>> > _______________________________________________
>> > Moses-support mailing list
>> > Moses-support@mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 110, Issue 12
**********************************************

Moses-support Digest, Vol 110, Issue 12

0 Response to "Moses-support Digest, Vol 110, Issue 12"

Post a Comment