Moses-support Digest, Vol 82, Issue 47

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Error when attempting to translate: fails with "
StrayFactorException " (Stefan Dumitrescu)

----------------------------------------------------------------------

Message: 1
Date: Wed, 28 Aug 2013 14:55:18 +0300
From: Stefan Dumitrescu <dumitrescu.stefan@gmail.com>
Subject: Re: [Moses-support] Error when attempting to translate: fails
with " StrayFactorException "
To: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <521DE526.8090204@gmail.com>
Content-Type: text/plain; charset="utf-8"

Ok, so why doesn't it run on the latest test where :

PhraseDictionaryMemory name=TranslationModel0 table-limit=20
num-features=4
path=/usr/local/trans/work/ted/m6/model/phrase-table.0,2-0.gz
*input-factor=0,2 output-factor=0*

Everything seems correctly done..

Just for testing' sake, i have reduced the 5 factored input file to only
2 factors, 0 and 2. Now the translation is working!

That means that i have to create an input file that has the exact
factors as those in the phrase-table for every factor combination I'm
thinking of using...
Thing is, I'm guessing that this is not normal behavior (as I
successfully used factored translation before); I'm hoping this is an
accidental bug in Moses :)

Stefan

P.S. The translation worked in the sense that it did not crash with an
exception. However, it translated practically nothing, so something is
definitely fishy, but I fail to see what (if?) I did wrong:

BEST TRANSLATION: "|"^DBLQ|UNK|UNK vino|veni^V2|UNK|UNK ?i|?i^CR|UNK|UNK
danseaz?|dansa^V2|UNK|UNK cu|cu^S|UNK|UNK mine|min?^NSON|UNK|UNK
.|.^PERIOD|UNK|UNK "|"^DBLQ|UNK|UNK v?|tu^PPPA|UNK|UNK
mul?umesc|mul?umi^V3|UNK|UNK .|.^PERIOD|UNK|UNK [11111111111]
[total=-1805.431]
core=(-1100.000,-11.000,11.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,-1437.261)
Line 1597: Translation took 0.000 seconds total

I 'm baffled that now i see 4 factors in the translation, and i have no
idea where they came from. The phrase table has only 2 factors, the same
as the input.

On 8/28/2013 2:12 PM, Hieu Hoang wrote:
>
>
>
> On 28 August 2013 12:00, Stefan Dumitrescu
> <dumitrescu.stefan@gmail.com <mailto:dumitrescu.stefan@gmail.com>> wrote:
>
> Hi Hieu,
>
> I have a single annotated training corpus from which i will build
> several models, single factor and multiple factor. I'm expecting
> that if i specify -translate-factors 0-0 i'll get a
> phrase-table.0-0.gz and if in a later model i specify
> -translate-factors 0,1,4-0 i'll get factors 0 1 and 4 in my phrase
> table, but using the same training data.
>
>
> yes, having a single annotated training corpus is a good idea and it
> should work. However, there might be a bug in the script or human
> error in how it was run.
>
> if the phrase-table entry is
> PhraseDictionaryMemory input-factor=0 output-factor=0 ...
> then the phrase-table can only have 1 factor in both input & output.
> If it has more than 1, then something went wrong and you have to work
> backwards to debug it
>
>
> I have just trained a new model (to test the above) with the
> following script:
>
> /usr/local/trans/tools/moses/scripts/training/train-model.perl \
> --corpus /usr/local/trans/corpus/ted/train \
> --external-bin-dir=/usr/local/trans/tools/mgiza/bin \
> --parallel \
> --mgiza \
> --mgiza-cpus 8 \
> --f ro --e en \
> --lm 0:5:/usr/local/trans/corpus/tedlm/en.surface.5gram.kni.blm:8 \
> --root-dir /usr/local/trans/work/ted/m6 \
> --max-phrase-length 4 \
> --first-step $FIRSTSTEP \
> *--translation-factors 0,2-0 \*
>
> --alignment-factors 2-2 \
> --alignment grow-diag-final-and \
> --reordering-factors 2-2 \
> --reordering wbe-msd-bidirectional-fe
>
> It creates a PT like : sdumitrescu
> /usr/local/trans/work/ted/scripts > zcat
> ../m6/model/*phrase-table.0,2-0.gz* | head -2
> *!|!^EXCL* *!|!^EXCL* *!|!^EXCL* *"|"^DBLQ* ||| *. "* |||
> 0.000169635 2.32345e-08 1 0.262566 ||| 0-0 1-0 2-0 3-1 ||| 5895 1 1
> !|!^EXCL !|!^EXCL !|!^EXCL pe|pe^S ||| ! ! ! ||| 0.5 0.0202071
> <tel:0.5%200.0202071> 1 0.190109 ||| 0-0 2-0 0-1 1-1 1-2 ||| 2 1 1
>
> So far, so good. moses.ini looks like: (it automatically filled
> the input factors, though i am not using factor #1 anywhere)
> [input-factors]
> *0*
> *
> **1**
> **2*
>
> # mapping steps
> [mapping]
> 0 T 0
>
> [distortion-limit]
> 6
>
> # feature functions
> [feature]
> UnknownWordPenalty
> WordPenalty
> PhrasePenalty
> *PhraseDictionaryMemory name=TranslationModel0 table-limit=20
> num-features=4
> path=/usr/local/trans/work/ted/m6/model/phrase-table.0,2-0.gz
> input-factor=0,2 output-factor=0*
> LexicalReordering name=LexicalReordering0 num-features=6
> type=wbe-msd-bidirectional-fe-allff input-factor=2 output-factor=2
> path=/usr/local/trans/work/ted/m6/model/reordering-table.2-2.wbe-msd-bidirectional-fe.gz
>
>
> Distortion
> KENLM lazyken=0 name=LM0 factor=0
> path=/usr/local/trans/corpus/tedlm/en.surface.5gram.kni.blm order=5
>
> # dense weights for feature functions
> [weight]
> UnknownWordPenalty0= 1
> WordPenalty0= -1
> PhrasePenalty0= 0.2
> TranslationModel0= 0.2 0.2 0.2 0.2
> LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
> Distortion0= 0.3
> LM0= 0.5
>
>
>
> Input is: (5 factors, same as training data)
> sdumitrescu /usr/local/trans/work/ted/scripts > cat
> ../../../corpus/ted/test.in.ro <http://test.in.ro> | head -1
> *Robert|robert|robert^NP|NP|Np* Gupta|gupta|gupta^NP|NP|Np
> ,|,|,^COMMA|COMMA|COMMA
> violonist|violonist|violonist^NSN|NSN|Nc-s-ny la|la|la^S|S|Spca
> Orchestra|orchestra|orchestra^NP|NP|Np
> Filarmonic?|filarmonic?|filarmonic?^NP|NP|Np din|din|din^S|S|Spca
> Los|los|los^NP|NP|Np Angeles|angelege|angelege^V3|V3|Vmii3p
> *
>
> *Error looks like:
>
> Exception: moses/Word.cpp:109 in void
> Moses::Word::CreateFromString(Moses::FactorDirection, const
> std::vector<long unsigned int>&, const StringPiece&, bool) threw
> StrayFactorException because `fit'.
> You have configured 3 factors but the word
> Robert|robert|robert^NP|NP|Np contains factor delimiter | too many
> times.
>
> From all the experiments so far i can only deduce one thing, i
> have to create as many input files as different model types i
> build, just to match the factors in the phrase table? Wasn't the
> default behavior of Moses to allow any number of factors in the
> input file and pick the ones it needs at each
> translation/generation/reordering step?
>
> Past year I have tried a significant number of factored models
> (different combinations) and i just used a single input file that
> contained all the factors, as i am doing now, without any moses
> exceptions.. For the example above to work i'm guessing i have to
> recreate the test.in.ro <http://test.in.ro> file only with factors
> 0 and 2?
>
> Thanks,
> Stefan
>
>
> On 8/28/2013 11:32 AM, Hieu Hoang wrote:
>> Do you want to have multiple factors in your phrase-table?
>>
>> The training command doesn't specify any factors. The ini file
>> says your phrase-table has only 1 factor for both input and
>> output. However, your translation rules contain 10 factors!
>>
>>
>> On 28/08/2013 08:59, Stefan Dumitrescu wrote:
>>> Hi Hieu,
>>>
>>> The training and test data is correctly processed, first
>>> tokenized (with moses' script), then truecased then annotated).
>>>
>>> I have trained a surface model on the unannotated (unfactored)
>>> data and everything runs smoothly. However, when i am using an
>>> annotated corpus (correctly annotated, each token becomes 5
>>> factors) as well as an annotated input, then i get this exception.
>>>
>>> I tried recompiling moses with -max-factors 10, no change.
>>>
>>> I played with the -input-factors switch for the decoder, now i
>>> am getting this:
>>>
>>> .... (i cut the first part) ...
>>> line=KENLM lazyken=0 name=LM0 factor=0
>>> path=/usr/local/trans/corpus/tedlm/en.sur face.5gram.kni.blm order=5
>>> FeatureFunction: LM0 start: 14 end: 14
>>> Loading table into memory...done.
>>> Start loading text SCFG phrase table. Moses format : [63.000]
>>> seconds
>>> Reading /usr/local/trans/work/ted/m4/model/phrase-table.0-0.gz
>>> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85-
>>> --90---95--100
>>> **************************************************************************************
>>> **************
>>> Exception: bitset::set
>>>
>>> It is a bit frustrating because i have used factored models
>>> several times in the past year without any issues..
>>>
>>> For my model m1, i did not specify any -translation-factors in
>>> the training phase and i got a phrase-table.gz which contained
>>> the five factors together as in :
>>>
>>> sdumitrescu /usr/local/trans/work/ted/scripts > zcat
>>> ../m1/model/phrase-table.gz | head -2
>>> !|!|!^EXCL|EXCL|EXCL !|!|!^EXCL|EXCL|EXCL !|!|!^EXCL|EXCL|EXCL
>>> pe|pe|pe^S|S|Spca ||| !|!|!^EXCL|EXCL|EXCL !|!|!^EXCL|EXCL|EXCL
>>> !|!|!^EXCL|EXCL|EXCL ||| 0.5 0.018702 1 <tel:0.5%200.018702%201>
>>> 0.18952 ||| 0-0 1-0 0-1 2-2 ||| 2 1 1
>>> !|!|!^EXCL|EXCL|EXCL !|!|!^EXCL|EXCL|EXCL !|!|!^EXCL|EXCL|EXCL
>>> ||| !|!|!^EXCL|EXCL|EXCL !|!|!^EXCL|EXCL|EXCL
>>> !|!|!^EXCL|EXCL|EXCL "|"|"^DBLQ|DBLQ|DBLQ ||| 1 0.291644 0.5
>>> 0.000225623 ||| 0-0 1-1 2-1 1-2 2-3 ||| 1 2 1
>>>
>>> For model m3 for example, trained with:
>>>
>>> ...(cut)...
>>> --root-dir /usr/local/trans/work/ted/m3 \
>>> --max-phrase-length 4 \
>>> --first-step $FIRSTSTEP \
>>> --alignment-factors 2-2 \
>>> --alignment grow-diag-final-and \
>>> --reordering-factors 2-2 \
>>> --reordering wbe-msd-bidirectional-fe
>>>
>>> i'm getting a phrase-table.0-0.gz:
>>>
>>> sdumitrescu /usr/local/trans/work/ted/scripts > zcat
>>> ../m3/model/phrase-table.0-0.gz | head -2
>>> ! ! ! " ||| . " ||| 0.000169635 2.32345e-08 1 0.262566 ||| 0-0
>>> 1-0 2-0 3-1 ||| 5895 1 1
>>> ! ! ! pe ||| ! ! ! ||| 0.5 0.0202114 1 0.190109 ||| 0-0 2-0 0-1
>>> 1-1 1-2 ||| 2 1 1
>>>
>>> Either one does not work with an annotated input file like:
>>>
>>> *I*|i|i^NN|NN|Nc *actually*|actually|actually^ADVE|ADVE|Rmp
>>> *am*|be|be^VERB1|VERB1|Vmip1s *.*|.|.^PERIOD|PERIOD|PERIOD
>>>
>>> .. i'm getting the strayfactor exception when not specifying
>>> any -input-factors (default 0), or exception: bitset::set when
>>> setting anything else.
>>>
>>> Thanks for your help,
>>> Stefan
>>>
>>> On 8/27/2013 6:07 PM, Hieu Hoang wrote:
>>>> did you escape your training and input data? There must not be |
>>>> characters in your data unless you are using factored models
>>>>
>>>> the moses tokenizer script does it, as well as the specific escape script.
>>>> scripts/tokenizer/tokenizer.perl
>>>> scripts/tokenizer/escape-special-chars.perl
>>>>
>>>> On 26/08/2013 15:23, Stefan Dumitrescu wrote:
>>>>> Hi!
>>>>>
>>>>> I have the following error when attempting to translate:
>>>>>
>>>>> Exception: moses/Word.cpp:109 in void
>>>>> Moses::Word::CreateFromString(Moses::FactorDirection, const
>>>>> std::vector<long unsigned int>&, const StringPiece&, bool) threw
>>>>> StrayFactorException because `fit'.
>>>>> You have configured 1 factors but the word !|!|!^EXCL|EXCL|EXCL contains
>>>>> factor delimiter | too many times.
>>>>>
>>>>> I have the following training script:
>>>>>
>>>>> /usr/local/trans/tools/moses/scripts/training/train-model.perl \
>>>>> --corpus /usr/local/trans/corpus/ted/train \
>>>>> --external-bin-dir=/usr/local/trans/tools/mgiza/bin \
>>>>> --parallel \
>>>>> --mgiza \
>>>>> --mgiza-cpus 8 \
>>>>> --f ro --e en \
>>>>> --lm 0:5:/usr/local/trans/corpus/tedlm/en.surface.5gram.kni.blm:8 \
>>>>> --root-dir /usr/local/trans/work/ted/m1 \
>>>>> --max-phrase-length 4 \
>>>>> --first-step $FIRSTSTEP \
>>>>> --translation-factors 0-0 \
>>>>> --alignment grow-diag-final-and \
>>>>> --reordering wbe-msd-bidirectional-fe
>>>>>
>>>>> The train files are factored (5 factors: word, lemma, lemma^postag1,
>>>>> postag1, postag2). The training process works without any errors, it
>>>>> generates a valid phrase-table that looks like:
>>>>>
>>>>> zcat ../m1/model/phrase-table.gz | head -1
>>>>> !|!|!^EXCL|EXCL|EXCL !|!|!^EXCL|EXCL|EXCL !|!|!^EXCL|EXCL|EXCL
>>>>> pe|pe|pe^S|S|Spca ||| !|!|!^EXCL|EXCL|EXCL !|!|!^EXCL|EXCL|EXCL
>>>>> !|!|!^EXCL|EXCL|EXCL ||| 0.5 0.0187005 1 0.18952 ||| 0-0 1-0 0-1 2-2 |||
>>>>> 2 1 1
>>>>>
>>>>> I did not get this error a couple of months ago when working on another
>>>>> experiment. I'm guessing something changed in Moses and I am missing
>>>>> some required flag in my scripts? I am using scripts that have worked ok
>>>>> so far.
>>>>> I looked through the manual, and I tried using the -input-factors
>>>>> option, but i still receive the same error. What am I doing wrong? It is
>>>>> something trivial most likely, but I do appreciate your help with it.
>>>>>
>>>>> Thank you,
>>>>> Stefan
>>>>>
>>>>> (moses.ini below:)
>>>>> #########################
>>>>> ### MOSES CONFIG FILE ###
>>>>> #########################
>>>>>
>>>>> # input factors
>>>>> [input-factors]
>>>>> 0
>>>>>
>>>>> # mapping steps
>>>>> [mapping]
>>>>> 0 T 0
>>>>>
>>>>> [distortion-limit]
>>>>> 6
>>>>>
>>>>> # feature functions
>>>>> [feature]
>>>>> UnknownWordPenalty
>>>>> WordPenalty
>>>>> PhrasePenalty
>>>>> PhraseDictionaryMemory name=TranslationModel0 table-limit=20
>>>>> num-features=4 path=/usr/local/trans/work/ted/m1/model/phrase-table.gz
>>>>> input-factor=0 output-factor=0
>>>>> LexicalReordering name=LexicalReordering0 num-features=6
>>>>> type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0
>>>>> path=/usr/local/trans/work/ted/m1/model/reordering-table.wbe-msd-bidirectional-fe.gz
>>>>> Distortion
>>>>> KENLM lazyken=0 name=LM0 factor=0
>>>>> path=/usr/local/trans/corpus/tedlm/en.surface.5gram.kni.blm order=5
>>>>>
>>>>> # dense weights for feature functions
>>>>> [weight]
>>>>> UnknownWordPenalty0= 1
>>>>> WordPenalty0= -1
>>>>> PhrasePenalty0= 0.2
>>>>> TranslationModel0= 0.2 0.2 0.2 0.2
>>>>> LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
>>>>> Distortion0= 0.3
>>>>> LM0= 0.5
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20130828/b0468ba9/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 82, Issue 47
*********************************************

Moses-support Digest, Vol 82, Issue 47

0 Response to "Moses-support Digest, Vol 82, Issue 47"

Post a Comment