Moses-support Digest, Vol 97, Issue 49

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Regarding Factored Model (Mukund Roy)
2. normalization issue in tokenization of Kannada words in
baseline MT (shiva kumar)
3. Re: How should I properly change the moses.ini file for
tuning if I did not prepare an arpa file (and do we need an arpa
file)? (Barry Haddow)
4. Re: How should I properly change the moses.ini file for
tuning if I did not prepare an arpa file (and do we need an arpa
file)? (Daniel Seita)

----------------------------------------------------------------------

Message: 1
Date: Tue, 18 Nov 2014 17:00:36 +0530
From: Mukund Roy <mukundkumarroy@cdac.in>
Subject: [Moses-support] Regarding Factored Model
To: moses-support@mit.edu
Message-ID:
<CAF22fzg4H3r=FbHN7peSbxM38y1Lp5HsRpU=p8sYmgFNEtUuCA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Dear Sir

I used below command for building factored model

$MOSES_HOME/scripts/training/
train-model.perl -root-dir
$WORKING_DIR/train -corpus $WORKING_DIR/Train.true.clean -f $slang -e
$tlang -alignment grow-diag-final-and -reordering msd-bidirectional-fe
--lm 2:3:$WORKING_DIR/lm/lm-corpus.blm.POS.$tlang:0 --alignment-facor
0-0 --translation-factors 0-0,2 --reordering-factors 0-0
--decoding-steps t0

I have a factored corpus with two factor: lemma & POS. The baseline
Phrase based model produced BLEU score of around 27 but using above
command for Factored model, BLEU score dipped to 3.5.

@ Hoang: Sir As you said I am attaching the ini file and Sample input
outputs of Baseline phrase based model and Factored Model

Thanks & Regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141118/efc6361b/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Mail-moses.tar.gz
Type: application/x-gzip
Size: 3651 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141118/efc6361b/attachment-0001.bin

------------------------------

Message: 2
Date: Tue, 18 Nov 2014 00:58:00 -0800
From: shiva kumar <shivadvg19@yahoo.com>
Subject: [Moses-support] normalization issue in tokenization of
Kannada words in baseline MT
To: moses-support@mit.edu
Message-ID:
<1416301080.63798.YahooMailBasic@web162302.mail.bf1.yahoo.com>
Content-Type: text/plain; charset=us-ascii

hi
i am working on baseline SMT with moses for Kannada-english MT. in the tokenization step the input unicode fonts of kannada words will get added with their unicode references because of glyph substitution.

due to this i am not able to get good translation. if i give the tokenized sentences as input to decoder i am getting correct translation.

how to solve this problem?

i am using ubuntu12.04 and moses.

regards,
ShivaKumar KM
Asst.Professor,
Amrita VishwaVidyaPeetham Mysore Campus
Bogadi 2nd stage
Mysore
9611913393

------------------------------

Message: 3
Date: Tue, 18 Nov 2014 15:25:45 +0000
From: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Subject: Re: [Moses-support] How should I properly change the
moses.ini file for tuning if I did not prepare an arpa file (and do we
need an arpa file)?
To: Daniel Seita <takeshidanny@gmail.com>, Moses support
<Moses-support@mit.edu>
Message-ID: <546B64F9.9020703@staffmail.ed.ac.uk>
Content-Type: text/plain; charset=UTF-8; format=flowed

Hi Daniel

On 18/11/14 15:14, Daniel Seita wrote:
> Thanks for the response Barry. I'm still confused after reading your
> suggestions so perhaps you or someone can clarify when you have time?
>
> (1) I think that the /tuning/ step requires /both/ the arpa and the
> binarized files, right? While the /training/ step only requires the
> binarized version? I haven't reached the testing step yet.

The training step doesn't actually use the LM, it just inserts the path
into the moses.ini file. The tuning step can use either an arpa or a
binarised file (not both) but using a binarised file will take up less RAM.

>
> (2) OK, so as you mention, the baseline instructions assume we use
> IRSTLM to create the arpa file, then use KenLM to binarize it. Under
> the "Language Model Training" section, there are six boxes that have
> command line instructions (the last one is querying the language
> model). I assume this means you /only/ want us to execute the commands
> in the first and fifth boxes?

Yes, you should only run the first and the fifth. The others are options
which imho confuse the reader.

>
> (3) Is it possible to get the entire training, tuning, and testing
> steps done /without/ an arpa file? This might help avoid my problems
> because I don't think I have a problem getting my binarized IRSTLM
> files. The instructions, as you say, do not explain how to configure
> Moses to do that (and we do this by changing the moses.ini file, right?).

You need a language model file for tuning and testing, but if you
directly build an IRSTLM binarised file, then you don't need an ARPA
file. You do need to make changes to moses.ini (as compared to the
baseline instructions) and at the moment I can't lay my hands on the
correct arguments.

>
> I'm going to check the IRSTLM documentation because in the version I
> have (5.80.06) both "--text yes" and "--text" fail and create the
> exact error "DEBUG: warning too many arguments" that we see in the
> mailing list discussion that we both linked to. Also, running that
> perl script (to do "steps 1-5") to get the LM also fails (that command
> itself doesn't fail; it causes problems later in the sequence), and
> using the EMS fails on the tuning step, I assume because of the same
> issues above, but that's a story for another day.
>

That's all a bit strange. The "official" IRSTLM argument is "--text=yes"
so that should work. The other methods you mention should also work.

cheers - Barry

> Thanks,
> Daniel
>
>
> On Tue, Nov 18, 2014 at 1:30 AM, Barry Haddow
> <bhaddow@staffmail.ed.ac.uk <mailto:bhaddow@staffmail.ed.ac.uk>> wrote:
>
> Hi Daniel
>
> I looked at the baseline system instructions, and they are a bit
> confusing around the LM building. They explain how to use IRSTLM
> to binarise a language model, but do not say how to configure
> Moses to load an IRSTLM-binarised model.
>
> In fact, when I wrote the original baseline system manual, I
> assumed that you would build an ARPA file with IRSTLM (since KENLM
> didn't do estimation then, and SRILM wasn't open-source), and then
> binarise with KENLM and use it at runtime.
>
> Now, however, KENLM does estimation, and creates ARPA files. This
> could be one solution to your problem:
> http://kheafield.com/code/kenlm/estimation/
>
> If you want to build an ARPA file with IRSTLM, then this is
> definitely possible, but as noted here
> http://comments.gmane.org/gmane.comp.nlp.moses.user/9924
> there is some uncertainty over the arguments. I assume this is a
> versioning issue, but the bottom line is that either "--text yes"
> or "--text" should work. When I originally wrote the baseline
> instructions, the arguments I gave worked with the version of
> IRSTLM I installed.
>
> Hope that helps,
>
> cheers
> Barry
>
> On 17/11/14 16:54, Daniel Seita wrote:
>
> Hello everyone,
>
> I am struggling to follow the baseline instructions. I am
> using a Mac OS X 10.9 with boost 1.57, irstlm 5.80.06, and the
> latest moses/mgiza version from github. I ran training
> successfully using this command
>
> nohup nice ~/mosesdecoder/scripts/training/train-model.perl
> -root-dir train -corpus
> ~/corpus/news-commentary-v8.fr-en.clean -f fr -e en -alignment
> grow-diag-final-and -reordering msd-bidirectional-fe -lm
> 0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 -mgiza
> -mgiza-cpus 8 -external-bin-dir
> ~/mosesdecoder/word_align_tools/ >&training.out &
>
> Notice that I'm using mgiza (which is different from what's
> listed on the baseline), and that my word_align_tools contains
> the mgiza binaries and merge_align.py. Also notice that I'm
> using the "blm.en" language model file. This is what is listed
> on the baseline instructions, so I assumed this is correct.
> Unfortunately, tuning fails. I can successfully download the
> data and run scripts on it, but the major tuning command fails:
>
> nohup nice ~/mosesdecoder/scripts/training/mert-moses.pl
> <http://mert-moses.pl> <http://mert-moses.pl>
> ~/corpus/news-test2008.true.fr <http://news-test2008.true.fr>
> <http://news-test2008.true.fr> ~/corpus/news-test2008.true.en
> ~/mosesdecoder/bin/moses train/model/moses.ini --mertdir
> ~/mosesdecoder/bin/ &> mert.out --decoder-flags="-threads 8" &
>
> My ~/working/mert.out file says at the end:
>
> "This looks like an IRSTLM binary file. Did you forget to
> pass --text yes to compile-lm? Byte: 40"
>
> I'm confused because /the baseline instructions imply that we
> want an IRSTLM binary file/. I have attached my
> ~/working/train/model/moses.ini file that was generated from
> training, if it helps. I suspect the line to change is:
>
> KENLM lazyken=0 name=LM0 factor=0
> path=/Users/danielseita/lm/news-commentary-v8.fr-en.blm.en order=3
>
> However, changing KENLM to IRSTLM did not work, and I'm not
> sure what to do with "lazyken".
>
> The one other problem I think I might have is that I failed to
> create the "arpa" file according to the baseline, but I
> thought that was okay because we wouldn't need it.
> Specifically, I ran into the problem listed in this mailing list:
>
> http://comments.gmane.org/gmane.comp.nlp.moses.user/9924
>
> But following the suggestion of just using "text" or omitting
> "text" did not work. I'm using IRSTLM 5.80.06 instead of the
> 5.80.03 that's assumed in the baseline, so that might change
> stuff (installing 5.80.03 fails on my computer due to some
> esoteric errors that don't appear on Google searching). And in
> any case, I'm not sure I even need the arpa file because that
> seems to be /unbinarized/, so why would we want it? I followed
> the command under the section "/You can directly create an
> IRSTLM binary LM (for faster loading in Moses) by replacing
> the last command with the following:/" and used that /instead/
> of this command:
>
> ~/irstlm/bin/compile-lm \
> --text yes \
> news-commentary-v8.fr-en.lm.en.gz \
> news-commentary-v8.fr-en.arpa.en
>
> Because the above command did not work due to DEBUG: too many
> arguments.
>
> So to summarize...
>
> (1) I think I can fix my issue by figuring out how to fix the
> moses.ini file to refer to IRSTLM, but I'm confused about why
> I'd need to do that since the baseline instructions assume
> that we're using IRSTLM, right?
>
> (2) How ca I get irstlm's compile-lm to work to create the
> .arpa file, because it seems like it's needed after all?
>
> I know this seems like a lot so if you can address even part
> of my questions that would be great.
>
> Thanks,
> Daniel Seita
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

------------------------------

Message: 4
Date: Tue, 18 Nov 2014 08:26:54 -0800
From: Daniel Seita <takeshidanny@gmail.com>
Subject: Re: [Moses-support] How should I properly change the
moses.ini file for tuning if I did not prepare an arpa file (and do we
need an arpa file)?
To: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Cc: Moses support <moses-support@mit.edu>
Message-ID:
<CAKUmyF7FYns6OtkQPDAq9En00GqguUuxnsBtkQPDuO12q1xpeA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Putting in the equal sign appeared to do the trick. So --text=yes works but
not --text yes.

(PS: Sorry for emailing this directly to you Barry, I meant to respond to
the whole mailing list so everyone could know.)

Thanks,
Daniel

On Tue, Nov 18, 2014 at 7:25 AM, Barry Haddow <bhaddow@staffmail.ed.ac.uk>
wrote:

> Hi Daniel
>
> On 18/11/14 15:14, Daniel Seita wrote:
>
>> Thanks for the response Barry. I'm still confused after reading your
>> suggestions so perhaps you or someone can clarify when you have time?
>>
>> (1) I think that the /tuning/ step requires /both/ the arpa and the
>> binarized files, right? While the /training/ step only requires the
>> binarized version? I haven't reached the testing step yet.
>>
>
> The training step doesn't actually use the LM, it just inserts the path
> into the moses.ini file. The tuning step can use either an arpa or a
> binarised file (not both) but using a binarised file will take up less RAM.
>
>
>> (2) OK, so as you mention, the baseline instructions assume we use IRSTLM
>> to create the arpa file, then use KenLM to binarize it. Under the "Language
>> Model Training" section, there are six boxes that have command line
>> instructions (the last one is querying the language model). I assume this
>> means you /only/ want us to execute the commands in the first and fifth
>> boxes?
>>
>
> Yes, you should only run the first and the fifth. The others are options
> which imho confuse the reader.
>
>
>> (3) Is it possible to get the entire training, tuning, and testing steps
>> done /without/ an arpa file? This might help avoid my problems because I
>> don't think I have a problem getting my binarized IRSTLM files. The
>> instructions, as you say, do not explain how to configure Moses to do that
>> (and we do this by changing the moses.ini file, right?).
>>
>
> You need a language model file for tuning and testing, but if you directly
> build an IRSTLM binarised file, then you don't need an ARPA file. You do
> need to make changes to moses.ini (as compared to the baseline
> instructions) and at the moment I can't lay my hands on the correct
> arguments.
>
>
>> I'm going to check the IRSTLM documentation because in the version I have
>> (5.80.06) both "--text yes" and "--text" fail and create the exact error
>> "DEBUG: warning too many arguments" that we see in the mailing list
>> discussion that we both linked to. Also, running that perl script (to do
>> "steps 1-5") to get the LM also fails (that command itself doesn't fail; it
>> causes problems later in the sequence), and using the EMS fails on the
>> tuning step, I assume because of the same issues above, but that's a story
>> for another day.
>>
>>
> That's all a bit strange. The "official" IRSTLM argument is "--text=yes"
> so that should work. The other methods you mention should also work.
>
> cheers - Barry
>
>
>
> Thanks,
>> Daniel
>>
>>
>>
>> On Tue, Nov 18, 2014 at 1:30 AM, Barry Haddow <bhaddow@staffmail.ed.ac.uk
>> <mailto:bhaddow@staffmail.ed.ac.uk>> wrote:
>>
>> Hi Daniel
>>
>> I looked at the baseline system instructions, and they are a bit
>> confusing around the LM building. They explain how to use IRSTLM
>> to binarise a language model, but do not say how to configure
>> Moses to load an IRSTLM-binarised model.
>>
>> In fact, when I wrote the original baseline system manual, I
>> assumed that you would build an ARPA file with IRSTLM (since KENLM
>> didn't do estimation then, and SRILM wasn't open-source), and then
>> binarise with KENLM and use it at runtime.
>>
>> Now, however, KENLM does estimation, and creates ARPA files. This
>> could be one solution to your problem:
>> http://kheafield.com/code/kenlm/estimation/
>>
>> If you want to build an ARPA file with IRSTLM, then this is
>> definitely possible, but as noted here
>> http://comments.gmane.org/gmane.comp.nlp.moses.user/9924
>> there is some uncertainty over the arguments. I assume this is a
>> versioning issue, but the bottom line is that either "--text yes"
>> or "--text" should work. When I originally wrote the baseline
>> instructions, the arguments I gave worked with the version of
>> IRSTLM I installed.
>>
>> Hope that helps,
>>
>> cheers
>> Barry
>>
>> On 17/11/14 16:54, Daniel Seita wrote:
>>
>> Hello everyone,
>>
>> I am struggling to follow the baseline instructions. I am
>> using a Mac OS X 10.9 with boost 1.57, irstlm 5.80.06, and the
>> latest moses/mgiza version from github. I ran training
>> successfully using this command
>>
>> nohup nice ~/mosesdecoder/scripts/training/train-model.perl
>> -root-dir train -corpus
>> ~/corpus/news-commentary-v8.fr-en.clean -f fr -e en -alignment
>> grow-diag-final-and -reordering msd-bidirectional-fe -lm
>> 0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 -mgiza
>> -mgiza-cpus 8 -external-bin-dir
>> ~/mosesdecoder/word_align_tools/ >&training.out &
>>
>> Notice that I'm using mgiza (which is different from what's
>> listed on the baseline), and that my word_align_tools contains
>> the mgiza binaries and merge_align.py. Also notice that I'm
>> using the "blm.en" language model file. This is what is listed
>> on the baseline instructions, so I assumed this is correct.
>> Unfortunately, tuning fails. I can successfully download the
>> data and run scripts on it, but the major tuning command fails:
>>
>> nohup nice ~/mosesdecoder/scripts/training/mert-moses.pl
>> <http://mert-moses.pl> <http://mert-moses.pl>
>>
>> ~/corpus/news-test2008.true.fr <http://news-test2008.true.fr>
>> <http://news-test2008.true.fr> ~/corpus/news-test2008.true.en
>> ~/mosesdecoder/bin/moses train/model/moses.ini --mertdir
>> ~/mosesdecoder/bin/ &> mert.out --decoder-flags="-threads 8" &
>>
>> My ~/working/mert.out file says at the end:
>>
>> "This looks like an IRSTLM binary file. Did you forget to
>> pass --text yes to compile-lm? Byte: 40"
>>
>> I'm confused because /the baseline instructions imply that we
>> want an IRSTLM binary file/. I have attached my
>> ~/working/train/model/moses.ini file that was generated from
>> training, if it helps. I suspect the line to change is:
>>
>> KENLM lazyken=0 name=LM0 factor=0
>> path=/Users/danielseita/lm/news-commentary-v8.fr-en.blm.en
>> order=3
>>
>> However, changing KENLM to IRSTLM did not work, and I'm not
>> sure what to do with "lazyken".
>>
>> The one other problem I think I might have is that I failed to
>> create the "arpa" file according to the baseline, but I
>> thought that was okay because we wouldn't need it.
>> Specifically, I ran into the problem listed in this mailing list:
>>
>> http://comments.gmane.org/gmane.comp.nlp.moses.user/9924
>>
>> But following the suggestion of just using "text" or omitting
>> "text" did not work. I'm using IRSTLM 5.80.06 instead of the
>> 5.80.03 that's assumed in the baseline, so that might change
>> stuff (installing 5.80.03 fails on my computer due to some
>> esoteric errors that don't appear on Google searching). And in
>> any case, I'm not sure I even need the arpa file because that
>> seems to be /unbinarized/, so why would we want it? I followed
>> the command under the section "/You can directly create an
>> IRSTLM binary LM (for faster loading in Moses) by replacing
>> the last command with the following:/" and used that /instead/
>> of this command:
>>
>> ~/irstlm/bin/compile-lm \
>> --text yes \
>> news-commentary-v8.fr-en.lm.en.gz \
>> news-commentary-v8.fr-en.arpa.en
>>
>> Because the above command did not work due to DEBUG: too many
>> arguments.
>>
>> So to summarize...
>>
>> (1) I think I can fix my issue by figuring out how to fix the
>> moses.ini file to refer to IRSTLM, but I'm confused about why
>> I'd need to do that since the baseline instructions assume
>> that we're using IRSTLM, right?
>>
>> (2) How ca I get irstlm's compile-lm to work to create the
>> .arpa file, because it seems like it's needed after all?
>>
>> I know this seems like a lot so if you can address even part
>> of my questions that would be great.
>>
>> Thanks,
>> Daniel Seita
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> -- The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141118/fa3b3339/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 97, Issue 49
*********************************************

Moses-support Digest, Vol 97, Issue 49

0 Response to "Moses-support Digest, Vol 97, Issue 49"

Post a Comment