Moses-support Digest, Vol 97, Issue 54

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: placeholders for numbers - extract step (Vito Mandorino)
2. Re: Incremental training (Sandipan Dandapat)

----------------------------------------------------------------------

Message: 1
Date: Wed, 19 Nov 2014 15:41:45 +0100
From: Vito Mandorino <vito.mandorino@linguacustodia.com>
Subject: Re: [Moses-support] placeholders for numbers - extract step
To: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CA+8mSmFLv5im=RV_VN5xBqxWvxs9r4XP2xHGX4B_QR5088-pug@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Thank you Hieu, that worked very well. I am now tackling the decoding part
and I have two questions.

1) Sometimes, I get the following error message during decoding:

terminate called after throwing an instance of 'util::Exception'
what(): moses-cmd/IOWrapper.cpp:213 in std::map<long unsigned int,
const Moses::Factor*> MosesCmd::GetPlaceholders(const
Moses::Hypothesis&, Moses::FactorType) threw util::Exception because
`targetPos.size() != 1'.
Placeholder should be aligned to 1, and only 1, word
Aborted

I don't understand why. I checked the phrase-table and I didn't find
phrase pairs where the '@num@' token is aligned to 2 or more words.

2) This may be related to the first question. If I run the decoder to
translate the input using the suggested command

./moses -placeholder-factor 1 -xml-input exclusive

I get the '@num@' string in the output and not the expected number. I
do get the number if I use the option '-placeholder-factor 0'. The
model that I am using is a phrase-based, non-factored model.

Vito

2014-11-19 10:32 GMT+01:00 Hieu Hoang <Hieu.Hoang@ed.ac.uk>:

> hi vito
>
> On 18 November 2014 11:30, Vito Mandorino <
> vito.mandorino@linguacustodia.com> wrote:
>
>> Hello everyone,
>>
>> I am trying to use placeholders for numbers in phrase-based MT, according
>> to http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc75
>>
>> The above page says
>>
>> ---
>>
>> During extraction, add the following to the extract command
>> (phrase-based only for now):
>>
>> ./extract --Placeholders @num@ ....
>>
>> --
>>
>> Does this mean that I have to first run train-model.perl with
>> --last-step=4, then the line above and then again train-model.perl with
>> --first-step=6?
>>
> when you run train-model.perl, add the argument
> -extract-options '--Placeholders @num@'
> You can see it in this script that the EMS creates
>
> http://www.statmt.org/moses/RELEASE-2.1/models/cs-en/steps/3/TRAINING_extract-phrases.3
>
>>
>> If this is the case, which arguments and options should I pass to extract
>> for a baseline training? I think the syntax is something like
>>
> The script will then call extract with the following argument
> --Placeholders @num@
> You can see it in the STDERR file of the above script
>
> http://www.statmt.org/moses/RELEASE-2.1/models/cs-en/steps/3/TRAINING_extract-phrases.3.STDERR
>
>>
>> syntax: extract en de align extract max-length [orientation [ --model
>> [wbe|phrase|hier]-[msd|mslr|mono] ] | --OnlyOutputSpanInfo | --NoTTable |
>> --GZOutput | --IncludeSentenceId | --SentenceOffset n | --InstanceWeights
>> filename ]
>>
>> In particular I cannot figure out what should be passed as 'align' and
>> 'extract' arguments.
>>
>>
>> Regards,
>>
>> Vito
>>
>> --
>>
>> *M**. Vito MANDORINO -- Chief Scientist*
>>
>>
>> [image: Description : Description : lingua_custodia_final full logo]
>>
>> *The Translation Trustee*
>>
>> *1, Place Charles de Gaulle, **78180 Montigny-le-Bretonneux*
>>
>> *Tel : +33 1 30 44 04 23 Mobile : +33 6 84 65 68 89
>> <%2B33%206%2084%2065%2068%2089>*
>>
>> *Email :* *vito.mandorino@linguacustodia.com
>> <massinissa.ahmim@linguacustodia.com>*
>>
>> *Website :* *www.linguacustodia.com <http://www.linguacustodia.com/> -
>> www.thetranslationtrustee.com <http://www.thetranslationtrustee.com/>*
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>

--
*M**. Vito MANDORINO -- Chief Scientist*

[image: Description : Description : lingua_custodia_final full logo]

*The Translation Trustee*

*1, Place Charles de Gaulle, **78180 Montigny-le-Bretonneux*

*Tel : +33 1 30 44 04 23 Mobile : +33 6 84 65 68 89*

*Email :* *vito.mandorino@linguacustodia.com
<massinissa.ahmim@linguacustodia.com>*

*Website :* *www.linguacustodia.com <http://www.linguacustodia.com/> -
www.thetranslationtrustee.com <http://www.thetranslationtrustee.com/>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141119/51851606/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 4421 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141119/51851606/attachment-0001.jpg

------------------------------

Message: 2
Date: Wed, 19 Nov 2014 14:44:36 +0000
From: Sandipan Dandapat <sandipandandapat@gmail.com>
Subject: Re: [Moses-support] Incremental training
To: prajdabre <prajdabre@gmail.com>
Cc: moses-support <moses-support@mit.edu>, "i.ramadan@saudisoft.com"
<i.ramadan@saudisoft.com>
Message-ID:
<CAGr2oZTQ89ivXJv+KXWJ0c4sUzZ93vVhNaeAfpOQmrAVT3_xvw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Dear Raj,
I also tried to use your scripts for incremental alignment. I copied your
python script in the desired directory still I am receiving the same error
as posted by Ihab.
reading vocabulary files
Reading vocabulary file from:new_corpus/inc.fr.vcb
ERROR: TOKEN ID must be unique for each token, in line :
24 roi 2
TOKEN ID 24 has already been assigned to: roi

I took only 500 sentences pairs for full_train.sh and it worked fine with
758 lines in the corpus/tgt_filename.vcb file

I took only 10 sentences for incremental alignment_new.sh which generated
the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb
Is there any problem? Can you please help me on the same.

Thanks and regards,
sandipan

On 4 November 2014 16:13, prajdabre <prajdabre@gmail.com> wrote:

> Dear Ihab.
> There is a python script that was there in the google drive folder in the
> first mail I sent you.
> Please replace the existing file with my copy.
>
> It has to work.
>
> Regards.
>
>
> Sent from Samsung Mobile
>
>
>
> -------- Original message --------
> From: Ihab Ramadan <i.ramadan@saudisoft.com>
> Date: 05/11/2014 00:54 (GMT+09:00)
> To: 'Raj Dabre' <prajdabre@gmail.com>
> Cc: moses-support@mit.edu
> Subject: RE: [Moses-support] Incremental training
>
>
> Dear Raj,
>
> Your point is clear and I try to follow the steps you mentioned but I
> stuck now in the align_new.sh script which gives me this error
>
> reading vocabulary files
>
> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
>
> ERROR: TOKEN ID must be unique for each token, in line :
>
> 29107 q-1 4
>
> Do you have any idea what this error means?
>
>
>
> *From:* Raj Dabre [mailto:prajdabre@gmail.com]
> *Sent:* Tuesday, November 4, 2014 12:06 PM
> *To:* i.ramadan@saudisoft.com
> *Cc:* moses-support@mit.edu
> *Subject:* Re: [Moses-support] Incremental training
>
>
>
> Dear Ihab,
>
> Perhaps I should have mentioned much more clearly what my script does.
> Sorry for that.
>
> Let me start with this: There is no direct/easy way to generate the
> moses.ini file as you need.
>
> 1. Suppose you have 2 million lines of parallel corpora and you trained a
> SMT system for it. This naturally gives the phrase table, reordering table
> and moses.ini.
>
> 2. Suppose you got 500 k more lines of parallel corpora.... there are 2
> ways:
>
> a. Retrain 2.5 million lines from scratch (will take lots of time: ~
> 2-3 days on a regular machines)
>
> b. Train on only the 500k new lines using the alignment information of
> the original training data. (Faster: ~ 6-7 hours).
>
>
>
> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE TABLES.*
>
> 1. full_train.sh -------------- This trains on the original corpus of 2
> million lines. (Generate alignment files only for the original corpus)
>
> 2. align_new.sh -------------- This trains on the new corpus of 500 k
> lines. (Generate alignment files only for the new corpus using the
> alignments for 1)
>
>
>
> *Why this split ????* Because the basic training step of Moses does not
> preserve the alignment probability information. Only the alignments are
> saved. To continue training we need the probability information.
>
> You can pass flags to moses to preserve this information ( this flag is
> --giza-option . If you do this then you will not need full_train.sh. But
> you will have to change the config files before using align_new.sh)
>
> *HOW TO GET UPDATED PHRASE TABLE:*
>
> 1. Append the forward alignments (fwd) generated by align_new.sh to the
> forward (fwd) alignments generated by full_train.sh.
> 2. Append the inverse alignments (inv) generated by align_new.sh to the
> inverse (inv) alignments generated by full_train.sh.
>
> 3. Run the moses training script with additional flags:
>
> - --first-step -- first step in the training process (default
> 1)--------------- This will be 4
> - --last-step -- last step in the training process (default
> 7)------------ This will remain 7
> - --giza-f2e -- <path to folder>/new_giza.fwd
> - --giza-e2f -- <path to folder>/new_giza.inv
>
> For example:
>
> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your training directory> \
>
> -corpus <your new corpus name> \
>
> -f <src> -e <tgt> -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
>
> -lm 0:3:<path to LM>:8 \
> --first-step 4 --last-step 7 --giza-f2e -- <path to folder>/new_giza.fwd --giza-e2f -- <path to folder>/new_giza.inv \
> -external-bin-dir <path to giza++ binaries>
>
> For more details on the training step read this:
> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters
>
> What this does is assumes that you have alignments and continue the phrase
> extraction, reordering and generate the new moses.ini file.
>
> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.*
>
>
>
> If you are still unclear then please ask and I will try to help you as
> much as I can.
>
> Regards.
>
>
>
>
>
>
>
> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <i.ramadan@saudisoft.com>
> wrote:
>
> Dear Raj,
>
> That?s a great work my friend,
>
> This files make the script work but it takes long time to finish also it
> did not generate the model folder which contain the moses.ini file
>
> Is this normal?
>
> And I now try to run it again as I suspect that the server was shut down
> before the training was completed but i notice that it starts form the
> beginning and did not use the existing files generated
>
> Thanks Raj it still a great work
>
>
>
>
>
> *From:* Raj Dabre [mailto:prajdabre@gmail.com]
> *Sent:* Thursday, October 30, 2014 4:54 PM
>
>
> *To:* i.ramadan@saudisoft.com
> *Cc:* moses-support@mit.edu
> *Subject:* Re: [Moses-support] Incremental training
>
>
>
> Ahh.... i totally forgot that part.
>
> Sorry.
>
> PFA.
>
> Just place them in the folder where the shell scripts full_train.sh and
> align_new.sh are.
>
> Hopefully it should run now.
>
> Please let me know if you succeed.
>
>
>
> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan <i.ramadan@saudisoft.com>
> wrote:
>
> Dear Raj,
>
> It is a great solution
>
> I installed MGIZA++ successfully and I am using your scripts to run
> training
>
> And I followed the steps you mentioned but I faces this error when I was
> running the full_train.sh script
>
>
>
> bla bla bla
>
> .
>
> .
>
> .
>
> .
>
>
>
> Starting MGIZA
>
> Initializing Global Paras
>
> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>
> ERROR: Cannot open configuration file configgiza.fwd!
>
> Starting MGIZA
>
> Initializing Global Paras
>
> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>
> ERROR: Cannot open configuration file configgiza.rev!
>
>
>
>
>
> This two files does not exists
>
> should they be generated from the installation?
>
> How to get them?
>
>
>
> *From:* Raj Dabre [mailto:prajdabre@gmail.com]
> *Sent:* Sunday, October 26, 2014 6:21 PM
> *To:* i.ramadan@saudisoft.com
> *Cc:* moses-support@mit.edu
> *Subject:* Re: [Moses-support] Incremental training
>
>
>
> Hello Ihab,
>
> I would suggest using mgiza++.
> http://www.kyloo.net/software/doku.php/mgiza:overview
>
> It is very easy to use.
>
> I also wrote some scripts to make it easy for training.
> Visit the link below for my scripts.
>
> https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing
>
> Usage:
>
> To train basic IBM models:
> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name>
> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation>
>
> To align 2 new files using previously trained models (aka continue
> training).
>
> bash align_new.sh <new_src_corpus_file_name> <new_tgt_corpus_file_name>
> <old_src_corpus_file_name> <old_tgt_corpus_file_name> <model_folder_base>
> <corpus_folder_base> <path_to_mgizapp_installation>
>
> There is also a python script which you had better replace in the scripts
> folder of mgiza++. I have modified it to work with my scripts.
>
> Hope this helps.
>
>
>
>
>
> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan <i.ramadan@saudisoft.com>
> wrote:
>
> Dear All,
>
> I just need a clear steps on how to do incremental training in moses, as
> the illustration in the manual is not cleared enough
>
> Thanks
>
>
>
> Best Regards
>
> *Ihab Ramadan*| Senior Developer| Saudisoft <http://www.saudisoft.com/> -
> Egypt | *Tel * +2 02 330 320 37 Ext- 0 | Mob+201007570826 | Fax
> +20233032036 | *Follow us on *[image: linked]
> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* |
> **[image: ZA102637861]*
> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* |
> **[image: ZA102637858]* <https://twitter.com/Saudisoft>
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
>
> Raj Dabre.
> Research Student,
>
> Graduate School of Informatics,
> Kyoto University.
>
> CSE MTech, IITB., 2011-2014
>
>
>
>
> --
>
> Raj Dabre.
> Research Student,
>
> Graduate School of Informatics,
> Kyoto University.
>
> CSE MTech, IITB., 2011-2014
>
>
>
>
> --
>
> Raj Dabre.
> Research Student,
>
> Graduate School of Informatics,
> Kyoto University.
>
> CSE MTech, IITB., 2011-2014
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141119/072123ef/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 97, Issue 54
*********************************************

Moses-support Digest, Vol 97, Issue 54

0 Response to "Moses-support Digest, Vol 97, Issue 54"

Post a Comment