Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Multiple reference for tuning (Cheng Yong)
2. Re: Multiple reference for tuning (Lingxiao WANG)
3. Re: Multiple reference for tuning (Philipp Koehn)
4. Re: Incremental training (Ihab Ramadan)
----------------------------------------------------------------------
Message: 1
Date: Tue, 4 Nov 2014 10:15:21 +0000 (UTC)
From: Cheng Yong <chengyong3001@gmail.com>
Subject: [Moses-support] Multiple reference for tuning
To: moses-support@mit.edu
Message-ID: <loom.20141104T111132-571@post.gmane.org>
Content-Type: text/plain; charset=us-ascii
I want use the moses ems to train a translation system.
When tuning, I have multiple reference. So I name source set as dev.zh and
target set as dev.en0, dev.en1, dev.en2
I configure the tuning part of configure.basic like this
#input-sgm = $wmt12-data/dev/newstest2010-src.$input-extension.sgm
raw-input = $wmt12-data/dev.$input-extension
#tokenized-input =
#factorized-input =
#input =
#
#reference-sgm = $wmt12-data/dev/newstest2010-ref.$output-extension.sgm
raw-reference = $wmt12-data/dev.$output-extension
#tokenized-reference =
#factorized-reference =
#reference =
but it doesn't work.
If I treat it as a single reference, It's ok.
How do I set up multiple tuning references?
------------------------------
Message: 2
Date: Tue, 4 Nov 2014 14:14:06 +0100
From: Lingxiao WANG <wanglingxiao0216@gmail.com>
Subject: Re: [Moses-support] Multiple reference for tuning
To: Cheng Yong <chengyong3001@gmail.com>
Cc: moses-support@mit.edu
Message-ID: <426869A1-4E5A-4785-A89F-CCE8F3A31974@gmail.com>
Content-Type: text/plain; charset="utf-8"
hi,
try like this.
raw-reference = "$wmt12-data/dev/reference1 $wmt12-data/dev/reference2 $wmt12-data/dev/reference3"
LX
> Le 4 nov. 2014 ? 11:15, Cheng Yong <chengyong3001@gmail.com> a ?crit :
>
> I want use the moses ems to train a translation system.
> When tuning, I have multiple reference. So I name source set as dev.zh and
> target set as dev.en0, dev.en1, dev.en2
>
> I configure the tuning part of configure.basic like this
> #input-sgm = $wmt12-data/dev/newstest2010-src.$input-extension.sgm
> raw-input = $wmt12-data/dev.$input-extension
> #tokenized-input =
> #factorized-input =
> #input =
> #
> #reference-sgm = $wmt12-data/dev/newstest2010-ref.$output-extension.sgm
> raw-reference = $wmt12-data/dev.$output-extension
> #tokenized-reference =
> #factorized-reference =
> #reference =
>
> but it doesn't work.
> If I treat it as a single reference, It's ok.
> How do I set up multiple tuning references?
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141104/87fd0f4f/attachment-0001.htm
------------------------------
Message: 3
Date: Tue, 4 Nov 2014 08:54:52 -0500
From: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Subject: Re: [Moses-support] Multiple reference for tuning
To: Cheng Yong <chengyong3001@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAAFADDAqnHO3fYYcQbTngPTa16rz9uSdQMUutuZwOES5T21fSg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Hi,
you are naming the files correctly - you just have to add the line
multiref = yes
in the relevant [TUNING] or [EVALUATION:set] section.
-phi
On Tue, Nov 4, 2014 at 5:15 AM, Cheng Yong <chengyong3001@gmail.com> wrote:
> I want use the moses ems to train a translation system.
> When tuning, I have multiple reference. So I name source set as dev.zh and
> target set as dev.en0, dev.en1, dev.en2
>
> I configure the tuning part of configure.basic like this
> #input-sgm = $wmt12-data/dev/newstest2010-src.$input-extension.sgm
> raw-input = $wmt12-data/dev.$input-extension
> #tokenized-input =
> #factorized-input =
> #input =
> #
> #reference-sgm = $wmt12-data/dev/newstest2010-ref.$output-extension.sgm
> raw-reference = $wmt12-data/dev.$output-extension
> #tokenized-reference =
> #factorized-reference =
> #reference =
>
> but it doesn't work.
> If I treat it as a single reference, It's ok.
> How do I set up multiple tuning references?
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
------------------------------
Message: 4
Date: Tue, 4 Nov 2014 17:54:01 +0200
From: "Ihab Ramadan" <i.ramadan@saudisoft.com>
Subject: Re: [Moses-support] Incremental training
To: "'Raj Dabre'" <prajdabre@gmail.com>
Cc: moses-support@mit.edu
Message-ID: <00d501cff847$8eb70800$ac251800$@saudisoft.com>
Content-Type: text/plain; charset="utf-8"
Dear Raj,
Your point is clear and I try to follow the steps you mentioned but I stuck now in the align_new.sh script which gives me this error
reading vocabulary files
Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
ERROR: TOKEN ID must be unique for each token, in line :
29107 q-1 4
Do you have any idea what this error means?
From: Raj Dabre [mailto:prajdabre@gmail.com]
Sent: Tuesday, November 4, 2014 12:06 PM
To: i.ramadan@saudisoft.com
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Incremental training
Dear Ihab,
Perhaps I should have mentioned much more clearly what my script does. Sorry for that.
Let me start with this: There is no direct/easy way to generate the moses.ini file as you need.
1. Suppose you have 2 million lines of parallel corpora and you trained a SMT system for it. This naturally gives the phrase table, reordering table and moses.ini.
2. Suppose you got 500 k more lines of parallel corpora.... there are 2 ways:
a. Retrain 2.5 million lines from scratch (will take lots of time: ~ 2-3 days on a regular machines)
b. Train on only the 500k new lines using the alignment information of the original training data. (Faster: ~ 6-7 hours).
What my scripts do: THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE TABLES.
1. full_train.sh -------------- This trains on the original corpus of 2 million lines. (Generate alignment files only for the original corpus)
2. align_new.sh -------------- This trains on the new corpus of 500 k lines. (Generate alignment files only for the new corpus using the alignments for 1)
Why this split ???? Because the basic training step of Moses does not preserve the alignment probability information. Only the alignments are saved. To continue training we need the probability information.
You can pass flags to moses to preserve this information ( this flag is --giza-option . If you do this then you will not need full_train.sh. But you will have to change the config files before using align_new.sh)
HOW TO GET UPDATED PHRASE TABLE:
1. Append the forward alignments (fwd) generated by align_new.sh to the forward (fwd) alignments generated by full_train.sh.
2. Append the inverse alignments (inv) generated by align_new.sh to the inverse (inv) alignments generated by full_train.sh.
3. Run the moses training script with additional flags:
* --first-step -- first step in the training process (default 1)--------------- This will be 4
* --last-step -- last step in the training process (default 7)------------ This will remain 7
* --giza-f2e -- <path to folder>/new_giza.fwd
* --giza-e2f -- <path to folder>/new_giza.inv
For example:
~/mosesdecoder/scripts/training/train-model.perl -root-dir <your training directory> \
-corpus <your new corpus name> \
-f <src> -e <tgt> -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:3:<path to LM>:8 \
--first-step 4 --last-step 7 --giza-f2e -- <path to folder>/new_giza.fwd --giza-e2f -- <path to folder>/new_giza.inv \
-external-bin-dir <path to giza++ binaries>
For more details on the training step read this: http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters
What this does is assumes that you have alignments and continue the phrase extraction, reordering and generate the new moses.ini file.
WARNING: Specify the filenames and paths properly OR IT WILL FAIL.
If you are still unclear then please ask and I will try to help you as much as I can.
Regards.
On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <i.ramadan@saudisoft.com> wrote:
Dear Raj,
That?s a great work my friend,
This files make the script work but it takes long time to finish also it did not generate the model folder which contain the moses.ini file
Is this normal?
And I now try to run it again as I suspect that the server was shut down before the training was completed but i notice that it starts form the beginning and did not use the existing files generated
Thanks Raj it still a great work
From: Raj Dabre [mailto:prajdabre@gmail.com]
Sent: Thursday, October 30, 2014 4:54 PM
To: i.ramadan@saudisoft.com
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Incremental training
Ahh.... i totally forgot that part.
Sorry.
PFA.
Just place them in the folder where the shell scripts full_train.sh and align_new.sh are.
Hopefully it should run now.
Please let me know if you succeed.
On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan <i.ramadan@saudisoft.com> wrote:
Dear Raj,
It is a great solution
I installed MGIZA++ successfully and I am using your scripts to run training
And I followed the steps you mentioned but I faces this error when I was running the full_train.sh script
bla bla bla
.
.
.
.
Starting MGIZA
Initializing Global Paras
DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
ERROR: Cannot open configuration file configgiza.fwd!
Starting MGIZA
Initializing Global Paras
DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
ERROR: Cannot open configuration file configgiza.rev!
This two files does not exists
should they be generated from the installation?
How to get them?
From: Raj Dabre [mailto:prajdabre@gmail.com]
Sent: Sunday, October 26, 2014 6:21 PM
To: i.ramadan@saudisoft.com
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Incremental training
Hello Ihab,
I would suggest using mgiza++. http://www.kyloo.net/software/doku.php/mgiza:overview
It is very easy to use.
I also wrote some scripts to make it easy for training.
Visit the link below for my scripts.
https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M <https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing> &usp=sharing
Usage:
To train basic IBM models:
bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation>
To align 2 new files using previously trained models (aka continue training).
bash align_new.sh <new_src_corpus_file_name> <new_tgt_corpus_file_name> <old_src_corpus_file_name> <old_tgt_corpus_file_name> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation>
There is also a python script which you had better replace in the scripts folder of mgiza++. I have modified it to work with my scripts.
Hope this helps.
On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan <i.ramadan@saudisoft.com> wrote:
Dear All,
I just need a clear steps on how to do incremental training in moses, as the illustration in the manual is not cleared enough
Thanks
Best Regards
Ihab Ramadan| Senior Developer| <http://www.saudisoft.com/> Saudisoft - Egypt | Tel +2 02 330 320 37 Ext- 0 | Mob+201007570826 <tel:%2B201007570826> | Fax+20233032036 <tel:%2B20233032036> | Follow us on <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary> linked | <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark> ZA102637861 | <https://twitter.com/Saudisoft> ZA102637858
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
--
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014
--
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014
--
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141104/00593a2b/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1314 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141104/00593a2b/attachment.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1317 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141104/00593a2b/attachment-0001.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1351 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20141104/00593a2b/attachment-0002.gif
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 97, Issue 5
********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 97, Issue 5"
Post a Comment