Moses-support Digest, Vol 97, Issue 55

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. How much time will tokenizing take? (Asad A.Malik)
2. Re: Incremental training (Raj Dabre)

----------------------------------------------------------------------

Message: 1
Date: Wed, 19 Nov 2014 15:09:00 +0000 (UTC)
From: "Asad A.Malik" <asad_12204@yahoo.com>
Subject: [Moses-support] How much time will tokenizing take?
To: Moses-support <moses-support@mit.edu>
Message-ID:
<1745529079.2587804.1416409740988.JavaMail.yahoo@jws106138.mail.bf1.yahoo.com>

Content-Type: text/plain; charset="utf-8"

Hi All,

I am tokenizing my corpus and have entered the command but it is taking to long, I just wanted to know that how much time it will take?
P.S. the corpus is same, may be around 1000 sentences, and in french language.
?Kind Regards,

Mr. Asad Abdul Malik?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141119/4487eb09/attachment-0001.htm

------------------------------

Message: 2
Date: Thu, 20 Nov 2014 00:18:04 +0900
From: Raj Dabre <prajdabre@gmail.com>
Subject: Re: [Moses-support] Incremental training
To: Sandipan Dandapat <sandipandandapat@gmail.com>
Cc: moses-support <moses-support@mit.edu>, "i.ramadan@saudisoft.com"
<i.ramadan@saudisoft.com>
Message-ID:
<CAB3gfjBBpWF7EJz9kEKAsMXrLHPjs0YB1-uVwc0_arPEdOnLgA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hey,

I am pretty sure that my script does not generate duplicate token id.

In fact, I used to get the same error till I modified the script.

In case you do want to avoid this error and not use my script then:

1. Open the original python script: plain2snt-hasvcb.py
2. There is a line which increments the id counter by 1 ( the line is nid =
len(fvcb)+1;)
3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering
starts from 1, and thus if you have 23 tokens then the id will go from 2 to
24. The original update script will do: nid = 23 + 1 = 24 and the
modification will give 25 correctly). This is in 2 places: nid =
len(evcb)+2;

Do this and it will work.

In any case... send me a zip file of your working directory (if its
small.... you are testing it on small data right ? ). I will see what the
problem is.

On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat <
sandipandandapat@gmail.com> wrote:

> Dear Raj,
> I also tried to use your scripts for incremental alignment. I copied your
> python script in the desired directory still I am receiving the same error
> as posted by Ihab.
> reading vocabulary files
> Reading vocabulary file from:new_corpus/inc.fr.vcb
> ERROR: TOKEN ID must be unique for each token, in line :
> 24 roi 2
> TOKEN ID 24 has already been assigned to: roi
>
> I took only 500 sentences pairs for full_train.sh and it worked fine with
> 758 lines in the corpus/tgt_filename.vcb file
>
> I took only 10 sentences for incremental alignment_new.sh which generated
> the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb
> Is there any problem? Can you please help me on the same.
>
> Thanks and regards,
> sandipan
>
>
> On 4 November 2014 16:13, prajdabre <prajdabre@gmail.com> wrote:
>
>> Dear Ihab.
>> There is a python script that was there in the google drive folder in the
>> first mail I sent you.
>> Please replace the existing file with my copy.
>>
>> It has to work.
>>
>> Regards.
>>
>>
>> Sent from Samsung Mobile
>>
>>
>>
>> -------- Original message --------
>> From: Ihab Ramadan <i.ramadan@saudisoft.com>
>> Date: 05/11/2014 00:54 (GMT+09:00)
>> To: 'Raj Dabre' <prajdabre@gmail.com>
>> Cc: moses-support@mit.edu
>> Subject: RE: [Moses-support] Incremental training
>>
>>
>> Dear Raj,
>>
>> Your point is clear and I try to follow the steps you mentioned but I
>> stuck now in the align_new.sh script which gives me this error
>>
>> reading vocabulary files
>>
>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
>>
>> ERROR: TOKEN ID must be unique for each token, in line :
>>
>> 29107 q-1 4
>>
>> Do you have any idea what this error means?
>>
>>
>>
>> *From:* Raj Dabre [mailto:prajdabre@gmail.com]
>> *Sent:* Tuesday, November 4, 2014 12:06 PM
>> *To:* i.ramadan@saudisoft.com
>> *Cc:* moses-support@mit.edu
>> *Subject:* Re: [Moses-support] Incremental training
>>
>>
>>
>> Dear Ihab,
>>
>> Perhaps I should have mentioned much more clearly what my script does.
>> Sorry for that.
>>
>> Let me start with this: There is no direct/easy way to generate the
>> moses.ini file as you need.
>>
>> 1. Suppose you have 2 million lines of parallel corpora and you trained a
>> SMT system for it. This naturally gives the phrase table, reordering table
>> and moses.ini.
>>
>> 2. Suppose you got 500 k more lines of parallel corpora.... there are 2
>> ways:
>>
>> a. Retrain 2.5 million lines from scratch (will take lots of time: ~
>> 2-3 days on a regular machines)
>>
>> b. Train on only the 500k new lines using the alignment information
>> of the original training data. (Faster: ~ 6-7 hours).
>>
>>
>>
>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE
>> TABLES.*
>>
>> 1. full_train.sh -------------- This trains on the original corpus of 2
>> million lines. (Generate alignment files only for the original corpus)
>>
>> 2. align_new.sh -------------- This trains on the new corpus of 500 k
>> lines. (Generate alignment files only for the new corpus using the
>> alignments for 1)
>>
>>
>>
>> *Why this split ????* Because the basic training step of Moses does not
>> preserve the alignment probability information. Only the alignments are
>> saved. To continue training we need the probability information.
>>
>> You can pass flags to moses to preserve this information ( this flag is
>> --giza-option . If you do this then you will not need full_train.sh. But
>> you will have to change the config files before using align_new.sh)
>>
>> *HOW TO GET UPDATED PHRASE TABLE:*
>>
>> 1. Append the forward alignments (fwd) generated by align_new.sh to the
>> forward (fwd) alignments generated by full_train.sh.
>> 2. Append the inverse alignments (inv) generated by align_new.sh to the
>> inverse (inv) alignments generated by full_train.sh.
>>
>> 3. Run the moses training script with additional flags:
>>
>> - --first-step -- first step in the training process (default
>> 1)--------------- This will be 4
>> - --last-step -- last step in the training process (default
>> 7)------------ This will remain 7
>> - --giza-f2e -- <path to folder>/new_giza.fwd
>> - --giza-e2f -- <path to folder>/new_giza.inv
>>
>> For example:
>>
>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your training directory> \
>>
>> -corpus <your new corpus name> \
>>
>> -f <src> -e <tgt> -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
>>
>> -lm 0:3:<path to LM>:8 \
>> --first-step 4 --last-step 7 --giza-f2e -- <path to folder>/new_giza.fwd --giza-e2f -- <path to folder>/new_giza.inv \
>> -external-bin-dir <path to giza++ binaries>
>>
>> For more details on the training step read this:
>> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters
>>
>> What this does is assumes that you have alignments and continue the
>> phrase extraction, reordering and generate the new moses.ini file.
>>
>> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.*
>>
>>
>>
>> If you are still unclear then please ask and I will try to help you as
>> much as I can.
>>
>> Regards.
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <i.ramadan@saudisoft.com>
>> wrote:
>>
>> Dear Raj,
>>
>> That?s a great work my friend,
>>
>> This files make the script work but it takes long time to finish also it
>> did not generate the model folder which contain the moses.ini file
>>
>> Is this normal?
>>
>> And I now try to run it again as I suspect that the server was shut down
>> before the training was completed but i notice that it starts form the
>> beginning and did not use the existing files generated
>>
>> Thanks Raj it still a great work
>>
>>
>>
>>
>>
>> *From:* Raj Dabre [mailto:prajdabre@gmail.com]
>> *Sent:* Thursday, October 30, 2014 4:54 PM
>>
>>
>> *To:* i.ramadan@saudisoft.com
>> *Cc:* moses-support@mit.edu
>> *Subject:* Re: [Moses-support] Incremental training
>>
>>
>>
>> Ahh.... i totally forgot that part.
>>
>> Sorry.
>>
>> PFA.
>>
>> Just place them in the folder where the shell scripts full_train.sh and
>> align_new.sh are.
>>
>> Hopefully it should run now.
>>
>> Please let me know if you succeed.
>>
>>
>>
>> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan <i.ramadan@saudisoft.com>
>> wrote:
>>
>> Dear Raj,
>>
>> It is a great solution
>>
>> I installed MGIZA++ successfully and I am using your scripts to run
>> training
>>
>> And I followed the steps you mentioned but I faces this error when I was
>> running the full_train.sh script
>>
>>
>>
>> bla bla bla
>>
>> .
>>
>> .
>>
>> .
>>
>> .
>>
>>
>>
>> Starting MGIZA
>>
>> Initializing Global Paras
>>
>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>>
>> ERROR: Cannot open configuration file configgiza.fwd!
>>
>> Starting MGIZA
>>
>> Initializing Global Paras
>>
>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>>
>> ERROR: Cannot open configuration file configgiza.rev!
>>
>>
>>
>>
>>
>> This two files does not exists
>>
>> should they be generated from the installation?
>>
>> How to get them?
>>
>>
>>
>> *From:* Raj Dabre [mailto:prajdabre@gmail.com]
>> *Sent:* Sunday, October 26, 2014 6:21 PM
>> *To:* i.ramadan@saudisoft.com
>> *Cc:* moses-support@mit.edu
>> *Subject:* Re: [Moses-support] Incremental training
>>
>>
>>
>> Hello Ihab,
>>
>> I would suggest using mgiza++.
>> http://www.kyloo.net/software/doku.php/mgiza:overview
>>
>> It is very easy to use.
>>
>> I also wrote some scripts to make it easy for training.
>> Visit the link below for my scripts.
>>
>> https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing
>>
>> Usage:
>>
>> To train basic IBM models:
>> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name>
>> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation>
>>
>> To align 2 new files using previously trained models (aka continue
>> training).
>>
>> bash align_new.sh <new_src_corpus_file_name> <new_tgt_corpus_file_name>
>> <old_src_corpus_file_name> <old_tgt_corpus_file_name> <model_folder_base>
>> <corpus_folder_base> <path_to_mgizapp_installation>
>>
>> There is also a python script which you had better replace in the scripts
>> folder of mgiza++. I have modified it to work with my scripts.
>>
>> Hope this helps.
>>
>>
>>
>>
>>
>> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan <i.ramadan@saudisoft.com>
>> wrote:
>>
>> Dear All,
>>
>> I just need a clear steps on how to do incremental training in moses, as
>> the illustration in the manual is not cleared enough
>>
>> Thanks
>>
>>
>>
>> Best Regards
>>
>> *Ihab Ramadan*| Senior Developer| Saudisoft <http://www.saudisoft.com/>
>> - Egypt | *Tel * +2 02 330 320 37 Ext- 0 | Mob+201007570826 | Fax
>> +20233032036 | *Follow us on *[image: linked]
>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* |
>> **[image: ZA102637861]*
>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* |
>> **[image: ZA102637858]* <https://twitter.com/Saudisoft>
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> --
>>
>> Raj Dabre.
>> Research Student,
>>
>> Graduate School of Informatics,
>> Kyoto University.
>>
>> CSE MTech, IITB., 2011-2014
>>
>>
>>
>>
>> --
>>
>> Raj Dabre.
>> Research Student,
>>
>> Graduate School of Informatics,
>> Kyoto University.
>>
>> CSE MTech, IITB., 2011-2014
>>
>>
>>
>>
>> --
>>
>> Raj Dabre.
>> Research Student,
>>
>> Graduate School of Informatics,
>> Kyoto University.
>>
>> CSE MTech, IITB., 2011-2014
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>

--
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141120/2402fe8e/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 97, Issue 55
*********************************************

Moses-support Digest, Vol 97, Issue 55

0 Response to "Moses-support Digest, Vol 97, Issue 55"

Post a Comment