Moses-support Digest, Vol 115, Issue 7

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Data for building a factored model (Philipp Koehn)
2. Re: (no subject) (Sanjanashree Palanivel)


----------------------------------------------------------------------

Message: 1
Date: Thu, 5 May 2016 18:08:03 -0400
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] Data for building a factored model
To: Sa?o Kuntaric <saso.kuntaric@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAAFADDA709AyhG_+gNGzWfYMJO4y27xVykm8tMrkzJUHWCPaFg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

Hi,

life is easier with factored models, if you use the experiment.perl set-up,
where you just have to specify the factor set-up and scripts that generate
factors.

These scripts take the tokenized text and replace each word with a factor
(e.g., replace each word with the POS tag).

The POS LM is trained on such a corpus - each word is replaced by a
POS tag, and then the standard LM training process is run over it.

See $MOSES/scripts/ems/example/config.factored for an example.

-phi

On Wed, May 4, 2016 at 3:30 PM, Sa?o Kuntaric <saso.kuntaric@gmail.com> wrote:
> Hello again,
>
> I believe I can wrap my head around the theoretical part, but the English
> and German corpora in the Moses factored model tutorial
> (http://www.statmt.org/moses/?n=Moses.FactoredTutorial) look beautifully
> factored, so my question is how were the original corpora processed? Was a
> specific tagger used and was there any manual/script postprocessing done?
>
> And since I am already bugging everyone, how is the language model pos.lm
> created? Is it extracted from a file, created manually or in another way?
>
> Thank you in advance for all the replies.
>
> Best regards,
>
> Sa?o
>
> 2016-05-02 19:45 GMT+02:00 Marwa Refaie <basmallah@hotmail.com>:
>>
>> Corpus for translation model should be on 2 parallel files in the format
>> Word | pos | Lema .... For example , by a file for each language. You can
>> prepare files using word net , Stanford , or any tagger & stemmer as can
>> deal with your language pairs. May be before enter the files to moses you
>> should adjust the text files by a python script (write it your self)
>>
>> For language model ... You must build it as follows
>> Verb noun noun
>> Noun Det adj
>> ....... Depending on the target language only ,, Then build it as usual
>> n-gram lm.
>>
>> Sent from my iPad
>>
>> > On May 2, 2016, at 10:11, Sa?o Kuntaric <saso.kuntaric@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I am having some issues producing the corpora in the correct format for
>> > Moses to execute factored training.
>> >
>> > I am looking at the factored tutorial on the Moses website and I am
>> > wondering, how to get such consistent corpora for two languages. What tools
>> > are being used and can they be trained for specific languages (Slovenian in
>> > my example). Are such tools available for download or is such data produced
>> > with custom scripts?
>> >
>> > --
>> > Best regards,
>> >
>> > Sa?o
>> > _______________________________________________
>> > Moses-support mailing list
>> > Moses-support@mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> lp,
>
> Sa?o
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



------------------------------

Message: 2
Date: Fri, 6 May 2016 18:30:52 +0530
From: Sanjanashree Palanivel <sanjanashree@gmail.com>
Subject: Re: [Moses-support] (no subject)
To: Nadir Durrani <nadir.durrani@nu.edu.pk>, moses-support@mit.edu
Message-ID:
<CAAc_kp5DqccJkW+5EnDyCToRnRjSf3kD_K-p7HdPLCfqgaYqcg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Why do i get " Use of uninitialized value in string eq at
/home/mosesdecoder/scripts/Transliteration/clean.pl line 139, <$IN> line
1." while training transliteration model ... what is wrong

On Fri, May 6, 2016 at 4:20 PM, Sanjanashree Palanivel <
sanjanashree@gmail.com> wrote:

> I installed mgiza, and copied those binary subfolder in the same folder
> where i got giza++ binary files and also merge_alignment.py file. but still
> i get error, in this case I am getting an error stating
>
> Training Transliteration Module - Start
> Fri May 6 16:16:04 IST 2016
> Creating Model
> Extracting 1-1 Alignments
> Cleaning the list for Miner
> Source is Latin
> will run Transliteration module
> Three preprocessing steps to do:
> 1) Delete Symbol 2) Delete Latin from non-Latin langauge 3)
> Character Frequency based filtering
> STARTING 1 and 2 ...
> Use of uninitialized value in string eq at
> /home/sanjana/Documents/SMT/mosesdecoder/scripts/Transliteration/clean.pl
> line 139, <$IN> line 1.
> Use of uninitialized value $wrds[1] in numeric lt (<) at
> /home/sanjana/Documents/SMT/mosesdecoder/scripts/Transliteration/clean.pl
> line 143, <$IN> line 1.
> Use of uninitialized value $retur in numeric eq (==) at
> /home/sanjana/Documents/SMT/mosesdecoder/scripts/Transliteration/clean.pl
> line 61, <$IN> line 1.
> DONE 1 and 2
> STARTING 3) Preprocessing for Character filtering...
> Use of uninitialized value $keys[0] in hash element at
> /home/sanjana/Documents/SMT/mosesdecoder/scripts/Transliteration/clean.pl
> line 197.
> Use of uninitialized value $bestsrcfreq in multiplication (*) at
> /home/sanjana/Documents/SMT/mosesdecoder/scripts/Transliteration/clean.pl
> line 198.
> Use of uninitialized value $keys[0] in hash element at
> /home/sanjana/Documents/SMT/mosesdecoder/scripts/Transliteration/clean.pl
> line 227.
> Use of uninitialized value $besttrgfreq in multiplication (*) at
> /home/sanjana/Documents/SMT/mosesdecoder/scripts/Transliteration/clean.pl
> line 228.
> DONE 3
> Extracting Transliteration Pairs
> Constructing Graph
> Computing Probs : iteration 1
> Computing Probs : iteration 2
> Computing Probs : iteration 3
> Computing Probs : iteration 4
> Computing Probs : iteration 5
> Computing Probs : iteration 6
> Computing Probs : iteration 7
> Computing Probs : iteration 8
> Computing Probs : iteration 9
> Computing Probs : iteration 10
> Finished...
> Selecting Transliteration Pairs with threshold 0.5
> Name "main::hash" used only once: possible typo at
> /home/sanjana/Documents/SMT/mosesdecoder/scripts/Transliteration/
> threshold.pl line 26.
> Preparing Corpus
> Align Corpus
> Using SCRIPTS_ROOTDIR: /home/sanjana/Documents/SMT/mosesdecoder/scripts
> Using multi-thread GIZA
> using gzip
> (1) preparing corpus @ Fri May 6 16:16:05 IST 2016
> Executing: mkdir -p
> /home/sanjana/Documents/SMT/Transliteration/training/prepared
> (1.0) selecting factors @ Fri May 6 16:16:05 IST 2016
> (1.1) running mkcls @ Fri May 6 16:16:05 IST 2016
> /home/sanjana/Documents/SMT/mosesdecoder/tools/mkcls -c50 -n2
> -p/home/sanjana/Documents/SMT/Transliteration/training/corpus.en
> -V/home/sanjana/Documents/SMT/Transliteration/training/prepared/en.vcb.classes
> opt
> Executing: /home/sanjana/Documents/SMT/mosesdecoder/tools/mkcls -c50 -n2
> -p/home/sanjana/Documents/SMT/Transliteration/training/corpus.en
> -V/home/sanjana/Documents/SMT/Transliteration/training/prepared/en.vcb.classes
> opt
> ERROR: Execution of: /home/sanjana/Documents/SMT/mosesdecoder/tools/mkcls
> -c50 -n2 -p/home/sanjana/Documents/SMT/Transliteration/training/corpus.en
> -V/home/sanjana/Documents/SMT/Transliteration/training/prepared/en.vcb.classes
> opt
>
>
> On Fri, May 6, 2016 at 4:09 PM, Nadir Durrani <nadir.durrani@nu.edu.pk>
> wrote:
>
>> You need to check if you have mgiza and its required components in the
>> external bin directory. Here's the git
>>
>> https://github.com/moses-smt/mgiza
>>
>> Have you ever trained a Moses SMT system? Here are the instructions.
>>
>> http://www.statmt.org/moses/?n=Development.GetStarted
>>
>>
>>
>> On Fri, May 6, 2016 at 11:36 AM, Sanjanashree Palanivel <
>> sanjanashree@gmail.com> wrote:
>>
>>> Dear nadir,
>>>
>>> How the input should be given to train transliteration, Is
>>> just raw parallel corpus enough?
>>>
>>> When I try running this script
>>>
>>> /home/sanjana/Documents/SMT/mosesdecoder/scripts/Transliteration/
>>>> train-transliteration-module.pl --corpus-f DATA/ICON15/H_train.en
>>>> --corpus-e DATA/ICON15/H_train.hi --alignment
>>>> /home/sanjana/Documents/SMT/ICON15/Health/BL/En_H/model/aligned.grow-diag-final-and
>>>> --moses-src-dir /home/sanjana/Documents/SMT/mosesdecoder --external-bin-dir
>>>> /home/sanjana/Documents/SMT/mosesdecoder/tools --input-extension en
>>>> --output-extension hi --srilm-dir
>>>> /home/sanjana/Documents/SMT/srilm-1.7.1/bin/i686-m64 --out-dir
>>>> /home/sanjana/Documents/SMT/Transliteration
>>>>
>>>
>>> But Giza is not running i guess, because i do not find any folders
>>> regarding giza,
>>>
>>> I understand that the transliteration scripts works fine. But why I am
>>> unable to train models.
>>>
>>> What mistake I am doing.
>>>
>>> SRILM was installed correctly, when i checked with ngram-count, it
>>> worked fine.
>>>
>>> Why error mentioning multi thread giza has occured, (i didnt install
>>> mgiza). Do I have to install mgiza.
>>>
>>> Please guide me, I do not understand why it is not working
>>>
>>>
>>> On Fri, May 6, 2016 at 7:08 AM, Sanjanashree Palanivel <
>>> sanjanashree@gmail.com> wrote:
>>>
>>>> Dear nadir,
>>>> Thanks a lot... i will just wrk on what you have said... and
>>>> update you what happens..
>>>> On May 6, 2016 4:36 AM, "Nadir Durrani" <nadir.durrani@nu.edu.pk>
>>>> wrote:
>>>>
>>>>>
>>>>> I can only ensure that there's no bug in the scripts. You will need to
>>>>> debug and troubleshoot the problem. The files I sent you should be helpful.
>>>>> Here are the steps
>>>>>
>>>>> Mining
>>>>>
>>>>> 1. Extract 1-1 alignments from parallel data, compare "1-1.en-hi" file
>>>>> with mine
>>>>> 2. Clean the list and make ready for miner, compare 1-1.en-hi.cleaned
>>>>> with mine
>>>>> 3. TMining to extract transliteration pairs,
>>>>> compare 1-1.en-hi.pair-probs with mine
>>>>> 4. Threshold.pl to extract the transliteration corpus,
>>>>> compare 1-1.en-hi.mined-pairs with mine
>>>>>
>>>>> Transliteration Model
>>>>>
>>>>> 1. Running Giza on the corpus, you should be able to see giza and
>>>>> giza-ineverse folders inside training and model/aligned.grow-diag-final-and
>>>>> 2. Model training, you should be able to see following files inside
>>>>> model folder
>>>>>
>>>>> extract.inv.sorted.gz extract.sorted.gz lex.e2f lex.f2e moses.ini
>>>>> phrase-table.gz
>>>>>
>>>>> and targetLM.bin inside lm folder
>>>>>
>>>>> 3. Tune the system, tuning folder should have the following files
>>>>>
>>>>> filtered input moses.filtered.ini moses.ini moses.tuned.ini
>>>>> reference tmp
>>>>>
>>>>> moses.ini is the final file that is created. if you open it you will
>>>>> see the BLEU scores for tuning-set (if it ran properly)
>>>>>
>>>>> Just make sure that your moses is compiled fine and works properly. If
>>>>> things still don't work then try pulling a new version and recompile from
>>>>> scratch.
>>>>>
>>>>> Good luck
>>>>>
>>>>> Nadir
>>>>>
>>>>> On Thu, May 5, 2016 at 6:44 PM, Sanjanashree Palanivel <
>>>>> sanjanashree@gmail.com> wrote:
>>>>>
>>>>>> Dear Nadir,
>>>>>>
>>>>>> Thanks a lot... But why i couldn't train transliteration
>>>>>> model or do anything reg transliteration.. what should i do to make it
>>>>>> work..Please help me in this..
>>>>>>
>>>>>> On Thu, May 5, 2016 at 8:18 PM, Nadir Durrani <
>>>>>> nadir.durrani@nu.edu.pk> wrote:
>>>>>>
>>>>>>> I just asked for the word-alignment :-)
>>>>>>>
>>>>>>> Anyway, I ran your script with my paths and it ran fine. I am
>>>>>>> attaching my Transliteration folder.
>>>>>>>
>>>>>>> As you can see in
>>>>>>>
>>>>>>> 1-1.en-hi.mined-pairs
>>>>>>>
>>>>>>> roughly 4000 transliteration pairs were mined. The threshold.pl
>>>>>>> script selects from word pairs which have probability lower than 0.5. The
>>>>>>> entire list with probs can be seen in
>>>>>>>
>>>>>>> 1-1.en-hi.pair-probs
>>>>>>>
>>>>>>> Lower the probability number, better transliteration it is.
>>>>>>>
>>>>>>> Looking at the transliteration module and tuning run, you can see
>>>>>>> that transliteration system is pretty good. Check out
>>>>>>>
>>>>>>> moses.ini
>>>>>>>
>>>>>>> in the tuning folder. Tuning BLEU is 91.48 which is great.
>>>>>>>
>>>>>>> Nadir
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 5, 2016 at 3:18 PM, Sanjanashree Palanivel <
>>>>>>> sanjanashree@gmail.com> wrote:
>>>>>>>
>>>>>>>> ?
>>>>>>>> En_H.zip
>>>>>>>> <https://drive.google.com/file/d/0Bwi7uqU0aYEzQ1djLWlGMnQ3Q2c/view?usp=drive_web>
>>>>>>>> ?Hi,
>>>>>>>>
>>>>>>>> I have attached you the zip file
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks and regards,
>>>>>>>>
>>>>>>>> Sanjanasri J.P
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks and regards,
>>>>>>
>>>>>> Sanjanasri J.P
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Thanks and regards,
>>>
>>> Sanjanasri J.P
>>>
>>
>>
>
>
> --
> Thanks and regards,
>
> Sanjanasri J.P
>



--
Thanks and regards,

Sanjanasri J.P
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160506/9361f146/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 115, Issue 7
*********************************************

0 Response to "Moses-support Digest, Vol 115, Issue 7"

Post a Comment