Moses-support Digest, Vol 97, Issue 87

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Unknown single words that are part of phrases (Raj Dabre)
2. Re: Delvin et al 2014 (Alexandra Birch)

----------------------------------------------------------------------

Message: 1
Date: Thu, 27 Nov 2014 01:18:11 +0900
From: Raj Dabre <prajdabre@gmail.com>
Subject: Re: [Moses-support] Unknown single words that are part of
phrases
To: "Vera Aleksic, Linguatec GmbH" <v.aleksic@linguatec.de>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAB3gfjBX_0eXodD94Lhuaz1gY-c6vf_TyJeW7dxXLvHS94hkRg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hello,

If I am not wrong this is most likely due to the grow (-diag) method
applied to the word aligned data (both directions) before phrase extraction.
Furthermore..... one word translations should exist (but not always)....
search for them.

Regards.

On Thu, Nov 27, 2014 at 12:53 AM, Vera Aleksic, Linguatec GmbH <
v.aleksic@linguatec.de> wrote:

> Hi,
>
> I have observed many times that some words do not exist as single word
> translations in the phrase table, although they exist in the training
> corpus and in multiword phrases.
> An example:
> German-English translation for "Gitarre" is unknown, i.e. there is no
> single word entry for "Gitarre" in the phrase table, although some other
> phrases containing this word exist (see below).
> How is it possible?
> Thanks and best regards,
> Vera
>
>
> Gitarre , ||| guitar ; ||| 1 0.0284465 1 0.0654272 2.718 ||| ||| 1 1
> Gitarre darstellt , unter Beanspruchung ||| guitar using ||| 0.25
> 2.7351e-11 1 0.0625119 2.718 ||| ||| 4 1
> Gitarre darstellt , unter ||| guitar using ||| 0.25 1.18917e-05 1
> 0.0625119 2.718 ||| ||| 4 1
> Gitarre darstellt , ||| guitar using ||| 0.25 0.00569228 1 0.0625119 2.718
> ||| ||| 4 1
> Gitarre darstellt ||| guitar using ||| 0.25 0.0400028 1 0.0625119 2.718
> ||| ||| 4 1
> Kopfplatte einer Gitarre darstellt , ||| head of a guitar using ||| 0.5
> 4.23407e-08 1 0.00471281 2.718 ||| ||| 2 1
> Kopfplatte einer Gitarre darstellt ||| head of a guitar using ||| 0.5
> 2.97552e-07 1 0.00471281 2.718 ||| ||| 2 1
> eine elektrische Gitarre , ||| an electric guitar ; ||| 1 0.00107982 1
> 0.00163632 2.718 ||| ||| 1 1
> einer Gitarre darstellt , unter ||| of a guitar using ||| 0.333333
> 6.4754e-07 1 0.00471281 2.718 ||| ||| 3 1
> einer Gitarre darstellt , ||| of a guitar using ||| 0.333333 0.000309961 1
> 0.00471281 2.718 ||| ||| 3 1
> einer Gitarre darstellt ||| of a guitar using ||| 0.333333 0.00217827 1
> 0.00471281 2.718 ||| ||| 3 1
> elektrische Gitarre , ||| electric guitar ; ||| 1 0.005661 1 0.0142097
> 2.718 ||| ||| 1 1
> wie eine elektrische Gitarre , ||| as an electric guitar ; ||| 1
> 0.000177339 1 0.000809485 2.718 ||| ||| 1 1
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

--
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141127/cf99deef/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 26 Nov 2014 16:27:44 +0000
From: Alexandra Birch <lexi.birch@gmail.com>
Subject: Re: [Moses-support] Delvin et al 2014
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CA+h82t4LfQdPj+zWq_HJO2Gtm2aOkHMbdvQy1LO-gfvsobfDXA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

OK,

Here is 1-4:

1. You would normally train bilingual lm on the same corpus as the SMT
model, but it is not required
2. Yes, but there are also other ways to make training faster which you
might want to explore
3. Yes it is important that the bilingual lm corpus matches the format that
will be passed to it by the decoder at decoding time, or it will work as
well.
4. Yes it can include the sentences which were filtered by the training
scripts. You just need to have word alignments for them, and they do need
to be reasonably good translations of each other. So filter out the junk.

Lexi

On Wed, Nov 26, 2014 at 3:44 PM, Tom Hoar <
tahoar@precisiontranslationtools.com> wrote:

> Thanks again. It's very useful feedback. We're now preparing to move from
> v1.0 to 3.x. We skipped Moses 2.x. So, I'm not familiar with the new
> moses.ini syntax.
>
> Here are some more questions to help us get started playing with the
> extract_training.py options:
>
> 1. I'm assuming corpus.e and corpus.f are the same prepared corpus
> files as used in train-model.perl?
> 2. Is it possible for corpus.e and corpus.f to be different from the
> train-model.perl corpus, for example a smaller random sampling?
> 3. The corpus files are tokenized and lower-cased and escaped the
> same.
> 4. Do the corpus files also need to enforce clean-corpus-n.perl max
> tokens (100) and ratio (9:1) for src & tgt? These address (M)GIZA++ limits
> and might not apply to BilingualLM. However, are there advantages to using
> the limits or disadvantages to overriding them? I.e. can these corpus files
> include lines that are filtered with clean-corpus-n.perl?
> 5. What is the --align value? Is it the output of train-model.perl
> step 3 or an file with word alignments for each line of the corpus.e and
> corpus.f pair?
> 6. Re --prune-source-vocab & --prune-target-vocab, do these thresholds
> set the size of the vocabulary you reference in #4 below (i.e. 16K, 500K,
> etc)?
> 7. Re --source-context & --target-context, are these the BilingualLM
> equivalents to a typical LM's order or ngrams for each?
> 8. Re --tagged-corpus, is this for POS factored corpora?
>
> Thanks.
>
>
>
> On 11/26/2014 09:27 PM, Nikolay Bogoychev wrote:
>
> Hey, Tom
>
> 1) It's independent. You just add -with-oxlm and -with-nplm to the stack
> 2) Yes, they are both thread safe, you can run the decoder with however
> many threads you wish.
> 3) It doesn't create a separate binary. The compilation flag adds a new
> feature inside moses that is called BilingualNPLM and you have to add it to
> your moses.ini with a weight.
> 4) That depends on the vocabulary size used. With 16k source 16k target
> about 100 megabytes. With 500000 about 1.5 gigabytes.
>
> Beware that the memory requirements during decoding are much larger,
> because of premultiplication. If you have memory issues supply
> "premultiply=false" to the BilingualNPLM line in moses.ini, but this is
> likely going to slow down decoding by a lot.
>
>
> Cheers,
>
> Nick
>
> On Wed, Nov 26, 2014 at 2:09 PM, Tom Hoar <
> tahoar@precisiontranslationtools.com> wrote:
>
>> Thanks Nikolay! This is a great start. I have a few clarification
>> questions.
>>
>> 1) does this replace or run independently of traditional language models
>> like KenLM? I.e. when compiling, we can use -with-kenlm, -with-irstlm,
>> -with-randlm and -with-srilm together. Are -with-oxlm and -with-nplm added
>> to the stack or are they exclusive?
>>
>> 2) It looks like your branch of nplm is thread-safe. Is oxlm also
>> thread-safe?
>>
>> 3) You say, "To run it in moses as a feature function..." Does that mean
>> compiling with your above option(s) creates a new runtime binary "
>> BilingualNPLM" that replaces the moses binary, much like moseschart and
>> mosesserver? Or, does BilingualNPLM run in a separate process that the
>> Moses binary accesses during runtime?
>>
>> 4) How large do these LM files become? Are they comparable to traditional
>> ARPA files, larger or smaller? Also, are they binarized with mmap reads or
>> do they have to load into RAM?
>>
>> Thanks,
>> Tom
>>
>>
>>
>>
>>
>> On 11/26/2014 08:04 PM, Nikolay Bogoychev wrote:
>>
>> Fix formatting...
>>
>> Hey,
>>
>> BilingualLM is implemented and as of last week resides within moses
>> master:
>> https://github.com/moses-smt/mosesdecoder/blob/master/moses/LM/BilingualLM.cpp
>>
>> To compile it you need a NeuralNetwork backend for it. Currently there
>> are two supported: Oxlm and Nplm. Adding a new backend is relatively easy,
>> you need to implement the interface as shown here:
>>
>> https://github.com/moses-smt/mosesdecoder/blob/master/moses/LM/bilingual-lm/BiLM_NPLM.h
>>
>> To compile with oxlm backend you need to compile moses with the switch
>> -with-oxlm=/path/to/oxlm
>> To compile with nplm backend you need to compile moses with the switch
>> -with-nplm=/path/to/nplm (You need this fork of nplm
>> https://github.com/rsennrich/nplm
>>
>> Unfortunately documentaiton is not yet available so here's a short
>> summary how to train a model and use it using, the nplm backend:
>> Use the extract training script to prepare aligned bilingual corpus:
>> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/bilingual-lm/extract_training.py
>>
>> You need the following options:
>>
>> "-e", "--target-language", type="string", dest="target_language")
>> //Mandatory, for example es
>> "-f", "--source-language", type="string", dest="source_language")
>> //Mandatory, for example en
>> "-c", "--corpus", type="string", dest="corpus_stem") // path/to/corpus In
>> the directory you have specified there should be files corpus.sourcelang
>> and corpus.targetlang
>> "-t", "--tagged-corpus", type="string", dest="tagged_stem") //Optional
>> for backoff to pos tag
>> "-a", "--align", type="string", dest="align_file") //Mandatory alignment
>> file
>> "-w", "--working-dir", type="string", dest="working_dir") //Output
>> directory of the model
>> "-n", "--target-context", type="int", dest="n") /
>> "-m", "--source-context", type="int", dest="m") //The actual context size
>> is 2*m + 1, this is the number of words on both left and right
>> "-s", "--prune-source-vocab", type="int", dest="sprune") //cutoff
>> vocabulary threshold
>> "-p", "--prune-target-vocab", type="int", dest="tprune") //cutoff
>> vocabulary threshold
>>
>>
>> Then, use the training script to train the model:
>> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/bilingual-lm/train_nplm.py
>>
>> Example execution is:
>>
>> train_nplm.py -w de-en-500250source/ -r de-en150nopos-source750 -n 16
>> -d 0 --nplm-home=/home/abmayne/code/deepathon/nplm_one_layer/ -c
>> corpus.1.word -i 750 -o 750
>>
>> where -i and -o are input and output embeddings
>> -n is the total ngram size
>> -d is the number of hidden layyers
>> -w and -c are the same as the extract_training options
>> -r is the output directory of the model
>>
>> Consult the python script for more detailed description of the options
>>
>> After you have done that in the output directory you should have a
>> trained bilingual Neural Network language model
>>
>> To run it in moses as a feature function you need the following line:
>>
>> BilingualNPLM filepath=/mnt/gna0/nbogoych/new_nplm_german/de-en150nopos/train.10k.model.nplm.10
>> target_ngrams=4 source_ngrams=9 source_vocab=/mnt/gna0/
>> nbogoych/new_nplm_german/de-enIWSLTnopos/vocab.source
>> target_vocab=/mnt/gna0/nbogoych/new_nplm_german/de-
>> enIWSLTnopos/vocab.targe
>>
>> The source and target vocab is located in the working directory used to
>> prepare the neural network language model.
>> target_ngrams doesn't include the predicted word (so target_ngrams = 4,
>> would mean 1 word predicted and 4 target context word)
>> The total of the model would target_ngrams + source_ngrams + 1)
>>
>> I will write a proper documentation in the following weeks. If you have
>> any problems runnning it, please consult me.
>>
>> Cheers,
>>
>> Nick
>>
>>
>> On Wed, Nov 26, 2014 at 1:02 PM, Nikolay Bogoychev <nheart@gmail.com>
>> wrote:
>>
>>> Hey,
>>>
>>> BilingualLM is implemented and as of last week resides within moses
>>> master:
>>> https://github.com/moses-smt/mosesdecoder/blob/master/moses/LM/BilingualLM.cpp
>>>
>>> To compile it you need a NeuralNetwork backend for it. Currently there
>>> are two supported: Oxlm and Nplm. Adding a new backend is relatively easy,
>>> you need to implement the interface as shown here:
>>>
>>> https://github.com/moses-smt/mosesdecoder/blob/master/moses/LM/bilingual-lm/BiLM_NPLM.h
>>>
>>> To compile with oxlm backend you need to compile moses with the switch
>>> -with-oxlm=/path/to/oxlm
>>> To compile with nplm backend you need to compile moses with the switch
>>> -with-nplm=/path/to/nplm (You need this fork of nplm
>>> https://github.com/rsennrich/nplm
>>>
>>> Unfortunately documentaiton is not yet available so here's a short
>>> summary how to train a model and use it using, the nplm backend:
>>> Use the extract training script to prepare aligned bilingual corpus:
>>> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/bilingual-lm/extract_training.py
>>>
>>> You need the following options:
>>>
>>> "-e", "--target-language", type="string", dest="target_language")
>>> //Mandatory, for example es "-f", "--source-language", type="string",
>>> dest="source_language") //Mandatory, for example en "-c", "--corpus",
>>> type="string", dest="corpus_stem") // path/to/corpus In the directory you
>>> have specified there should be files corpus.sourcelang and
>>> corpus.targetlang "-t", "--tagged-corpus", type="string",
>>> dest="tagged_stem") //Optional for backoff to pos tag "-a", "--align",
>>> type="string", dest="align_file") //Mandatory alignemtn file "-w",
>>> "--working-dir", type="string", dest="working_dir") //Output directory of
>>> the model "-n", "--target-context", type="int", dest="n") / "-m",
>>> "--source-context", type="int", dest="m") //The actual context size is 2*m
>>> + 1, this is the number of words on both left and right "-s",
>>> "--prune-source-vocab", type="int", dest="sprune") //cutoff vocabulary
>>> threshold "-p", "--prune-target-vocab", type="int", dest="tprune") //cutoff
>>> vocabulary threshold
>>> Then, use the training script to train the model:
>>> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/bilingual-lm/train_nplm.py
>>>
>>> Example execution is: train_nplm.py -w de-en-500250source/ -r
>>> de-en150nopos-source750 -n 16 -d 0
>>> --nplm-home=/home/abmayne/code/deepathon/nplm_one_layer/ -c corpus.1.word
>>> -i 750 -o 750
>>>
>>> where -i and -o are input and output embeddings
>>> -n is the total ngram size
>>> -d is the number of hidden layyers
>>> -w and -c are the same as the extract_training options
>>> -r is the output directory of the model
>>>
>>> Consult the python script for more detailed description of the options
>>>
>>> After you have done that in the output directory you should have a
>>> trained bilingual Neural Network language model
>>>
>>> To run it in moses as a feature function you need the following line:
>>>
>>> BilingualNPLM
>>> filepath=/mnt/gna0/nbogoych/new_nplm_german/de-en150nopos/train.10k.model.nplm.10
>>> target_ngrams=4 source_ngrams=9
>>> source_vocab=/mnt/gna0/nbogoych/new_nplm_german/de-enIWSLTnopos/vocab.sourcetarget_vocab=/mnt/gna0/nbogoych/new_nplm_german/de-enIWSLTnopos/vocab.targe
>>>
>>> The source and target vocab is located in the working directory used to
>>> prepare the neural network language model.
>>> target_ngrams doesn't include the predicted word (so target_ngrams = 4,
>>> would mean 1 word predicted and 4 target context word)
>>> The total of the model would target_ngrams + source_ngrams + 1)
>>>
>>> I will write a proper documentation in the following weeks. If you have
>>> any problems runnning it, please consult me.
>>>
>>> Cheers,
>>>
>>> Nick
>>>
>>>
>>>
>>>
>>> On Wed, Nov 26, 2014 at 11:53 AM, Tom Hoar <
>>> tahoar@precisiontranslationtools.com> wrote:
>>>
>>>> Hieu,
>>>>
>>>> Sorry I missed you in Vancouver. I just reviewed your slide deck from
>>>> the MosesCore TAUS Round Table in Vancouver
>>>> (taus-moses-industry-roundtable-2014-changes-in-moses-hieu-hoang-university-of-edinburgh).
>>>>
>>>>
>>>> In particular, I'm interested in the "Bilingual Language Models" that
>>>> "replicate Delvin et al, 2014". A search on statmt.org/moses doesn't
>>>> show any hits searching for "delvin". So, A) is the code finished? If so B)
>>>> are there any instructions how to enable/use this feature? If not, C) what
>>>> kind of help do you need to test the code for release?
>>>>
>>>> --
>>>>
>>>> Best regards,
>>>> Tom Hoar
>>>> Managing Director
>>>> *Precision Translation Tools Co., Ltd.*
>>>> Bangkok, Thailand
>>>> Web: www.precisiontranslationtools.com
>>>> Mobile: +66 87 345-1875 <%2B66%2087%20345-1875>
>>>> Skype: tahoar
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>>
>>>
>>
>>
>> _______________________________________________
>> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

--
------------------------------------------------------------------------------------------
School of Informatics
University of Edinburgh
Phone +44 (0)131 650-8286

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141126/2fb589d4/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 97, Issue 87
*********************************************

Moses-support Digest, Vol 97, Issue 87

0 Response to "Moses-support Digest, Vol 97, Issue 87"

Post a Comment