Moses-support Digest, Vol 112, Issue 35

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Using lmplz instead of SRI's ngram-count while training
transliteration model (jeremy)
2. Re: Using lmplz instead of SRI's ngram-count while training
transliteration model (Kenneth Heafield)


----------------------------------------------------------------------

Message: 1
Date: Thu, 18 Feb 2016 09:50:44 -0500
From: jeremy <jeremy@gwinnup.org>
Subject: [Moses-support] Using lmplz instead of SRI's ngram-count
while training transliteration model
To: <moses-support@mit.edu>
Message-ID: <1246f1899c514f136bb965ae74811690@gwinnup.org>
Content-Type: text/plain; charset=UTF-8; format=flowed

Hi,

Does anyone know if there are any gotchas using lmplz instead of
ngram-count during transliteration model training? I'm trying it out
using the --discount_fallback option and lmplz's default behavior should
match the -interpolate option for ngram-count (I think?)

Thanks!
-Jeremy


------------------------------

Message: 2
Date: Thu, 18 Feb 2016 15:13:07 +0000
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Using lmplz instead of SRI's ngram-count
while training transliteration model
To: moses-support@mit.edu
Message-ID: <56C5DF83.4090707@kheafield.com>
Content-Type: text/plain; charset=windows-1252

Hi,

There are a few differences, most of which I'd expect you're fine with.

- The discounts are different but you're using --discount_fallback so
you know that.

- Unknown word handling is different. If you want an SRI's IMHO broken
behavior pass --interpolate_unigrams 0 (though if your vocabulary is
small enough then it might actually behave like the default
--interpolate_unigrams 1, since SRI switches between the two based on
p(<unk>).

- SRI's default prunes singletons of order 3 and above based on adjusted
count. lmplz doesn't prune by default but if you turn it on then
pruning is based on count (!= adjusted count).

To summarize, the following two are equivalent up to floating-point
rounding (SRI is less precise):

lmplz -o 5 --interpolate_unigrams 0 <text >text.arpa

ngram-count -order 5 -interpolate -kndiscount -unk -gt3min 1 -gt4min 1
-gt5min 1 -text text -lm text.arpa

I do not think they can be made equivalent in a --discount_fallback
scenario.

Kenneth

On 02/18/2016 02:50 PM, jeremy wrote:
> Hi,
>
> Does anyone know if there are any gotchas using lmplz instead of
> ngram-count during transliteration model training? I'm trying it out
> using the --discount_fallback option and lmplz's default behavior should
> match the -interpolate option for ngram-count (I think?)
>
> Thanks!
> -Jeremy
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 112, Issue 35
**********************************************

0 Response to "Moses-support Digest, Vol 112, Issue 35"

Post a Comment