Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: unknown words in SRILM/Kenlm (Kenneth Heafield)
2. running the decoder for the first time (mohamed hasanien)
3. Using boost for prefix/suffix checks (Jeroen Vermeulen)
----------------------------------------------------------------------
Message: 1
Date: Thu, 05 Feb 2015 09:09:45 -0500
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] unknown words in SRILM/Kenlm
To: moses-support@mit.edu
Message-ID: <54D379A9.2040104@kheafield.com>
Content-Type: text/plain; charset=windows-1252
Hi,
Great question!
As described in Chen and Goodman, modified Kneser-Ney smoothing treats
<unk> as a count-0 unigrams. All unigrams are interpolated with the
uniform distribution with weight backoff(empty string), so they get
backoff(empty string)/|vocabulary| mass just for being a word. That's
the only mass that <unk> gets. This is what KenLM does by default. As
a corollary, p(<unk>) is always smaller than p(word) for any seen word.
Footnote 7 on page 30 of my thesis mentions how SRILM does it:
http://kheafield.com/professional/thesis.pdf . Here it is in more gory
detail. I'm going from memory here because I currently work for a
for-profit.
1. First compute the probability of every word except <unk>, including
the aforementioned interpolation with unigrams.
2. Sum those probabilities and subtract from 1 to attain p(<unk>). In
principle, this produces the same result of backoff(empty
string)/|vocabulary|. However the sum is very close to 1 and p(<unk>)
is small, so this method is numerically imprecise.
3. SRILM checks if it calculated p(<unk>) > 3*10^-6 (which is the
hard-coded value of epsilon). If so, which is only the case for very
tiny language models (otherwise |vocabulary| is big enough), it returns
p(<unk>).
4. If it calculated p(<unk>) < 3*10^-6, as it usually is, then it does
what the comments describe as "another hack". This "disables" unigram
interpolation. Interpolation with uniform has too terms: backoff(empty
string)/|vocabulary| + discounted probability where discounted
probability implicitly includes the 1-backoff(empty string) term. It
just never adds backoff(empty string)/|vocabulary| to each unigram, but
the discounted probabilities were still implicitly multiplied by
1-backoff(empty string) when they were discounted. In effect, compared
with Chen and Goodman, it steals backoff(empty string)/|vocabulary| from
every unigram.
5. SRILM again sums all the unigrams and takes 1 - their sum. Because
each of the |vocabulary| - 1 terms had backoff(empty
string)/|vocabulary| stolen from it, p(<unk>) is now higher by
(|vocabulary| - 1) * backoff(empty string) / |vocabulary|
and it already owned backoff(empty string)/|vocabulary| of the
probability space, so then it becomes
|vocabulary| * backoff(empty string)/|vocabulary|
= backoff(empty string). Therefore, SRILM's <unk> is larger than Chen
and Goodman say it should be, by a factor of |vocabulary|. This
explains the famous issue that p(<unk>) can be higher than the
probability of words in the vocabulary. With KenLM, you can emulate
this (IMHO broken) functionality by using --interpolate_unigrams 0.
Kenneth
On 02/05/2015 08:14 AM, koormoosh wrote:
> Hi,
>
> I am trying to figure out how unknown words are being handled in
> SRILM/KenLM. I've searched inside the /lm/src directory but the grep
> matches are not helpful. I am interested in LM and doing some
> experiments with my own implementation of Kneser-Ney, so knowing how
> unknown words are handled is important to get roughly equal results with
> SRILM or KenLM. Any comments? A pointer to a class is appreciated the most.
>
> * please note that I am not looking for a solution to handle unknown
> words, as I already have a solution for it. I want to know exactly how
> unknown words are being handled in SRILM.
>
> thank you
> -Koormoosh
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
------------------------------
Message: 2
Date: Thu, 5 Feb 2015 15:32:33 +0000 (UTC)
From: mohamed hasanien <mhmd_hasnen@yahoo.com>
Subject: [Moses-support] running the decoder for the first time
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<1786793930.273353.1423150353582.JavaMail.yahoo@mail.yahoo.com>
Content-Type: text/plain; charset="utf-8"
Hi moses?
i try to run the moses decoder using sample provided via this link?http://www.statmt.org/moses/download/sample-models.tgz
using this command?echo 'das ist ein kleines haus' | ~/mosesdecoder/bin/moses -f ~/ sample-models/phrase-model/moses.ini > out
?and i get this error message?
Defined parameters (per moses.ini or switch):? ? ? ? config: /root/ sample-models/phrase-model/moses.iniline=WordPenaltyFeatureFunction: WordPenalty0 start: 0 end: 0Exception: moses/ScoreComponentCollection.cpp:250 in void Moses::ScoreComponentCollection::Assign(const Moses::FeatureFunction*, const std::vector<float>&) threw util::Exception'.Feature function WordPenalty0 specified 1 dense scores or weights. Actually has 0
mohammed hassanien Mohammed
Egyption Programmers Vice-captain
01000121556
Egyption Programmers Syndicate
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150205/e462232a/attachment-0001.htm
------------------------------
Message: 3
Date: Thu, 05 Feb 2015 16:49:03 +0100
From: Jeroen Vermeulen <jtv@precisiontranslationtools.com>
Subject: [Moses-support] Using boost for prefix/suffix checks
To: moses-support@mit.edu
Message-ID: <54D390EF.6090501@precisiontranslationtools.com>
Content-Type: text/plain; charset="utf-8"
Here's a minor patch in case it's of use - but feel free to tell me to
shut up if it isn't.
Looking at some of the file-handling code I noticed that a lot of places
check a string for a particular prefix or suffix with this kind of pattern:
if (
(text.size() >= suffix.size()) &&
(text.substr(text.size()-suffix.size()) == suffix)) {
It's a bit hard to read, and could lead to strange crashes if you forget
the length check. For example, checking for...
filename.substr(filename.size()-3) == ".gz"
...would crash if filename was less than 3 characters long.
If anyone's interested, I'm attaching a patch that replaces all
prefix/suffix checks that I could find with BOOST's starts_with() and
ends_with(). It's a little safer and easier to follow, and doesn't make
you count the characters in a fixed-length suffix:
if (ends_with(text, suffix)) {
if (ends_with(filename, ".gz")) {
if (starts_with(item, "[") && ends_with(item, "]")) {
None of these cases looked particularly performance-sensitive, but I
checked just in case. If anything, the BOOST code looks more
optimizer-friendly. It compares characters in-place (so no need to copy
a substring) and seems to optimize for the known length of string
constants (so it knows at compile time that ".gz" is 3 characters long).
I haven't done any manual testing, but the unit tests pass. Is that
considered a reasonable guarantee?
Jeroen
-------------- next part --------------
A non-text attachment was scrubbed...
Name: moses-boost-starts_with-ends_with.diff
Type: text/x-patch
Size: 21164 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150205/ae122197/attachment.bin
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 100, Issue 20
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 100, Issue 20"
Post a Comment