Moses-support Digest, Vol 127, Issue 11

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Computing Perplexity with KenLM (Python API) (liling tan)

----------------------------------------------------------------------

Message: 1
Date: Mon, 8 May 2017 21:52:29 +0800
From: liling tan <alvations@gmail.com>
Subject: Re: [Moses-support] Computing Perplexity with KenLM (Python
API)
To: moses-support <moses-support@mit.edu>
Cc: Ilia Kurenkov <ilia.kurenkov@gmail.com>
Message-ID:
<CAKzPaJJ9ZTsxrFoeJQb4fqUSJkuvhngz-k2++qYFhZ1okThE+w@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Kenneth,

Thanks for the formula!
Now, it's returning the usual perplexity values =)

Regards,
Liling

Message: 2
Date: Mon, 8 May 2017 10:15:52 +0100
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Computing Perplexity with KenLM (Python
API)
To: moses-support@mit.edu
Message-ID: <56f2dafa-e297-ba36-64e5-ee47b5f3462d@kheafield.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

Hi Liling,

You can test your program matches bin/query.

None of these is correct.

You want math.pow(10.0, sum_inv_logs / n)

Kenneth

---------------------------

On Mon, May 8, 2017 at 2:37 PM, liling tan <alvations@gmail.com> wrote:

> Dear Moses Community,
>
> Does anyone know how to compute sentence perplexity with a KenLM model?
>
> Let's say we build a model on this:
>
> $ wget https://gist.githubusercontent.com/alvations/1c1b388456dc3760ffb487ce950712ac/raw/86cdf7de279a2b9bceeb3adb481e42691d12fbba/something.txt
> $ lmplz -o 5 < something.txt > something.arpa
>
>
> From the perplexity formula (https://web.stanford.edu/class/cs124/lec/
> languagemodeling.pdf)
>
> Applying the sum of inverse log formula to get the inner variable and then
> taking the nth root, the perplexity number is unusually small:
>
> >>> import kenlm>>> m = kenlm.Model('something.arpa')
> # Sentence seen in data.>>> s = 'The development of a forward-looking and comprehensive European migration policy,'>>> list(m.full_scores(s))
> [(-0.8502398729324341, 2, False), (-3.0185394287109375, 3, False), (-0.3004383146762848, 4, False), (-1.0249041318893433, 5, False), (-0.6545327305793762, 5, False), (-0.29304179549217224, 5, False), (-0.4497605562210083, 5, False), (-0.49850910902023315, 5, False), (-0.3856896460056305, 5, False), (-0.3572353720664978, 5, False), (-1.7523181438446045, 1, False)]>>> n = len(s.split())>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))>>> math.pow(sum_inv_logs, 1.0/n)1.2536033936438895
>
>
> Trying again with a sentence not found in the data:
>
> # Sentence not seen in data.>>> s = 'The European developement of a forward-looking and comphrensive society is doh.'>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))>>> sum_inv_logs35.59524390101433>>> n = len(s.split())>>> math.pow(sum_inv_logs, 1.0/n)1.383679905428275
>
>
> And trying again with totally out of domain data:
>
> >>> s = """On the evening of 5 May 2017, just before the French Presidential Election on 7 May, it was reported that nine gigabytes of Macron's campaign emails had been anonymously posted to Pastebin, a document-sharing site. In a statement on the same evening, Macron's political movement, En Marche!, said: "The En Marche! Movement has been the victim of a massive and co-ordinated hack this evening which has given rise to the diffusion on social media of various internal information""">>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))>>> sum_inv_logs282.61719834804535>>> n = len(list(m.full_scores(s)))>>> n79>>> math.pow(sum_inv_logs, 1.0/n)1.0740582373271952
>
>
>
> Although, it is expected that the longer sentence has lower perplexity,
> it's strange that the difference is less than 1.0 and in the range of
> decimals.
>
> Is the above the right way to compute perplexity with KenLM? If not, does
> anyone know how to computer perplexity with the KenLM through the Python
> API?
>
> Thanks in advance for the help!
>
> Regards,
> Liling
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20170508/f7775d57/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 127, Issue 11
**********************************************

Moses-support Digest, Vol 127, Issue 11

0 Response to "Moses-support Digest, Vol 127, Issue 11"

Post a Comment