Moses-support Digest, Vol 116, Issue 6

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Extract list of n-grams from Trie Language Model that
contains a certain word (Kenneth Heafield)
2. Re: Extract list of n-grams from Trie Language Model that
contains a certain word (Graeme Kidd)

----------------------------------------------------------------------

Message: 1
Date: Sat, 4 Jun 2016 21:18:22 +0100
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Extract list of n-grams from Trie
Language Model that contains a certain word
To: moses-support@mit.edu
Message-ID: <5753378E.6080409@kheafield.com>
Content-Type: text/plain; charset=windows-1252

There isn't a program that does exactly what you want, but you can
modify dump_trie since it iterates over the n-grams. In doing that
modification, you'll remove the existing condition (vocab filtering,
close to what you want) and therefore the vocab file.

Kenneth

On 06/04/16 12:11, Graeme Kidd wrote:
>
> Thanks, that?s given me a good starting point. The next problem is
> that the dump_trie program expects a vocab file which isn?t provided.
> Any idea how I could create one?
>
>
>
> Thanks again,
>
> Graeme
>
>
>
> *From:*Kenneth Heafield [mailto:moses@kheafield.com]
> *Sent:* 04 June 2016 08:00
> *To:* Graeme Kidd <graemekidd@gmail.com>; moses-support@mit.edu
> *Subject:* Re: [Moses-support] Extract list of n-grams from Trie
> Language Model that contains a certain word
>
>
>
> The trie file you have contains conditional probabilities and backoffs
> but not counts. If you're OK with that, check out/modify the dump_trie
> program in the bounded-noquant branch of github.com/kpu/kenlm
> <http://github.com/kpu/kenlm> . It can stream but you will need to do
> ulimit -v with something above 6 TB even though physical usage will be
> fine.
>
> For counts, contact me off list.
>
> On June 4, 2016 1:42:40 AM GMT+01:00, Graeme Kidd
> <graemekidd@gmail.com <mailto:graemekidd@gmail.com>> wrote:
>
> Hi,
>
>
>
> This is still all very new to me so apologies if this is not the
> correct place to ask this questions.
>
>
>
> I am wanting to take the English Trie Language Model (5.5TB)
> created from the Common Crawl data set:
>
> http://data.statmt.org/ngrams/lm/en.trie
>
>
>
> Then extract all n-grams that contain a certain word. This needs
> to be done for a list of 100 words. For example if I was looking
> for all n-grams that contained the word ?discombobulated? I would
> want an output file containing the n-gram that contains that word
> and the number of times that n-gram occurs:
>
> word1 discombobulated 25
>
> word1 discombobulated word3 40
>
>
>
> Due to the size of the file, this is something I am keen to get
> right first time. For this reason is someone able to give me an
> example of how this can be done and would this kind of query be
> possible with 64GB of RAM?
>
>
>
> Thanks,
>
> Graeme
>
> ------------------------------------------------------------------------
>
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

Message: 2
Date: Sun, 5 Jun 2016 08:08:36 +0100
From: "Graeme Kidd" <graemekidd@gmail.com>
Subject: Re: [Moses-support] Extract list of n-grams from Trie
Language Model that contains a certain word
To: <moses-support@mit.edu>
Message-ID: <002c01d1bef9$14f66040$3ee320c0$@gmail.com>
Content-Type: text/plain; charset="us-ascii"

When trying to understand new code I find it helps to have a working
example, allowing me to step through the code and learn what it does before
I start modifying it. Without this I just have to blindly comment out code
without any real understanding until something works.

To get something to work I had to use a previous version of the file:
https://github.com/kpu/kenlm/blob/26aa867a8201083e01a0894a17c71b4f1ae83ff5/l
m/dump_trie_main.cc

After that version the code is modified to "Dump a trie to a trie (barely
tested)" instead of text files.

After removing the code responsible for the vocabulary file I am left with a
bunch of files in the format of "Prob WordIndex Backoff " e.g:
-5.0458198 14842 -0.06544129
-3.35619 231080 -0.06544129
-3.4443479 285446 -0.06544129
-2.0936365 1130471 -0.06544129

It turns out the "CheckedRead" function only gets the WordIndex which I
suspect is used for the vocabulary file and I still have no idea how the
actual n-gram can be retrieved:
https://gist.github.com/anonymous/6856d588aa10f9220c5200a4e319b1ea#file-dump
_trie_main-cc-L82

Any further ideas on how to progress would be greatly appreciated since I am
not that experienced with C++.

Thanks,
Graeme

-----Original Message-----
From: moses-support-bounces@mit.edu [mailto:moses-support-bounces@mit.edu]
On Behalf Of Kenneth Heafield
Sent: 04 June 2016 21:18
To: moses-support@mit.edu
Subject: Re: [Moses-support] Extract list of n-grams from Trie Language
Model that contains a certain word

There isn't a program that does exactly what you want, but you can modify
dump_trie since it iterates over the n-grams. In doing that modification,
you'll remove the existing condition (vocab filtering, close to what you
want) and therefore the vocab file.

Kenneth

On 06/04/16 12:11, Graeme Kidd wrote:
>
> Thanks, that's given me a good starting point. The next problem is
> that the dump_trie program expects a vocab file which isn't provided.
> Any idea how I could create one?
>
>
>
> Thanks again,
>
> Graeme
>
>
>
> *From:*Kenneth Heafield [mailto:moses@kheafield.com]
> *Sent:* 04 June 2016 08:00
> *To:* Graeme Kidd <graemekidd@gmail.com>; moses-support@mit.edu
> *Subject:* Re: [Moses-support] Extract list of n-grams from Trie
> Language Model that contains a certain word
>
>
>
> The trie file you have contains conditional probabilities and backoffs
> but not counts. If you're OK with that, check out/modify the dump_trie
> program in the bounded-noquant branch of github.com/kpu/kenlm
> <http://github.com/kpu/kenlm> . It can stream but you will need to do
> ulimit -v with something above 6 TB even though physical usage will be
> fine.
>
> For counts, contact me off list.
>
> On June 4, 2016 1:42:40 AM GMT+01:00, Graeme Kidd
> <graemekidd@gmail.com <mailto:graemekidd@gmail.com>> wrote:
>
> Hi,
>
>
>
> This is still all very new to me so apologies if this is not the
> correct place to ask this questions.
>
>
>
> I am wanting to take the English Trie Language Model (5.5TB)
> created from the Common Crawl data set:
>
> http://data.statmt.org/ngrams/lm/en.trie
>
>
>
> Then extract all n-grams that contain a certain word. This needs
> to be done for a list of 100 words. For example if I was looking
> for all n-grams that contained the word "discombobulated" I would
> want an output file containing the n-gram that contains that word
> and the number of times that n-gram occurs:
>
> word1 discombobulated 25
>
> word1 discombobulated word3 40
>
>
>
> Due to the size of the file, this is something I am keen to get
> right first time. For this reason is someone able to give me an
> example of how this can be done and would this kind of query be
> possible with 64GB of RAM?
>
>
>
> Thanks,
>
> Graeme
>
>
> ----------------------------------------------------------------------
> --
>
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 116, Issue 6
*********************************************

Moses-support Digest, Vol 116, Issue 6

0 Response to "Moses-support Digest, Vol 116, Issue 6"

Post a Comment