Moses-support Digest, Vol 116, Issue 5

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Extract list of n-grams from Trie Language Model that
contains a certain word (Graeme Kidd)

----------------------------------------------------------------------

Message: 1
Date: Sat, 4 Jun 2016 12:11:46 +0100
From: "Graeme Kidd" <graemekidd@gmail.com>
Subject: Re: [Moses-support] Extract list of n-grams from Trie
Language Model that contains a certain word
To: <moses-support@mit.edu>
Message-ID: <001401d1be51$e2cdc6a0$a86953e0$@gmail.com>
Content-Type: text/plain; charset="utf-8"

Thanks, that?s given me a good starting point. The next problem is that the dump_trie program expects a vocab file which isn?t provided. Any idea how I could create one?

Thanks again,

Graeme

From: Kenneth Heafield [mailto:moses@kheafield.com]
Sent: 04 June 2016 08:00
To: Graeme Kidd <graemekidd@gmail.com>; moses-support@mit.edu
Subject: Re: [Moses-support] Extract list of n-grams from Trie Language Model that contains a certain word

The trie file you have contains conditional probabilities and backoffs but not counts. If you're OK with that, check out/modify the dump_trie program in the bounded-noquant branch of github.com/kpu/kenlm <http://github.com/kpu/kenlm> . It can stream but you will need to do ulimit -v with something above 6 TB even though physical usage will be fine.

For counts, contact me off list.

On June 4, 2016 1:42:40 AM GMT+01:00, Graeme Kidd <graemekidd@gmail.com <mailto:graemekidd@gmail.com> > wrote:

Hi,

This is still all very new to me so apologies if this is not the correct place to ask this questions.

I am wanting to take the English Trie Language Model (5.5TB) created from the Common Crawl data set:

http://data.statmt.org/ngrams/lm/en.trie

Then extract all n-grams that contain a certain word. This needs to be done for a list of 100 words. For example if I was looking for all n-grams that contained the word ?discombobulated? I would want an output file containing the n-gram that contains that word and the number of times that n-gram occurs:

word1 discombobulated 25

word1 discombobulated word3 40

Due to the size of the file, this is something I am keen to get right first time. For this reason is someone able to give me an example of how this can be done and would this kind of query be possible with 64GB of RAM?

Thanks,

Graeme

_____

Moses-support mailing list
Moses-support@mit.edu <mailto:Moses-support@mit.edu>
http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160604/0ebce45f/attachment-0001.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 116, Issue 5
*********************************************

Moses-support Digest, Vol 116, Issue 5

0 Response to "Moses-support Digest, Vol 116, Issue 5"

Post a Comment