Moses-support Digest, Vol 101, Issue 70

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. KenLM with 16GB of texts (liling tan)
2. Re: KenLM with 16GB of texts (Marcin Junczys-Dowmunt)


----------------------------------------------------------------------

Message: 1
Date: Wed, 25 Mar 2015 14:17:01 +0100
From: liling tan <alvations@gmail.com>
Subject: [Moses-support] KenLM with 16GB of texts
To: moses-support <moses-support@mit.edu>
Message-ID:
<CAKzPaJLjzz3_pRHsbctobqz7yx1eLU=VRfM_K5vurYNYRkA_DQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Dear Moses dev/users,

Has anyone tried to build a language model from 16 GB of texts?

What does "Last input should have been poison." mean?

Does anyone know how to estimate the output size of the language model file
given 16GB of texts with 8 grams? How about 5grams, how big will it get?


We've tried to extract 8grams with 16GB of texts and we ended up with:


=== 1/5 Counting and sorting n-grams ===
Reading /home/gillin/wmt15/corpus.truecase/train-lm.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 21621391360 bytes == 0x1de6000 @
tcmalloc: large alloc 86485549056 bytes == 0x50ba5a000 @
*****************************=== 1/5 Counting and sorting n-grams ===
Reading /home/gillin/wmt15/corpus.truecase/train-lm.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 14100905984 bytes == 0x2e6c000 @
tcmalloc: large alloc 94006026240 bytes == 0x34bec4000 @
****************************************************************************************************
Unigram tokens 3038737446 types 5924314
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:71091768 2:3162479872 3:5929649664 4:9487439872
5:13835849728 6:18974879744 7:24904527872 8:31624798208
tcmalloc: large alloc 31624798208 bytes == 0x34bec4000 @
tcmalloc: large alloc 3162480640 bytes == 0x2e6c000 @
tcmalloc: large alloc 5929656320 bytes == 0xbf666000 @
tcmalloc: large alloc 9487441920 bytes == 0xaa8e86000 @
tcmalloc: large alloc 13835853824 bytes == 0xcde674000 @
tcmalloc: large alloc 18974883840 bytes == 0x101715a000 @
tcmalloc: large alloc 24904531968 bytes == 0x1940db4000 @
Statistics:
1 5924314 D1=0.709218 D2=1.04888 D3+=1.33462
2 108520273 D1=0.723401 D2=1.06804 D3+=1.36804
3 543892823 D1=0.788765 D2=1.11107 D3+=1.35713
4 1204990660 D1=0.855434 D2=1.17274 D3+=1.36107
5 1716616322 D1=0.907776 D2=1.25272 D3+=1.39455
6 1966436508 D1=0.943121 D2=1.34991 D3+=1.45437
7 2029467690 D1=0.96405 D2=1.44994 D3+=1.5283
8 1997628560 D1=0.863904 D2=1.45784 D3+=1.59832
Memory estimate for binary LM:
type GB
probing 202 assuming -p 1.5
probing 245 assuming -r models -p 1.5
trie 115 without quantization
trie 69 assuming -q 8 -b 8 quantization
trie 96 assuming -a 22 array pointer compression
trie 49 assuming -a 22 -q 8 -b 8 array pointer compression and
quantization
=== 3/5 Calculating and sorting initial probabilities ===
tcmalloc: large alloc 10877861888 bytes == 0x72650000 @
tcmalloc: large alloc 28919783424 bytes == 0x34bec4000 @
tcmalloc: large alloc 48065257472 bytes == 0xa07ad2000 @
tcmalloc: large alloc 62925971456 bytes == 0x34bec4000 @
tcmalloc: large alloc 73060843520 bytes == 0x34bec4000 @
tcmalloc: large alloc 79905144832 bytes == 0x34bec4000 @
Chain sizes: 1:71091768 2:1736324368 3:6017972736 4:9628755968
5:14041935872 6:19257511936 7:25275484160 8:32095852544
tcmalloc: large alloc 9628762112 bytes == 0x19349e6000 @
tcmalloc: large alloc 14041939968 bytes == 0x1b7289a000 @
tcmalloc: large alloc 19257516032 bytes == 0x34bec4000 @
tcmalloc: large alloc 25275490304 bytes == 0x7c7c2a000 @
tcmalloc: large alloc 32095854592 bytes == 0xdaa4c0000 @
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:71091768 2:1736324368 3:5881222144 4:9409955840
5:13722852352 6:18819911680 7:24701134848 8:31366518784
tcmalloc: large alloc 9409961984 bytes == 0x19349e6000 @
tcmalloc: large alloc 13722853376 bytes == 0x1b657f0000 @
tcmalloc: large alloc 18819915776 bytes == 0x34bec4000 @
tcmalloc: large alloc 24701140992 bytes == 0x7adad6000 @
tcmalloc: large alloc 31366520832 bytes == 0xd6dfae000 @
Last input should have been poison.
util/file.cc:274 in void util::ErsatzPWrite(int, const void*, std::size_t,
uint64_t) threw FDException'.
No space left on device in /tmp/TuM5Ow (deleted) while writing 13586550656
bytes at offset 49146486784


Regards,
Liling
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150325/583e2019/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 25 Mar 2015 14:21:59 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] KenLM with 16GB of texts
To: liling tan <alvations@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <4a700d741b32e5892b9153b8aebef69a@amu.edu.pl>
Content-Type: text/plain; charset="utf-8"



Hi,

you do not have enough space in /tmp, see "No space left on device in
/tmp/TuM5Ow". The poison-message is just another echo of that. You can
use the -T "path to more space" option to set a path where you have more
space. You probably need something around 100-200 GB (16 GB of
compressed or uncompressed text? If compressed then probably more.)

Best,

Marcin

W dniu 2015-03-25 14:17, liling tan napisa?(a):

> Dear Moses dev/users,
>
> Has anyone tried to build a language model from 16 GB of texts?
>
> What does "Last input should have been poison." mean?
>
> Does anyone know how to estimate the output size of the language model file given 16GB of texts with 8 grams? How about 5grams, how big will it get?
>
> We've tried to extract 8grams with 16GB of texts and we ended up with:
>
>> === 1/5 Counting and sorting n-grams ===
>>
>> Reading /home/gillin/wmt15/corpus.truecase/train-lm.en
>>
>> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
>>
>> tcmalloc: large alloc 21621391360 bytes == 0x1de6000 @
>>
>> tcmalloc: large alloc 86485549056 bytes == 0x50ba5a000 @
>>
>> *****************************=== 1/5 Counting and sorting n-grams ===
>>
>> Reading /home/gillin/wmt15/corpus.truecase/train-lm.en
>>
>> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
>>
>> tcmalloc: large alloc 14100905984 bytes == 0x2e6c000 @
>>
>> tcmalloc: large alloc 94006026240 bytes == 0x34bec4000 @
>>
>> ****************************************************************************************************
>>
>> Unigram tokens 3038737446 types 5924314
>>
>> === 2/5 Calculating and sorting adjusted counts ===
>>
>> Chain sizes: 1:71091768 2:3162479872 3:5929649664 4:9487439872 5:13835849728 6:18974879744 7:24904527872 8:31624798208
>>
>> tcmalloc: large alloc 31624798208 bytes == 0x34bec4000 @
>>
>> tcmalloc: large alloc 3162480640 bytes == 0x2e6c000 @
>>
>> tcmalloc: large alloc 5929656320 bytes == 0xbf666000 @
>>
>> tcmalloc: large alloc 9487441920 bytes == 0xaa8e86000 @
>>
>> tcmalloc: large alloc 13835853824 bytes == 0xcde674000 @
>>
>> tcmalloc: large alloc 18974883840 bytes == 0x101715a000 @
>>
>> tcmalloc: large alloc 24904531968 bytes == 0x1940db4000 @
>>
>> Statistics:
>>
>> 1 5924314 D1=0.709218 D2=1.04888 D3+=1.33462
>>
>> 2 108520273 D1=0.723401 D2=1.06804 D3+=1.36804
>>
>> 3 543892823 D1=0.788765 D2=1.11107 D3+=1.35713
>>
>> 4 1204990660 D1=0.855434 D2=1.17274 D3+=1.36107
>>
>> 5 1716616322 D1=0.907776 D2=1.25272 D3+=1.39455
>>
>> 6 1966436508 D1=0.943121 D2=1.34991 D3+=1.45437
>>
>> 7 2029467690 D1=0.96405 D2=1.44994 D3+=1.5283
>>
>> 8 1997628560 D1=0.863904 D2=1.45784 D3+=1.59832
>>
>> Memory estimate for binary LM:
>>
>> type GB
>>
>> probing 202 assuming -p 1.5
>>
>> probing 245 assuming -r models -p 1.5
>>
>> trie 115 without quantization
>>
>> trie 69 assuming -q 8 -b 8 quantization
>>
>> trie 96 assuming -a 22 array pointer compression
>>
>> trie 49 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
>>
>> === 3/5 Calculating and sorting initial probabilities ===
>>
>> tcmalloc: large alloc 10877861888 bytes == 0x72650000 @
>>
>> tcmalloc: large alloc 28919783424 bytes == 0x34bec4000 @
>>
>> tcmalloc: large alloc 48065257472 bytes == 0xa07ad2000 @
>>
>> tcmalloc: large alloc 62925971456 bytes == 0x34bec4000 @
>>
>> tcmalloc: large alloc 73060843520 bytes == 0x34bec4000 @
>>
>> tcmalloc: large alloc 79905144832 bytes == 0x34bec4000 @
>>
>> Chain sizes: 1:71091768 2:1736324368 3:6017972736 4:9628755968 5:14041935872 6:19257511936 7:25275484160 8:32095852544
>>
>> tcmalloc: large alloc 9628762112 bytes == 0x19349e6000 @
>>
>> tcmalloc: large alloc 14041939968 bytes == 0x1b7289a000 @
>>
>> tcmalloc: large alloc 19257516032 bytes == 0x34bec4000 @
>>
>> tcmalloc: large alloc 25275490304 bytes == 0x7c7c2a000 @
>>
>> tcmalloc: large alloc 32095854592 bytes == 0xdaa4c0000 @
>>
>> === 4/5 Calculating and writing order-interpolated probabilities ===
>>
>> Chain sizes: 1:71091768 2:1736324368 3:5881222144 4:9409955840 5:13722852352 6:18819911680 7:24701134848 8:31366518784
>>
>> tcmalloc: large alloc 9409961984 bytes == 0x19349e6000 @
>>
>> tcmalloc: large alloc 13722853376 bytes == 0x1b657f0000 @
>>
>> tcmalloc: large alloc 18819915776 bytes == 0x34bec4000 @
>>
>> tcmalloc: large alloc 24701140992 bytes == 0x7adad6000 @
>>
>> tcmalloc: large alloc 31366520832 bytes == 0xd6dfae000 @
>>
>> Last input should have been poison.
>>
>> util/file.cc:274 in void util::ErsatzPWrite(int, const void*, std::size_t, uint64_t) threw FDException'.
>>
>> No space left on device in /tmp/TuM5Ow (deleted) while writing 13586550656 bytes at offset 49146486784
>
> Regards,
> Liling
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support [1]



Links:
------
[1] http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150325/1db5a4ed/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 101, Issue 70
**********************************************

Related Posts :

0 Response to "Moses-support Digest, Vol 101, Issue 70"

Post a Comment