Moses-support Digest, Vol 82, Issue 29

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Kenlm, lmplz, pruning singleton n-grams, mmapping error
with build_binary (Kenneth Heafield)
2. Re: Kenlm, lmplz, pruning singleton n-grams, mmapping error
with build_binary (Jonathan Clark)


----------------------------------------------------------------------

Message: 1
Date: Wed, 21 Aug 2013 12:05:15 +0100
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Kenlm, lmplz, pruning singleton n-grams,
mmapping error with build_binary
To: moses-support@mit.edu
Message-ID: <52149EEB.6070200@kheafield.com>
Content-Type: text/plain; charset=ISO-8859-1

On 08/21/13 11:20, Marcin Junczys-Dowmunt wrote:
> Hi,
>
> This should probably go directly to Kenneth, but I guess answers might
> interesting for others, too.
>
> 1) Is there a way to simulate the pruning function for singleton n-grams
> (as in IRSTLM) when using lmplz from kenlm? I guess this is not quite
> straight-forward with Improved Kneser-Ney smoothing used by lmplz. If do
> it manually I probably need a ngram frequency list, or can I somehow
> infer directly from the generated ARPA file what to remove?

Entropy pruning is a bad idea \cite{entropybad,entropybad2}

@inproceedings{entropybad,
author={Ciprian Chelba and Thorsten Brants and Will Neveitt and Peng Xu},
year={2010},
pages={2242--2245},
booktitle={Proceedings of Interspeech},
title={Study on Interaction between Entropy Pruning and {K}neser-{N}ey
Smoothing},
}
@inproceedings{entropybad2,
author={Robert C. Moore and Chris Quirk},
title={Less is More: Significance-Based N-gram Selection for Smaller,
Better Language Models},
booktitle={Proceedings of the 2009 Conference on Empirical Methods in
Natural Language Processing},
pages={746--755},
year={2009},
month={August},
}

Postprocessing to remove count-1 also won't accomplish what you want,
because statistics like backoff change for the n-grams you keep. The
model won't sum to 1 anymore, for example.

The "correct" way is to remove low-count entries after collecting
discounts but before computing uninterpolated probabilities. And only
remove low-count entries if they are not a substring of another entry.
There's also some question as to whether pruning should be done with raw
counts or adjusted counts. Hermann Ney says both are incorrect.
Currently neither is implemented in lmplz. I look forward to working
with you at MT Marathon.

>
> 2) Another problem: I have generated a 73GB plain-text arpa file with
> lmplz, when I run build-binary, I get the following error message right
> away:
>
> ./kenlm/bin/build_binary train.lm.no-tag.de.arpa train.lm.no-tag.de.kenlm
>
> util/mmap.cc:115 in void* util::MapOrThrow(std::size_t, bool, int, bool,
> int, uint64_t) threw ErrnoException because `(ret = mmap(__null, size,
> protect, flags, fd, offset)) == ((void *) -1)'.
> Cannot allocate memory mmap failed for size 36347339336 at offset 0
> Byte: 97 File: train.lm.no-tag.de.arpa
> ERROR
>
> I have mmapped 36GB files successfully in the past (though not on this
> machine), so this is strange. There is also a lot of free disk space
> available, memory is rather limited, 8GB only.

Increase your ulimit -v. Even if ulimit says "unlimited", there's still
a default limit on virtual memory size.

But honestly you really don't want to build a linear probing model with
memory mapping like that. Use the trie data structure, which will write
5 sequential streams.

Also, there's code in a branch to build a trie without mmaping the
entire file, just chunks at a time. Just haven't put the icing on yet,
like a useful command line interface.

>
> Thanks,
>
> Marcin
>
>
>
> P.S.: Love the speed of lmplz!
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


------------------------------

Message: 2
Date: Wed, 21 Aug 2013 08:07:21 -0700
From: Jonathan Clark <jon.h.clark@gmail.com>
Subject: Re: [Moses-support] Kenlm, lmplz, pruning singleton n-grams,
mmapping error with build_binary
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAC9XMS3LPqFm75EkwGy-ziP3LZW4GxJry1Pr64bbVL6Yk-SGTQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Also, just to be nit-picky: lmplz implements Order-Interpolated Modified
Kneser-Ney smoothing -- not Improved Kneser-Ney smoothing, which is a
smoothing technique unique to irstlm -- IKN is referred to as modified
shift beta in recent versions of IRSTLM.


On Wed, Aug 21, 2013 at 3:20 AM, Marcin Junczys-Dowmunt
<junczys@amu.edu.pl>wrote:

> **
>
> Hi,
>
> This should probably go directly to Kenneth, but I guess answers might
> interesting for others, too.
>
> 1) Is there a way to simulate the pruning function for singleton n-grams
> (as in IRSTLM) when using lmplz from kenlm? I guess this is not quite
> straight-forward with Improved Kneser-Ney smoothing used by lmplz. If do it
> manually I probably need a ngram frequency list, or can I somehow infer
> directly from the generated ARPA file what to remove?
>
> 2) Another problem: I have generated a 73GB plain-text arpa file with
> lmplz, when I run build-binary, I get the following error message right
> away:
>
> ./kenlm/bin/build_binary train.lm.no-tag.de.arpa train.lm.no-tag.de.kenlm
>
> util/mmap.cc:115 in void* util::MapOrThrow(std::size_t, bool, int, bool,
> int, uint64_t) threw ErrnoException because `(ret = mmap(__null, size,
> protect, flags, fd, offset)) == ((void *) -1)'.
> Cannot allocate memory mmap failed for size 36347339336 at offset 0 Byte:
> 97 File: train.lm.no-tag.de.arpa
> ERROR
>
> I have mmapped 36GB files successfully in the past (though not on this
> machine), so this is strange. There is also a lot of free disk space
> available, memory is rather limited, 8GB only.
>
> Thanks,
>
> Marcin
>
>
>
> P.S.: Love the speed of lmplz!
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20130821/de8d9162/attachment-0001.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 82, Issue 29
*********************************************

0 Response to "Moses-support Digest, Vol 82, Issue 29"

Post a Comment