Moses-support Digest, Vol 146, Issue 1

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. 5-gram discount out of range for adjusted count 2 (James Baker)
2. Re: 5-gram discount out of range for adjusted count 2
(Kenneth Heafield)
3. Re: 5-gram discount out of range for adjusted count 2
(James Baker)


----------------------------------------------------------------------

Message: 1
Date: Mon, 3 Dec 2018 11:23:01 +0000
From: James Baker <james.d.baker@gmail.com>
Subject: [Moses-support] 5-gram discount out of range for adjusted
count 2
To: moses-support@mit.edu
Message-ID:
<CAOa=L2zQVymwvnu4DnOtjpgM+nqAGBybj=oii6cc3Oj9NLSAnw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Morning,

I've been trying to train a language model using the following command:

/opt/model-builder/mosesdecoder/bin/lmplz -o 5 -S 80% -T /tmp <
lm_data.en > model.lm

But I'm getting the following error:

=== 1/5 Counting and sorting n-grams ===
Reading /opt/model-builder/training/lm_data.en

----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

****************************************************************************************************
Unigram tokens 21187448 types 117756
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1413072 2:5151762432 3:9659554816 4:15455287296
5:22538960896
terminate called after throwing an instance of
'lm::builder::BadDiscountException'
what(): /opt/model-builder/mosesdecoder/lm/builder/adjust_counts.cc:61
in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const
lm::builder::DiscountConfig&) threw BadDiscountException because
`discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
ERROR: 5-gram discount out of range for adjusted count 2: -6.80247

The data I'm training on has come from the OPUS project. I found some
references online to issues when there isn't enough training data, but I
think I have sufficient data and have previously trained on a lot less (and
even on a subset of my current data):

$ wc lm_data.en
1874495 21187448 96148754 lm_data.en

Any ideas what might be causing the problem?

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20181203/d0df07a4/attachment-0001.html

------------------------------

Message: 2
Date: Mon, 3 Dec 2018 11:49:02 +0000
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] 5-gram discount out of range for adjusted
count 2
To: moses-support@mit.edu
Message-ID: <125a386b-059f-f73a-8f98-df2d5e83b0de@kheafield.com>
Content-Type: text/plain; charset="utf-8"

Hi,

??? If I had to guess, you have a lot of duplicated text??

Kenneth

On 12/3/18 11:23 AM, James Baker wrote:
> Morning,
>
> I've been trying to train a language model using the following command:
>
> ? ? /opt/model-builder/mosesdecoder/bin/lmplz -o 5 -S 80% -T /tmp <
> lm_data.en > model.lm
>
> But I'm getting the following error:
>
> ? ? === 1/5 Counting and sorting n-grams ===
> ? ? Reading /opt/model-builder/training/lm_data.en
> ? ?
> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
> ? ?
> ****************************************************************************************************
> ? ? Unigram tokens 21187448 types 117756
> ? ? === 2/5 Calculating and sorting adjusted counts ===
> ? ? Chain sizes: 1:1413072 2:5151762432 3:9659554816 4:15455287296
> 5:22538960896
> ? ? terminate called after throwing an instance of
> 'lm::builder::BadDiscountException'
> ? ? what():
> /opt/model-builder/mosesdecoder/lm/builder/adjust_counts.cc:61 in void
> lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const
> lm::builder::DiscountConfig&) threw BadDiscountException because
> `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
> ? ? ERROR: 5-gram discount out of range for adjusted count 2: -6.80247
>
> The data I'm training on has come from the OPUS project. I found some
> references online to issues when there isn't enough training data, but
> I think I have sufficient data and have previously trained on a lot
> less (and even on a subset of my current data):
>
> ? ? $ wc lm_data.en?
> ? ? 1874495 21187448 96148754 lm_data.en
>
> Any ideas what might be causing the problem?
>
> James
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20181203/c46461ba/attachment-0001.html

------------------------------

Message: 3
Date: Mon, 3 Dec 2018 12:34:06 +0000
From: James Baker <james.d.baker@gmail.com>
Subject: Re: [Moses-support] 5-gram discount out of range for adjusted
count 2
To: moses@kheafield.com
Cc: moses-support@mit.edu
Message-ID:
<CAOa=L2zsjDR39kfQpZXJFGxetVZ5F7W1o_Mc4SvO=Gn-3kUL-g@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

What would constitute duplicated in this context? The number of duplicated
lines in the document is relatively small, but it's possible some of the
lines have similar text.

$ wc lm_data.en
1876364 21359196 96962517 lm_data.en
$ sort lm_data.en | uniq > lm_data_uniq.en
$ wc lm_data_uniq.en
1487703 15801025 71344598 lm_data_uniq.en

I'd have thought there should be enough unique data in there though, as the
file is a combined version of the following datasets from OPUS:

* GNOME
* OpenSubtitles 2018
* Tanzil
* Tatoeba
* Ubuntu

Thanks,
James

On Mon, 3 Dec 2018 at 11:58, Kenneth Heafield <moses@kheafield.com> wrote:

> Hi,
>
> If I had to guess, you have a lot of duplicated text?
>
> Kenneth
> On 12/3/18 11:23 AM, James Baker wrote:
>
> Morning,
>
> I've been trying to train a language model using the following command:
>
> /opt/model-builder/mosesdecoder/bin/lmplz -o 5 -S 80% -T /tmp <
> lm_data.en > model.lm
>
> But I'm getting the following error:
>
> === 1/5 Counting and sorting n-grams ===
> Reading /opt/model-builder/training/lm_data.en
>
> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
>
> ****************************************************************************************************
> Unigram tokens 21187448 types 117756
> === 2/5 Calculating and sorting adjusted counts ===
> Chain sizes: 1:1413072 2:5151762432 3:9659554816 4:15455287296
> 5:22538960896
> terminate called after throwing an instance of
> 'lm::builder::BadDiscountException'
> what(): /opt/model-builder/mosesdecoder/lm/builder/adjust_counts.cc:61
> in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const
> lm::builder::DiscountConfig&) threw BadDiscountException because
> `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
> ERROR: 5-gram discount out of range for adjusted count 2: -6.80247
>
> The data I'm training on has come from the OPUS project. I found some
> references online to issues when there isn't enough training data, but I
> think I have sufficient data and have previously trained on a lot less (and
> even on a subset of my current data):
>
> $ wc lm_data.en
> 1874495 21187448 96148754 lm_data.en
>
> Any ideas what might be causing the problem?
>
> James
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20181203/e2712ca0/attachment-0001.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 146, Issue 1
*********************************************

0 Response to "Moses-support Digest, Vol 146, Issue 1"

Post a Comment