Moses-support Digest, Vol 146, Issue 2

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: 5-gram discount out of range for adjusted count 2
(Kenneth Heafield)
2. Message to all list members (EUROPHRAS 2019)


----------------------------------------------------------------------

Message: 1
Date: Mon, 3 Dec 2018 17:39:20 +0000
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] 5-gram discount out of range for adjusted
count 2
To: James Baker <james.d.baker@gmail.com>
Cc: moses-support@mit.edu
Message-ID: <758a5fd1-ff93-fb8d-32a6-92cd45e2a3da@kheafield.com>
Content-Type: text/plain; charset="utf-8"

Hi,

What I think is going on is that the corpus has short sentences. Or at
least that's my stereotype of the GNOME and Ubuntu data from OPUS. So
there are not many ways to extend a 5-gram, which is confusing
Kneser-Ney. You can always duct-tape it with --discount_fallback.

Kenneth

On 12/3/18 12:52 PM, James Baker wrote:
> Strangely, if I take a random sample of 75% of that same data, it works
> just fine. I can use that for the time being, but it is a curious "feature"!
>
> James
>
> On Mon, 3 Dec 2018 at 12:34, James Baker <james.d.baker@gmail.com
> <mailto:james.d.baker@gmail.com>> wrote:
>
> What would constitute duplicated in this context? The number of
> duplicated lines in the document is relatively small, but it's
> possible some of the lines have similar text.
>
> ? ? $ wc lm_data.en?
> ? ? ?1876364 21359196 96962517 lm_data.en
> ? ? $ sort lm_data.en | uniq > lm_data_uniq.en
> ? ? $ wc lm_data_uniq.en?
> ? ???1487703 15801025 71344598 lm_data_uniq.en
>
> I'd have thought there should be enough unique data in there though,
> as the file is a combined version of the following datasets from OPUS:
>
> * GNOME
> * OpenSubtitles 2018
> * Tanzil
> * Tatoeba
> * Ubuntu
>
> Thanks,
> James
>
> On Mon, 3 Dec 2018 at 11:58, Kenneth Heafield <moses@kheafield.com
> <mailto:moses@kheafield.com>> wrote:
>
> Hi,
>
> ??? If I had to guess, you have a lot of duplicated text??
>
> Kenneth
>
> On 12/3/18 11:23 AM, James Baker wrote:
>> Morning,
>>
>> I've been trying to train a language model using the following
>> command:
>>
>> ? ? /opt/model-builder/mosesdecoder/bin/lmplz -o 5 -S 80% -T
>> /tmp < lm_data.en > model.lm
>>
>> But I'm getting the following error:
>>
>> ? ? === 1/5 Counting and sorting n-grams ===
>> ? ? Reading /opt/model-builder/training/lm_data.en
>> ? ?
>> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
>> ? ?
>> ****************************************************************************************************
>> ? ? Unigram tokens 21187448 types 117756
>> ? ? === 2/5 Calculating and sorting adjusted counts ===
>> ? ? Chain sizes: 1:1413072 2:5151762432 3:9659554816
>> 4:15455287296 5:22538960896
>> ? ? terminate called after throwing an instance of
>> 'lm::builder::BadDiscountException'
>> ? ? what():
>> /opt/model-builder/mosesdecoder/lm/builder/adjust_counts.cc:61
>> in void
>> lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const
>> lm::builder::DiscountConfig&) threw BadDiscountException
>> because `discounts_[i].amount[j] < 0.0 ||
>> discounts_[i].amount[j] > j'.
>> ? ? ERROR: 5-gram discount out of range for adjusted count 2:
>> -6.80247
>>
>> The data I'm training on has come from the OPUS project. I
>> found some references online to issues when there isn't enough
>> training data, but I think I have sufficient data and have
>> previously trained on a lot less (and even on a subset of my
>> current data):
>>
>> ? ? $ wc lm_data.en?
>> ? ? 1874495 21187448 96148754 lm_data.en
>>
>> Any ideas what might be causing the problem?
>>
>> James
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


------------------------------

Message: 2
Date: Mon, 3 Dec 2018 16:41:01 +0100
From: EUROPHRAS 2019 <europhras2019@gmail.com>
Subject: [Moses-support] Message to all list members
To: <moses-support@mit.edu>
Message-ID:
<CAJW1vc5x=Nm3PWG6n2CLuXf-8zXUBTXP+DffzbhECc2E6ZCQLw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

*** Apologies for cross-posting ***


[image: image.png]

International Conference EUROPHRAS 2019: ?*Computational and Corpus-based
Phraseology?*

Malaga, 25-27 September 2019


*Conference topics*

The forthcoming international conference ?Computational and Corpus-based
Phraseology? will take place in Malaga from 25 to 27 September 2019.

The conference will focus on interdisciplinary approaches to phraseology
and will invite submissions on a wide range of topics, including, but not
limited to: corpus-based, psycholinguistic and cognitive approaches to the
study of phraseology, the computational treatment of multi-word
expressions, and practical applications in translation, lexicography and
language learning, teaching and assessment.

*Submissions and publication*


EUROPHRAS 2019 invites three types of submissions: regular papers, short
papers and poster presentations.

- Submissions as regular papers should not exceed 15 pages including
references; their minimum length should be 12 pages. The accepted regular
papers will be published in a Springer LNAI volume which will be available
at the time of the conference. The regular papers should be written in
English.
- Short papers should not exceed 7 pages excluding references. The
accepted short papers will be published as conference e-proceedings with
ISBN and will be also available at the time of the conference. Short papers
can be in English and Spanish.
- Poster presentations should not exceed 4 pages excluding references.
Accepted papers for poster presentations will be included in the conference
e-proceedings along with the short papers. Poster presentations can be in
English and Spanish.

Each submission will be reviewed by at least three members of the Programme
Committee. The first call for papers will provide details on the submission
procedure and on the conference schedule, including submission and
notification deadlines.

The proceedings will be published as a volume and also in the form of
e-proceedings which will be both available at the conference. Call for
submissions describing continuation and/or more details of the studies
presented at Europhras?2019 will be announced after the conference and the
accepted papers reporting these new developments will be published as
another volume and/in a special issue of a journal.


*Keynote Speakers*

Sylviane Granger, Universit? Catholique de Louvain

Natalie K?bler, Paris Diderot University

Kathrin Steyer, Institute of German Language

Aline Villavicencio, Federal University of Rio Grande do Sul and University
of Essex



*Programme Committee*

The Programme Committee features experts in different aspects of
corpus-based and computational phraseology and includes:


Nicoletta Calzolari, Institute for Computational Linguistics

Mar?a Luisa Carri? Pastor, Polytechnic University of Valencia

Sheila Castilho, Dublin City University

Ken Church, Baidu

Jean-Pierre Colson, Universit? Catholique de Louvain

Gloria Corpas Pastor, University of Malaga (Co-Chair)

Anna ?erm?kov?, Charles University

Dmitrij Dobrovolskij, Russian Language Institute

Peter ?ur?o, University of St. Cyril and Methodius

Natalia Filatkina, University of Trier

Thierry Fontenelle, Translation Centre for the Bodies of the European Union

Jos? Enrique Gargallo, University of Barcelona

Ulrich Heid, University of Hildesheim

Elvira Manero, University of Murcia

Carmen Mellado Blanco, University of Santiago de Compostela

Ruslan Mitkov, University of Wolverhampton (Co-Chair)

Pedro Mogorr?n Huerta, University of Alicante

Johanna Monti, ?L?Orientale? University of Naples

Esteban T. Montoro, University of Granada

Sara Moze, University of Wolverhampton

Michael Oakes, University of Wolverhampton

St?phane Patin, Paris Diderot University

Alain Polgu?re, University of Lorraine

Carlos Ramisch, Laboratoire d?Informatique Fondamentale de Marseille

M? ?ngeles Recio Ariza, University of Salamanca

Ute R?mer, Georgia State University

Leonor Ruiz Gurillo, University of Alicante

Julia Sevilla Mu?oz, Complutense University of Madrid

Kathrin Steyer, Institute of German Language

Joanna Szerszunowicz, University of Bialystok

Benjamin Tsou, City University of Hong Kong

Agn?s Tutin, University of Stendhal

Aline Villavicencio, Federal University of Rio Grande do Sul and University of
Essex


*Organisation *

The forthcoming international conference ?Computational and Corpus-based
Phraseology? is jointly organised by the European Association for
Phraseology EUROPHRAS, the University of Malaga (Research Group in
Lexicography and Translation), the University of Wolverhampton (Research
Group in Computational Linguistics) and the Association for Computational
Linguistics - Bulgaria.

*Further information and contact details*

The first call for papers is expected in January 2019 and registration will
be open as from April 2019.

The conference website (*http://www.lexytrad.es/europhras2019
<http://www.lexytrad.es/europhras2019>)* will be updated on a regular
basis. For further information, please email*europhras2019@gmail.com
<europhras2019@gmail.com>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20181203/082d3465/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 36201 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20181203/082d3465/attachment.png

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 146, Issue 2
*********************************************

0 Response to "Moses-support Digest, Vol 146, Issue 2"

Post a Comment