Moses-support Digest, Vol 106, Issue 9

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: EMS results - makes sense ? (Hieu Hoang)
2. Re: EMS results - makes sense ? (Barry Haddow)
3. Editing Phrase table in SMT (SANJANASRI JP)
4. Re: EMS results - makes sense ? (Vincent Nguyen)
5. Re: EMS results - makes sense ? (Kenneth Heafield)

----------------------------------------------------------------------

Message: 1
Date: Wed, 5 Aug 2015 11:43:18 +0400
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] EMS results - makes sense ?
To: Vincent Nguyen <vnguyen@neuf.fr>, Barry Haddow
<bhaddow@inf.ed.ac.uk>, moses-support <moses-support@mit.edu>
Message-ID: <55C1BE96.2070902@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed

are you sure it didn't run out of disk space again? check in the
TRAINING_extract.*.STDERR file for messages.

Also, because extract and scoring it is run in parallel, the error
messages sometimes overwrite each other so you don't get clear messages.
you have to use your intuition

On 05/08/2015 11:31, Vincent Nguyen wrote:
>
> I increased the disk space but the hierarchical training crashes for
> another reason I do not understand.
> it's at build_ttable time
> I attached the error + config
> cheers,
> Vincent
>
> Le 04/08/2015 17:28, Barry Haddow a ?crit :
>> Hi Vincent
>>
>> If you are comparing to the results of WMT11, then you can look at
>> the system descriptions to see what the authors did. In fact it's
>> worth looking at the WMT14 descriptions (WMT15 will be available next
>> month) to see how state-of-the-art systems are built.
>>
>> For fr-en or en-fr, the first thing to look at is the data. There are
>> some large data sets released for WMT and you can get a good gain
>> from just crunching more data (monolingual and parallel).
>> Unfortunately this takes more resources (disk, cpu etc) so you may
>> run into trouble here.
>>
>> The hierarchical models are much bigger so yes you will need more
>> disk. For fr-en/en-fr it's probably not worth the extra effort,
>>
>> cheers - Barry
>>
>> On 04/08/15 15:58, Vincent Nguyen wrote:
>>> thanks for your insights.
>>>
>>> I am just stuck by the Bleu difference between my 26 and the 30 of
>>> WMT11, and some results of WMT14 close to 36 or even 39
>>>
>>> I am currently having trouble with hierarchical rule set instead of
>>> lexical reordering
>>> wondering if I will get better results but I have an error message
>>> filesystem root low disk space before it crashes.
>>> is this model taking more disk space in some ways ?
>>>
>>> I will next try to use more corpora of which in domain with my
>>> internal TMX
>>>
>>> thanks for your answers.
>>>
>>> Le 04/08/2015 16:02, Hieu Hoang a ?crit :
>>>>
>>>> On 03/08/2015 13:00, Vincent Nguyen wrote:
>>>>> Hi,
>>>>>
>>>>> Just a heads up on some EMS results, to get your experienced
>>>>> opinions.
>>>>>
>>>>> Corpus: Europarlv7 + NC2010
>>>>> fr => en
>>>>> Evaluation NC2011.
>>>>>
>>>>> 1) IRSTLM vs KenLM is much slower for training / tuning.
>>>> that sounds right. KenLM is also multithreaded, IRSTLM can only be
>>>> used in single-threaded decoding.
>>>>> 2) BLEU results are almost the same (25.7 with Irstlm, 26.14 with
>>>>> KenLM)
>>>> true
>>>>> 3) Compact Mode is faster than onDisk with a short test (77
>>>>> segments 96
>>>>> seconds, vs 126 seconds)
>>>> true
>>>>> 4) One last thing I do not understand though :
>>>>> For sake of checking, I replaced NC2011 by NC2010 in the
>>>>> evaluation (I
>>>>> know since NC2010 is part of training, should not be relevant)
>>>>> I got roughly the same BLEU score. I would have expected a higher
>>>>> score
>>>>> with a test set inculded in the training corpus.
>>>>>
>>>>> makes sense ?
>>>>>
>>>>>
>>>>> Next steps :
>>>>> What path should I use to get better scores ? I read the 'optimize'
>>>>> section of the website which deals more with speed
>>>>> and of course I will appply all of this but I was interested in
>>>>> tips to
>>>>> get more quality if possible.
>>>> look into domain adaptation if you have multiple training corpora,
>>>> some of which is in-domain and some out-of-domain.
>>>>
>>>> Other than that, getting good bleu score is a research open question.
>>>>
>>>> Well done on getting this far
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>

--
Hieu Hoang
Researcher
New York University, Abu Dhabi
http://www.hoang.co.uk/hieu

------------------------------

Message: 2
Date: Wed, 05 Aug 2015 09:18:58 +0100
From: Barry Haddow <bhaddow@inf.ed.ac.uk>
Subject: Re: [Moses-support] EMS results - makes sense ?
To: Hieu Hoang <hieuhoang@gmail.com>, Vincent Nguyen
<vnguyen@neuf.fr>, moses-support <moses-support@mit.edu>
Message-ID: <55C1C6F2.1050907@inf.ed.ac.uk>
Content-Type: text/plain; charset=utf-8; format=flowed

Could it be this zlib bug?

http://permalink.gmane.org/gmane.comp.nlp.moses.user/10151

On 05/08/15 08:43, Hieu Hoang wrote:
> are you sure it didn't run out of disk space again? check in the
> TRAINING_extract.*.STDERR file for messages.
>
> Also, because extract and scoring it is run in parallel, the error
> messages sometimes overwrite each other so you don't get clear
> messages. you have to use your intuition
>
> On 05/08/2015 11:31, Vincent Nguyen wrote:
>>
>> I increased the disk space but the hierarchical training crashes for
>> another reason I do not understand.
>> it's at build_ttable time
>> I attached the error + config
>> cheers,
>> Vincent
>>
>> Le 04/08/2015 17:28, Barry Haddow a ?crit :
>>> Hi Vincent
>>>
>>> If you are comparing to the results of WMT11, then you can look at
>>> the system descriptions to see what the authors did. In fact it's
>>> worth looking at the WMT14 descriptions (WMT15 will be available
>>> next month) to see how state-of-the-art systems are built.
>>>
>>> For fr-en or en-fr, the first thing to look at is the data. There
>>> are some large data sets released for WMT and you can get a good
>>> gain from just crunching more data (monolingual and parallel).
>>> Unfortunately this takes more resources (disk, cpu etc) so you may
>>> run into trouble here.
>>>
>>> The hierarchical models are much bigger so yes you will need more
>>> disk. For fr-en/en-fr it's probably not worth the extra effort,
>>>
>>> cheers - Barry
>>>
>>> On 04/08/15 15:58, Vincent Nguyen wrote:
>>>> thanks for your insights.
>>>>
>>>> I am just stuck by the Bleu difference between my 26 and the 30 of
>>>> WMT11, and some results of WMT14 close to 36 or even 39
>>>>
>>>> I am currently having trouble with hierarchical rule set instead of
>>>> lexical reordering
>>>> wondering if I will get better results but I have an error message
>>>> filesystem root low disk space before it crashes.
>>>> is this model taking more disk space in some ways ?
>>>>
>>>> I will next try to use more corpora of which in domain with my
>>>> internal TMX
>>>>
>>>> thanks for your answers.
>>>>
>>>> Le 04/08/2015 16:02, Hieu Hoang a ?crit :
>>>>>
>>>>> On 03/08/2015 13:00, Vincent Nguyen wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Just a heads up on some EMS results, to get your experienced
>>>>>> opinions.
>>>>>>
>>>>>> Corpus: Europarlv7 + NC2010
>>>>>> fr => en
>>>>>> Evaluation NC2011.
>>>>>>
>>>>>> 1) IRSTLM vs KenLM is much slower for training / tuning.
>>>>> that sounds right. KenLM is also multithreaded, IRSTLM can only be
>>>>> used in single-threaded decoding.
>>>>>> 2) BLEU results are almost the same (25.7 with Irstlm, 26.14 with
>>>>>> KenLM)
>>>>> true
>>>>>> 3) Compact Mode is faster than onDisk with a short test (77
>>>>>> segments 96
>>>>>> seconds, vs 126 seconds)
>>>>> true
>>>>>> 4) One last thing I do not understand though :
>>>>>> For sake of checking, I replaced NC2011 by NC2010 in the
>>>>>> evaluation (I
>>>>>> know since NC2010 is part of training, should not be relevant)
>>>>>> I got roughly the same BLEU score. I would have expected a higher
>>>>>> score
>>>>>> with a test set inculded in the training corpus.
>>>>>>
>>>>>> makes sense ?
>>>>>>
>>>>>>
>>>>>> Next steps :
>>>>>> What path should I use to get better scores ? I read the 'optimize'
>>>>>> section of the website which deals more with speed
>>>>>> and of course I will appply all of this but I was interested in
>>>>>> tips to
>>>>>> get more quality if possible.
>>>>> look into domain adaptation if you have multiple training corpora,
>>>>> some of which is in-domain and some out-of-domain.
>>>>>
>>>>> Other than that, getting good bleu score is a research open question.
>>>>>
>>>>> Well done on getting this far
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>
>

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

------------------------------

Message: 3
Date: Wed, 5 Aug 2015 15:06:52 +0530
From: SANJANASRI JP <sanjanasrijp@gmail.com>
Subject: [Moses-support] Editing Phrase table in SMT
To: moses-support@mit.edu
Message-ID:
<CALfZiJc2YS9b3s=ahwLBq0cBg5HYomY6BWq3brbdbGqe7rj3Og@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Dear all,

I am newbie to SMT. My doubt is, how can I edit the phrase table
features. In brief, If i wish to add more features to phrase table
manually, I am getting an error stating "num feature(5!=4) " since 4 seems
to be default feature in moses.ini file. even when i try to edit
num-feature=5 in moses file. I am getting an error. I dunno where to refer
all these. Please help. what i should do actually. Where I am being worng.
I would be really happy If i can get an earnest reply

Regards,

SANJANASRI J.P
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150805/c09cf2df/attachment-0001.htm

------------------------------

Message: 4
Date: Wed, 5 Aug 2015 12:33:33 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] EMS results - makes sense ?
To: Barry Haddow <bhaddow@inf.ed.ac.uk>, Hieu Hoang
<hieuhoang@gmail.com>, moses-support <moses-support@mit.edu>
Message-ID: <55C1E67D.50303@neuf.fr>
Content-Type: text/plain; charset=utf-8; format=flowed

I doubt for both.
extract job was finished at 1:42 AM, the other error popped much later.
the what() refers to consolidate-main.cpp not to the gzip stuff

Le 05/08/2015 10:18, Barry Haddow a ?crit :
> Could it be this zlib bug?
>
> http://permalink.gmane.org/gmane.comp.nlp.moses.user/10151
>
> On 05/08/15 08:43, Hieu Hoang wrote:
>> are you sure it didn't run out of disk space again? check in the
>> TRAINING_extract.*.STDERR file for messages.
>>
>> Also, because extract and scoring it is run in parallel, the error
>> messages sometimes overwrite each other so you don't get clear
>> messages. you have to use your intuition
>>
>> On 05/08/2015 11:31, Vincent Nguyen wrote:
>>>
>>> I increased the disk space but the hierarchical training crashes for
>>> another reason I do not understand.
>>> it's at build_ttable time
>>> I attached the error + config
>>> cheers,
>>> Vincent
>>>
>>> Le 04/08/2015 17:28, Barry Haddow a ?crit :
>>>> Hi Vincent
>>>>
>>>> If you are comparing to the results of WMT11, then you can look at
>>>> the system descriptions to see what the authors did. In fact it's
>>>> worth looking at the WMT14 descriptions (WMT15 will be available
>>>> next month) to see how state-of-the-art systems are built.
>>>>
>>>> For fr-en or en-fr, the first thing to look at is the data. There
>>>> are some large data sets released for WMT and you can get a good
>>>> gain from just crunching more data (monolingual and parallel).
>>>> Unfortunately this takes more resources (disk, cpu etc) so you may
>>>> run into trouble here.
>>>>
>>>> The hierarchical models are much bigger so yes you will need more
>>>> disk. For fr-en/en-fr it's probably not worth the extra effort,
>>>>
>>>> cheers - Barry
>>>>
>>>> On 04/08/15 15:58, Vincent Nguyen wrote:
>>>>> thanks for your insights.
>>>>>
>>>>> I am just stuck by the Bleu difference between my 26 and the 30 of
>>>>> WMT11, and some results of WMT14 close to 36 or even 39
>>>>>
>>>>> I am currently having trouble with hierarchical rule set instead of
>>>>> lexical reordering
>>>>> wondering if I will get better results but I have an error message
>>>>> filesystem root low disk space before it crashes.
>>>>> is this model taking more disk space in some ways ?
>>>>>
>>>>> I will next try to use more corpora of which in domain with my
>>>>> internal TMX
>>>>>
>>>>> thanks for your answers.
>>>>>
>>>>> Le 04/08/2015 16:02, Hieu Hoang a ?crit :
>>>>>>
>>>>>> On 03/08/2015 13:00, Vincent Nguyen wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Just a heads up on some EMS results, to get your experienced
>>>>>>> opinions.
>>>>>>>
>>>>>>> Corpus: Europarlv7 + NC2010
>>>>>>> fr => en
>>>>>>> Evaluation NC2011.
>>>>>>>
>>>>>>> 1) IRSTLM vs KenLM is much slower for training / tuning.
>>>>>> that sounds right. KenLM is also multithreaded, IRSTLM can only be
>>>>>> used in single-threaded decoding.
>>>>>>> 2) BLEU results are almost the same (25.7 with Irstlm, 26.14
>>>>>>> with KenLM)
>>>>>> true
>>>>>>> 3) Compact Mode is faster than onDisk with a short test (77
>>>>>>> segments 96
>>>>>>> seconds, vs 126 seconds)
>>>>>> true
>>>>>>> 4) One last thing I do not understand though :
>>>>>>> For sake of checking, I replaced NC2011 by NC2010 in the
>>>>>>> evaluation (I
>>>>>>> know since NC2010 is part of training, should not be relevant)
>>>>>>> I got roughly the same BLEU score. I would have expected a
>>>>>>> higher score
>>>>>>> with a test set inculded in the training corpus.
>>>>>>>
>>>>>>> makes sense ?
>>>>>>>
>>>>>>>
>>>>>>> Next steps :
>>>>>>> What path should I use to get better scores ? I read the 'optimize'
>>>>>>> section of the website which deals more with speed
>>>>>>> and of course I will appply all of this but I was interested in
>>>>>>> tips to
>>>>>>> get more quality if possible.
>>>>>> look into domain adaptation if you have multiple training corpora,
>>>>>> some of which is in-domain and some out-of-domain.
>>>>>>
>>>>>> Other than that, getting good bleu score is a research open
>>>>>> question.
>>>>>>
>>>>>> Well done on getting this far
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> Moses-support@mit.edu
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>
>>>>
>>>
>>
>
>

------------------------------

Message: 5
Date: Wed, 05 Aug 2015 11:39:06 +0100
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] EMS results - makes sense ?
To: moses-support@mit.edu
Message-ID: <55C1E7CA.2090208@kheafield.com>
Content-Type: text/plain; charset=utf-8

Looking for a perl volunteer to:

1. Always run commands under set -e -o pipefail conditions so errors are
likely to be reported in the return code.

2. Actually check the return code and die on failure.

It shouldn't be guesswork when one runs out of disk space.

On 08/05/15 08:43, Hieu Hoang wrote:
> are you sure it didn't run out of disk space again? check in the
> TRAINING_extract.*.STDERR file for messages.
>
> Also, because extract and scoring it is run in parallel, the error
> messages sometimes overwrite each other so you don't get clear messages.
> you have to use your intuition
>
>

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 106, Issue 9
*********************************************

Moses-support Digest, Vol 106, Issue 9

0 Response to "Moses-support Digest, Vol 106, Issue 9"

Post a Comment