Moses-support Digest, Vol 106, Issue 16

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: EMS results - makes sense ? (Philipp Koehn)
2. Re: EMS results - makes sense ? (Dingyuan Wang)

----------------------------------------------------------------------

Message: 1
Date: Thu, 6 Aug 2015 12:00:41 -0400
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] EMS results - makes sense ?
To: Barry Haddow <bhaddow@inf.ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAFADDCFrKFFx+5DJiuy1OAwaOdRW_+7ZdiVjvAZRPu+ekgAUA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

if you run into memory problems with fast align, you can
add the following in the [TRAINING] section:

fast-align-max-lines = 1000000

This will run fast-align in parts of 1 million sentence pairs.

-phi

On Thu, Aug 6, 2015 at 7:28 AM, Barry Haddow <bhaddow@inf.ed.ac.uk> wrote:

> Hi Vincent
>
> It's a SIGKILL. Probably means it ran out of memory.
>
> I'd recommend fast_align for this data set. Even if you manage to get it
> running with mgiza it will still take a week or so.
>
> Just add
> fast-align-settings = "-d -o -v"
> to the TRAINING section of ems, and make sure that fast_align is in your
> external-bin-dir.
>
> cheers - Barry
>
>
> On 06/08/15 08:40, Vincent Nguyen wrote:
>
>
> so I dropped my hierarchical model since I got an error.
> Switched back to the "more data" by adding the Giga FR EN source
> but now another error pops un running Giza Inverse :
>
> Using SCRIPTS_ROOTDIR: /home/moses/mosesdecoder/scripts
> Using multi-thread GIZA
> using gzip
> (2) running giza @ Wed Aug 5 21:03:56 CEST 2015
> (2.1a) running snt2cooc fr-en @ Wed Aug 5 21:03:56 CEST 2015
> Executing: mkdir -p /home/moses/working/training/giza-inverse.7
> Executing: /home/moses/working/bin/training-tools/mgizapp/snt2cooc
> /home/moses/working/training/giza-inverse.7/fr-en.cooc
> /home/moses/working/training/prepared.7/en.vcb
> /home/moses/working/training/prepared.7/fr.vcb
> /home/moses/working/training/prepared.7/fr-en-int-train.snt
> line 1000
> line 2000
>
> ...
> line 6609000
> line 6610000
> ERROR: Execution of:
> /home/moses/working/bin/training-tools/mgizapp/snt2cooc
> /home/moses/working/training/giza-inverse.7/fr-en.cooc
> /home/moses/working/training/prepared.7/en.vcb
> /home/moses/working/training/prepared.7/fr.vcb
> /home/moses/working/training/prepared.7/fr-en-int-train.snt
> died with signal 9, without coredump
>
>
> any clue what signal 9 means ?
>
>
>
> Le 04/08/2015 17:28, Barry Haddow a ?crit :
>
> Hi Vincent
>
> If you are comparing to the results of WMT11, then you can look at the
> system descriptions to see what the authors did. In fact it's worth looking
> at the WMT14 descriptions (WMT15 will be available next month) to see how
> state-of-the-art systems are built.
>
> For fr-en or en-fr, the first thing to look at is the data. There are some
> large data sets released for WMT and you can get a good gain from just
> crunching more data (monolingual and parallel). Unfortunately this takes
> more resources (disk, cpu etc) so you may run into trouble here.
>
> The hierarchical models are much bigger so yes you will need more disk.
> For fr-en/en-fr it's probably not worth the extra effort,
>
> cheers - Barry
>
> On 04/08/15 15:58, Vincent Nguyen wrote:
>
> thanks for your insights.
>
> I am just stuck by the Bleu difference between my 26 and the 30 of
> WMT11, and some results of WMT14 close to 36 or even 39
>
> I am currently having trouble with hierarchical rule set instead of
> lexical reordering
> wondering if I will get better results but I have an error message
> filesystem root low disk space before it crashes.
> is this model taking more disk space in some ways ?
>
> I will next try to use more corpora of which in domain with my internal
> TMX
>
> thanks for your answers.
>
> Le 04/08/2015 16:02, Hieu Hoang a ?crit :
>
>
> On 03/08/2015 13:00, Vincent Nguyen wrote:
>
> Hi,
>
> Just a heads up on some EMS results, to get your experienced opinions.
>
> Corpus: Europarlv7 + NC2010
> fr => en
> Evaluation NC2011.
>
> 1) IRSTLM vs KenLM is much slower for training / tuning.
>
> that sounds right. KenLM is also multithreaded, IRSTLM can only be
> used in single-threaded decoding.
>
> 2) BLEU results are almost the same (25.7 with Irstlm, 26.14 with KenLM)
>
> true
>
> 3) Compact Mode is faster than onDisk with a short test (77 segments 96
> seconds, vs 126 seconds)
>
> true
>
> 4) One last thing I do not understand though :
> For sake of checking, I replaced NC2011 by NC2010 in the evaluation (I
> know since NC2010 is part of training, should not be relevant)
> I got roughly the same BLEU score. I would have expected a higher score
> with a test set inculded in the training corpus.
>
> makes sense ?
>
>
> Next steps :
> What path should I use to get better scores ? I read the 'optimize'
> section of the website which deals more with speed
> and of course I will appply all of this but I was interested in tips to
> get more quality if possible.
>
> look into domain adaptation if you have multiple training corpora,
> some of which is in-domain and some out-of-domain.
>
> Other than that, getting good bleu score is a research open question.
>
> Well done on getting this far
>
>
> Thanks
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150806/11b6016b/attachment-0001.htm

------------------------------

Message: 2
Date: Fri, 7 Aug 2015 00:12:49 +0800
From: Dingyuan Wang <abcdoyle888@gmail.com>
Subject: Re: [Moses-support] EMS results - makes sense ?
To: moses-support <moses-support@mit.edu>, Philipp Koehn <phi@jhu.edu>
Message-ID:
<CAFt8H759dYrEppy=56Gv7ViBfMed+1WK8Udgeb7wETfunWqYig@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

When is the 'fast-align-max-lines' added? That's convenient. I had to write
a wrapper script before to limit the lines to process.
Also, can I not run two direction's fast-aligns/mgizas in parallel? I have
the memory to run one at a time, but not two. (I also wrote a wrapper
script to block one.)

> 2015?8?7? 00:01? "Philipp Koehn" <phi@jhu.edu>???
>>
>> Hi,
>>
>> if you run into memory problems with fast align, you can
>> add the following in the [TRAINING] section:
>>
>> fast-align-max-lines = 1000000
>>
>> This will run fast-align in parts of 1 million sentence pairs.
>>
>> -phi
>>
>>
>> On Thu, Aug 6, 2015 at 7:28 AM, Barry Haddow <bhaddow@inf.ed.ac.uk>
wrote:
>>>
>>> Hi Vincent
>>>
>>> It's a SIGKILL. Probably means it ran out of memory.
>>>
>>> I'd recommend fast_align for this data set. Even if you manage to get
it running with mgiza it will still take a week or so.
>>>
>>> Just add
>>> fast-align-settings = "-d -o -v"
>>> to the TRAINING section of ems, and make sure that fast_align is in
your external-bin-dir.
>>>
>>> cheers - Barry
>>>
>>>
>>> On 06/08/15 08:40, Vincent Nguyen wrote:
>>>>
>>>>
>>>> so I dropped my hierarchical model since I got an error.
>>>> Switched back to the "more data" by adding the Giga FR EN source
>>>> but now another error pops un running Giza Inverse :
>>>>
>>>> Using SCRIPTS_ROOTDIR: /home/moses/mosesdecoder/scripts
>>>> Using multi-thread GIZA
>>>> using gzip
>>>> (2) running giza @ Wed Aug 5 21:03:56 CEST 2015
>>>> (2.1a) running snt2cooc fr-en @ Wed Aug 5 21:03:56 CEST 2015
>>>> Executing: mkdir -p /home/moses/working/training/giza-inverse.7
>>>> Executing: /home/moses/working/bin/training-tools/mgizapp/snt2cooc
/home/moses/working/training/giza-inverse.7/fr-en.cooc
/home/moses/working/training/prepared.7/en.vcb
/home/moses/working/training/prepared.7/fr.vcb
/home/moses/working/training/prepared.7/fr-en-int-train.snt
>>>> line 1000
>>>> line 2000
>>>>
>>>> ...
>>>> line 6609000
>>>> line 6610000
>>>> ERROR: Execution of:
/home/moses/working/bin/training-tools/mgizapp/snt2cooc
/home/moses/working/training/giza-inverse.7/fr-en.cooc
/home/moses/working/training/prepared.7/en.vcb
/home/moses/working/training/prepared.7/fr.vcb
/home/moses/working/training/prepared.7/fr-en-int-train.snt
>>>> died with signal 9, without coredump
>>>>
>>>>
>>>> any clue what signal 9 means ?
>>>>
>>>>
>>>>
>>>> Le 04/08/2015 17:28, Barry Haddow a ?crit :
>>>>>
>>>>> Hi Vincent
>>>>>
>>>>> If you are comparing to the results of WMT11, then you can look at
the system descriptions to see what the authors did. In fact it's worth
looking at the WMT14 descriptions (WMT15 will be available next month) to
see how state-of-the-art systems are built.
>>>>>
>>>>> For fr-en or en-fr, the first thing to look at is the data. There are
some large data sets released for WMT and you can get a good gain from just
crunching more data (monolingual and parallel). Unfortunately this takes
more resources (disk, cpu etc) so you may run into trouble here.
>>>>>
>>>>> The hierarchical models are much bigger so yes you will need more
disk. For fr-en/en-fr it's probably not worth the extra effort,
>>>>>
>>>>> cheers - Barry
>>>>>
>>>>> On 04/08/15 15:58, Vincent Nguyen wrote:
>>>>>>
>>>>>> thanks for your insights.
>>>>>>
>>>>>> I am just stuck by the Bleu difference between my 26 and the 30 of
>>>>>> WMT11, and some results of WMT14 close to 36 or even 39
>>>>>>
>>>>>> I am currently having trouble with hierarchical rule set instead of
>>>>>> lexical reordering
>>>>>> wondering if I will get better results but I have an error message
>>>>>> filesystem root low disk space before it crashes.
>>>>>> is this model taking more disk space in some ways ?
>>>>>>
>>>>>> I will next try to use more corpora of which in domain with my
internal TMX
>>>>>>
>>>>>> thanks for your answers.
>>>>>>
>>>>>> Le 04/08/2015 16:02, Hieu Hoang a ?crit :
>>>>>>>
>>>>>>>
>>>>>>> On 03/08/2015 13:00, Vincent Nguyen wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Just a heads up on some EMS results, to get your experienced
opinions.
>>>>>>>>
>>>>>>>> Corpus: Europarlv7 + NC2010
>>>>>>>> fr => en
>>>>>>>> Evaluation NC2011.
>>>>>>>>
>>>>>>>> 1) IRSTLM vs KenLM is much slower for training / tuning.
>>>>>>>
>>>>>>> that sounds right. KenLM is also multithreaded, IRSTLM can only be
>>>>>>> used in single-threaded decoding.
>>>>>>>>
>>>>>>>> 2) BLEU results are almost the same (25.7 with Irstlm, 26.14 with
KenLM)
>>>>>>>
>>>>>>> true
>>>>>>>>
>>>>>>>> 3) Compact Mode is faster than onDisk with a short test (77
segments 96
>>>>>>>> seconds, vs 126 seconds)
>>>>>>>
>>>>>>> true
>>>>>>>>
>>>>>>>> 4) One last thing I do not understand though :
>>>>>>>> For sake of checking, I replaced NC2011 by NC2010 in the
evaluation (I
>>>>>>>> know since NC2010 is part of training, should not be relevant)
>>>>>>>> I got roughly the same BLEU score. I would have expected a higher
score
>>>>>>>> with a test set inculded in the training corpus.
>>>>>>>>
>>>>>>>> makes sense ?
>>>>>>>>
>>>>>>>>
>>>>>>>> Next steps :
>>>>>>>> What path should I use to get better scores ? I read the
'optimize'
>>>>>>>> section of the website which deals more with speed
>>>>>>>> and of course I will appply all of this but I was interested in
tips to
>>>>>>>> get more quality if possible.
>>>>>>>
>>>>>>> look into domain adaptation if you have multiple training corpora,
>>>>>>> some of which is in-domain and some out-of-domain.
>>>>>>>
>>>>>>> Other than that, getting good bleu score is a research open
question.
>>>>>>>
>>>>>>> Well done on getting this far
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Moses-support mailing list
>>>>>>>> Moses-support@mit.edu
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150807/f9197814/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 106, Issue 16
**********************************************

Moses-support Digest, Vol 106, Issue 16

0 Response to "Moses-support Digest, Vol 106, Issue 16"

Post a Comment