Moses-support Digest, Vol 106, Issue 19

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: EMS results - makes sense ? (Philipp Koehn)
2. Re: EMS results - makes sense ? (Vincent Nguyen)


----------------------------------------------------------------------

Message: 1
Date: Thu, 6 Aug 2015 10:45:23 -0700
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] EMS results - makes sense ?
To: Vincent Nguyen <vnguyen@neuf.fr>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAFADDA4_h_QqVovxgxwRKxL8r9tyJ7oavmahJMXVD16TQxcQQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

I have no hard numbers for this - just give it a try and it needs too much
RAM,
then reduce to 100,000.

-phi

On Thu, Aug 6, 2015 at 10:17 AM, Vincent Nguyen <vnguyen@neuf.fr> wrote:

>
> is there like a rough estimate for how many gigs of ram you need for 1
> million sentence pairs batches ?
>
> Le 06/08/2015 18:00, Philipp Koehn a ?crit :
>
> Hi,
>
> if you run into memory problems with fast align, you can
> add the following in the [TRAINING] section:
>
> fast-align-max-lines = 1000000
>
> This will run fast-align in parts of 1 million sentence pairs.
>
> -phi
>
>
> On Thu, Aug 6, 2015 at 7:28 AM, Barry Haddow <bhaddow@inf.ed.ac.uk> wrote:
>
>> Hi Vincent
>>
>> It's a SIGKILL. Probably means it ran out of memory.
>>
>> I'd recommend fast_align for this data set. Even if you manage to get it
>> running with mgiza it will still take a week or so.
>>
>> Just add
>> fast-align-settings = "-d -o -v"
>> to the TRAINING section of ems, and make sure that fast_align is in your
>> external-bin-dir.
>>
>> cheers - Barry
>>
>>
>> On 06/08/15 08:40, Vincent Nguyen wrote:
>>
>>
>> so I dropped my hierarchical model since I got an error.
>> Switched back to the "more data" by adding the Giga FR EN source
>> but now another error pops un running Giza Inverse :
>>
>> Using SCRIPTS_ROOTDIR: /home/moses/mosesdecoder/scripts
>> Using multi-thread GIZA
>> using gzip
>> (2) running giza @ Wed Aug 5 21:03:56 CEST 2015
>> (2.1a) running snt2cooc fr-en @ Wed Aug 5 21:03:56 CEST 2015
>> Executing: mkdir -p /home/moses/working/training/giza-inverse.7
>> Executing: /home/moses/working/bin/training-tools/mgizapp/snt2cooc
>> /home/moses/working/training/giza-inverse.7/fr-en.cooc
>> /home/moses/working/training/prepared.7/en.vcb
>> /home/moses/working/training/prepared.7/fr.vcb
>> /home/moses/working/training/prepared.7/fr-en-int-train.snt
>> line 1000
>> line 2000
>>
>> ...
>> line 6609000
>> line 6610000
>> ERROR: Execution of:
>> /home/moses/working/bin/training-tools/mgizapp/snt2cooc
>> /home/moses/working/training/giza-inverse.7/fr-en.cooc
>> /home/moses/working/training/prepared.7/en.vcb
>> /home/moses/working/training/prepared.7/fr.vcb
>> /home/moses/working/training/prepared.7/fr-en-int-train.snt
>> died with signal 9, without coredump
>>
>>
>> any clue what signal 9 means ?
>>
>>
>>
>> Le 04/08/2015 17:28, Barry Haddow a ?crit :
>>
>> Hi Vincent
>>
>> If you are comparing to the results of WMT11, then you can look at the
>> system descriptions to see what the authors did. In fact it's worth looking
>> at the WMT14 descriptions (WMT15 will be available next month) to see how
>> state-of-the-art systems are built.
>>
>> For fr-en or en-fr, the first thing to look at is the data. There are
>> some large data sets released for WMT and you can get a good gain from just
>> crunching more data (monolingual and parallel). Unfortunately this takes
>> more resources (disk, cpu etc) so you may run into trouble here.
>>
>> The hierarchical models are much bigger so yes you will need more disk.
>> For fr-en/en-fr it's probably not worth the extra effort,
>>
>> cheers - Barry
>>
>> On 04/08/15 15:58, Vincent Nguyen wrote:
>>
>> thanks for your insights.
>>
>> I am just stuck by the Bleu difference between my 26 and the 30 of
>> WMT11, and some results of WMT14 close to 36 or even 39
>>
>> I am currently having trouble with hierarchical rule set instead of
>> lexical reordering
>> wondering if I will get better results but I have an error message
>> filesystem root low disk space before it crashes.
>> is this model taking more disk space in some ways ?
>>
>> I will next try to use more corpora of which in domain with my internal
>> TMX
>>
>> thanks for your answers.
>>
>> Le 04/08/2015 16:02, Hieu Hoang a ?crit :
>>
>>
>> On 03/08/2015 13:00, Vincent Nguyen wrote:
>>
>> Hi,
>>
>> Just a heads up on some EMS results, to get your experienced opinions.
>>
>> Corpus: Europarlv7 + NC2010
>> fr => en
>> Evaluation NC2011.
>>
>> 1) IRSTLM vs KenLM is much slower for training / tuning.
>>
>> that sounds right. KenLM is also multithreaded, IRSTLM can only be
>> used in single-threaded decoding.
>>
>> 2) BLEU results are almost the same (25.7 with Irstlm, 26.14 with KenLM)
>>
>> true
>>
>> 3) Compact Mode is faster than onDisk with a short test (77 segments 96
>> seconds, vs 126 seconds)
>>
>> true
>>
>> 4) One last thing I do not understand though :
>> For sake of checking, I replaced NC2011 by NC2010 in the evaluation (I
>> know since NC2010 is part of training, should not be relevant)
>> I got roughly the same BLEU score. I would have expected a higher score
>> with a test set inculded in the training corpus.
>>
>> makes sense ?
>>
>>
>> Next steps :
>> What path should I use to get better scores ? I read the 'optimize'
>> section of the website which deals more with speed
>> and of course I will appply all of this but I was interested in tips to
>> get more quality if possible.
>>
>> look into domain adaptation if you have multiple training corpora,
>> some of which is in-domain and some out-of-domain.
>>
>> Other than that, getting good bleu score is a research open question.
>>
>> Well done on getting this far
>>
>>
>> Thanks
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>>
>>
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150806/f3e5c782/attachment-0001.htm

------------------------------

Message: 2
Date: Thu, 6 Aug 2015 22:18:59 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] EMS results - makes sense ?
To: Philipp Koehn <phi@jhu.edu>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <55C3C133.1000508@neuf.fr>
Content-Type: text/plain; charset="utf-8"


ok just for info / the record
1 million
=> 5 GB for 1st job
=> 5 GB for 2nd job (inverse)
Since I have more Gigs it's fine, now my 8 cpus are the limit .....

thanks

Le 06/08/2015 19:45, Philipp Koehn a ?crit :
> Hi,
>
> I have no hard numbers for this - just give it a try and it needs too
> much RAM,
> then reduce to 100,000.
>
> -phi
>
> On Thu, Aug 6, 2015 at 10:17 AM, Vincent Nguyen <vnguyen@neuf.fr
> <mailto:vnguyen@neuf.fr>> wrote:
>
>
> is there like a rough estimate for how many gigs of ram you need
> for 1 million sentence pairs batches ?
>
> Le 06/08/2015 18:00, Philipp Koehn a ?crit :
>> Hi,
>>
>> if you run into memory problems with fast align, you can
>> add the following in the [TRAINING] section:
>>
>> fast-align-max-lines = 1000000
>>
>> This will run fast-align in parts of 1 million sentence pairs.
>>
>> -phi
>>
>>
>> On Thu, Aug 6, 2015 at 7:28 AM, Barry Haddow
>> <bhaddow@inf.ed.ac.uk <mailto:bhaddow@inf.ed.ac.uk>> wrote:
>>
>> Hi Vincent
>>
>> It's a SIGKILL. Probably means it ran out of memory.
>>
>> I'd recommend fast_align for this data set. Even if you
>> manage to get it running with mgiza it will still take a week
>> or so.
>>
>> Just add
>> fast-align-settings = "-d -o -v"
>> to the TRAINING section of ems, and make sure that fast_align
>> is in your external-bin-dir.
>>
>> cheers - Barry
>>
>>
>> On 06/08/15 08:40, Vincent Nguyen wrote:
>>>
>>> so I dropped my hierarchical model since I got an error.
>>> Switched back to the "more data" by adding the Giga FR EN source
>>> but now another error pops un running Giza Inverse :
>>>
>>> Using SCRIPTS_ROOTDIR: /home/moses/mosesdecoder/scripts
>>> Using multi-thread GIZA
>>> using gzip
>>> (2) running giza @ Wed Aug 5 21:03:56 CEST 2015
>>> (2.1a) running snt2cooc fr-en @ Wed Aug 5 21:03:56 CEST 2015
>>> Executing: mkdir -p /home/moses/working/training/giza-inverse.7
>>> Executing:
>>> /home/moses/working/bin/training-tools/mgizapp/snt2cooc
>>> /home/moses/working/training/giza-inverse.7/fr-en.cooc
>>> /home/moses/working/training/prepared.7/en.vcb
>>> /home/moses/working/training/prepared.7/fr.vcb
>>> /home/moses/working/training/prepared.7/fr-en-int-train.snt
>>> line 1000
>>> line 2000
>>>
>>> ...
>>> line 6609000
>>> line 6610000
>>> ERROR: Execution of:
>>> /home/moses/working/bin/training-tools/mgizapp/snt2cooc
>>> /home/moses/working/training/giza-inverse.7/fr-en.cooc
>>> /home/moses/working/training/prepared.7/en.vcb
>>> /home/moses/working/training/prepared.7/fr.vcb
>>> /home/moses/working/training/prepared.7/fr-en-int-train.snt
>>> died with signal 9, without coredump
>>>
>>>
>>> any clue what signal 9 means ?
>>>
>>>
>>>
>>> Le 04/08/2015 17:28, Barry Haddow a ?crit :
>>>> Hi Vincent
>>>>
>>>> If you are comparing to the results of WMT11, then you can
>>>> look at the system descriptions to see what the authors
>>>> did. In fact it's worth looking at the WMT14 descriptions
>>>> (WMT15 will be available next month) to see how
>>>> state-of-the-art systems are built.
>>>>
>>>> For fr-en or en-fr, the first thing to look at is the data.
>>>> There are some large data sets released for WMT and you can
>>>> get a good gain from just crunching more data (monolingual
>>>> and parallel). Unfortunately this takes more resources
>>>> (disk, cpu etc) so you may run into trouble here.
>>>>
>>>> The hierarchical models are much bigger so yes you will
>>>> need more disk. For fr-en/en-fr it's probably not worth the
>>>> extra effort,
>>>>
>>>> cheers - Barry
>>>>
>>>> On 04/08/15 15:58, Vincent Nguyen wrote:
>>>>> thanks for your insights.
>>>>>
>>>>> I am just stuck by the Bleu difference between my 26 and
>>>>> the 30 of
>>>>> WMT11, and some results of WMT14 close to 36 or even 39
>>>>>
>>>>> I am currently having trouble with hierarchical rule set
>>>>> instead of
>>>>> lexical reordering
>>>>> wondering if I will get better results but I have an error
>>>>> message
>>>>> filesystem root low disk space before it crashes.
>>>>> is this model taking more disk space in some ways ?
>>>>>
>>>>> I will next try to use more corpora of which in domain
>>>>> with my internal TMX
>>>>>
>>>>> thanks for your answers.
>>>>>
>>>>> Le 04/08/2015 16:02, Hieu Hoang a ?crit :
>>>>>>
>>>>>> On 03/08/2015 13:00, Vincent Nguyen wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Just a heads up on some EMS results, to get your
>>>>>>> experienced opinions.
>>>>>>>
>>>>>>> Corpus: Europarlv7 + NC2010
>>>>>>> fr => en
>>>>>>> Evaluation NC2011.
>>>>>>>
>>>>>>> 1) IRSTLM vs KenLM is much slower for training / tuning.
>>>>>> that sounds right. KenLM is also multithreaded, IRSTLM
>>>>>> can only be
>>>>>> used in single-threaded decoding.
>>>>>>> 2) BLEU results are almost the same (25.7 with Irstlm,
>>>>>>> 26.14 with KenLM)
>>>>>> true
>>>>>>> 3) Compact Mode is faster than onDisk with a short test
>>>>>>> (77 segments 96
>>>>>>> seconds, vs 126 seconds)
>>>>>> true
>>>>>>> 4) One last thing I do not understand though :
>>>>>>> For sake of checking, I replaced NC2011 by NC2010 in the
>>>>>>> evaluation (I
>>>>>>> know since NC2010 is part of training, should not be
>>>>>>> relevant)
>>>>>>> I got roughly the same BLEU score. I would have expected
>>>>>>> a higher score
>>>>>>> with a test set inculded in the training corpus.
>>>>>>>
>>>>>>> makes sense ?
>>>>>>>
>>>>>>>
>>>>>>> Next steps :
>>>>>>> What path should I use to get better scores ? I read the
>>>>>>> 'optimize'
>>>>>>> section of the website which deals more with speed
>>>>>>> and of course I will appply all of this but I was
>>>>>>> interested in tips to
>>>>>>> get more quality if possible.
>>>>>> look into domain adaptation if you have multiple training
>>>>>> corpora,
>>>>>> some of which is in-domain and some out-of-domain.
>>>>>>
>>>>>> Other than that, getting good bleu score is a research
>>>>>> open question.
>>>>>>
>>>>>> Well done on getting this far
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>
>>>>
>>>
>>
>>
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150806/4f87508e/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 106, Issue 19
**********************************************

0 Response to "Moses-support Digest, Vol 106, Issue 19"

Post a Comment