Moses-support Digest, Vol 106, Issue 18

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: EMS results - makes sense ? (Vincent Nguyen)
2. Re: EMS results - makes sense ? (Vincent Nguyen)


----------------------------------------------------------------------

Message: 1
Date: Thu, 6 Aug 2015 18:54:05 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] EMS results - makes sense ?
To: Philipp Koehn <phi@jhu.edu>, Barry Haddow <bhaddow@inf.ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <55C3912D.5040409@neuf.fr>
Content-Type: text/plain; charset="utf-8"


I have a bunch of
unicode non character u-fdd3 is illegal for open interchange
in the SDTERR of tokenize / clean with the Giga data set.
Should I just ignore these ? these lines are skipped ?

Le 06/08/2015 18:00, Philipp Koehn a ?crit :
> Hi,
>
> if you run into memory problems with fast align, you can
> add the following in the [TRAINING] section:
>
> fast-align-max-lines = 1000000
>
> This will run fast-align in parts of 1 million sentence pairs.
>
> -phi
>
>
> On Thu, Aug 6, 2015 at 7:28 AM, Barry Haddow <bhaddow@inf.ed.ac.uk
> <mailto:bhaddow@inf.ed.ac.uk>> wrote:
>
> Hi Vincent
>
> It's a SIGKILL. Probably means it ran out of memory.
>
> I'd recommend fast_align for this data set. Even if you manage to
> get it running with mgiza it will still take a week or so.
>
> Just add
> fast-align-settings = "-d -o -v"
> to the TRAINING section of ems, and make sure that fast_align is
> in your external-bin-dir.
>
> cheers - Barry
>
>
> On 06/08/15 08:40, Vincent Nguyen wrote:
>>
>> so I dropped my hierarchical model since I got an error.
>> Switched back to the "more data" by adding the Giga FR EN source
>> but now another error pops un running Giza Inverse :
>>
>> Using SCRIPTS_ROOTDIR: /home/moses/mosesdecoder/scripts
>> Using multi-thread GIZA
>> using gzip
>> (2) running giza @ Wed Aug 5 21:03:56 CEST 2015
>> (2.1a) running snt2cooc fr-en @ Wed Aug 5 21:03:56 CEST 2015
>> Executing: mkdir -p /home/moses/working/training/giza-inverse.7
>> Executing:
>> /home/moses/working/bin/training-tools/mgizapp/snt2cooc
>> /home/moses/working/training/giza-inverse.7/fr-en.cooc
>> /home/moses/working/training/prepared.7/en.vcb
>> /home/moses/working/training/prepared.7/fr.vcb
>> /home/moses/working/training/prepared.7/fr-en-int-train.snt
>> line 1000
>> line 2000
>>
>> ...
>> line 6609000
>> line 6610000
>> ERROR: Execution of:
>> /home/moses/working/bin/training-tools/mgizapp/snt2cooc
>> /home/moses/working/training/giza-inverse.7/fr-en.cooc
>> /home/moses/working/training/prepared.7/en.vcb
>> /home/moses/working/training/prepared.7/fr.vcb
>> /home/moses/working/training/prepared.7/fr-en-int-train.snt
>> died with signal 9, without coredump
>>
>>
>> any clue what signal 9 means ?
>>
>>
>>
>> Le 04/08/2015 17:28, Barry Haddow a ?crit :
>>> Hi Vincent
>>>
>>> If you are comparing to the results of WMT11, then you can look
>>> at the system descriptions to see what the authors did. In fact
>>> it's worth looking at the WMT14 descriptions (WMT15 will be
>>> available next month) to see how state-of-the-art systems are
>>> built.
>>>
>>> For fr-en or en-fr, the first thing to look at is the data.
>>> There are some large data sets released for WMT and you can get
>>> a good gain from just crunching more data (monolingual and
>>> parallel). Unfortunately this takes more resources (disk, cpu
>>> etc) so you may run into trouble here.
>>>
>>> The hierarchical models are much bigger so yes you will need
>>> more disk. For fr-en/en-fr it's probably not worth the extra
>>> effort,
>>>
>>> cheers - Barry
>>>
>>> On 04/08/15 15:58, Vincent Nguyen wrote:
>>>> thanks for your insights.
>>>>
>>>> I am just stuck by the Bleu difference between my 26 and the 30 of
>>>> WMT11, and some results of WMT14 close to 36 or even 39
>>>>
>>>> I am currently having trouble with hierarchical rule set
>>>> instead of
>>>> lexical reordering
>>>> wondering if I will get better results but I have an error message
>>>> filesystem root low disk space before it crashes.
>>>> is this model taking more disk space in some ways ?
>>>>
>>>> I will next try to use more corpora of which in domain with my
>>>> internal TMX
>>>>
>>>> thanks for your answers.
>>>>
>>>> Le 04/08/2015 16:02, Hieu Hoang a ?crit :
>>>>>
>>>>> On 03/08/2015 13:00, Vincent Nguyen wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Just a heads up on some EMS results, to get your experienced
>>>>>> opinions.
>>>>>>
>>>>>> Corpus: Europarlv7 + NC2010
>>>>>> fr => en
>>>>>> Evaluation NC2011.
>>>>>>
>>>>>> 1) IRSTLM vs KenLM is much slower for training / tuning.
>>>>> that sounds right. KenLM is also multithreaded, IRSTLM can
>>>>> only be
>>>>> used in single-threaded decoding.
>>>>>> 2) BLEU results are almost the same (25.7 with Irstlm, 26.14
>>>>>> with KenLM)
>>>>> true
>>>>>> 3) Compact Mode is faster than onDisk with a short test (77
>>>>>> segments 96
>>>>>> seconds, vs 126 seconds)
>>>>> true
>>>>>> 4) One last thing I do not understand though :
>>>>>> For sake of checking, I replaced NC2011 by NC2010 in the
>>>>>> evaluation (I
>>>>>> know since NC2010 is part of training, should not be relevant)
>>>>>> I got roughly the same BLEU score. I would have expected a
>>>>>> higher score
>>>>>> with a test set inculded in the training corpus.
>>>>>>
>>>>>> makes sense ?
>>>>>>
>>>>>>
>>>>>> Next steps :
>>>>>> What path should I use to get better scores ? I read the
>>>>>> 'optimize'
>>>>>> section of the website which deals more with speed
>>>>>> and of course I will appply all of this but I was interested
>>>>>> in tips to
>>>>>> get more quality if possible.
>>>>> look into domain adaptation if you have multiple training
>>>>> corpora,
>>>>> some of which is in-domain and some out-of-domain.
>>>>>
>>>>> Other than that, getting good bleu score is a research open
>>>>> question.
>>>>>
>>>>> Well done on getting this far
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150806/463849a4/attachment-0001.htm

------------------------------

Message: 2
Date: Thu, 6 Aug 2015 19:17:00 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] EMS results - makes sense ?
To: Philipp Koehn <phi@jhu.edu>, Barry Haddow <bhaddow@inf.ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <55C3968C.5070706@neuf.fr>
Content-Type: text/plain; charset="utf-8"


is there like a rough estimate for how many gigs of ram you need for 1
million sentence pairs batches ?

Le 06/08/2015 18:00, Philipp Koehn a ?crit :
> Hi,
>
> if you run into memory problems with fast align, you can
> add the following in the [TRAINING] section:
>
> fast-align-max-lines = 1000000
>
> This will run fast-align in parts of 1 million sentence pairs.
>
> -phi
>
>
> On Thu, Aug 6, 2015 at 7:28 AM, Barry Haddow <bhaddow@inf.ed.ac.uk
> <mailto:bhaddow@inf.ed.ac.uk>> wrote:
>
> Hi Vincent
>
> It's a SIGKILL. Probably means it ran out of memory.
>
> I'd recommend fast_align for this data set. Even if you manage to
> get it running with mgiza it will still take a week or so.
>
> Just add
> fast-align-settings = "-d -o -v"
> to the TRAINING section of ems, and make sure that fast_align is
> in your external-bin-dir.
>
> cheers - Barry
>
>
> On 06/08/15 08:40, Vincent Nguyen wrote:
>>
>> so I dropped my hierarchical model since I got an error.
>> Switched back to the "more data" by adding the Giga FR EN source
>> but now another error pops un running Giza Inverse :
>>
>> Using SCRIPTS_ROOTDIR: /home/moses/mosesdecoder/scripts
>> Using multi-thread GIZA
>> using gzip
>> (2) running giza @ Wed Aug 5 21:03:56 CEST 2015
>> (2.1a) running snt2cooc fr-en @ Wed Aug 5 21:03:56 CEST 2015
>> Executing: mkdir -p /home/moses/working/training/giza-inverse.7
>> Executing:
>> /home/moses/working/bin/training-tools/mgizapp/snt2cooc
>> /home/moses/working/training/giza-inverse.7/fr-en.cooc
>> /home/moses/working/training/prepared.7/en.vcb
>> /home/moses/working/training/prepared.7/fr.vcb
>> /home/moses/working/training/prepared.7/fr-en-int-train.snt
>> line 1000
>> line 2000
>>
>> ...
>> line 6609000
>> line 6610000
>> ERROR: Execution of:
>> /home/moses/working/bin/training-tools/mgizapp/snt2cooc
>> /home/moses/working/training/giza-inverse.7/fr-en.cooc
>> /home/moses/working/training/prepared.7/en.vcb
>> /home/moses/working/training/prepared.7/fr.vcb
>> /home/moses/working/training/prepared.7/fr-en-int-train.snt
>> died with signal 9, without coredump
>>
>>
>> any clue what signal 9 means ?
>>
>>
>>
>> Le 04/08/2015 17:28, Barry Haddow a ?crit :
>>> Hi Vincent
>>>
>>> If you are comparing to the results of WMT11, then you can look
>>> at the system descriptions to see what the authors did. In fact
>>> it's worth looking at the WMT14 descriptions (WMT15 will be
>>> available next month) to see how state-of-the-art systems are
>>> built.
>>>
>>> For fr-en or en-fr, the first thing to look at is the data.
>>> There are some large data sets released for WMT and you can get
>>> a good gain from just crunching more data (monolingual and
>>> parallel). Unfortunately this takes more resources (disk, cpu
>>> etc) so you may run into trouble here.
>>>
>>> The hierarchical models are much bigger so yes you will need
>>> more disk. For fr-en/en-fr it's probably not worth the extra
>>> effort,
>>>
>>> cheers - Barry
>>>
>>> On 04/08/15 15:58, Vincent Nguyen wrote:
>>>> thanks for your insights.
>>>>
>>>> I am just stuck by the Bleu difference between my 26 and the 30 of
>>>> WMT11, and some results of WMT14 close to 36 or even 39
>>>>
>>>> I am currently having trouble with hierarchical rule set
>>>> instead of
>>>> lexical reordering
>>>> wondering if I will get better results but I have an error message
>>>> filesystem root low disk space before it crashes.
>>>> is this model taking more disk space in some ways ?
>>>>
>>>> I will next try to use more corpora of which in domain with my
>>>> internal TMX
>>>>
>>>> thanks for your answers.
>>>>
>>>> Le 04/08/2015 16:02, Hieu Hoang a ?crit :
>>>>>
>>>>> On 03/08/2015 13:00, Vincent Nguyen wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Just a heads up on some EMS results, to get your experienced
>>>>>> opinions.
>>>>>>
>>>>>> Corpus: Europarlv7 + NC2010
>>>>>> fr => en
>>>>>> Evaluation NC2011.
>>>>>>
>>>>>> 1) IRSTLM vs KenLM is much slower for training / tuning.
>>>>> that sounds right. KenLM is also multithreaded, IRSTLM can
>>>>> only be
>>>>> used in single-threaded decoding.
>>>>>> 2) BLEU results are almost the same (25.7 with Irstlm, 26.14
>>>>>> with KenLM)
>>>>> true
>>>>>> 3) Compact Mode is faster than onDisk with a short test (77
>>>>>> segments 96
>>>>>> seconds, vs 126 seconds)
>>>>> true
>>>>>> 4) One last thing I do not understand though :
>>>>>> For sake of checking, I replaced NC2011 by NC2010 in the
>>>>>> evaluation (I
>>>>>> know since NC2010 is part of training, should not be relevant)
>>>>>> I got roughly the same BLEU score. I would have expected a
>>>>>> higher score
>>>>>> with a test set inculded in the training corpus.
>>>>>>
>>>>>> makes sense ?
>>>>>>
>>>>>>
>>>>>> Next steps :
>>>>>> What path should I use to get better scores ? I read the
>>>>>> 'optimize'
>>>>>> section of the website which deals more with speed
>>>>>> and of course I will appply all of this but I was interested
>>>>>> in tips to
>>>>>> get more quality if possible.
>>>>> look into domain adaptation if you have multiple training
>>>>> corpora,
>>>>> some of which is in-domain and some out-of-domain.
>>>>>
>>>>> Other than that, getting good bleu score is a research open
>>>>> question.
>>>>>
>>>>> Well done on getting this far
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150806/b06f18aa/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 106, Issue 18
**********************************************

0 Response to "Moses-support Digest, Vol 106, Issue 18"

Post a Comment