Moses-support Digest, Vol 104, Issue 77

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: BLEU Score Variance: Which score to use? (Hokage Sama)
2. Re: BLEU Score Variance: Which score to use? (Hokage Sama)


----------------------------------------------------------------------

Message: 1
Date: Mon, 22 Jun 2015 17:53:32 -0500
From: Hokage Sama <nvncbol@gmail.com>
Subject: Re: [Moses-support] BLEU Score Variance: Which score to use?
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAD3ogMaW3Xrp1RZHN4CpE+6aCRKgwfJJu6qQkqcW0JEvKoK8Rw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Ok will do

On 22 June 2015 at 17:47, Marcin Junczys-Dowmunt <junczys@amu.edu.pl> wrote:

> I don't think so. However, when you repeat those experiments, you might
> try to identify where two trainings are starting to diverge by pairwise
> comparisions of the same files between two runs. Maybe then we can deduce
> something.
>
> On 23.06.2015 00:25, Hokage Sama wrote:
>
>> Hi I delete all the files (I think) generated during a training job
>> before rerunning the entire training. You think this could cause variation?
>> Here's the commands I run to delete:
>>
>> rm ~/corpus/train.tok.en
>> rm ~/corpus/train.tok.sm <http://train.tok.sm>
>> rm ~/corpus/train.true.en
>> rm ~/corpus/train.true.sm <http://train.true.sm>
>> rm ~/corpus/train.clean.en
>> rm ~/corpus/train.clean.sm <http://train.clean.sm>
>> rm ~/corpus/truecase-model.en
>> rm ~/corpus/truecase-model.sm <http://truecase-model.sm>
>> rm ~/corpus/test.tok.en
>> rm ~/corpus/test.tok.sm <http://test.tok.sm>
>> rm ~/corpus/test.true.en
>> rm ~/corpus/test.true.sm <http://test.true.sm>
>> rm -rf ~/working/filtered-test
>> rm ~/working/test.out
>> rm ~/working/test.translated.en
>> rm ~/working/training.out
>> rm -rf ~/working/train/corpus
>> rm -rf ~/working/train/giza.en-sm
>> rm -rf ~/working/train/giza.sm-en
>> rm -rf ~/working/train/model
>>
>> On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt <junczys@amu.edu.pl
>> <mailto:junczys@amu.edu.pl>> wrote:
>>
>> You're welcome. Take another close look at those varying bleu
>> scores though. That would make me worry if it happened to me for
>> the same data and the same weights.
>>
>> On 22.06.2015 10 <tel:22.06.2015%2010>:31, Hokage Sama wrote:
>>
>> Ok thanks. Appreciate your help.
>>
>> On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt
>> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>
>> <mailto:junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>>> wrote:
>>
>> Difficult to tell with that little data. Once you get beyond
>> 100,000 segments (or 50,000 at least) i would say 2000 per dev
>> (for tuning) and test set, rest for training. With that few
>> segments it's hard to give you any recommendations since
>> it might
>> just not give meaningful results. It's currently a toy
>> model, good
>> for learning and playing around with options. But not good for
>> trying to infer anything from BLEU scores.
>>
>>
>> On 22.06.2015 10 <tel:22.06.2015%2010>
>> <tel:22.06.2015%2010>:17, Hokage Sama wrote:
>>
>> Yes the language model was built earlier when I first went
>> through the manual to build a French-English baseline
>> system.
>> So I just reused it for my Samoan-English system.
>> Yes for all three runs I used the same training and
>> testing files.
>> How can I determine how much parallel data I should
>> set aside
>> for tuning and testing? I have only 10,028 segments
>> (198,385
>> words) altogether. At the moment I'm using 259
>> segments for
>> testing and the rest for training.
>>
>> Thanks,
>> Hilton
>>
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150622/05bc14f6/attachment-0001.htm

------------------------------

Message: 2
Date: Mon, 22 Jun 2015 22:06:49 -0500
From: Hokage Sama <nvncbol@gmail.com>
Subject: Re: [Moses-support] BLEU Score Variance: Which score to use?
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAD3ogMYhdENiWmhYYgQfozSd1EymN1Zas64fsyQOC_viDFBumg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Ok my scores don't vary so much when I just run tokenisation, truecasing,
and cleaning once. Found some differences beginning from the truecased
files. Here are my results now:

BLEU = 16.85, 48.7/21.0/11.7/6.7 (BP=1.000, ratio=1.089, hyp_len=3929,
ref_len=3609)
BLEU = 16.82, 48.6/21.1/11.6/6.7 (BP=1.000, ratio=1.085, hyp_len=3914,
ref_len=3609)
BLEU = 16.59, 48.3/20.6/11.4/6.7 (BP=1.000, ratio=1.085, hyp_len=3917,
ref_len=3609)
BLEU = 16.40, 48.4/20.7/11.3/6.4 (BP=1.000, ratio=1.086, hyp_len=3920,
ref_len=3609)
BLEU = 17.25, 49.2/21.6/12.0/6.9 (BP=1.000, ratio=1.090, hyp_len=3935,
ref_len=3609)
BLEU = 16.78, 48.9/21.0/11.6/6.7 (BP=1.000, ratio=1.091, hyp_len=3937,
ref_len=3609)

On 22 June 2015 at 17:53, Hokage Sama <nvncbol@gmail.com> wrote:

> Ok will do
>
> On 22 June 2015 at 17:47, Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
> wrote:
>
>> I don't think so. However, when you repeat those experiments, you might
>> try to identify where two trainings are starting to diverge by pairwise
>> comparisions of the same files between two runs. Maybe then we can deduce
>> something.
>>
>> On 23.06.2015 00:25, Hokage Sama wrote:
>>
>>> Hi I delete all the files (I think) generated during a training job
>>> before rerunning the entire training. You think this could cause variation?
>>> Here's the commands I run to delete:
>>>
>>> rm ~/corpus/train.tok.en
>>> rm ~/corpus/train.tok.sm <http://train.tok.sm>
>>> rm ~/corpus/train.true.en
>>> rm ~/corpus/train.true.sm <http://train.true.sm>
>>> rm ~/corpus/train.clean.en
>>> rm ~/corpus/train.clean.sm <http://train.clean.sm>
>>> rm ~/corpus/truecase-model.en
>>> rm ~/corpus/truecase-model.sm <http://truecase-model.sm>
>>> rm ~/corpus/test.tok.en
>>> rm ~/corpus/test.tok.sm <http://test.tok.sm>
>>> rm ~/corpus/test.true.en
>>> rm ~/corpus/test.true.sm <http://test.true.sm>
>>> rm -rf ~/working/filtered-test
>>> rm ~/working/test.out
>>> rm ~/working/test.translated.en
>>> rm ~/working/training.out
>>> rm -rf ~/working/train/corpus
>>> rm -rf ~/working/train/giza.en-sm
>>> rm -rf ~/working/train/giza.sm-en
>>> rm -rf ~/working/train/model
>>>
>>> On 22 June 2015 at 03:35, Marcin Junczys-Dowmunt <junczys@amu.edu.pl
>>> <mailto:junczys@amu.edu.pl>> wrote:
>>>
>>> You're welcome. Take another close look at those varying bleu
>>> scores though. That would make me worry if it happened to me for
>>> the same data and the same weights.
>>>
>>> On 22.06.2015 10 <tel:22.06.2015%2010>:31, Hokage Sama wrote:
>>>
>>> Ok thanks. Appreciate your help.
>>>
>>> On 22 June 2015 at 03:22, Marcin Junczys-Dowmunt
>>> <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>
>>> <mailto:junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>>> wrote:
>>>
>>> Difficult to tell with that little data. Once you get beyond
>>> 100,000 segments (or 50,000 at least) i would say 2000 per
>>> dev
>>> (for tuning) and test set, rest for training. With that few
>>> segments it's hard to give you any recommendations since
>>> it might
>>> just not give meaningful results. It's currently a toy
>>> model, good
>>> for learning and playing around with options. But not good
>>> for
>>> trying to infer anything from BLEU scores.
>>>
>>>
>>> On 22.06.2015 10 <tel:22.06.2015%2010>
>>> <tel:22.06.2015%2010>:17, Hokage Sama wrote:
>>>
>>> Yes the language model was built earlier when I first
>>> went
>>> through the manual to build a French-English baseline
>>> system.
>>> So I just reused it for my Samoan-English system.
>>> Yes for all three runs I used the same training and
>>> testing files.
>>> How can I determine how much parallel data I should
>>> set aside
>>> for tuning and testing? I have only 10,028 segments
>>> (198,385
>>> words) altogether. At the moment I'm using 259
>>> segments for
>>> testing and the rest for training.
>>>
>>> Thanks,
>>> Hilton
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150622/4588fc31/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 104, Issue 77
**********************************************

0 Response to "Moses-support Digest, Vol 104, Issue 77"

Post a Comment