Moses-support Digest, Vol 104, Issue 60

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Major bug found in Moses (Rico Sennrich)
2. Re: Major bug found in Moses (Rico Sennrich)
3. Re: Major bug found in Moses (Marcin Junczys-Dowmunt)
4. Re: Major bug found in Moses (Rico Sennrich)
5. Re: Major bug found in Moses (Marcin Junczys-Dowmunt)
6. Re: Dependencies in EMS/Experiment.perl (Matthias Huck)

----------------------------------------------------------------------

Message: 1
Date: Fri, 19 Jun 2015 17:22:08 +0000 (UTC)
From: Rico Sennrich <rico.sennrich@gmx.ch>
Subject: Re: [Moses-support] Major bug found in Moses
To: moses-support@mit.edu
Message-ID: <loom.20150619T185936-963@post.gmane.org>
Content-Type: text/plain; charset=utf-8

Read, James C <jcread@...> writes:

> So, all I did was filter out the less likely phrase pairs and the BLEU
score shot up. Was that such a stroke of genius? Was that not blindingly
obvious??

you are right. The idea is pretty obvious. It roughly corresponds to
'Histogram pruning' in this paper:

Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL), pp. 972-983.

The idea has been described in the literature before that (for instance,
Johnson et al. (2007) only use the top 30 phrase pairs per source phrase),
and may have been used in ps?????????????????????%?????????????????)?????????????????????????????????????????????????????)???????????????????????M5P???????????????????????????????)?????????????????????????

------------------------------

Message: 2
Date: Fri, 19 Jun 2015 18:25:51 +0100
From: Rico Sennrich <rico.sennrich@gmx.ch>
Subject: Re: [Moses-support] Major bug found in Moses
To: moses-support@mit.edu
Message-ID: <5584509F.9050804@gmx.ch>
Content-Type: text/plain; charset=windows-1252; format=flowed

[sorry for the garbled message before]

you are right. The idea is pretty obvious. It roughly corresponds to
'Histogram pruning' in this paper:

Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL), pp. 972-983.

The idea has been described in the literature before that (for instance,
Johnson et al. (2007) only use the top 30 phrase pairs per source
phrase), and may have been used in practice for even longer. If you read
the paper above, you will find that histogram pruning does not improve
translation quality on a state-of-the-art SMT system, and performs
poorly compared to more advanced pruning techniques.

On 19.06.2015 17:49, Read, James C. wrote:
> So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious?
>
>

------------------------------

Message: 3
Date: Fri, 19 Jun 2015 19:28:25 +0200
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Major bug found in Moses
To: moses-support@mit.edu
Message-ID: <55845139.2040208@amu.edu.pl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Rico,
since you are at it, some pointers to the more advanced pruning
techniques that do perform better, please :)

On 19.06.2015 19:25, Rico Sennrich wrote:
> [sorry for the garbled message before]
>
> you are right. The idea is pretty obvious. It roughly corresponds to
> 'Histogram pruning' in this paper:
>
> Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
> Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
> Empirical Methods in Natural Language Processing and Computational
> Natural Language Learning (EMNLP-CoNLL), pp. 972-983.
>
> The idea has been described in the literature before that (for instance,
> Johnson et al. (2007) only use the top 30 phrase pairs per source
> phrase), and may have been used in practice for even longer. If you read
> the paper above, you will find that histogram pruning does not improve
> translation quality on a state-of-the-art SMT system, and performs
> poorly compared to more advanced pruning techniques.
>
> On 19.06.2015 17:49, Read, James C. wrote:
>> So, all I did was filter out the less likely phrase pairs and the BLEU score shot up. Was that such a stroke of genius? Was that not blindingly obvious?
>>
>>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

Message: 4
Date: Fri, 19 Jun 2015 17:35:51 +0000 (UTC)
From: Rico Sennrich <rico.sennrich@gmx.ch>
Subject: Re: [Moses-support] Major bug found in Moses
To: moses-support@mit.edu
Message-ID: <loom.20150619T192942-209@post.gmane.org>
Content-Type: text/plain; charset=us-ascii

Marcin Junczys-Dowmunt <junczys@...> writes:

>
> Hi Rico,
> since you are at it, some pointers to the more advanced pruning
> techniques that do perform better, please :)
>
> On 19.06.2015 19:25, Rico Sennrich wrote:
> > [sorry for the garbled message before]
> >
> > you are right. The idea is pretty obvious. It roughly corresponds to
> > 'Histogram pruning' in this paper:
> >
> > Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
> > Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
> > Empirical Methods in Natural Language Processing and Computational
> > Natural Language Learning (EMNLP-CoNLL), pp. 972-983.
> >
> > The idea has been described in the literature before that (for instance,
> > Johnson et al. (2007) only use the top 30 phrase pairs per source
> > phrase), and may have been used in practice for even longer. If you read
> > the paper above, you will find that histogram pruning does not improve
> > translation quality on a state-of-the-art SMT system, and performs
> > poorly compared to more advanced pruning techniques.

the Zens et al. (2012) paper has a nice overview. significance
pruning and relative entropy pruning are both effective - you are not
guaranteed improvements over the unpruned system (although Johnson (2007)
does report improvements), but both allow you to reduce the size of your
models substantially with little loss in quality.

------------------------------

Message: 5
Date: Fri, 19 Jun 2015 19:38:33 +0200
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Major bug found in Moses
To: moses-support@mit.edu
Message-ID: <55845399.3090104@amu.edu.pl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Ah OK, I misunderstood, I thought you were talking about more advanced
pruning techniques compared to the significance method from Johnson et
al. while you only referred to the 30-best variant.
Cheers,
Marcin

On 19.06.2015 19:35, Rico Sennrich wrote:
> Marcin Junczys-Dowmunt <junczys@...> writes:
>
>> Hi Rico,
>> since you are at it, some pointers to the more advanced pruning
>> techniques that do perform better, please :)
>>
>> On 19.06.2015 19:25, Rico Sennrich wrote:
>>> [sorry for the garbled message before]
>>>
>>> you are right. The idea is pretty obvious. It roughly corresponds to
>>> 'Histogram pruning' in this paper:
>>>
>>> Zens, R., Stanton, D., Xu, P. (2012). A Systematic Comparison of Phrase
>>> Table Pruning Technique. In Proceedings of the 2012 Joint Conference on
>>> Empirical Methods in Natural Language Processing and Computational
>>> Natural Language Learning (EMNLP-CoNLL), pp. 972-983.
>>>
>>> The idea has been described in the literature before that (for instance,
>>> Johnson et al. (2007) only use the top 30 phrase pairs per source
>>> phrase), and may have been used in practice for even longer. If you read
>>> the paper above, you will find that histogram pruning does not improve
>>> translation quality on a state-of-the-art SMT system, and performs
>>> poorly compared to more advanced pruning techniques.
>
> the Zens et al. (2012) paper has a nice overview. significance
> pruning and relative entropy pruning are both effective - you are not
> guaranteed improvements over the unpruned system (although Johnson (2007)
> does report improvements), but both allow you to reduce the size of your
> models substantially with little loss in quality.
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

Message: 6
Date: Fri, 19 Jun 2015 19:42:52 +0100
From: Matthias Huck <mhuck@inf.ed.ac.uk>
Subject: Re: [Moses-support] Dependencies in EMS/Experiment.perl
To: Evgeny Matusov <ematusov@apptek.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <1434739372.30904.1184.camel@portedgar>
Content-Type: text/plain; charset="UTF-8"

Hi Evgeny,

If setting TRAINING:config won't help, then it might get a bit tricky.
Another thing you can try is setting filtered-config or filtered-dir in
the [TUNING] section.

The next workaround I can think of is pointing to existing files in all
the [CORPUS:*] sections by setting tokenized-stem, clean-stem,
truecased-stem ...

Similarly in the [LM:*] sections with tokenized-corpus and
truecased-corpus etc., if defining lm and/or binlm doesn't make it skip
those steps.

Cheers,
Matthias

On Fri, 2015-06-19 at 16:41 +0000, Evgeny Matusov wrote:
> Hi,
>
>
> to those of you using Experiment.perl for experiments, maybe you can
> help me solve the following problem:
>
>
> I added a step to filter full segment overlap of evaluation and tuning
> data with the training data. This steps removes all sentences from
> each CORPUS which are also found in EVALUATION and TUNING sentences.
> Thus, one of the CORPUS steps depend on EVALUATION and TUNING.
>
>
> Now, I want to exchange the tuning corpus I am using, picking another
> one which was already declared in the EVALUATION section. Thus, the
> filter against which the overlap is checked does not change, and hence
> the training data does not need to be filtered again, and therefore
> neither the alignment training nor LM training or anything else should
> be repeated, just the tuning step should re-start. However,
> Experiment.perl is not smart enough to realize this. I tried to add
> "pass-if" or "ignore-if" step on the filter-overlap step that I
> declared and set a variable to pass it, but this did not help - all
> steps after it are still executed. Setting TRAINING:config to a valid
> moses.ini file helps to prevent the alignment training from running,
> but not the LM training, nor (more importantly), the several
> cleaning/lowercasing steps that follow the overlap step for each
> training corpus.
>
>
> Is there an easy way to block everything below tuning from being
> repeated, even if the tuning data changes?
>
>
> Thanks,
>
> Evgeny.
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 104, Issue 60
**********************************************

Moses-support Digest, Vol 104, Issue 60

0 Response to "Moses-support Digest, Vol 104, Issue 60"

Post a Comment