Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: Major bug found in Moses (amittai axelrod)
2. Re: Major bug found in Moses (Lane Schwartz)
----------------------------------------------------------------------
Message: 1
Date: Wed, 17 Jun 2015 14:03:34 -0400
From: amittai axelrod <amittai@umiacs.umd.edu>
Subject: Re: [Moses-support] Major bug found in Moses
To: "Read, James C" <jcread@essex.ac.uk>, Hieu Hoang
<hieuhoang@gmail.com>, Kenneth Heafield <moses@kheafield.com>,
"moses-support@mit.edu" <moses-support@mit.edu>
Cc: "Arnold, Doug" <doug@essex.ac.uk>
Message-ID: <5581B676.10105@umiacs.umd.edu>
Content-Type: text/plain; charset=windows-1252; format=flowed
this is a little hard to follow. "naturally" dropping the LM from the
equation makes the system worse, but "surprisingly" filtering out
suboptimal phrase pairs from the search space makes the system better?
it is not clear what your intuition derives from, though your faith in
it is astonishing.
regarding expectations -- i would expect publishable results to include
a comparison against a standard baseline. the comparison might not be
fair to your proposed system -- but that's not the baseline's fault!
as you are proposing a brand new translation paradigm on the grounds
that the current state of the art is broken, it is incumbent upon
you,the proponent, to show that your method works better than the
current standard. that's how science works.
you can have another baseline in there with no LM if you like, and say
you're isolating some small part of the sytem, and you can decide not to
tune your hypothesis system, but all baselines have to be tuned.
~amittai
On 6/17/15 13:46, Read, James C wrote:
> Please note that in order for the baseline to be meaningful it has to also use no LM. So, naturally the scores are lower than those of baselines you are referring to.
>
> Regarding expectations. Are you seriously suggesting that we would expect the translation model to be incapable of finding higher scoring translations when not filtering out less likely phrase pairs? How high exactly would that rank on your desirable qualities of a TM list?
>
> James
>
> ________________________________________
> From: amittai axelrod <amittai@umiacs.umd.edu>
> Sent: Wednesday, June 17, 2015 8:20 PM
> To: Read, James C; Hieu Hoang; Kenneth Heafield; moses-support@mit.edu
> Cc: Arnold, Doug
> Subject: Re: [Moses-support] Major bug found in Moses
>
> hi --
>
> you might not be aware, but your emails sound almost belligerently
> confrontational. i can see how you would be frustrated, but starting a
> conversation with "i have found a major bug" and then repeatedly saying
> that "clearly" everything is broken -- that may not be the best way to
> convince the few hundred people on the mailing list of the soundness of
> your approach.
>
> also, your argument could be easily mis-interpreted as "this behavior is
> unexpected to me, ergo this is unexpected behavior", and that will
> unfortunately bias the listener against you, as that is the preferred
> argument structure of conspiracy theorists.
>
> at any rate, "the system" is designed to take a large number of phrase
> pairs and model scores cobble them together into a translation. it does
> do that. it appears that you have identified a different way of doing
> that cobbling-together, one that uses much fewer models -- so far so good!
>
> however, from reading your paper, it seems that your baseline is
> completely unoptimized, so performance gains against it may not show up
> in the real world. as specific examples, Table 1 in your paper shows
> that your baseline French-English system score is 11.36, Spanish-English
> is 7.16, and German-English is 6.70 BLEU. if you compare those baselines
> against published results in those languages from the previous few
> years, you will see that those scores are well off the mark. your
> position will be helped by showing results against a stronger, yet still
> basic, baseline.
>
> what happens if you compare your approach against a vanilla use of the
> Moses pipeline [this includes tuning]?
>
> cheers,
> ~amittai
>
>
>
> On 6/17/15 12:45, Read, James C wrote:
>> Doesn't look like the LM is contributing all that much then does it?
>>
>> James
>>
>> ________________________________________
>> From: moses-support-bounces@mit.edu <moses-support-bounces@mit.edu> on behalf of Hieu Hoang <hieuhoang@gmail.com>
>> Sent: Wednesday, June 17, 2015 7:35 PM
>> To: Kenneth Heafield; moses-support@mit.edu
>> Subject: Re: [Moses-support] Major bug found in Moses
>>
>> On 17/06/2015 20:13, Kenneth Heafield wrote:
>>> I'll bite.
>>>
>>> The moses.ini files ship with bogus feature weights. One is required to
>>> tune the system to discover good weights for their system. You did not
>>> tune. The results of an untuned system are meaningless.
>>>
>>> So for example if the feature weights are all zeros, then the scores are
>>> all zero. The system will arbitrarily pick some awful translation from
>>> a large space of translations.
>>>
>>> The filter looks at one feature p(target | source). So now you've
>>> constrained the awful untuned model to a slightly better region of the
>>> search space.
>>>
>>> In other words, all you've done is a poor approximation to manually
>>> setting the weight to 1.0 on p(target | source) and the rest to 0.
>>>
>>> The problem isn't that you are running without a language model (though
>>> we generally do not care what happens without one). The problem is that
>>> you did not tune the feature weights.
>>>
>>> Moreover, as Marcin is pointing out, I wouldn't necessarily expect
>>> tuning to work without an LM.
>> Tuning does work without a LM. The results aren't half bad. fr-en
>> europarl (pb):
>> with LM: 22.84
>> retuned without LM: 18.33
>>>
>>> On 06/17/15 11:56, Read, James C wrote:
>>>> Actually the approximation I expect to be:
>>>>
>>>> p(e|f)=p(f|e)
>>>>
>>>> Why would you expect this to give poor results if the TM is well trained? Surely the results of my filtering experiments provve otherwise.
>>>>
>>>> James
>>>>
>>>> ________________________________________
>>>> From: moses-support-bounces@mit.edu <moses-support-bounces@mit.edu> on behalf of Rico Sennrich <rico.sennrich@gmx.ch>
>>>> Sent: Wednesday, June 17, 2015 5:32 PM
>>>> To: moses-support@mit.edu
>>>> Subject: Re: [Moses-support] Major bug found in Moses
>>>>
>>>> Read, James C <jcread@...> writes:
>>>>
>>>>> I have been unable to find a logical explanation for this behaviour other
>>>> than to conclude that there must be some kind of bug in Moses which causes a
>>>> TM only run of Moses to perform poorly in finding the most likely
>>>> translations according to the TM when
>>>>> there are less likely phrase pairs included in the race.
>>>> I may have overlooked something, but you seem to have removed the language
>>>> model from your config, and used default weights. your default model will
>>>> thus (roughly) implement the following model:
>>>>
>>>> p(e|f) = p(e|f)*p(f|e)
>>>>
>>>> which is obviously wrong, and will give you poor results. This is not a bug
>>>> in the code, but a poor choice of models and weights. Standard steps in SMT
>>>> (like tuning the model weights on a development set, and including a
>>>> language model) will give you the desired results.
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>> --
>> Hieu Hoang
>> Researcher
>> New York University, Abu Dhabi
>> http://www.hoang.co.uk/hieu
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
------------------------------
Message: 2
Date: Wed, 17 Jun 2015 13:11:35 -0500
From: Lane Schwartz <dowobeha@gmail.com>
Subject: Re: [Moses-support] Major bug found in Moses
To: "Read, James C" <jcread@essex.ac.uk>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>, "Arnold, Doug"
<doug@essex.ac.uk>
Message-ID:
<CABv3vZ=W+UhKS=0345_X_KJbw=mmwdY2toOA94iVgrsk3jNJHw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
James,
The underlying questions that you appear to be posing are these: When the
search space is simplified by decoding without a language model, to what
extent is the decoder able to identify hypotheses that have the best model
score? Second, does filtering the phrase table in a particular way change
the answer to this question? Third, how is the BLEU score (or any other
metric) affected by these questions?
These are valid questions.
Unfortunately, as Kenneth, Amittai, and Hieu have pointed out, the
experiment that you have designed does not provide you with all of what you
need to be able to answer these questions.
Recall that we don't really deal with probabilities when decoding. Yes,
some of our features are trained as probability models. But the decoder
searches using a weighted combination of scores. Lots of them. Even the
phrase table is comprised of (at least) four distinct scores (phrase
translation score and lexical translation score, in both directions).
Decoding is a search problem. Specifically, it is a search through all
possible translations to attempt to identify the one with the highest score
according to this weighted combination of component scores.
There are two problems then, that we have to deal with:
First is this. Even if all we care about is the ultimate weighted
combination of component scores, the search space is so vast (it's NP
complete) that we cannot hope to exhaustively search through it in a
reasonable amount of time, even for sentences that are only of moderate
length. This means that we have to resort to pruning.
Second is this. We don't really care about finding solutions that are
optimal according to the weighted combination of component scores. We care
about getting translations that are fluent and mean the same thing as the
original sentence. Since we don't know how to measure adequacy and fluency
automatically, we resort to imperfect metrics that can be calculated
automatically, like BLEU. This is fine, but it makes the search problem
(which was already intractably large) even worse.
The decoder only knows how to search by finding solutions that are good
according to the weighted combination of component scores. If we want
translations that are good according to some metric (like BLEU), then we
need to attempt to formulate the weights such that solutions that are good
according to the weighted combination of component scores are also good
according to the desired metric (BLEU).
The mechanism by which this is performed is tuning.
Your decoder, by necessity, is operating using pruning. As such, your
decoder is only operating in a confined region of the overall search space.
The question then is, what region of the search space would you prefer to
have your decoder operate in. If you choose not to run tuning, then you are
choosing to have your decoder operate in an arbitrary region of the search
space. If you chose to run tuning, then you are choosing to have your
decoder operate in a region of the search space where you have reason to
believe contains good translations according to your metric.
Another way to think about this is as follows. If you choose not to run
tuning, and you obtain translations that are good according to the metric
(BLEU), this is great, but it doesn't tell you much. If you obtain
translations that are bad according to the metric, this is to be expected.
What your experiments have shown is this:
The complexity of the search space is greater when you use all available
phrase pairs than it is when you pre-select only the best phrase pairs.
When you choose to not tune and not use and LM, and then decode in the
simpler space, you get better BLEU scores than when you decode in the more
complex space.
This is not a surprising result. It is in fact the expected result.
Why is this the expected result? Two reasons.
First, because search involves pruning. If you simplify the search space
(by allowing the decoder to search using only the best phrase pairs), then
it becomes easier for the decoder to find translations that are closer to
optimal according to the weighted combination of scores, simply because the
decoder is searching through a much smaller (and higher quality) sub-region
of the search space.
Second, because by choosing not to tune, the weights with which you are
decoding are arbitrary. Not tuning effectively says, I don't care whether
or not my decoder scores should correspond with my metric scores.
I hope this helps. I know it can be very discouraging when papers get
rejected. It is certainly possible that there are bugs in Moses. But the
experiment that you have run does not provide any evidence of that so far.
I know it seems incredible that people could not care about a very large
BLEU point swing. But if the baseline with tuning and with an LM is (for
example) 35 BLEU, and you show that no tuning and no LM gets you 29 BLEU
with a filtered TM and 6 BLEU with an unfiltered TM, that's not necessarily
a surprising or very interesting result.
Lane
On Wed, Jun 17, 2015 at 11:24 AM, Read, James C <jcread@essex.ac.uk> wrote:
>
> Which features would you like me to tune? The whole purpose of the
> exercise was to eliminate all variables except the TM and to keep constant
> those that could not be eliminated so that I could see which types of
> phrase pairs contribute most to increases in BLEU score in a TM only setup.
>
> Now you are saying I have to tune but tuning won't work without a LM. So
> how do you expect a researcher to be able to understand how well the TM
> component of the system is working if you are going to insist that I must
> include a LM for tuning to work.
>
> Clearly the system is broken. It is designed to work well with a LM and
> poorly without. When clearly good results can be obtained with a functional
> TM and well chosen phrase pairs.
>
> James
>
> ________________________________________
> From: moses-support-bounces@mit.edu <moses-support-bounces@mit.edu> on
> behalf of Kenneth Heafield <moses@kheafield.com>
> Sent: Wednesday, June 17, 2015 7:13 PM
> To: moses-support@mit.edu
> Subject: Re: [Moses-support] Major bug found in Moses
>
> I'll bite.
>
> The moses.ini files ship with bogus feature weights. One is required to
> tune the system to discover good weights for their system. You did not
> tune. The results of an untuned system are meaningless.
>
> So for example if the feature weights are all zeros, then the scores are
> all zero. The system will arbitrarily pick some awful translation from
> a large space of translations.
>
> The filter looks at one feature p(target | source). So now you've
> constrained the awful untuned model to a slightly better region of the
> search space.
>
> In other words, all you've done is a poor approximation to manually
> setting the weight to 1.0 on p(target | source) and the rest to 0.
>
> The problem isn't that you are running without a language model (though
> we generally do not care what happens without one). The problem is that
> you did not tune the feature weights.
>
> Moreover, as Marcin is pointing out, I wouldn't necessarily expect
> tuning to work without an LM.
>
> On 06/17/15 11:56, Read, James C wrote:
> > Actually the approximation I expect to be:
> >
> > p(e|f)=p(f|e)
> >
> > Why would you expect this to give poor results if the TM is well
> trained? Surely the results of my filtering experiments provve otherwise.
> >
> > James
> >
> > ________________________________________
> > From: moses-support-bounces@mit.edu <moses-support-bounces@mit.edu> on
> behalf of Rico Sennrich <rico.sennrich@gmx.ch>
> > Sent: Wednesday, June 17, 2015 5:32 PM
> > To: moses-support@mit.edu
> > Subject: Re: [Moses-support] Major bug found in Moses
> >
> > Read, James C <jcread@...> writes:
> >
> >> I have been unable to find a logical explanation for this behaviour
> other
> > than to conclude that there must be some kind of bug in Moses which
> causes a
> > TM only run of Moses to perform poorly in finding the most likely
> > translations according to the TM when
> >> there are less likely phrase pairs included in the race.
> > I may have overlooked something, but you seem to have removed the
> language
> > model from your config, and used default weights. your default model will
> > thus (roughly) implement the following model:
> >
> > p(e|f) = p(e|f)*p(f|e)
> >
> > which is obviously wrong, and will give you poor results. This is not a
> bug
> > in the code, but a poor choice of models and weights. Standard steps in
> SMT
> > (like tuning the model weights on a development set, and including a
> > language model) will give you the desired results.
> >
> > _______________________________________________
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> > _______________________________________________
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
--
When a place gets crowded enough to require ID's, social collapse is not
far away. It is time to go elsewhere. The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150617/0dd1b478/attachment.htm
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 104, Issue 38
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 104, Issue 38"
Post a Comment