Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: decoder question (Ulrich Germann)
2. low BLEU Score (EN->DE) (Raphael Hoeps)
----------------------------------------------------------------------
Message: 1
Date: Sat, 5 Dec 2015 17:08:59 +0000
From: Ulrich Germann <ulrich.germann@gmail.com>
Subject: Re: [Moses-support] decoder question
To: Vincent Nguyen <vnguyen@neuf.fr>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAHQSRUqzaZD88oSp4oipmPq8A4wDw8A=v=3sHjWig56Srobu=A@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi Vincent,
most LM training pipelines add the <s> and </s> (or whatever other symbols
are used) by themselves; do not add them to the input unless the
instructions for the particular LM specifically tell you so.
Whether sentences in the LM training data should end in punctuation or not
is a matter of corpus cleanup: The LM training data should be as similar as
possible to what you would like your translation output to be.
- Uli
On Sat, Dec 5, 2015 at 12:40 PM, Vincent Nguyen <vnguyen@neuf.fr> wrote:
>
> Tom, Uli thank you guys.
> Very clear.
>
> Now how important is it to have a "end of sentence" delimiter in the
> language model (not talking of the LF stuff) ?
> Should each line in the LM end with a "." or equivalent (excla-mark,
> question-mark, ...) ?
> I saw some LM (especially for ASR) where the training text ends each
> line with a specific delimiter </s>
>
>
>
>
> Le 05/12/2015 03:39, Tom Hoar a ?crit :
> > Here's another perspective. The concept of a what should be translated
> > as a "sentence" during production depends on the training data and
> > tuning set that created the model. I like Ulrich's input. The period
> > (question mark, exclamation mark, etc) are just tokens. The newline
> > marker tells moses, "start translating a new job with all the tokens
> > before me."
> >
> > Let's say you train your translation model with a parallel corpus broken
> > down into paired part-of-speed phrases (noun phrases, verb phrases,
> > object phrases, etc.). Then build your language model using the target
> > half of the part-of-speech corpus. Finally, tune your SMT model using
> > this TM/LM pair and a tuning set with part-of-speech pairs. Your
> > translation production input should also be broken into those same
> > part-of-speech phrases to achieve optimal results. With such a model,
> > you will get degraded results if you translate a complete sentence or a
> > paragraph (multi-sentences).
> >
> > Here's an modified approach. Train a translation model with the same
> > part-of-speech parallel corpus. Then, use a different version of the
> > target language corpus with complete sentences (i.e. broken by sentence
> > breaks like full-stops, question marks, etc.). Next, tune your SMT model
> > with a tuning set of paired complete sentences that match the LM's
> > breaks. The tuning process optimizes performance for that type of input.
> > Therefore, your optimized translation results will mirror the LM corpus
> > and matched tuning set. You will get degraded results if you translate
> > part-of-speech phrases or multi-sentences or complete paragraphs.
> >
> > We call these "things" models because they're supposed to be a miniature
> > representation of a larger universe. So, you'll always get the best
> > results when your production input matches the input side of your tuning
> > set.
> >
> > Re newline markers, I think Ulrich's "Mac: CR" is for the legacy Mac OS.
> > The current OS X uses Posix/Linux LF. We have not tested our
> > cross-platform updates with the older Mac CR and I suspect it will not
> > work. So I suggest using either CRLF or LF, which we have extensively
> > using across Windows and Posix systems.
> >
> > Tom
> >
> >
> > On 12/5/2015 6:13 AM, moses-support-request@mit.edu wrote:
> >> Date: Fri, 4 Dec 2015 23:13:10 +0000
> >> From: Ulrich Germann<ulrich.germann@gmail.com>
> >> Subject: Re: [Moses-support] decoder question
> >> To: Vincent Nguyen<vnguyen@neuf.fr>
> >> Cc: moses-support<moses-support@mit.edu>
> >>
> >> Hi Vincent,
> >>
> >> as far as Moses is concerned, the end of a sentence is marked by
> whatever
> >> the end-of-line marker is on the respective OS (Win: CRLF, Linux: LF,
> Mac:
> >> CR, apparently). A period is treated as a plain old token. The purpose
> of
> >> the sentence splitter that Kenneth mentioned is to tell Moses what the
> >> "sentence" boundaries are.
> >>
> >> The language model has a concept of sentences beginning and ending and
> >> usually doesn't like periods anywhere except at the end of a sentence,
> so
> >> it'll down-vote translation hypotheses containing isolated periods.
> >>
> >> - Uli
> > _______________________________________________
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
--
Ulrich Germann
Senior Researcher
School of Informatics
University of Edinburgh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151205/8250642d/attachment-0001.html
------------------------------
Message: 2
Date: Sun, 6 Dec 2015 11:25:47 +0100
From: Raphael Hoeps <raphael.hoeps@gmx.net>
Subject: [Moses-support] low BLEU Score (EN->DE)
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <56640D2B.9060902@gmx.net>
Content-Type: text/plain; charset=utf-8; format=flowed
Hi,
two weeks ago you already helped me to find a very stupid mistake in my
moses system setup. Now everything seems to work quite well, but I am a
little surprised about my (pretty low?) BLEU Score of only 10.
I sticked to this tutorial, but did it for translating English to German:
http://www.statmt.org/moses/?n=Moses.Baseline
I used the same English and the corresponding German corpora as it is
done in the tutorial, but I cut down the development corpora from 2000
to 500 lines to speed up the process. (My laptop is quite old).
So, do you think a BLEU Score of 10 is realistic or did I make a
mistake? In the tutorial the score was 24, but of course it will be
lower in a EN->DE system. Did anyone else ever setup a EN->DE baseline
system? What scores did you get?
Thanks a lot,
Raphael
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 110, Issue 15
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 110, Issue 15"
Post a Comment