Moses-support Digest, Vol 110, Issue 3

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Best way to mark unknowns in nbest-list (Jeremy Gwinnup)
2. Re: Training script documentation (Philipp Koehn)
3. Re: Best way to mark unknowns in nbest-list (Jeremy Gwinnup)
4. Re: Best way to mark unknowns in nbest-list (Ulrich Germann)


----------------------------------------------------------------------

Message: 1
Date: Wed, 2 Dec 2015 12:47:13 -0500
From: Jeremy Gwinnup <jeremy@gwinnup.org>
Subject: Re: [Moses-support] Best way to mark unknowns in nbest-list
To: ugermann@inf.ed.ac.uk
Cc: moses-support@mit.edu
Message-ID: <BB2D4D75-850A-4B23-96AD-DA096FD61BA7@gwinnup.org>
Content-Type: text/plain; charset="utf-8"

Uli,

?mark-unknown will apply the specified prefix or suffix to unknowns in the final output, but it won?t output these markers in the nbest lists. Doing the same for nbest lists shouldn?t be hard, I just need to find the right place so I don?t have to replicate this code for the different decoding algorithms.

Thanks!
-Jeremy


> On Dec 2, 2015, at 12:39 PM, Ulrich Germann <ulrich.germann@gmail.com> wrote:
>
> Have you tried specifying
>
> --mark-unknown
>
> on the command line? This will (i.e. should ;-)) prefix unknown words in the output with UNK
>
> you can set begin and end label with --unknown-word-prefix and --unknown-word-suffix.
>
> For example
>
> --unknown-word-prefix '<unk>'
> --unknown-word-suffix '</unk>'
>
> would give you XML-style markup.
>
> - Uli
>
> On Wed, Dec 2, 2015 at 4:36 PM, Jeremy Gwinnup <jeremy@gwinnup.org <mailto:jeremy@gwinnup.org>> wrote:
> Hi,
>
> I?d like to be able to mark unknown words in nbest lists - where is a good place to dig into the code so that it works with both phrase-based and chart decoding?
>
> Thanks!
> -Jeremy
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support <http://mailman.mit.edu/mailman/listinfo/moses-support>
>
>
>
> --
> Ulrich Germann
> Senior Researcher
> School of Informatics
> University of Edinburgh

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151202/1802987b/attachment-0001.html

------------------------------

Message: 2
Date: Wed, 2 Dec 2015 13:31:30 -0500
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] Training script documentation
To: "Read, James C" <jcread@essex.ac.uk>
Cc: Moses Support <moses-support@mit.edu>
Message-ID:
<CAAFADDAaYSt601NjOKZe9qQN355Yx=6r6koYuBwANqsObDWKdA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

the script expects tokenized data, and word alignment will fail if there
are too long sentences or if there is length mismatch in a sentence pair
(e.g., 1 word sentence translated as 70 word sentence). That's what the
cleaning script does. It also removes spurious spaces, which may
throw some processing steps off. Also, the provided tokenizer deals with
special characters like "|". If you do not use this tokenizer, you should
run scripts/tokenizer/escape-special-chars.perl to escape them.

Truecasing is optional. Many do lowercasing.

It does not matter to the training script how you prepare the data, so you
do not have to explicitly run these steps. You may already have tokenized
data, so no need to run the tokenizer.

Whatever you specify with "-corpus" (full path!) should work, as long as
the issues spelled out in the first paragraph above are addressed.

-phi

On Wed, Dec 2, 2015 at 10:28 AM, Read, James C <jcread@essex.ac.uk> wrote:

> In the past I've never been able to get the training script to run to
> completion without rigorously following the instructions here
> http://www.statmt.org/moses/?n=moses.baseline
>
>
> 1) Tokenise
>
> 2) Train truecaser
>
> 3) Truecase
>
> 4) Clean
>
>
> What if somebody wants to just tokenize and clean without truecasing or
> just clean without tokenizing? Why should the script bomb out? Is this
> something to do with formats required by early stages of the training
> process?
>
>
> James
>
>
> NOTE: This is not an open invitation to discuss why somebody would want to
> train models without tokenzing or truecasing. This is nothing more than a
> request for technical assistance.
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151202/59b281cc/attachment-0001.html

------------------------------

Message: 3
Date: Wed, 2 Dec 2015 13:39:55 -0500
From: Jeremy Gwinnup <jeremy@gwinnup.org>
Subject: Re: [Moses-support] Best way to mark unknowns in nbest-list
To: ugermann@inf.ed.ac.uk
Cc: moses-support@mit.edu
Message-ID: <EE03BC49-94E8-45FE-8A1B-7B56E5738AAA@gwinnup.org>
Content-Type: text/plain; charset=utf-8

Uli,

I just tried the suggested settings on a couple systems:

search algo :
1 (cube pruning) - yields UNK in both final output and nbest
3 (CYK+) - yields UNK in final output but not nbest
5 (inc-search) - yields UNK in final output but not nbest

I?m guessing since factors aren?t supported in the chart-based algorithms the bug you mention still happens?

Thanks again for the help!
-Jeremy


> On Dec 2, 2015, at 1:08 PM, Ulrich Germann <ulrich.germann@gmail.com> wrote:
>
> Hi Jeremy,
>
> looks like a bug to me at line 1755 in moses/Manager.cpp, when reportAllFactors is true.
>
> Theoretically, one could have the streaming operator for phrases check if mark-unknown is set and use the markup, but I'm trying to eliminate the dependence on global variables in the decoder, so I'm more inclined to have / use a toString() function that gets the requested behaviour as a function parameter.
>
> For the time being, a quick hack around the issue might be to set --report-all-factors to false and set --output-factors to the factors you want in the output. (If --output-factors is not set, Moses should default to just printing the first factor, but I haven't tried that out).
>
> - Uli




------------------------------

Message: 4
Date: Wed, 2 Dec 2015 21:01:55 +0000
From: Ulrich Germann <ulrich.germann@gmail.com>
Subject: Re: [Moses-support] Best way to mark unknowns in nbest-list
To: Jeremy Gwinnup <jeremy@gwinnup.org>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAHQSRUpa9gkOEjwm8a=49H7n7Z_T1okFQ6bMRfWF=T1nyi32uw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

That one required a bug fix. Pull and try again, please.

- Uli

On Wed, Dec 2, 2015 at 6:39 PM, Jeremy Gwinnup <jeremy@gwinnup.org> wrote:

> Uli,
>
> I just tried the suggested settings on a couple systems:
>
> search algo :
> 1 (cube pruning) - yields UNK in both final output and nbest
> 3 (CYK+) - yields UNK in final output but not nbest
> 5 (inc-search) - yields UNK in final output but not nbest
>
> I?m guessing since factors aren?t supported in the chart-based algorithms
> the bug you mention still happens?
>
> Thanks again for the help!
> -Jeremy
>
>
> > On Dec 2, 2015, at 1:08 PM, Ulrich Germann <ulrich.germann@gmail.com>
> wrote:
> >
> > Hi Jeremy,
> >
> > looks like a bug to me at line 1755 in moses/Manager.cpp, when
> reportAllFactors is true.
> >
> > Theoretically, one could have the streaming operator for phrases check
> if mark-unknown is set and use the markup, but I'm trying to eliminate the
> dependence on global variables in the decoder, so I'm more inclined to have
> / use a toString() function that gets the requested behaviour as a function
> parameter.
> >
> > For the time being, a quick hack around the issue might be to set
> --report-all-factors to false and set --output-factors to the factors you
> want in the output. (If --output-factors is not set, Moses should default
> to just printing the first factor, but I haven't tried that out).
> >
> > - Uli
>
>
>


--
Ulrich Germann
Senior Researcher
School of Informatics
University of Edinburgh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151202/4954d6d3/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 110, Issue 3
*********************************************

0 Response to "Moses-support Digest, Vol 110, Issue 3"

Post a Comment