Moses-support Digest, Vol 84, Issue 28

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Placeholders (Tom Hoar)
2. Re: tokenizer.perl to not tokenize exclude URLs (Barry Haddow)


----------------------------------------------------------------------

Message: 1
Date: Thu, 17 Oct 2013 10:25:03 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Placeholders
To: Moses-Support <moses-support@mit.edu>
Message-ID: <525F588F.80306@precisiontranslationtools.com>
Content-Type: text/plain; charset="utf-8"

The reality is that the current --xml-input functionality straddles the
fence between the scheme-less and defined schema worlds. It's "<anytag/>
except <wall/> and <zone/> and <ne/>." Moses currently supports only
four functions with XML markup: specifying alternate translation, walls,
zones and named entities. I'm not sure a full XML parser is necessary
for four functions, but the chance of accidental conflicts grows with
the number of functions.

It seems more efficient to assign a tag name to the only current
function that doesn't have a reserved tag name. Then, the undefined tag
names become the exception that Moses ignores.

Tom


On 10/16/2013 11:16 PM, Achim Ruopp wrote:
>
> <anytag/> is XML-compliant in schema-less XML (as long as the tag
> name complies to http://www.w3.org/TR/REC-xml/#NT-Name)
>
> IMHO Moses input (with the -xml-input option) should stay schema-less,
> or we should define a schema. Right now I can't see a pressing reason
> to define a schema.
>
> In any case it would be good to parse the input (with the -xml-input
> option) with a proper XML parser, e.g.
>
> http://www.boost.org/doc/libs/1_54_0/doc/html/boost_propertytree/parsers.html#boost_propertytree.parsers.xml_parser
>
>
> There are probably better XML parsers, but Moses already requires
> Boost. Using an XML parser could also solve some of the character
> escaping uncertainty.
>
> Achim
>
> *From:*moses-support-bounces@mit.edu
> [mailto:moses-support-bounces@mit.edu] *On Behalf Of
> *support@precisiontranslationtools.com
> *Sent:* Tuesday, October 15, 2013 10:25 PM
> *To:* moses-support@mit.edu
> *Subject:* Re: [Moses-support] Placeholders
>
> A change from <anytag/> will no-doubt disrupt existing pipelines.
> Communicating the change with the new release will be a great help.
>
> On 2013-10-15 01:35, Hieu Hoang wrote:
>
> they're good ideas. I'll have a think if I get round to doing it.
>
> Would also want to minimise the work I have to do, and minimize
> the disruption to people's existing pipeline.
>
> On 15 October 2013 01:33, Tom Hoar
> <tahoar@precisiontranslationtools.com
> <mailto:tahoar@precisiontranslationtools.com>> wrote:
>
> I agree that <anytag/> could cause problems, especially with the
> growing
> list of reserved tag names (ne, wall, zone). I wholeheartedly
> support a
> fixed tag, but I'm not sure "option" is it. What about <np/>
> (already in
> the manual) or <xml-markup/> or <xml-input/> or <moses/>?
>
> Here's another idea. The -xml-input flag supports values "exclusive,"
> "inclusive," "ignore" and "pass-through." What about changing the flag
> to a boolean flag. Then, use the value as the xml tags: <exclusive/>,
> <inclusive/> and <ignore/> so the one invocation of Moses would
> support
> all modes on a per-sentence basis. Just a thought. Think this
> would also
> be easier if you dropped the "pass-through" option because no need for
> backwards compatibility.
>
> Another idea, although slightly different subject. Moses'
> -monotone-at-punctuation flag would be more useful if we could
> define/override the punctuation & symbols that we want it to use. Not
> sure how to best accomplish this.
>
> Tom
>
>
>
>
> On 10/15/2013 04:07 AM, Hieu Hoang wrote:
> > In fact, we're thinking of changing <anytag/> to something
> fixed, like
> > <option/>
> >
> > The <anytag/> behaviour isn't good XML and will cause problems
> in the
> > future
> >
> > Any opinions on this gratefully received
> >
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
> _______________________________________________
>
> Moses-support mailing list
>
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131017/18c5c18b/attachment-0001.htm

------------------------------

Message: 2
Date: Thu, 17 Oct 2013 14:38:31 +0100
From: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Subject: Re: [Moses-support] tokenizer.perl to not tokenize exclude
URLs
To: Eleftherios Avramidis <eleftherios.avramidis@dfki.de>
Cc: Moses-support <moses-support@mit.edu>
Message-ID: <525FE857.9030205@staffmail.ed.ac.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Lefty

Thanks for pointing that out - I fixed it,

cheers - Barry

On 16/10/13 14:09, Eleftherios Avramidis wrote:
> Hi Barry,
>
> I found a typo/bug that explains why it hasn't worked so far here: the
> help message of tokenizer.perl said that the parameter is "-protect",
> but in fact it is "-protected".
>
> best
> Lefteris
>
>
>
> On 14/10/13 21:38, Barry Haddow wrote:
>> Hi Lefty
>>
>> For the 'protect' option, the format is one regular expression per
>> line. For example if you use a file with one line like this:
>>
>> http://\S+
>>
>> then it should protect some URLs from tokenisation. It works for me.
>> If you have problems then send me the file.
>>
>> For the -a option, I think the detokeniser should put the hyphens
>> back together again, but I have not checked.
>>
>> cheers - Barry
>>
>> On 14/10/13 19:22, Eleftherios Avramidis wrote:
>>> Hi,
>>>
>>> I see tokenizer.perl now offers an option for excluding URLs and other
>>> expressions. " -protect FILE ... specify file with patters to be
>>> protected in tokenisation." Unfortunately there is no explanation of
>>> how
>>> this optional file should be. I tried several ways of writing regular
>>> expressions for URLs, but URLs still come out tokenized. Could you
>>> provide an example?
>>>
>>> My second question concerns the -a option, for aggressive hyphen
>>> splitting. Does the detokenizer offer a similar option, to
>>> reconstructed
>>> separeted hyphens?
>>>
>>> cheers
>>> Lefteris
>>>
>>
>
>


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 84, Issue 28
*********************************************

0 Response to "Moses-support Digest, Vol 84, Issue 28"

Post a Comment