Moses-support Digest, Vol 84, Issue 22

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. tokenizer.perl to not tokenize exclude URLs
(Eleftherios Avramidis)
2. Re: tokenizer.perl to not tokenize exclude URLs (Barry Haddow)
3. Re: Placeholders (Hieu Hoang)
4. Re: Placeholders (Tom Hoar)


----------------------------------------------------------------------

Message: 1
Date: Mon, 14 Oct 2013 20:22:37 +0200
From: Eleftherios Avramidis <eleftherios.avramidis@dfki.de>
Subject: [Moses-support] tokenizer.perl to not tokenize exclude URLs
To: Moses-support <moses-support@mit.edu>
Message-ID: <525C366D.3060309@dfki.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi,

I see tokenizer.perl now offers an option for excluding URLs and other
expressions. " -protect FILE ... specify file with patters to be
protected in tokenisation." Unfortunately there is no explanation of how
this optional file should be. I tried several ways of writing regular
expressions for URLs, but URLs still come out tokenized. Could you
provide an example?

My second question concerns the -a option, for aggressive hyphen
splitting. Does the detokenizer offer a similar option, to reconstructed
separeted hyphens?

cheers
Lefteris

--
MSc. Inf. Eleftherios Avramidis
DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
Tel. +49-30 238 95-1806

Fax. +49-30 238 95-1810

-------------------------------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------------------------------------



------------------------------

Message: 2
Date: Mon, 14 Oct 2013 20:38:37 +0100
From: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Subject: Re: [Moses-support] tokenizer.perl to not tokenize exclude
URLs
To: Eleftherios Avramidis <eleftherios.avramidis@dfki.de>
Cc: Moses-support <moses-support@mit.edu>
Message-ID: <525C483D.9080300@staffmail.ed.ac.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Lefty

For the 'protect' option, the format is one regular expression per line.
For example if you use a file with one line like this:

http://\S+

then it should protect some URLs from tokenisation. It works for me. If
you have problems then send me the file.

For the -a option, I think the detokeniser should put the hyphens back
together again, but I have not checked.

cheers - Barry

On 14/10/13 19:22, Eleftherios Avramidis wrote:
> Hi,
>
> I see tokenizer.perl now offers an option for excluding URLs and other
> expressions. " -protect FILE ... specify file with patters to be
> protected in tokenisation." Unfortunately there is no explanation of how
> this optional file should be. I tried several ways of writing regular
> expressions for URLs, but URLs still come out tokenized. Could you
> provide an example?
>
> My second question concerns the -a option, for aggressive hyphen
> splitting. Does the detokenizer offer a similar option, to reconstructed
> separeted hyphens?
>
> cheers
> Lefteris
>



------------------------------

Message: 3
Date: Mon, 14 Oct 2013 22:07:48 +0100
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] Placeholders
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <F6555D94-6277-4384-8F75-0C4D8079E7C8@gmail.com>
Content-Type: text/plain; charset="us-ascii"

Hi tom

Sent while bumping into things

> On 13 Oct 2013, at 17:01, Tom Hoar <tahoar@precisiontranslationtools.com> wrote:
>
> Thanks Hieu and Achim for the new feature. I think it's great. Some questions:
>
> 1) When envoking mert-moses.pl to tune a model prepared with placeholders, and the dev set includes placeholders, it looks like the new moses command line options (-placeholder-factor 1 -xml-input exclusive) should be placed in the "--decoder-flags" or in the config file. Can you confirm?
Yep, they are decoder flags.
>
> 2) Are there any limits as to what escape sequences are used as placeholders? Your example was @num@. Could this just as easily be %(num)s if carried through all the necessary steps?
No limit on what the placeholder 'word' should be

There can also be multiple, different placeholder words. @num@ for numbers, %(date) for dates, :place: for place names etc
>
> 3) If we change your example to
>
> "you owe me $ 42.85 ."
>
> and update the ph_numbers.perl to re-format numbers with the target language formatting
>
> "you owe me $ <ne translation="@num@" entity="42,85">@num@</ne> .
>
> would the corresponding translated output include the 42,85?
Yes, 42,85 will be the output.

The placeholder script should be language pair specific. There are flags to specify source an target language in the script but i don't think they used at the moment. You shoul extend it
>
> 4) If the entity="" value must include reserved/special characters, such as &, <, >, or Moses restricted vertical bar | , should they be escaped within the quotes like the tokenizer.perl and escape-special-chars.perl scripts escape them?
Dunno. Haven't kicked the tyres on this yet.

You should ver on the safe side and escape it. Also, since you have to I escape the whole output sentence, not escaping it may cause you problems
>
> 5) The last I recall, the --xlm-input option wasn't particular about what XML tag is used. Is this still true, the example could be <anytag/> and still work the same?

No, it must be <ne ..>

In fact, we're thinking of changing <anytag/> to something fixed, like <option/>

The <anytag/> behaviour isn't good XML and will cause problems in the future

Any opinions on this gratefully received

>
> 6) Any chance to backport this feature to RELEASE-1.0? How much work do you think would be involved? If we choose to do the backport, can you point us in the right direction and do you want the updates for a RELEASE-1.1?
Can't add this to release 1. It depends on stuff that's only in the current github code

The current code will read most ini files you create with release 1, so that should lessen your pain

However, it would be good if you can move to release 2.0, it would cause less headaches for you and me. The ini file shouldn't change from what we have now in github
>
> Thanks,
> Tom
>
>
>
>
>> On 10/10/2013 08:30 PM, Hieu Hoang wrote:
>>
>>
>>
>>> On 10 October 2013 13:33, Nicola Bertoldi <bertoldi@fbk.eu> wrote:
>>> Hi Hieu
>>>
>>> I read the documentation
>>> and you mention that you enable the exclusive mode of xml-input
>>>
>>> I see few issues:
>>>
>>> - you mention that you enable the exclusive mode of xml-input;
>>> this can conflict with other usage of xml-input which instead require the inclusive mode.
>>> do you have any comments on that?
>>
>> it can be exclusive, inclusive or anything else except pass-through. It just requires the XML handling to run
>>
>>>
>>> - when you use the exclusive mode you force the translation of the span (@num@) with "100")
>>> and other larger span including @num@ are not allowed
>>> am I right?
>>> If yes, what is the advantage of having phrase pairs including other words
>>
>> it doesn't create XML options, it just needs the XML parsing to run.
>>
>>>
>>> - what is the meaning of "-placeholder-factor 1" ?
>> It stores the original text in the source factor 1. The placeholder symbol is in the factor 0, or whatever the translation model was configured to use.
>>
>>>
>>>
>>> Nicola Bertoldi
>>>
>>>
>>>
>>>
>>> On Oct 10, 2013, at 1:05 PM, Hieu Hoang wrote:
>>>
>>> Hi all
>>>
>>> Achim and I have been working on adding support for placeholders into Moses. That is, replacing a number, date, or named entity with a symbol eg. @num@, -date-, =named-entity=. We think it would be especially useful for commercial users of Moses, and for people translating text with lots of numbers, dates etc.
>>>
>>> It is now supported in the Moses training and decoding pipeline. See the following URL for more details.
>>> h
>>>
>>> --
>>> Hieu Hoang
>>> Research Associate
>>> University of Edinburgh
>>> http://www.hoang.co.uk/hieu
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu<mailto:Moses-support@mit.edu>
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> --
>> Hieu Hoang
>> Research Associate
>> University of Edinburgh
>> http://www.hoang.co.uk/hieu
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131014/047f4c1e/attachment-0001.htm

------------------------------

Message: 4
Date: Tue, 15 Oct 2013 08:33:05 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Placeholders
To: moses-support@mit.edu
Message-ID: <525C9B51.8080504@precisiontranslationtools.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

I agree that <anytag/> could cause problems, especially with the growing
list of reserved tag names (ne, wall, zone). I wholeheartedly support a
fixed tag, but I'm not sure "option" is it. What about <np/> (already in
the manual) or <xml-markup/> or <xml-input/> or <moses/>?

Here's another idea. The -xml-input flag supports values "exclusive,"
"inclusive," "ignore" and "pass-through." What about changing the flag
to a boolean flag. Then, use the value as the xml tags: <exclusive/>,
<inclusive/> and <ignore/> so the one invocation of Moses would support
all modes on a per-sentence basis. Just a thought. Think this would also
be easier if you dropped the "pass-through" option because no need for
backwards compatibility.

Another idea, although slightly different subject. Moses'
-monotone-at-punctuation flag would be more useful if we could
define/override the punctuation & symbols that we want it to use. Not
sure how to best accomplish this.

Tom



On 10/15/2013 04:07 AM, Hieu Hoang wrote:
> In fact, we're thinking of changing <anytag/> to something fixed, like
> <option/>
>
> The <anytag/> behaviour isn't good XML and will cause problems in the
> future
>
> Any opinions on this gratefully received
>



------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 84, Issue 22
*********************************************

0 Response to "Moses-support Digest, Vol 84, Issue 22"

Post a Comment