Moses-support Digest, Vol 84, Issue 23

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: tokenizer.perl to not tokenize exclude URLs (Tom Hoar)
2. Re: Placeholders (Hieu Hoang)
3. Re: tokenizer.perl to not tokenize exclude URLs (Barry Haddow)
4. How to tell EMS to use an existing LM (Lane Schwartz)
5. Re: How to tell EMS to use an existing LM (Eleftherios Avramidis)

----------------------------------------------------------------------

Message: 1
Date: Tue, 15 Oct 2013 08:47:46 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] tokenizer.perl to not tokenize exclude
URLs
To: moses-support@mit.edu
Message-ID: <525C9EC2.7050504@precisiontranslationtools.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

Thanks! This is another new handy feature. I suggest the "placeholders"
functionality (separate thread) with this "protect" option could be a
killer combination. Escape URLs with a token, for example @URL@, before
tokenization. Then, protect this token during tokenization. You won't
have to "fix" it afterwards, and you can define alternate URL
translations during Moses runtime (example.com => example.ca)

BTW, here's a more focused regular expression we use to identify URL's.

(?i)\b((?:(?:(?:[a-z32][\w-]{1,6}:{1}/{2,3})[a-z0-9.\-_]+(:\d{1,5})?(/?))([^\s<>',\?\.]*([\.][a-z]{2,4})?)*(?:\?[^\s<>',\.]+)?))

Here's another thatworks nicely. We found it at:
http://daringfireball.net/2010/07/improved_regex_for_matching_urls

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|$([^\s()<>]+|(\([^\s()<>]+$))*\))+(?:$([^\s()<>]+|(\([^\s()<>]+$))*\)|[^\s`!()\[\]{};:'".,<>???????]))

On 10/15/2013 02:38 AM, Barry Haddow wrote:
> Hi Lefty
>
> For the 'protect' option, the format is one regular expression per line.
> For example if you use a file with one line like this:
>
> http://\S+
>
> then it should protect some URLs from tokenisation. It works for me. If
> you have problems then send me the file.
>
> For the -a option, I think the detokeniser should put the hyphens back
> together again, but I have not checked.
>
> cheers - Barry
>
> On 14/10/13 19:22, Eleftherios Avramidis wrote:
>> Hi,
>>
>> I see tokenizer.perl now offers an option for excluding URLs and other
>> expressions. " -protect FILE ... specify file with patters to be
>> protected in tokenisation." Unfortunately there is no explanation of how
>> this optional file should be. I tried several ways of writing regular
>> expressions for URLs, but URLs still come out tokenized. Could you
>> provide an example?
>>
>> My second question concerns the -a option, for aggressive hyphen
>> splitting. Does the detokenizer offer a similar option, to reconstructed
>> separeted hyphens?
>>
>> cheers
>> Lefteris
>>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

Message: 2
Date: Tue, 15 Oct 2013 08:35:39 +0000
From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Subject: Re: [Moses-support] Placeholders
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAEKMkbi1u_0XYx5F+DhDhw-VY2Z4-=6_G7QGViAWn1GqYfsdPA@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

they're good ideas. I'll have a think if I get round to doing it.

Would also want to minimise the work I have to do, and minimize the
disruption to people's existing pipeline.

On 15 October 2013 01:33, Tom Hoar <tahoar@precisiontranslationtools.com>wrote:

> I agree that <anytag/> could cause problems, especially with the growing
> list of reserved tag names (ne, wall, zone). I wholeheartedly support a
> fixed tag, but I'm not sure "option" is it. What about <np/> (already in
> the manual) or <xml-markup/> or <xml-input/> or <moses/>?
>
> Here's another idea. The -xml-input flag supports values "exclusive,"
> "inclusive," "ignore" and "pass-through." What about changing the flag
> to a boolean flag. Then, use the value as the xml tags: <exclusive/>,
> <inclusive/> and <ignore/> so the one invocation of Moses would support
> all modes on a per-sentence basis. Just a thought. Think this would also
> be easier if you dropped the "pass-through" option because no need for
> backwards compatibility.
>
> Another idea, although slightly different subject. Moses'
> -monotone-at-punctuation flag would be more useful if we could
> define/override the punctuation & symbols that we want it to use. Not
> sure how to best accomplish this.
>
> Tom
>
>
>
> On 10/15/2013 04:07 AM, Hieu Hoang wrote:
> > In fact, we're thinking of changing <anytag/> to something fixed, like
> > <option/>
> >
> > The <anytag/> behaviour isn't good XML and will cause problems in the
> > future
> >
> > Any opinions on this gratefully received
> >
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131015/3eb30855/attachment-0001.htm

------------------------------

Message: 3
Date: Tue, 15 Oct 2013 09:42:44 +0100
From: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Subject: Re: [Moses-support] tokenizer.perl to not tokenize exclude
URLs
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: moses-support@mit.edu
Message-ID: <525D0004.2090000@staffmail.ed.ac.uk>
Content-Type: text/plain; charset=windows-1252; format=flowed

Hi Tom

The implementation of 'protected' segments was fairly quick and simple,
so in particular you'd at least have to turn the groupings in the
expressions below into non-capturing groupings.

The protected segments were sufficient for my purposes, but if anyone
wants to improve them, feel free...

cheers - Barry

On 15/10/13 02:47, Tom Hoar wrote:
> Thanks! This is another new handy feature. I suggest the "placeholders"
> functionality (separate thread) with this "protect" option could be a
> killer combination. Escape URLs with a token, for example @URL@, before
> tokenization. Then, protect this token during tokenization. You won't
> have to "fix" it afterwards, and you can define alternate URL
> translations during Moses runtime (example.com => example.ca)
>
> BTW, here's a more focused regular expression we use to identify URL's.
>
> (?i)\b((?:(?:(?:[a-z32][\w-]{1,6}:{1}/{2,3})[a-z0-9.\-_]+(:\d{1,5})?(/?))([^\s<>',\?\.]*([\.][a-z]{2,4})?)*(?:\?[^\s<>',\.]+)?))
>
> Here's another thatworks nicely. We found it at:
> http://daringfireball.net/2010/07/improved_regex_for_matching_urls
>
> (?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|$([^\s()<>]+|(\([^\s()<>]+$))*\))+(?:$([^\s()<>]+|(\([^\s()<>]+$))*\)|[^\s`!()\[\]{};:'".,<>???????]))
>
>
>
>
> On 10/15/2013 02:38 AM, Barry Haddow wrote:
>> Hi Lefty
>>
>> For the 'protect' option, the format is one regular expression per line.
>> For example if you use a file with one line like this:
>>
>> http://\S+
>>
>> then it should protect some URLs from tokenisation. It works for me. If
>> you have problems then send me the file.
>>
>> For the -a option, I think the detokeniser should put the hyphens back
>> together again, but I have not checked.
>>
>> cheers - Barry
>>
>> On 14/10/13 19:22, Eleftherios Avramidis wrote:
>>> Hi,
>>>
>>> I see tokenizer.perl now offers an option for excluding URLs and other
>>> expressions. " -protect FILE ... specify file with patters to be
>>> protected in tokenisation." Unfortunately there is no explanation of how
>>> this optional file should be. I tried several ways of writing regular
>>> expressions for URLs, but URLs still come out tokenized. Could you
>>> provide an example?
>>>
>>> My second question concerns the -a option, for aggressive hyphen
>>> splitting. Does the detokenizer offer a similar option, to reconstructed
>>> separeted hyphens?
>>>
>>> cheers
>>> Lefteris
>>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

------------------------------

Message: 4
Date: Tue, 15 Oct 2013 09:40:41 -0400
From: Lane Schwartz <dowobeha@gmail.com>
Subject: [Moses-support] How to tell EMS to use an existing LM
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CABv3vZnu_ZKUOwFTdECHh6cqu3brc075w2KrrEGwiDgDoEg8AA@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

I'm running Moses v1, and I have some existing already-trained LM
files that I'd like to use.

How can I tell EMS to use an existing LM file (presumably at the same
time telling EMS what LM type it is)?

Thanks,
Lane

------------------------------

Message: 5
Date: Tue, 15 Oct 2013 16:04:52 +0200
From: Eleftherios Avramidis <eleftherios.avramidis@dfki.de>
Subject: Re: [Moses-support] How to tell EMS to use an existing LM
To: moses-support@mit.edu
Message-ID: <525D4B84.9030006@dfki.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Lane,

in the section which describes the particular language model, you
uncomment #lm =
and directly point to your ready model. Type can be defined with the
parameter 'type'.

from example configuration:

[LM:europarl]
### command to run to get raw corpus files

#

#get-corpus-script = ""

### raw corpus (untokenized)

#

raw-corpus = $wmt12-data/training/europarl-v7.$output-extension

### tokenized corpus files (may contain long sentences)

#

#tokenized-corpus =

### if corpus preparation should be skipped,

# point to the prepared language model

#

#lm =

best
Lefteris

On 15/10/13 15:40, Lane Schwartz wrote:
> I'm running Moses v1, and I have some existing already-trained LM
> files that I'd like to use.
>
> How can I tell EMS to use an existing LM file (presumably at the same
> time telling EMS what LM type it is)?
>
> Thanks,
> Lane
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

--
MSc. Inf. Eleftherios Avramidis
DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
Tel. +49-30 238 95-1806

Fax. +49-30 238 95-1810

-------------------------------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------------------------------------

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 84, Issue 23
*********************************************

Moses-support Digest, Vol 84, Issue 23

0 Response to "Moses-support Digest, Vol 84, Issue 23"

Post a Comment