Moses-support Digest, Vol 84, Issue 26

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Moses chart MERT crashes when run with ems (Hieu Hoang)
2. Re: tokenizer.perl to not tokenize exclude URLs
(Eleftherios Avramidis)
3. Re: XLIFF support in the M4Loc project (Achim Ruopp)
4. Re: XLIFF support in the M4Loc project (Tom Hoar)


----------------------------------------------------------------------

Message: 1
Date: Wed, 16 Oct 2013 13:00:33 +0000
From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Subject: Re: [Moses-support] Moses chart MERT crashes when run with
ems
To: Eleftherios Avramidis <eleftherios.avramidis@dfki.de>
Cc: Moses-support <moses-support@mit.edu>
Message-ID:
<CAEKMkbh6Jc8WA=qb9VAbg0UUcx+K_nsXo-5kagvVpoheOhc6gQ@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

the EMS config file please.

The problem is probably binarization of the phrase table.

There's a line in there.
ttable-binarizer = "$moses-src-dir/bin/CreateOnDiskPt 1 1 5 100 2"
it should now be
ttable-binarizer = "$moses-src-dir/bin/CreateOnDiskPt 1 1 4 100 2"





On 16 October 2013 12:50, Eleftherios Avramidis <
eleftherios.avramidis@dfki.de> wrote:

> Hi
>
> moses chart crashed, while trying to run tuning, as part of a default ems
> pipeline. Exactly the same settings run perfectly for phrase-based moses.
> The error was:
>
>
> Start loading text SCFG phrase table. Moses format : [0.000] seconds
>
> max-chart-span: 20
>
> max-chart-span: 1000
>
> Check obj->GetMisc("NumScores") == m_numScoreComponents failed in
> moses/TranslationModel/**RuleTable/**PhraseDictionaryOnDisk.cpp:91
>
> Aborted (core dumped)
>
> Exit code: 134
>
> The decoder died. CONFIG WAS -weight-overwrite 'PhrasePenalty0= 0.057143
> WordPenalty0= -0.285714 TranslationModel0= 0.057143 0.057143 0.057143
> 0.057143 TranslationModel1= 0.285714 LM0= 0.142857'
>
>
> best
> Lefteris
>
> --
> MSc. Inf. Eleftherios Avramidis
> DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
> Tel. +49-30 238 95-1806
>
> Fax. +49-30 238 95-1810
>
> ------------------------------**------------------------------**
> ------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**
> ------------------------------**-
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131016/33a73978/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 16 Oct 2013 15:09:19 +0200
From: Eleftherios Avramidis <eleftherios.avramidis@dfki.de>
Subject: Re: [Moses-support] tokenizer.perl to not tokenize exclude
URLs
To: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Cc: Moses-support <moses-support@mit.edu>
Message-ID: <525E8FFF.6090804@dfki.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Barry,

I found a typo/bug that explains why it hasn't worked so far here: the
help message of tokenizer.perl said that the parameter is "-protect",
but in fact it is "-protected".

best
Lefteris



On 14/10/13 21:38, Barry Haddow wrote:
> Hi Lefty
>
> For the 'protect' option, the format is one regular expression per
> line. For example if you use a file with one line like this:
>
> http://\S+
>
> then it should protect some URLs from tokenisation. It works for me.
> If you have problems then send me the file.
>
> For the -a option, I think the detokeniser should put the hyphens back
> together again, but I have not checked.
>
> cheers - Barry
>
> On 14/10/13 19:22, Eleftherios Avramidis wrote:
>> Hi,
>>
>> I see tokenizer.perl now offers an option for excluding URLs and other
>> expressions. " -protect FILE ... specify file with patters to be
>> protected in tokenisation." Unfortunately there is no explanation of how
>> this optional file should be. I tried several ways of writing regular
>> expressions for URLs, but URLs still come out tokenized. Could you
>> provide an example?
>>
>> My second question concerns the -a option, for aggressive hyphen
>> splitting. Does the detokenizer offer a similar option, to reconstructed
>> separeted hyphens?
>>
>> cheers
>> Lefteris
>>
>


--
MSc. Inf. Eleftherios Avramidis
DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
Tel. +49-30 238 95-1806

Fax. +49-30 238 95-1810

-------------------------------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------------------------------------



------------------------------

Message: 3
Date: Wed, 16 Oct 2013 10:20:54 -0400
From: "Achim Ruopp" <achimru@gmail.com>
Subject: Re: [Moses-support] XLIFF support in the M4Loc project
To: "'John Tinsley'" <jtinsley@computing.dcu.ie>,
<moses-support@mit.edu>
Message-ID: <00af01ceca7a$ede21280$c9a63780$@com>
Content-Type: text/plain; charset="us-ascii"

Hi John,

The M4Loc tool chain only handles a subset of XLIFF inline tags generated by
the Okapi Moses Text Filter

http://www.opentag.com/okapi/wiki/index.php?title=Moses_Text_Filter

The complete list of tags generated by the filter is: g|x|bx|ex|lb|mrk



If you aren't using the Okapi tools, you can still use their library, I
believe, to convert

das ist ein <bpt id="1">&lt;b&gt;</bpt>kleines haus<ept
id="1">&lt;/b&gt;</ept>

into

das ist ein <g id="1">kleines haus</g>

and apply the reverse process to the translation.



Alternatively you could modify M4Loc to handle all XLIFF inline tagging

http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine

But I think that this would be more messy and with using Okapi you also get
future XLIFF (e.g. 2.0) support.



Achim



From: moses-support-bounces@mit.edu [mailto:moses-support-bounces@mit.edu]
On Behalf Of John Tinsley
Sent: Wednesday, October 16, 2013 7:20 AM
To: moses-support@mit.edu
Subject: [Moses-support] XLIFF support in the M4Loc project



Hi folks,



I'm having a little trouble with XLIFF handling using some of the M4Loc
tools, specifically 'reinsert.pm' for replacing inline markup after
translation.(https://code.google.com/p/m4loc/wiki/Pod_reinsert)



It works fine for simple tags where the text between the tags *should* be
translated, e.g.



src: das ist ein <bx id="1">kleines haus</bx>

tgt: this is |0-1| a |2-2| small |3-3| house |4-4|



output: this is a <bx id="1"> small house </bx>



However, there are often examples of paired tags (kind of like markup around
markup) which are not handled, e.g.



das ist ein <bpt id="1">&lt;b&gt;</bpt>kleines haus<ept
id="1">&lt;/b&gt;</ept>



In this case, the <bpt> and <ept> tags are paired, and everything in between
both sets of tags should be stripped out, e.g. &lt;b&gt; but this doesn't
appear to be the case.



Is there another tool in the project that handles this kind of markup or is
it not supported?



Thanks

John


--

John Tinsley

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131016/a9e95db3/attachment-0001.htm

------------------------------

Message: 4
Date: Wed, 16 Oct 2013 22:47:02 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] XLIFF support in the M4Loc project
To: moses-support@mit.edu
Message-ID: <525EB4F6.6010401@precisiontranslationtools.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi John,

If you're looking to completely remove these inline elements, you remove
the tags, then unescape their contents, an run a second pass to remove
the html tags. That works if the contents of the bpt/ept tags are html.
However they could be RTF or some other markup language. We've found
it's safe to simply use a regex pattern to remove everything between the
<bpt> .... </bpt> and <ept> .... </ept>.

These tags are not generated by Okapi, but other tools do create them.
So if you're looking to regenerate these and other tags created by other
tools in the translated output, I think you're out of luck for now.
We're developing a tool that supports all XLIFF 1.2 inline elements
during translation, but it will not be published as open source. It's
scheduled for completion by the end of the year.

Hi Achim,

Can you verify the "lb" tag you included in your list? I reviewed the
XLIFF 1.2 spec and it's not there. I also reviewed the XLIFF 2.0 draft
spec that was published yesterday:
https://tools.oasis-open.org/version-control/browse/wsvn/xliff/trunk/xliff-20/xliff-core.pdf.
It's a significant departure from 1.2! Any/all solutions that were
developed for 1.2's inline elements will need to be totally re-thought
and re-written.



On 10/16/2013 09:20 PM, Achim Ruopp wrote:
>
> Hi John,
>
> The M4Loc tool chain only handles a subset of XLIFF inline tags
> generated by the Okapi Moses Text Filter
>
> http://www.opentag.com/okapi/wiki/index.php?title=Moses_Text_Filter
>
> The complete list of tags generated by the filter is: g|x|bx|ex|lb|mrk
>
> If you aren't using the Okapi tools, you can still use their library,
> I believe, to convert
>
> das ist ein *<bpt id="1">&lt;b&gt;</bpt>*kleines haus*<ept
> id="1">&lt;/b&gt;</ept>*
>
> into
>
> das ist ein *<g id="1">*kleines haus*</g>*
>
> and apply the reverse process to the translation.
>
> Alternatively you could modify M4Loc to handle all XLIFF inline tagging
>
> http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine
>
> But I think that this would be more messy and with using Okapi you
> also get future XLIFF (e.g. 2.0) support.
>
> Achim
>
> *From:*moses-support-bounces@mit.edu
> [mailto:moses-support-bounces@mit.edu] *On Behalf Of *John Tinsley
> *Sent:* Wednesday, October 16, 2013 7:20 AM
> *To:* moses-support@mit.edu
> *Subject:* [Moses-support] XLIFF support in the M4Loc project
>
> Hi folks,
>
> I'm having a little trouble with XLIFF handling using some of the
> M4Loc tools, specifically 'reinsert.pm <http://reinsert.pm>' for
> replacing inline markup after
> translation.(https://code.google.com/p/m4loc/wiki/Pod_reinsert)
>
> It works fine for simple tags where the text between the tags *should*
> be translated, e.g.
>
> *src:* das ist ein <bx id="1">kleines haus</bx>
>
> *tgt: *this is |0-1| a |2-2| small |3-3| house |4-4|
>
> *output: *this is a <bx id="1"> small house </bx>
>
> However, there are often examples of paired tags (kind of like markup
> around markup) which are not handled, e.g.
>
> das ist ein *<bpt id="1">&lt;b&gt;</bpt>*kleines haus*<ept
> id="1">&lt;/b&gt;</ept>*
>
> In this case, the <bpt> and <ept> tags are paired, and everything in
> between both sets of tags should be stripped out, e.g. *&lt;b&gt; *but
> this doesn't appear to be the case.
>
> Is there another tool in the project that handles this kind of markup
> or is it not supported?
>
> Thanks
>
> John
>
>
> --
>
> John Tinsley
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131016/340aec4b/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 84, Issue 26
*********************************************

0 Response to "Moses-support Digest, Vol 84, Issue 26"

Post a Comment