Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: XLIFF support in the M4Loc project (Achim Ruopp)
----------------------------------------------------------------------
Message: 1
Date: Tue, 22 Oct 2013 23:09:22 -0400
From: "Achim Ruopp" <achimru@gmail.com>
Subject: Re: [Moses-support] XLIFF support in the M4Loc project
To: "'John Tinsley'" <jtinsley@computing.dcu.ie>, "'Tom Hoar'"
<tahoar@precisiontranslationtools.com>
Cc: moses-support@mit.edu
Message-ID: <00c601cecf9d$46aa3ce0$d3feb6a0$@com>
Content-Type: text/plain; charset="us-ascii"
Hi John,
M4Loc/Okapi can only deal with markup surrounding tokens. In fact, to work
properly the markup is separated from tokens with whitespace with the
tokenizer wrapper wrap_tokenizer.pm as part of the overall m4loc.pm umbrella
script. So in your example:
<g id="1"> A </g> <g id="2"> 4 </g>
Shouldn't "A" and "4" in your example be considered two separate tokens for
the purpose of MT?
It might be worth investigating if the whole construct can be replaced with
a placeholder (recently added to Moses). However, placeholders and markup
handling with M4Loc/Okapi likely won't play nicely together yet. There is a
work item in the M4Loc issue tracker:
http://code.google.com/p/m4loc/issues/detail?id=45
You also might want to try the tag preservation method that leaves tags in
place during the decoding process (m4loc.pm option "-o t"). This would
certainly preserve the tag order in your example, but might lead to lower
translation quality overall (some recent test have shown it to perform
pretty well on some test data in terms of BLEU).
Achim
From: johntins@gmail.com [mailto:johntins@gmail.com] On Behalf Of John
Tinsley
Sent: Tuesday, October 22, 2013 8:23 AM
To: Tom Hoar; Achim Ruopp
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] XLIFF support in the M4Loc project
Hi folks,
I tried the chain of tools in M4Loc/Okapi and it worked with relative
success (it solved the original issue I had) but there seems to be one type
of markup it cannot manage, for example:
If we have a token, for example A4, that is marked-up in the following way:
<g id="1">A</g><g id="2">4</g>
the word alignment information is not sufficient to reinsert these tags
because there are tags *within* the token. So the output we get is like the
following:
<g id="1">A4</g><g id="2"></g>
i.e. the whole token is wrapped in the first tag and the second tag is
either empty or wrapping the next word (incorrectly). This can have a
knock-on effect if there are more tags in the same sentence.
Is this known/solved somehow or am I out of luck? The only possible solution
I can imagine would be using some sort of character-based alignment to
reinsert the tags...
Cheers,
John
On 16 October 2013 16:47, Tom Hoar <tahoar@precisiontranslationtools.com>
wrote:
Hi John,
If you're looking to completely remove these inline elements, you remove the
tags, then unescape their contents, an run a second pass to remove the html
tags. That works if the contents of the bpt/ept tags are html. However they
could be RTF or some other markup language. We've found it's safe to simply
use a regex pattern to remove everything between the <bpt> .... </bpt> and
<ept> .... </ept>.
These tags are not generated by Okapi, but other tools do create them. So if
you're looking to regenerate these and other tags created by other tools in
the translated output, I think you're out of luck for now. We're developing
a tool that supports all XLIFF 1.2 inline elements during translation, but
it will not be published as open source. It's scheduled for completion by
the end of the year.
Hi Achim,
Can you verify the "lb" tag you included in your list? I reviewed the XLIFF
1.2 spec and it's not there. I also reviewed the XLIFF 2.0 draft spec that
was published yesterday:
https://tools.oasis-open.org/version-control/browse/wsvn/xliff/trunk/xliff-2
0/xliff-core.pdf. It's a significant departure from 1.2! Any/all solutions
that were developed for 1.2's inline elements will need to be totally
re-thought and re-written.
On 10/16/2013 09:20 PM, Achim Ruopp wrote:
Hi John,
The M4Loc tool chain only handles a subset of XLIFF inline tags generated by
the Okapi Moses Text Filter
http://www.opentag.com/okapi/wiki/index.php?title=Moses_Text_Filter
The complete list of tags generated by the filter is: g|x|bx|ex|lb|mrk
If you aren't using the Okapi tools, you can still use their library, I
believe, to convert
das ist ein <bpt id="1"><b></bpt>kleines haus<ept
id="1"></b></ept>
into
das ist ein <g id="1">kleines haus</g>
and apply the reverse process to the translation.
Alternatively you could modify M4Loc to handle all XLIFF inline tagging
http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine
But I think that this would be more messy and with using Okapi you also get
future XLIFF (e.g. 2.0) support.
Achim
From: moses-support-bounces@mit.edu [mailto:moses-support-bounces@mit.edu]
On Behalf Of John Tinsley
Sent: Wednesday, October 16, 2013 7:20 AM
To: moses-support@mit.edu
Subject: [Moses-support] XLIFF support in the M4Loc project
Hi folks,
I'm having a little trouble with XLIFF handling using some of the M4Loc
tools, specifically 'reinsert.pm' for replacing inline markup after
translation.(https://code.google.com/p/m4loc/wiki/Pod_reinsert)
It works fine for simple tags where the text between the tags *should* be
translated, e.g.
src: das ist ein <bx id="1">kleines haus</bx>
tgt: this is |0-1| a |2-2| small |3-3| house |4-4|
output: this is a <bx id="1"> small house </bx>
However, there are often examples of paired tags (kind of like markup around
markup) which are not handled, e.g.
das ist ein <bpt id="1"><b></bpt>kleines haus<ept
id="1"></b></ept>
In this case, the <bpt> and <ept> tags are paired, and everything in between
both sets of tags should be stripped out, e.g. <b> but this doesn't
appear to be the case.
Is there another tool in the project that handles this kind of markup or is
it not supported?
Thanks
John
--
John Tinsley
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
--
Dr. John Tinsley
Research Integration Officer
Centre for Next Generation Localisation (CNGL)
Dublin City University
web: http://www.iptranslator.com
email: jtinsley@computing.dcu.ie
phone: +353 (0)1 7006916
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131022/1b101676/attachment.htm
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 84, Issue 33
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 84, Issue 33"
Post a Comment