Moses-support Digest, Vol 84, Issue 32

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: XLIFF support in the M4Loc project (John Tinsley)


----------------------------------------------------------------------

Message: 1
Date: Tue, 22 Oct 2013 13:22:54 +0100
From: John Tinsley <jtinsley@computing.dcu.ie>
Subject: Re: [Moses-support] XLIFF support in the M4Loc project
To: Tom Hoar <tahoar@precisiontranslationtools.com>, Achim Ruopp
<achimru@gmail.com>
Cc: moses-support@mit.edu
Message-ID:
<CAHfkK=5-eLm3FFft3n7GbXtpG4FdmmtXMV2qM9xzoN+nteDVUA@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi folks,

I tried the chain of tools in M4Loc/Okapi and it worked with relative
success (it solved the original issue I had) but there seems to be one type
of markup it cannot manage, for example:

If we have a token, for example *A4*, that is marked-up in the following
way:

*<g id="1">*A*</g><g id="2">*4*</g>*
*
*
the word alignment information is not sufficient to reinsert these tags
because there are tags *within* the token. So the output we get is like the
following:

*<g id="1">*A4*</g><g id="2">**</g>*
*
*
i.e. the whole token is wrapped in the first tag and the second tag is
either empty or wrapping the next word (incorrectly). This can have a
knock-on effect if there are more tags in the same sentence.

Is this known/solved somehow or am I out of luck? The only possible
solution I can imagine would be using some sort of character-based
alignment to reinsert the tags...

Cheers,
John
*
*


On 16 October 2013 16:47, Tom Hoar <tahoar@precisiontranslationtools.com>wrote:

> Hi John,
>
> If you're looking to completely remove these inline elements, you remove
> the tags, then unescape their contents, an run a second pass to remove the
> html tags. That works if the contents of the bpt/ept tags are html. However
> they could be RTF or some other markup language. We've found it's safe to
> simply use a regex pattern to remove everything between the <bpt> ....
> </bpt> and <ept> .... </ept>.
>
> These tags are not generated by Okapi, but other tools do create them. So
> if you're looking to regenerate these and other tags created by other tools
> in the translated output, I think you're out of luck for now. We're
> developing a tool that supports all XLIFF 1.2 inline elements during
> translation, but it will not be published as open source. It's scheduled
> for completion by the end of the year.
>
> Hi Achim,
>
> Can you verify the "lb" tag you included in your list? I reviewed the
> XLIFF 1.2 spec and it's not there. I also reviewed the XLIFF 2.0 draft spec
> that was published yesterday:
> https://tools.oasis-open.org/version-control/browse/wsvn/xliff/trunk/xliff-20/xliff-core.pdf.
> It's a significant departure from 1.2! Any/all solutions that were
> developed for 1.2's inline elements will need to be totally re-thought and
> re-written.
>
>
>
>
> On 10/16/2013 09:20 PM, Achim Ruopp wrote:
>
> Hi John,****
>
> The M4Loc tool chain only handles a subset of XLIFF inline tags generated
> by the Okapi Moses Text Filter****
>
> http://www.opentag.com/okapi/wiki/index.php?title=Moses_Text_Filter ****
>
> The complete list of tags generated by the filter is: g|x|bx|ex|lb|mrk****
>
> ** **
>
> If you aren't using the Okapi tools, you can still use their library, I
> believe, to convert****
>
> das ist ein *<bpt id="1">&lt;b&gt;</bpt>*kleines haus*<ept
> id="1">&lt;/b&gt;</ept>*****
>
> into****
>
> das ist ein *<g id="1">*kleines haus*</g>*****
>
> and apply the reverse process to the translation.****
>
> ** **
>
> Alternatively you could modify M4Loc to handle all XLIFF inline tagging***
> *
>
> http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine ***
> *
>
> But I think that this would be more messy and with using Okapi you also
> get future XLIFF (e.g. 2.0) support.****
>
> ** **
>
> Achim ****
>
> ** **
>
> *From:* moses-support-bounces@mit.edu [
> mailto:moses-support-bounces@mit.edu <moses-support-bounces@mit.edu>] *On
> Behalf Of *John Tinsley
> *Sent:* Wednesday, October 16, 2013 7:20 AM
> *To:* moses-support@mit.edu
> *Subject:* [Moses-support] XLIFF support in the M4Loc project****
>
> ** **
>
> Hi folks,****
>
> ** **
>
> I'm having a little trouble with XLIFF handling using some of the M4Loc
> tools, specifically 'reinsert.pm' for replacing inline markup after
> translation.(https://code.google.com/p/m4loc/wiki/Pod_reinsert)****
>
> ** **
>
> It works fine for simple tags where the text between the tags *should* be
> translated, e.g.****
>
> ** **
>
> *src:* das ist ein <bx id="1">kleines haus</bx>****
>
> *tgt: *this is |0-1| a |2-2| small |3-3| house |4-4|****
>
> ** **
>
> *output: *this is a <bx id="1"> small house </bx>****
>
> ** **
>
> However, there are often examples of paired tags (kind of like markup
> around markup) which are not handled, e.g.****
>
> ** **
>
> das ist ein *<bpt id="1">&lt;b&gt;</bpt>*kleines haus*<ept
> id="1">&lt;/b&gt;</ept>*****
>
> ** **
>
> In this case, the <bpt> and <ept> tags are paired, and everything in
> between both sets of tags should be stripped out, e.g. *&lt;b&gt; *but
> this doesn't appear to be the case.****
>
> ** **
>
> Is there another tool in the project that handles this kind of markup or
> is it not supported?****
>
> ** **
>
> Thanks****
>
> John****
>
>
> -- ****
>
> John Tinsley****
>
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


--
Dr. John Tinsley
Research Integration Officer
Centre for Next Generation Localisation (CNGL)
Dublin City University

web: http://www.iptranslator.com
email: jtinsley@computing.dcu.ie
phone: +353 (0)1 7006916
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131022/8000b91d/attachment-0001.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 84, Issue 32
*********************************************

0 Response to "Moses-support Digest, Vol 84, Issue 32"

Post a Comment