Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: Preparing TMX files for use in Moses (Tom Hoar)
2. Re: Preparing TMX files for use in Moses (Sa?o Kuntaric)
3. Re: Preparing TMX files for use in Moses (Tom Hoar)
----------------------------------------------------------------------
Message: 1
Date: Sun, 13 Mar 2016 18:03:13 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Preparing TMX files for use in Moses
To: moses-support@mit.edu
Message-ID: <56E548F1.60802@precisiontranslationtools.com>
Content-Type: text/plain; charset="windows-1252"
I don't know the tmx2txt.pl script, but I can suggest where to look for
problems.
The most frequent problem we have when extracting data from TMX files
comes from files that don't comply with the TMX specification,
especially regarding compliance with the srclang attributes. The spec
states this about how to identify the source language:
"/the <tuv> holding the source segment will have its xml:lang
attribute set to the same value as srclang. (except if srclang is
set to "*all*"). If a <tu> element does not have a srclang attribute
specified, it uses the one defined in the <header> element./"
Sadly, many TMX creation tools, including tools from SDL, do not
properly identify the source language. Each tool that looks for the
source language TUV according to the spec handles erroneous TMX segments
in its own way. So, you need to learn how your TMX declares the srclang
attribute, and then study the script to see where there's a mismatch.
You can see how we managed these sloppy TMX files in this post, only a
week old: https://pttools.freshdesk.com/discussions/topics/6000034251
Hope this helps.
Tom
On 3/12/2016 8:57 PM, moses-support-request@mit.edu wrote:
> Date: Sat, 12 Mar 2016 13:42:05 +0100
> From: Sa?o Kuntaric<saso.kuntaric@gmail.com>
> Subject: [Moses-support] Preparing TMX files for use in Moses
> To:moses-support@mit.edu
>
> Hi all,
>
> I have a question that is not connected directly to Moses. I am trying to
> prepare the corpora for training my engine. I have exported a few of my TMs
> to the TMX format and now I am trying to create two separate UTF-8 text
> files. I have tried it with the extract-tmx-corpus and tmx2txt.pl tools. I
> get empty text files for both (the former tool claims that the input file
> can't be read). Are there any special setting I need to set when extracting
> the TMX files? I am using SDL Trados Studio 2015 for exporting the files.
>
> Has anyone come across anything like this?
>
> -- lp, Sa?o
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160313/e41f5176/attachment-0001.html
------------------------------
Message: 2
Date: Sun, 13 Mar 2016 12:24:48 +0100
From: Sa?o Kuntaric <saso.kuntaric@gmail.com>
Subject: Re: [Moses-support] Preparing TMX files for use in Moses
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: moses-support@mit.edu
Message-ID:
<CANsquDppD9SbZJJOuKLBNC4R_fJNwfXV5mZrTYGLtSvk1Fo-fg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Thank you for your reply.
It's one of those errors it's hard to admit one's mistake for, because it's
so trivial, namely I mistyped the language name (EN-US instead of en-US),
since I am mostly a Windows user. The script works fine now and I can
confirm it works well with Studio-exported TMX files.
I do have another question regarding the training of the truecaser. In the
example shown on the Moses homepage, a truecase-model.en file is used,
however it is downloaded with the example files. If I want to train my
truecaser for Slovenian, how do I get the truecase-model file. Is it
something I need to create myself and how do I go about and do it?
Thanks in advance for the replies.
Best regards,
Sa?o
2016-03-13 12:03 GMT+01:00 Tom Hoar <tahoar@precisiontranslationtools.com>:
> I don't know the tmx2txt.pl script, but I can suggest where to look for
> problems.
>
> The most frequent problem we have when extracting data from TMX files
> comes from files that don't comply with the TMX specification, especially
> regarding compliance with the srclang attributes. The spec states this
> about how to identify the source language:
>
> "*the <tuv> holding the source segment will have its xml:lang attribute
> set to the same value as srclang. (except if srclang is set to "*all*"). If
> a <tu> element does not have a srclang attribute specified, it uses the one
> defined in the <header> element.*"
>
> Sadly, many TMX creation tools, including tools from SDL, do not properly
> identify the source language. Each tool that looks for the source language
> TUV according to the spec handles erroneous TMX segments in its own way.
> So, you need to learn how your TMX declares the srclang attribute, and then
> study the script to see where there's a mismatch.
>
> You can see how we managed these sloppy TMX files in this post, only a
> week old: https://pttools.freshdesk.com/discussions/topics/6000034251
>
> Hope this helps.
>
> Tom
>
>
> On 3/12/2016 8:57 PM, moses-support-request@mit.edu wrote:
>
> Date: Sat, 12 Mar 2016 13:42:05 +0100
> From: Sa?o Kuntaric <saso.kuntaric@gmail.com> <saso.kuntaric@gmail.com>
> Subject: [Moses-support] Preparing TMX files for use in Moses
> To: moses-support@mit.edu
>
> Hi all,
>
> I have a question that is not connected directly to Moses. I am trying to
> prepare the corpora for training my engine. I have exported a few of my TMs
> to the TMX format and now I am trying to create two separate UTF-8 text
> files. I have tried it with the extract-tmx-corpus and tmx2txt.pl tools. I
> get empty text files for both (the former tool claims that the input file
> can't be read). Are there any special setting I need to set when extracting
> the TMX files? I am using SDL Trados Studio 2015 for exporting the files.
>
> Has anyone come across anything like this?
>
> --
> lp,
>
> Sa?o
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
--
lp,
Sa?o
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160313/ce84eb36/attachment-0001.html
------------------------------
Message: 3
Date: Sun, 13 Mar 2016 19:43:30 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Preparing TMX files for use in Moses
To: moses-support@mit.edu
Message-ID:
<80CB7B1F-6EB2-4459-9BC9-387D5C5E19E7@precisiontranslationtools.com>
Content-Type: text/plain; charset="utf-8"
Don't use truecase, but it's like recase. I'd start there. Recase starts by preparing a monolingual corpus of target language.
On March 13, 2016 6:24:48 PM GMT+07:00, "Sa?o Kuntaric" <saso.kuntaric@gmail.com> wrote:
>Thank you for your reply.
>
>It's one of those errors it's hard to admit one's mistake for, because
>it's
>so trivial, namely I mistyped the language name (EN-US instead of
>en-US),
>since I am mostly a Windows user. The script works fine now and I can
>confirm it works well with Studio-exported TMX files.
>
>I do have another question regarding the training of the truecaser. In
>the
>example shown on the Moses homepage, a truecase-model.en file is used,
>however it is downloaded with the example files. If I want to train my
>truecaser for Slovenian, how do I get the truecase-model file. Is it
>something I need to create myself and how do I go about and do it?
>
>Thanks in advance for the replies.
>
>Best regards,
>
>Sa?o
>
>2016-03-13 12:03 GMT+01:00 Tom Hoar
><tahoar@precisiontranslationtools.com>:
>
>> I don't know the tmx2txt.pl script, but I can suggest where to look
>for
>> problems.
>>
>> The most frequent problem we have when extracting data from TMX files
>> comes from files that don't comply with the TMX specification,
>especially
>> regarding compliance with the srclang attributes. The spec states
>this
>> about how to identify the source language:
>>
>> "*the <tuv> holding the source segment will have its xml:lang
>attribute
>> set to the same value as srclang. (except if srclang is set to
>"*all*"). If
>> a <tu> element does not have a srclang attribute specified, it uses
>the one
>> defined in the <header> element.*"
>>
>> Sadly, many TMX creation tools, including tools from SDL, do not
>properly
>> identify the source language. Each tool that looks for the source
>language
>> TUV according to the spec handles erroneous TMX segments in its own
>way.
>> So, you need to learn how your TMX declares the srclang attribute,
>and then
>> study the script to see where there's a mismatch.
>>
>> You can see how we managed these sloppy TMX files in this post, only
>a
>> week old: https://pttools.freshdesk.com/discussions/topics/6000034251
>>
>> Hope this helps.
>>
>> Tom
>>
>>
>> On 3/12/2016 8:57 PM, moses-support-request@mit.edu wrote:
>>
>> Date: Sat, 12 Mar 2016 13:42:05 +0100
>> From: Sa?o Kuntaric <saso.kuntaric@gmail.com>
><saso.kuntaric@gmail.com>
>> Subject: [Moses-support] Preparing TMX files for use in Moses
>> To: moses-support@mit.edu
>>
>> Hi all,
>>
>> I have a question that is not connected directly to Moses. I am
>trying to
>> prepare the corpora for training my engine. I have exported a few of
>my TMs
>> to the TMX format and now I am trying to create two separate UTF-8
>text
>> files. I have tried it with the extract-tmx-corpus and tmx2txt.pl
>tools. I
>> get empty text files for both (the former tool claims that the input
>file
>> can't be read). Are there any special setting I need to set when
>extracting
>> the TMX files? I am using SDL Trados Studio 2015 for exporting the
>files.
>>
>> Has anyone come across anything like this?
>>
>> --
>> lp,
>>
>> Sa?o
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
>--
>lp,
>
>Sa?o
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160313/8b45efb6/attachment.html
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 113, Issue 40
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 113, Issue 40"
Post a Comment