Moses-support Digest, Vol 113, Issue 44

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Preparing TMX files for use in Moses (Per Tunedal)
2. Re: apostrophe: detokenization or corpus issue ? (Vincent Nguyen)
3. Can moses.ini include all decoding parameters? (Yuqi Zhang)
4. Fwd: Moses-support post from btpg71@gmail.com requires
approval (Hieu Hoang)


----------------------------------------------------------------------

Message: 1
Date: Mon, 14 Mar 2016 09:05:08 +0100
From: Per Tunedal <per.tunedal@operamail.com>
Subject: Re: [Moses-support] Preparing TMX files for use in Moses
To: moses-support@mit.edu
Message-ID:
<1457942708.2908001.548276514.7AF6AFBA@webmail.messagingengine.com>
Content-Type: text/plain; charset="us-ascii"

Hi,
I had some problems with TMX extraction scripts and wrote my own. You might find it useful:

https://github.com/havet/TMX2Moses

It simply disregards the specification in the header and reads the
source and target language from the <tu> elements.

Works on single TMX-files as well as on folders containing TMX-files.

Yours,
Per Tunedal

On Sun, Mar 13, 2016, at 12:03, Tom Hoar wrote:
> I don't know the tmx2txt.pl script, but I can suggest where to look
for problems.
>
>
The most frequent problem we have when extracting data from TMX
files comes from files that don't comply with the TMX specification,
especially regarding compliance with the srclang attributes. The
spec states this about how to identify the source language:
>
>> "*the <tuv> holding the source segment will have
its xml:lang attribute set to the same value as srclang. (except
if srclang is set to "*all*"). If a <tu> element does not have a
srclang attribute specified, it uses the one defined in the
<header> element.*"
> Sadly, many TMX creation tools, including tools from SDL, do not
properly identify the source language. Each tool that looks for the
source language TUV according to the spec handles erroneous TMX
segments in its own way. So, you need to learn how your TMX declares
the srclang attribute, and then study the script to see where
there's a mismatch.
>
>
You can see how we managed these sloppy TMX files in this post, only
a week old:
https://pttools.freshdesk.com/discussions/topics/6000034251
>
>
Hope this helps.
>
>
Tom
>
>
>
> On 3/12/2016 8:57 PM,
moses-support-request@mit.edu wrote:
>> Date: Sat, 12 Mar 2016 13:42:05 +0100
From: Sa?o Kuntaric <saso.kuntaric@gmail.com>
Subject: [Moses-support] Preparing TMX files for use in Moses
To: moses-support@mit.edu

Hi all,

I have a question that is not connected directly to Moses. I am trying
to prepare the corpora for training my engine. I have exported a few of
my TMs to the TMX format and now I am trying to create two separate UTF-
8 text files. I have tried it with the extract-tmx-corpus and tmx2txt.pl
tools. I get empty text files for both (the former tool claims that the
input file can't be read). Are there any special setting I need to set
when extracting the TMX files? I am using SDL Trados Studio 2015 for
exporting the files.

Has anyone come across anything like this?

>>
>>
>> --
lp,

Sa?o
>>
>
> _________________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160314/f926e8e2/attachment-0001.html

------------------------------

Message: 2
Date: Mon, 14 Mar 2016 10:01:58 +0100
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] apostrophe: detokenization or corpus
issue ?
To: Philipp Koehn <phi@jhu.edu>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <56E67E06.3050100@neuf.fr>
Content-Type: text/plain; charset="utf-8"


I think I found the culprit.
this is very tricky ..... it's not a detokenizer issue but a
"normalize-punctuation | tokenizer" issue.

the normalize-punctuation script convert the special apostrophe utf-8
sequence E2 80 99
when it is surrounded by [a-z] on both sides.

s/([a-z])?([a-z])/$1\'$2/gi;
s/([a-z])?([a-z])/$1\'$2/gi;

The problem is that when the apostrophe is followed by a special
character like ? or ? which are utf-8 sequence C3 A9 or C3 A2
then it does not work .....
then the script converts these apostrophes to quotes "
s/?/\"/g;
s/?/\"/g;
s/?/\"/g;

Either we need to correct the [a-z] thing or maybe the last 3 conversion
et convert to the regular ' no matter what.

Hope this is clear.



Le 10/03/2016 13:00, Philipp Koehn a ?crit :
> Hi,
>
> I do not think that the detokenizer would cause conversion of ' to ".
> You can check the raw output of the decoder, and see how it is
> changed by the detokenizer.
>
> -phi
>
> On Wed, Mar 9, 2016 at 11:44 AM, Vincent Nguyen <vnguyen@neuf.fr
> <mailto:vnguyen@neuf.fr>> wrote:
>
> Hi,
>
> I got the following situation:
>
> This group age
> is translated sometimes in:
> ce groupe d'?ge (correct)
> ce groupe d" ?ge (incorrect)
> ce groupe d "?ge (incorrect)
>
> I am wondering if this is more a detokenizer issue or a corpus
> issue, or
> both.
>
> Technically in French, there shouldn't be any space before or
> after the
> apostrophe.
> In the Europarl Corpus, as well as in the News2014 one, there are some
> instances with a space before or after.
>
> Then I have the feeling that the decoder gets a &apos; with
> surrounding
> spaces leading to the detokenizer to transform into "
>
> Anyone with a similar issue ?
>
> thanks.
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160314/080ff5f3/attachment-0001.html

------------------------------

Message: 3
Date: Mon, 14 Mar 2016 11:35:14 +0100
From: Yuqi Zhang <zhang.yuqiyu@gmail.com>
Subject: [Moses-support] Can moses.ini include all decoding
parameters?
To: moses-support@mit.edu
Message-ID:
<CADF5gOZx6N9NH8BgAUL2GDK+7ykT71Rvfc7yELqUzhXGCu_NLw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi All,

Is moses.ini basically the parameter file for ALL parameters in decoding
process?
E.g: can I also set cube-pruning parameters also in moses.ini like?

[search-algorithm]
1
[cube-pruning-pop-limit]
2000
[s]
2000

Thanks!
Best regards,
Yuqi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160314/9c356879/attachment-0001.html

------------------------------

Message: 4
Date: Mon, 14 Mar 2016 10:42:40 +0000
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: [Moses-support] Fwd: Moses-support post from btpg71@gmail.com
requires approval
To: btpg71@gmail.com, moses-support <moses-support@mit.edu>
Message-ID:
<CAEKMkbjnovyH7_=voWEybBQhi1PSo6cXfSSE7LLNm1CLefdfhg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Please subscribe to the Moses mailing list before posting to it. You can
subscribe here:
http://mailman.mit.edu/mailman/listinfo/moses-support
To answer your question, each language supported by the tokenizer has it's
own file in
scripts/share/nonbreaking_prefixes
There is currently no file for Hindi. If you create 1, please consider
sharing it with everyone


Hieu Hoang
http://www.hoang.co.uk/hieu

---------- Forwarded message ----------
From: <moses-support-owner@mit.edu>
Date: 13 March 2016 at 18:13
Subject: Moses-support post from btpg71@gmail.com requires approval
To: moses-support-owner@mit.edu


As list administrator, your authorization is requested for the
following mailing list posting:

List: Moses-support@mit.edu
From: btpg71@gmail.com
Subject: Regarding moses
Reason: Post by non-member to a members-only list

At your convenience, visit:

http://mailman.mit.edu/mailman/admindb/moses-support

to approve or deny the request.


---------- Forwarded message ----------
From: Parul gupta <btpg71@gmail.com>
To: moses-support@mit.edu
Cc:
Date: Sun, 13 Mar 2016 23:43:45 +0530
Subject: Regarding moses
Hello sir,

I'm working on moses. I'm getting problem in hindi tokenization.
For mosesdecoder it's showing no abbreviations for 'hi'. How can i tokenize
hindi ?

Thanks !


---------- Forwarded message ----------
From: moses-support-request@mit.edu
To:
Cc:
Date:
Subject: confirm 39584649f4a0aa41db998e82e6d0b7f74c9d70fc
If you reply to this message, keeping the Subject: header intact,
Mailman will discard the held message. Do this if the message is
spam. If you reply to this message and include an Approved: header
with the list password in it, the message will be approved for posting
to the list. The Approved: header can also appear in the first line
of the body of the reply.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160314/43b4d8dd/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 113, Issue 44
**********************************************

0 Response to "Moses-support Digest, Vol 113, Issue 44"

Post a Comment