Moses-support Digest, Vol 92, Issue 17

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. What preprocessing scripts to use at what stage and in what
order for Chinese-English translation? (Gideon Wenniger)


----------------------------------------------------------------------

Message: 1
Date: Fri, 6 Jun 2014 17:20:58 +0200
From: Gideon Wenniger <gemdbw@hotmail.com>
Subject: [Moses-support] What preprocessing scripts to use at what
stage and in what order for Chinese-English translation?
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <DUB118-W32543605E0E7AD838A8B1D12C0@phx.gbl>
Content-Type: text/plain; charset="windows-1256"

Dear Moses support,
This is somewhat of a follow up question on my earlier mail
" Problems with segmentation mismatch and many unknown words for Chinese translation?" .
While I now ran some experiments with MultiUN data which is already in Simplified Chinese,
and results have improved a bit, I have still problems particularly with numbers and punctuation.

As Vincent Wang pointed out there is a script "escape-special-chars.perl" in the
/mosesdecoder/scripts/tokenizer directory, which could make a difference.
Actually there are more scripts there:
deescape-special-chars.perl detokenizer.perl escape-special-chars.perl lowercase.perl normalize-punctuation.perl replace-unicode-punctuation.perl tokenizer.perl

I was wondering if anybody could tell me which of these scripts to use, and at what stage in the preprocessing pipeline.

My own best guess is that normalize-punctuation.perl is possibly essential for Moses but optional for other decoders such as Joshua, due to the
different grammar format.

I also guess that it is helpful to use replace-unicode-punctuation.perl followed by normalize-punctuation.perl on the lowercased input
before feeding it to the segmenter or tokenizer.

Does anybody know if this understanding is right, or there is another way these scripts should be used, in particular for Chinese-English translation?
(Documentation on these scripts seems to be limited, I also searched with "grep" but could not find where these scripts are used as any larger
preprocessing script in the Moses codebase)
Thanks in advance.

Kind regards,

Gideon Wenniger

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140606/26d3c9ea/attachment-0001.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 92, Issue 17
*********************************************

0 Response to "Moses-support Digest, Vol 92, Issue 17"

Post a Comment