Moses-support Digest, Vol 91, Issue 46

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Problems with segmentation mismatch and many unknown words
for Chinese translation (Gideon Wenniger)
2. Feature Function (Jianri Li)
3. Fwd:Feature Function (Jianri Li)
4. Re: Problems with segmentation mismatch and many unknown
words for Chinese translation (Prashant Mathur)

----------------------------------------------------------------------

Message: 1
Date: Wed, 28 May 2014 16:45:25 +0200
From: Gideon Wenniger <gemdbw@hotmail.com>
Subject: [Moses-support] Problems with segmentation mismatch and many
unknown words for Chinese translation
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <DUB118-W443331B2AFD5F63C7D25CED1250@phx.gbl>
Content-Type: text/plain; charset="iso-2022-jp"

Dear Sir/Madam,
As my Machine Translation research focuses mainly on improving word order
I would like to run experiments with Chinese, i.e. Chinese-English, as much work
on reordering has been focusing on this language pair and it would be good
to compare to it.

Unfortunately, while I have been able to go through all the steps of per-processing
and data preparation, my Hiero baseline system so far is performing really badly.
While most researchers have reported scores around 30 BLEU when training on
the LDC Hongkong Hansards,Laws & News corpus and evaluating on the various
NIST test sets, I get scores only around 20 BLEU.
Furthermore, the high unknown word rate I obtain suggests something is going
wrong in the segmentation, or there is a mismatch between the training and testing
data, but I don't know why.
What I have been doing so far is pretty standard I belief:

================

-Train on Honkong Hansards,Laws,News, test on NIST (News) data

- I converted all the Hansards data to Unicode, and converted from Traditional Chinese to
Simplified Chinese using a conversion table I found online (from http://www.mandarintools.com/)

- For tokenization of English I simply use the tokenization script from the Moses codebase

- I use the Stanford segmenter with the Peking University CRF model to do the segmentation of chinese

- I use MGiza++ to do the word alignment, using Grow-Diag-Final-And heuristic

- I evaluate using Multeval, with 4 references

================

Does anybody have any ideas what could be going wrong?
Some important details about preprocessing Chinese before feeding to the Stanford Segmenter I might have missed?

Two minor details I discovered is that there are two types of comma's (the Chinese and European one) throughout the data.
I could normalize them, but it should not make too much difference, as comma's are always single tokens anyway.

Also, I noticed that some words are apparently differently spelled in the Hong Kong data as in the NIST data, e.g.
the test data contains:
?? ??
Doha Deadlock

And google translate suggests:
"Did you mean" ?? ??

As it turns out, the version suggested by Google translate (??) is present 49 times in the training data, and the one found in the
relevant test sentence (??) 0 times.
But I have no idea whether this issue of alternative spellings for the same word is specific for the data I use, and if this is the real problem (and if so, how to solve it), or still something else.

I would be very grateful from any help, hints or tips from people who have some more knowledge about Chinese or more experiences with the preprocessing, relevant issues and solutions.

Kind regards,

Gideon Wenniger

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140528/27a720d1/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 28 May 2014 23:50:21 +0900
From: Jianri Li <skywalker@postech.ac.kr>
Subject: [Moses-support] Feature Function
To: moses-support@mit.edu
Message-ID: <1401288621123.93663.postech@postech.ac.kr>
Content-Type: text/plain; charset="us-ascii"

An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140528/d52a2abb/attachment-0001.htm

------------------------------

Message: 3
Date: Thu, 29 May 2014 00:12:15 +0900
From: Jianri Li <skywalker@postech.ac.kr>
Subject: [Moses-support] Fwd:Feature Function
To: moses-support@mit.edu
Message-ID: <1401289935913.95344.postech@postech.ac.kr>
Content-Type: text/plain; charset="us-ascii"

An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140529/d115641c/attachment-0001.htm

------------------------------

Message: 4
Date: Wed, 28 May 2014 17:19:37 +0200
From: Prashant Mathur <prashant@fbk.eu>
Subject: Re: [Moses-support] Problems with segmentation mismatch and
many unknown words for Chinese translation
To: Gideon Wenniger <gemdbw@hotmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAK3pNhJjBpzsY+nY+eeNaLRr_3MaQxnGJ5R-nceS1P0Vif+gyA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Gideon,

I am also doing experiments with Chinese-English and I am getting BLEU
scores of around 9 points but in my case I am using one particular training
corpus which doesn't have a high overlap with NIST testsets. So, it is
bound to give low scores at least in my case.
Could you re-check if you are using the exact same corpora as one in the
papers that you say report BLEU scores of over 30? And that you use the
same type of models?

Regarding the unknown words, I am also having the same problem.

I would also like to know from experienced people with ZH-EN pair if there
are more steps in preprocessing apart from the usual.

Thanks,
Prashant

On Wed, May 28, 2014 at 4:45 PM, Gideon Wenniger <gemdbw@hotmail.com> wrote:

> Dear Sir/Madam,
> As my Machine Translation research focuses mainly on improving word order
> I would like to run experiments with Chinese, i.e. Chinese-English, as
> much work
> on reordering has been focusing on this language pair and it would be good
> to compare to it.
>
> Unfortunately, while I have been able to go through all the steps of
> per-processing
> and data preparation, my Hiero baseline system so far is performing really
> badly.
> While most researchers have reported scores around 30 BLEU when training
> on
> the LDC Hongkong Hansards,Laws & News corpus and evaluating on the various
> NIST test sets, I get scores only around 20 BLEU.
> Furthermore, the high unknown word rate I obtain suggests something is
> going
> wrong in the segmentation, or there is a mismatch between the training and
> testing
> data, but I don't know why.
> What I have been doing so far is pretty standard I belief:
>
> ================
>
> -Train on Honkong Hansards,Laws,News, test on NIST (News) data
>
> - I converted all the Hansards data to Unicode, and converted from
> Traditional Chinese to
> Simplified Chinese using a conversion table I found online (from
> http://www.mandarintools.com/)
>
> - For tokenization of English I simply use the tokenization script from
> the Moses codebase
>
> - I use the *Stanford segmenter *with the Peking University CRF model to
> do the segmentation of chinese
>
> - I use MGiza++ to do the word alignment, using Grow-Diag-Final-And
> heuristic
>
> - I evaluate using Multeval, with 4 references
>
> ================
>
> Does anybody have any ideas what could be going wrong?
> Some important details about preprocessing Chinese before feeding to the
> Stanford Segmenter I might have missed?
>
> Two minor details I discovered is that there are two types of comma's (the
> Chinese and European one) throughout the data.
> I could normalize them, but it should not make too much difference, as
> comma's are always single tokens anyway.
>
> Also, I noticed that some words are apparently differently spelled in the
> Hong Kong data as in the NIST data, e.g.
> the test data contains:
> * ?? * ??
> Doha Deadlock
>
> And google translate suggests:
> "Did you mean" *?? *??
>
> As it turns out, the version suggested by Google translate (*??*) is
> present 49 times in the training data, and the one found in the
> relevant test sentence (*??*) 0 times.
> But I have no idea whether this issue of alternative spellings for the
> same word is specific for the data I use, and if this is the real problem
> (and if so, how to solve it), or still something else.
>
> I would be very grateful from any help, hints or tips from people who have
> some more knowledge about Chinese or more experiences with the
> preprocessing, relevant issues and solutions.
>
> Kind regards,
>
> Gideon Wenniger
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140528/3ed0050f/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 91, Issue 46
*********************************************

Moses-support Digest, Vol 91, Issue 46

0 Response to "Moses-support Digest, Vol 91, Issue 46"

Post a Comment