Moses-support Digest, Vol 91, Issue 49

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Problems with segmentation mismatch and many unknown
words for Chinese translation (Hieu Hoang)
2. Re: Problems with segmentation mismatch and many unknown
words for Chinese translation (Matthias Huck)
3. Re: Problems with segmentation mismatch and many unknown
words for Chinese translation (Tom Hoar)


----------------------------------------------------------------------

Message: 1
Date: Thu, 29 May 2014 14:24:29 +0100
From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Subject: Re: [Moses-support] Problems with segmentation mismatch and
many unknown words for Chinese translation
To: Gideon Wenniger <gemdbw@hotmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAEKMkbi9uQV8iXuDLPVbZRHxinva8vf-9XgjBaC6kYmLLWUk7w@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

I don't think the Moses tokenizer has specific Chinese handling, so it uses
English tokenizing rules instead for chinese. This would probably be really
bad.

If you know of of good chinese tokenizer, please let us know or add it to
moses


On 28 May 2014 15:45, Gideon Wenniger <gemdbw@hotmail.com> wrote:

> Dear Sir/Madam,
> As my Machine Translation research focuses mainly on improving word order
> I would like to run experiments with Chinese, i.e. Chinese-English, as
> much work
> on reordering has been focusing on this language pair and it would be good
> to compare to it.
>
> Unfortunately, while I have been able to go through all the steps of
> per-processing
> and data preparation, my Hiero baseline system so far is performing really
> badly.
> While most researchers have reported scores around 30 BLEU when training
> on
> the LDC Hongkong Hansards,Laws & News corpus and evaluating on the various
> NIST test sets, I get scores only around 20 BLEU.
> Furthermore, the high unknown word rate I obtain suggests something is
> going
> wrong in the segmentation, or there is a mismatch between the training and
> testing
> data, but I don't know why.
> What I have been doing so far is pretty standard I belief:
>
> ================
>
> -Train on Honkong Hansards,Laws,News, test on NIST (News) data
>
> - I converted all the Hansards data to Unicode, and converted from
> Traditional Chinese to
> Simplified Chinese using a conversion table I found online (from
> http://www.mandarintools.com/)
>
> - For tokenization of English I simply use the tokenization script from
> the Moses codebase
>
> - I use the *Stanford segmenter *with the Peking University CRF model to
> do the segmentation of chinese
>
> - I use MGiza++ to do the word alignment, using Grow-Diag-Final-And
> heuristic
>
> - I evaluate using Multeval, with 4 references
>
> ================
>
> Does anybody have any ideas what could be going wrong?
> Some important details about preprocessing Chinese before feeding to the
> Stanford Segmenter I might have missed?
>
> Two minor details I discovered is that there are two types of comma's (the
> Chinese and European one) throughout the data.
> I could normalize them, but it should not make too much difference, as
> comma's are always single tokens anyway.
>
> Also, I noticed that some words are apparently differently spelled in the
> Hong Kong data as in the NIST data, e.g.
> the test data contains:
> * ?? * ??
> Doha Deadlock
>
> And google translate suggests:
> "Did you mean" *?? *??
>
> As it turns out, the version suggested by Google translate (*??*) is
> present 49 times in the training data, and the one found in the
> relevant test sentence (*??*) 0 times.
> But I have no idea whether this issue of alternative spellings for the
> same word is specific for the data I use, and if this is the real problem
> (and if so, how to solve it), or still something else.
>
> I would be very grateful from any help, hints or tips from people who have
> some more knowledge about Chinese or more experiences with the
> preprocessing, relevant issues and solutions.
>
> Kind regards,
>
> Gideon Wenniger
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140529/7077a7c0/attachment-0001.htm

------------------------------

Message: 2
Date: Thu, 29 May 2014 15:00:54 +0100
From: Matthias Huck <mhuck@inf.ed.ac.uk>
Subject: Re: [Moses-support] Problems with segmentation mismatch and
many unknown words for Chinese translation
To: Gideon Wenniger <gemdbw@hotmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <1401372054.2309.1820.camel@portedgar>
Content-Type: text/plain; charset="UTF-8"

Hi Gideon,

I still tend to believe that there's some issue with your preprocessing.
Or maybe there's a mismatch in the way you preprocessed your training
and test data? The OOV rates on MT06 and MT08 are very low in the
systems built by us at RWTH (cf. the numbers I sent you as a reply to
your request a few months ago). Can you tell us your OOV rates based on
the preprocessed data? You should also measure LM perplexity and look
into your phrase extraction heuristics. If you see low OOV rates based
on the vocabulary in the training data, but the translations contain a
large amount of OOVs, then the word alignment or extraction
configuration might be suboptimal. Did you eliminate other potential
sources of errors? In particular, does your postprocessing do what you
think it does and is the scoring tool really employing all four
reference translations? Is your search space large enough? Was the
phrase table pre-pruned too much?

We tried a few segmenters for Chinese at RWTH Aachen. Results are for
instance published in our IWSLT 2011 system description paper:

J. Wuebker, M. Huck, S. Mansour, M. Freitag, M. Feng, S. Peitz, C.
Schmidt, and H. Ney. The RWTH Aachen Machine Translation System for
IWSLT 2011. In International Workshop on Spoken Language Translation
(IWSLT), pages 106-113, San Francisco, California, USA, December 2011.
http://www-i6.informatik.rwth-aachen.de/publications/download/756/Wuebker-IWSLT-2011.pdf

This is on TED data as provided for the IWSLT evaluation campaign, not
on the NIST test sets (and not with four references).

What exactly is the NIST 2007 ZH-EN test set you're talking about?

Cheers,
Matthias


On Thu, 2014-05-29 at 14:51 +0200, Gideon Wenniger wrote:
> Hi Prashant ,
>
> Thanks for your response!
>
> I agree that the scores of course depend on the details of the
> training and test set.
> In my comparison to arrive at these indication scores I looked amongst
> others at:
>
> "Using Syntactic Head Information in Hierarchical Phrase-Based
> Translation" (Li. et. al, 2012)
> http://aclweb.org/anthology/W/W12/W12-3128.pdf
>
> "A Phrase Orientation Model for Hierarchical Machine
> Translation" (Huck. et. al, 2013)
> http://www-i6.informatik.rwth-aachen.de/publications/download/870/Huck-WMT%202013-2013.pdf
>
>
> while I may have not exactly the same setup as those approaches. My
> setup is similar enough to make comparison sensible.
> For example, the first paper explicitly states to use the LDC
> (Hansards, Laws and News) dataset
> while the second paper uses a similar amount of training data (3
> Million sentence pairs) and both papers evaluation on News data
> from NIST.
> I was training with also with the the LDC (Hansards, Laws and News)
> dataset and evaluation in the NIST 2007 ZH-EN test set (with 4
> references).
>
> One other contact of me mentioned that the mapping from traditional to
> simplified Chinese could still be tricky. I have been living under the
> assumption
> that this mapping is one to one (only in the traditional Chinese ->
> Simplified Chinese direction!),
> so that this should be ok if you have a proper mapping table (I used
> the one from http://www.mandarintools.com/ as
> mentioned before).
>
> Nevertheless, I am now trying to use just the MultiUN data available
> from http://opus.lingfil.uu.se/
> which is about 9.6 million sentence pairs for Chinese-English. As far
> as I know this data is all in simplified Chinese.
>
> Using the Stanford segmenter, I also still want to try with the Peking
> University segmentation model (as opposed to Chinese Treebank model)
> however, I don't expect a lot from that necessarily.
> It is suggested by the results reported in "Unsupervised
> Tokenization for Machine Translation" (Chun and Gildea,2009)
> http://www.cs.rochester.edu/~gildea/pubs/chung-gildea-emnlp09.pdf
> that the Chinese Treebank model works better at least for phrase based
> systems, but I don't know if results for a similar comparison for
> Hierarchical system has been done.
>
> While still working on redoing the alignment with the UN data, one
> thing that gives a bit of hope is that at least the number of unique
> words
> has gone up quite a bit (possibly just because it is more data I am
> using now):
>
> 604405 Chinese words
> 388769 English words
>
> In comparison, earlier with the Hongkong Hansards,Laws and News data I
> got:
> 206432 Chinese
> 125085 English Words
>
> On the other hand, I would intuitively still expect the LDC Hansards,
> Laws and News corpora together to be more varied and
> hence provide wider coverage at least for a news test set, but maybe
> that is not true.
>
> Like you I would be very interesting what other people have to add to
> this discussion, because I feel a lot like I am reinventing the wheel,
> while many people must know how to do this in a good way and what
> details are important to avoid to many unknown words etc.
>
>
> Gideon
>
>
>
>
>
> ______________________________________________________________________
> From: prashant@fbk.eu
> Date: Wed, 28 May 2014 17:19:37 +0200
> Subject: Re: [Moses-support] Problems with segmentation mismatch and
> many unknown words for Chinese translation
> To: gemdbw@hotmail.com
> CC: moses-support@mit.edu
>
> Hi Gideon,
>
>
> I am also doing experiments with Chinese-English and I am getting BLEU
> scores of around 9 points but in my case I am using one particular
> training corpus which doesn't have a high overlap with NIST testsets.
> So, it is bound to give low scores at least in my case.
> Could you re-check if you are using the exact same corpora as one in
> the papers that you say report BLEU scores of over 30? And that you
> use the same type of models?
>
>
> Regarding the unknown words, I am also having the same problem.
>
>
> I would also like to know from experienced people with ZH-EN pair if
> there are more steps in preprocessing apart from the usual.
>
>
> Thanks,
> Prashant
>
>
> On Wed, May 28, 2014 at 4:45 PM, Gideon Wenniger <gemdbw@hotmail.com>
> wrote:
> Dear Sir/Madam,
> As my Machine Translation research focuses mainly on improving
> word order
> I would like to run experiments with Chinese, i.e.
> Chinese-English, as much work
> on reordering has been focusing on this language pair and it
> would be good
> to compare to it.
>
> Unfortunately, while I have been able to go through all the
> steps of per-processing
> and data preparation, my Hiero baseline system so far is
> performing really badly.
> While most researchers have reported scores around 30 BLEU
> when training on
> the LDC Hongkong Hansards,Laws & News corpus and evaluating on
> the various
> NIST test sets, I get scores only around 20 BLEU.
> Furthermore, the high unknown word rate I obtain suggests
> something is going
> wrong in the segmentation, or there is a mismatch between the
> training and testing
> data, but I don't know why.
> What I have been doing so far is pretty standard I belief:
>
> ================
>
> -Train on Honkong Hansards,Laws,News, test on NIST (News) data
>
>
> - I converted all the Hansards data to Unicode, and converted
> from Traditional Chinese to
> Simplified Chinese using a conversion table I found online
> (from http://www.mandarintools.com/)
>
> - For tokenization of English I simply use the tokenization
> script from the Moses codebase
>
> - I use the Stanford segmenter with the Peking University CRF
> model to do the segmentation of chinese
>
> - I use MGiza++ to do the word alignment, using
> Grow-Diag-Final-And heuristic
>
> - I evaluate using Multeval, with 4 references
>
> ================
>
> Does anybody have any ideas what could be going wrong?
> Some important details about preprocessing Chinese before
> feeding to the Stanford Segmenter I might have missed?
>
> Two minor details I discovered is that there are two types of
> comma's (the Chinese and European one) throughout the data.
> I could normalize them, but it should not make too much
> difference, as comma's are always single tokens anyway.
>
> Also, I noticed that some words are apparently differently
> spelled in the Hong Kong data as in the NIST data, e.g.
> the test data contains:
> ?? ??
> Doha Deadlock
>
> And google translate suggests:
> "Did you mean" ?? ??
>
> As it turns out, the version suggested by Google translate (?
> ?) is present 49 times in the training data, and the one
> found in the
> relevant test sentence (??) 0 times.
> But I have no idea whether this issue of alternative spellings
> for the same word is specific for the data I use, and if this
> is the real problem (and if so, how to solve it), or still
> something else.
>
> I would be very grateful from any help, hints or tips from
> people who have some more knowledge about Chinese or more
> experiences with the preprocessing, relevant issues and
> solutions.
>
> Kind regards,
>
> Gideon Wenniger
>
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



------------------------------

Message: 3
Date: Thu, 29 May 2014 21:18:56 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Problems with segmentation mismatch and
many unknown words for Chinese translation
To: moses-support@mit.edu
Message-ID: <538741D0.2020200@precisiontranslationtools.com>
Content-Type: text/plain; charset="iso-8859-1"

We've found the Stanford Segmenter, which you're using, is about as good
as it gets for Chinese. For commercial work with commercial TM's as
training corpus and using Moses in phrase-based mode. Otherwise
everything is very close to your config. Our customers typically realize
BLEU scores > 60 with very low OOV rates. So, I don't think your problem
is in the segmenter itself. Your other tools and configurations look fine.

Your training corpus (LDC Hongkong Hansards, Laws & News corpus)
includes a mix of Traditional and Simplified Chinese. The Stanford
Segmenter with the Peking University CRF model is Simplified Chinese.
Conversion of Traditional Chinese to Simplified Chinese using a table is
imperfect at best. This conversion likely creates tokens that don't
match the Peking corpus that trained the Segmenter.

First, I tend to agree with Matthias. Verify that you don't have a
mismatch/inconsistency between your training corpus preparation and your
test sets.

Second, what kind of data are in your test sets? I.e. are you working
with data that you extracted from the greater corpus or independently
collected data? If it's independent data, it probably explains your the
OOV rate. Extracted data supports the hypothesis that your preparation
toolchains are out-of-synch.


On 05/29/2014 08:24 PM, Hieu Hoang wrote:
> I don't think the Moses tokenizer has specific Chinese handling, so it
> uses English tokenizing rules instead for chinese. This would probably
> be really bad.
>
> If you know of of good chinese tokenizer, please let us know or add it
> to moses
>
>
> On 28 May 2014 15:45, Gideon Wenniger <gemdbw@hotmail.com
> <mailto:gemdbw@hotmail.com>> wrote:
>
> Dear Sir/Madam,
> As my Machine Translation research focuses mainly on improving
> word order
> I would like to run experiments with Chinese, i.e.
> Chinese-English, as much work
> on reordering has been focusing on this language pair and it would
> be good
> to compare to it.
>
> Unfortunately, while I have been able to go through all the steps
> of per-processing
> and data preparation, my Hiero baseline system so far is
> performing really badly.
> While most researchers have reported scores around 30 BLEU when
> training on
> the LDC Hongkong Hansards,Laws & News corpus and evaluating on the
> various
> NIST test sets, I get scores only around 20 BLEU.
> Furthermore, the high unknown word rate I obtain suggests
> something is going
> wrong in the segmentation, or there is a mismatch between the
> training and testing
> data, but I don't know why.
> What I have been doing so far is pretty standard I belief:
>
> ================
>
> -Train on Honkong Hansards,Laws,News, test on NIST (News) data
>
> - I converted all the Hansards data to Unicode, and converted from
> Traditional Chinese to
> Simplified Chinese using a conversion table I found online (from
> http://www.mandarintools.com/)
>
> - For tokenization of English I simply use the tokenization script
> from the Moses codebase
>
> - I use the *Stanford segmenter *with the Peking University CRF
> model to do the segmentation of chinese
>
> - I use MGiza++ to do the word alignment, using
> Grow-Diag-Final-And heuristic
>
> - I evaluate using Multeval, with 4 references
>
> ================
>
> Does anybody have any ideas what could be going wrong?
> Some important details about preprocessing Chinese before feeding
> to the Stanford Segmenter I might have missed?
>
> Two minor details I discovered is that there are two types of
> comma's (the Chinese and European one) throughout the data.
> I could normalize them, but it should not make too much
> difference, as comma's are always single tokens anyway.
>
> Also, I noticed that some words are apparently differently spelled
> in the Hong Kong data as in the NIST data, e.g.
> the test data contains:
> * ?? * ??
> Doha Deadlock
>
> And google translate suggests:
> "Did you mean"***/??/ *??
>
> As it turns out, the version suggested by Google translate
> (*/??/*)**is present 49 times in the training data, and the one
> found in the
> relevant test sentence (*??*) 0 times.
> But I have no idea whether this issue of alternative spellings for
> the same word is specific for the data I use, and if this is the
> real problem (and if so, how to solve it), or still something else.
>
> I would be very grateful from any help, hints or tips from people
> who have some more knowledge about Chinese or more experiences
> with the preprocessing, relevant issues and solutions.
>
> Kind regards,
>
> Gideon Wenniger
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140529/c17cd20c/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 91, Issue 49
*********************************************

Related Posts :

0 Response to "Moses-support Digest, Vol 91, Issue 49"

Post a Comment