Moses-support Digest, Vol 91, Issue 48

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Moses Training issue (Hieu Hoang)
2. Re: Configuring LMs (Kenneth Heafield)
3. Re: Moses-support post from lars.bungum@idi.ntnu.no requires
approval (ULStudent:GIOVANNI.GALLO)
4. Re: Problems with segmentation mismatch and many unknown
words for Chinese translation (Gideon Wenniger)

----------------------------------------------------------------------

Message: 1
Date: Thu, 29 May 2014 00:40:55 +0100
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] Moses Training issue
To: Mohsen Afshin <mafshin89@gmail.com>, moses-support@mit.edu
Message-ID: <53867407.4080503@gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

are you sure you are running giza++ on the cleaned data, ie. sentences
with wildly diffent lengths are discarded? giza++ really doesn't like
such sentences

On 26/05/14 06:53, Mohsen Afshin wrote:
> Hi Moses devs
>
> I just followed the guide on installation of Moses here
> <http://www.statmt.org/moses/?n=moses.baseline> step by step.
> Everything worked fine till the step "Training the Translation System"
> where there should be a generated "moses.ini" but it doesn't exist.
>
> I get the following error in "training.out" file.
>
> ERROR: Giza did not produce the output file
> /home/mohsen/working/train/giza.fr-en/fr-en.A3.final. Is your
> corpus clean (reasonably-sized sentences)? at
> /home/mohsen/mosesdecoder/scripts/training/train-model.perl line 1191.
>
>
> Here is the training.out file :
> https://www.dropbox.com/s/osuaowjg5cfsuci/training.out
> <https://www.dropbox.com/s/osuaowjg5cfsuci/training.out>
>
> --
> "Mathematics is the queen of the sciences and number theory is the
> queen of mathematics."
> --Gauss
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140529/3b1d80ce/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 28 May 2014 16:49:23 -0700
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Configuring LMs
To: moses-support@mit.edu
Message-ID: <53867603.9000901@kheafield.com>
Content-Type: text/plain; charset=ISO-8859-1

KenLM will happily consume an ARPA file produced by SRILM. However, you
created a binary file with SRILM.

On 05/28/14 02:41, Lars Bungum wrote:
> Still, it works to use the KENLM feature with the IRSTLM LM (but not,
> SRILM as I discovered).

------------------------------

Message: 3
Date: Thu, 29 May 2014 09:13:02 +0000
From: "ULStudent:GIOVANNI.GALLO" <12064866@studentmail.ul.ie>
Subject: Re: [Moses-support] Moses-support post from
lars.bungum@idi.ntnu.no requires approval
To: Hieu Hoang <hieuhoang@gmail.com>, "moses-support@mit.edu"
<moses-support@mit.edu>, "lars.bungum@idi.ntnu.no"
<lars.bungum@idi.ntnu.no>
Message-ID: <1401354781943.10828@studentmail.ul.ie>
Content-Type: text/plain; charset="iso-8859-1"

Hi everybody,

Just jumping in the conversation to ask a couple of questions. Why are you saying "tcmalloc is now
called tcmalloc_minimal4. Moses can't use it yet"? I just installed Ubuntu 14.04 a couple of days ago and moses compiled successfully: should I to expect that it won't work OK?
Thanks.

Giancarlo
________________________________________
Da: moses-support-bounces@mit.edu <moses-support-bounces@mit.edu> per conto di Hieu Hoang <hieuhoang@gmail.com>
Inviato: luned? 26 maggio 2014 23.03
A: moses-support@mit.edu; lars.bungum@idi.ntnu.no
Oggetto: Re: [Moses-support] Moses-support post from lars.bungum@idi.ntnu.no requires approval

hi lars

please subscribe to the Moses mailing list before posting to it. You can
subscribe here
http://mailman.mit.edu/mailman/listinfo/moses-support

to answer your question - bjam links the moses decoder to tcmalloc
automatically if it is installed on the computer Moses was compiled on.

You can see it when you ask bjam to give details about the compiler
options it's using
./bjam -d2 ...
....
g++ ... -ltcmalloc_minimal ...
....

However, i've just noticed on the new Ubuntu 14.04, tcmalloc is now
called tcmalloc_minimal4. Moses can't use it yet. We'll get round to
fixing this issue

On 26/05/14 16:14, moses-support-owner@mit.edu wrote:
> As list administrator, your authorization is requested for the
> following mailing list posting:
>
> List:Moses-support@mit.edu
> From:lars.bungum@idi.ntnu.no
> Subject: Using tcmalloc with bjam
> Reason: Post by non-member to a members-only list
>
> At your convenience, visit:
>
> http://mailman.mit.edu/mailman/admindb/moses-support
>
> to approve or deny the request.

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

Message: 4
Date: Thu, 29 May 2014 14:51:26 +0200
From: Gideon Wenniger <gemdbw@hotmail.com>
Subject: Re: [Moses-support] Problems with segmentation mismatch and
many unknown words for Chinese translation
To: Prashant Mathur <prashant@fbk.eu>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <DUB118-W327C62718E9AD52D2BF08ED1240@phx.gbl>
Content-Type: text/plain; charset="iso-2022-jp"

Hi Prashant

,

Thanks for your response!

I agree that the scores of course depend on the details of the training and test set.
In my comparison to arrive at these indication scores I looked amongst others at:

"Using Syntactic Head Information in Hierarchical Phrase-Based Translation" (Li. et. al, 2012)
http://aclweb.org/anthology/W/W12/W12-3128.pdf

"A Phrase Orientation Model for Hierarchical Machine Translation" (Huck. et. al, 2013)
http://www-i6.informatik.rwth-aachen.de/publications/download/870/Huck-WMT%202013-2013.pdf

while I may have not exactly the same setup as those approaches. My setup is similar enough to make comparison sensible.
For example, the first paper explicitly states to use the LDC (Hansards, Laws and News) dataset
while the second paper uses a similar amount of training data (3 Million sentence pairs) and both papers evaluation on News data
from NIST.
I was training with also with the the LDC (Hansards, Laws and News) dataset and evaluation in the NIST 2007 ZH-EN test set (with 4 references).

One other contact of me mentioned that the mapping from traditional to simplified Chinese could still be tricky. I have been living under the assumption
that this mapping is one to one (only in the traditional Chinese -> Simplified Chinese direction!),
so that this should be ok if you have a proper mapping table (I used the one from http://www.mandarintools.com/ as
mentioned before).

Nevertheless, I am now trying to use just the MultiUN data available from http://opus.lingfil.uu.se/
which is about 9.6 million sentence pairs for Chinese-English. As far as I know this data is all in simplified Chinese.

Using the Stanford segmenter, I also still want to try with the Peking University segmentation model (as opposed to Chinese Treebank model)
however, I don't expect a lot from that necessarily.
It is suggested by the results reported in "Unsupervised Tokenization for Machine Translation" (Chun and Gildea,2009)
http://www.cs.rochester.edu/~gildea/pubs/chung-gildea-emnlp09.pdf
that the Chinese Treebank model works better at least for phrase based systems, but I don't know if results for a similar comparison for
Hierarchical system has been done.

While still working on redoing the alignment with the UN data, one thing that gives a bit of hope is that at least the number of unique words
has gone up quite a bit (possibly just because it is more data I am using now):

604405 Chinese words
388769 English words

In comparison, earlier with the Hongkong Hansards,Laws and News data I got:
206432 Chinese
125085 English Words

On the other hand, I would intuitively still expect the LDC Hansards, Laws and News corpora together to be more varied and
hence provide wider coverage at least for a news test set, but maybe that is not true.

Like you I would be very interesting what other people have to add to this discussion, because I feel a lot like I am reinventing the wheel,
while many people must know how to do this in a good way and what details are important to avoid to many unknown words etc.

Gideon

From: prashant@fbk.eu
Date: Wed, 28 May 2014 17:19:37 +0200
Subject: Re: [Moses-support] Problems with segmentation mismatch and many unknown words for Chinese translation
To: gemdbw@hotmail.com
CC: moses-support@mit.edu

Hi Gideon,
I am also doing experiments with Chinese-English and I am getting BLEU scores of around 9 points but in my case I am using one particular training corpus which doesn't have a high overlap with NIST testsets. So, it is bound to give low scores at least in my case.

Could you re-check if you are using the exact same corpora as one in the papers that you say report BLEU scores of over 30? And that you use the same type of models?
Regarding the unknown words, I am also having the same problem.

I would also like to know from experienced people with ZH-EN pair if there are more steps in preprocessing apart from the usual.
Thanks,
Prashant

On Wed, May 28, 2014 at 4:45 PM, Gideon Wenniger <gemdbw@hotmail.com> wrote:

Dear Sir/Madam,
As my Machine Translation research focuses mainly on improving word order
I would like to run experiments with Chinese, i.e. Chinese-English, as much work
on reordering has been focusing on this language pair and it would be good

to compare to it.

Unfortunately, while I have been able to go through all the steps of per-processing
and data preparation, my Hiero baseline system so far is performing really badly.
While most researchers have reported scores around 30 BLEU when training on

the LDC Hongkong Hansards,Laws & News corpus and evaluating on the various
NIST test sets, I get scores only around 20 BLEU.
Furthermore, the high unknown word rate I obtain suggests something is going
wrong in the segmentation, or there is a mismatch between the training and testing

data, but I don't know why.
What I have been doing so far is pretty standard I belief:

================

-Train on Honkong Hansards,Laws,News, test on NIST (News) data

- I converted all the Hansards data to Unicode, and converted from Traditional Chinese to
Simplified Chinese using a conversion table I found online (from http://www.mandarintools.com/)

- For tokenization of English I simply use the tokenization script from the Moses codebase

- I use the Stanford segmenter with the Peking University CRF model to do the segmentation of chinese

- I use MGiza++ to do the word alignment, using Grow-Diag-Final-And heuristic

- I evaluate using Multeval, with 4 references

================

Does anybody have any ideas what could be going wrong?

Some important details about preprocessing Chinese before feeding to the Stanford Segmenter I might have missed?

Two minor details I discovered is that there are two types of comma's (the Chinese and European one) throughout the data.

I could normalize them, but it should not make too much difference, as comma's are always single tokens anyway.

Also, I noticed that some words are apparently differently spelled in the Hong Kong data as in the NIST data, e.g.

the test data contains:
?? ??
Doha Deadlock

And google translate suggests:
"Did you mean" ?? ??

As it turns out, the version suggested by Google translate (??) is present 49 times in the training data, and the one found in the

relevant test sentence (??) 0 times.
But I have no idea whether this issue of alternative spellings for the same word is specific for the data I use, and if this is the real problem (and if so, how to solve it), or still something else.

I would be very grateful from any help, hints or tips from people who have some more knowledge about Chinese or more experiences with the preprocessing, relevant issues and solutions.

Kind regards,

Gideon Wenniger

_______________________________________________

Moses-support mailing list

Moses-support@mit.edu

http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140529/af265e82/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 91, Issue 48
*********************************************

Moses-support Digest, Vol 91, Issue 48

0 Response to "Moses-support Digest, Vol 91, Issue 48"

Post a Comment