Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: is there a way to remove a bad entry in the phrase table
? (Vincent Nguyen)
2. Re: is there a way to remove a bad entry in the phrase table
? (Matthias Huck)
3. Final Call for Participation: WAT2015 (The 2nd Workshop on
Asian Translation) (Toshiaki Nakazawa)
4. Issue with alignment (gang tang)
----------------------------------------------------------------------
Message: 1
Date: Thu, 24 Sep 2015 22:37:58 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] is there a way to remove a bad entry in
the phrase table ?
To: Matthias Huck <mhuck@inf.ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <56045F26.8020503@neuf.fr>
Content-Type: text/plain; charset=utf-8; format=flowed
Thanks Matthias for the detailed explanation.
I think I have most of it in mind except not really understanding how
this one works :
"Difficult sentences generally have worse model score than easy ones but
may still be useful for training."
but yes what you describe is more or less what I did to better
understand the mechanism.
and I know I have to tune with in domain data for proper end result.
Cheers,
Vincent
Le 24/09/2015 22:13, Matthias Huck a ?crit :
> Hi Vincent,
>
> This is a different topic, and I'm not completely clear about what
> exactly you did here. Did you decode the source side of the parallel
> training data, conduct sentence selection by applying a threshold on the
> decoder score, and extract a new phrase table from the selected fraction
> of the original parallel training data? If this is the case, I have some
> comments:
>
>
> - Be careful when you translate training data. The system knows these
> sentences and does things like frequently applying long singleton
> phrases that have been extracted from the very same sentence.
> https://aclweb.org/anthology/P/P10/P10-1049.pdf
>
> - Longer sentences may have worse model score than shorter sentences.
> Consider normalizing by sentence length if you use model score for data
> selection.
> Difficult sentences generally have worse model score than easy ones but
> may still be useful for training. You possibly keep the parts of the
> data that are easy to translate or are highly redundant in the corpus.
>
> - You probably see no out-of-vocabulary words (OOVs) when translating
> training data, or very few of them (depending on word alignment, phrase
> extraction method, and phrase table pruning), but be aware that if there
> are OOVs, this may affect the model score a lot.
>
> - Check to what extent the sentence selection reduces the vocabulary of
> your system.
>
>
> Last but not least, two more general comments:
>
> - You need dev and test sets that are similar to the type of real-world
> documents that you're building your system for. Don't tune on Europarl
> if you eventually want to translate pharmaceutical patents, for
> instance. Try to collect in-domain training data as well.
>
> - In case you have in-domain and out-of-domain training corpora, you can
> try modified Moore-Lewis filtering for data selection.
> https://aclweb.org/anthology/D/D11/D11-1033.pdf
>
>
> Cheers,
> Matthias
>
>
> On Thu, 2015-09-24 at 18:19 +0200, Vincent Nguyen wrote:
>> This is an interesting subject ......
>>
>> As a matter of fact I have done several tests.
>> I came up to that need after realizing that even though my results were
>> good in a "standard dev + test set" situation
>> I had some strange results with real-world documents.
>> That's why I investigated.
>>
>> But you are right removing some so-called bad entries could have
>> unexpected results.
>>
>> For instance here is a test I did :
>>
>> I trained a fr-en model on europarl v7 ( 2 millions sentences)
>> I tuned with a subset of 3 K sentences.
>> I ran a evaluation on the full 2 million lines.
>> then I removed the 90 K sentences for which the score was less than 0.2
>> retrained on 1917853 sentences.
>>
>> In the end I got more sentences (in %) with a score above 0.2
>> but when analyzing at > 0.3 it becomes similar and > 0.4 the initial
>> corpus is better.
>>
>> Just weird.
>
>
------------------------------
Message: 2
Date: Thu, 24 Sep 2015 22:15:16 +0100
From: Matthias Huck <mhuck@inf.ed.ac.uk>
Subject: Re: [Moses-support] is there a way to remove a bad entry in
the phrase table ?
To: Vincent Nguyen <vnguyen@neuf.fr>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <1443129316.13101.945.camel@portedgar>
Content-Type: text/plain; charset="UTF-8"
Hi Vincent,
On Thu, 2015-09-24 at 22:37 +0200, Vincent Nguyen wrote:
> Thanks Matthias for the detailed explanation.
> I think I have most of it in mind except not really understanding how
> this one works :
>
> "Difficult sentences generally have worse model score than easy ones but
> may still be useful for training."
Well, your data selection method may discard training instances that are
somehow hard to decode, e.g. because of complex sentence structure or
because of rare vocabulary. But that doesn't necessarily mean that it's
bad sentence pairs that you're removing. You should manually inspect
some samples if possible.
I didn't try, but I suspect that you'd get a higher decoder score on the
1-best decoder output of the first of the following two input sentences:
(1) " Merci ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! "
(2) " Je l' ai v?cu moi-m?me en personne quand j' ai eu mon dipl?me ? Barnard College en 2002 . "
(Just as a simple made-up example.)
If we assume that you have a correct English target sentence for both of
those sentences in your training data, I wonder which of the two you
could learn more from?
If you're doing what I think, then you're also basically just assessing
whether the source side of the sentence pair is easy to translate. Does
this tell you anything about the target sentence? The target side might
be misaligned or in a different third language if your data is noisy.
Cheers,
Matthias
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
------------------------------
Message: 3
Date: Fri, 25 Sep 2015 18:05:04 +0900
From: Toshiaki Nakazawa <nakazawa@pa.jst.jp>
Subject: [Moses-support] Final Call for Participation: WAT2015 (The
2nd Workshop on Asian Translation)
To: moses-support@mit.edu
Message-ID: <m2h9mikgxr.wl-nakazawa@pa.jst.jp>
Content-Type: text/plain; charset=UTF-8
Dear all MT researchers/users,
This is the final call for participation to the 2nd Workshop on Asian
Translation (WAT2015). The workshop will be held on October 16, 2015
in Kyoto, Japan. The registration fee is FREE. Please feel free to
join our workshop!
Best regards,
---------------------------------------------------------------------------
WAT 2015
(The 2nd Workshop on Asian Translation)
http://lotus.kuee.kyoto-u.ac.jp/WAT/
October 16, 2015, Kyoto, Japan
Following the success of the previous Workshop on Asian Translation
(WAT2014), WAT2015 brings together machine translation researchers and
users to try, evaluate, share and discuss brand-new ideas of machine
translation. We are working toward the practical use of machine
translation among all Asian countries.
For the 2nd WAT, we adopt new translation subtasks
"Chinese-to-Japanese and Korean-to-Japanese patent translation" in
addition to the subtasks that were conducted in WAT2014.
PROGRAM
-------
10:30 - 10:40 Welcome
10:40 - 11:25 Invited talk ?: Eiichiro Sumita
11:25 - 11:30 Break
11:30 - 11:50 Overview of WAT2015
11:50 - 12:30 Oral Presentation ? (2 systems)
12:30 - 13:30 Lunch
13:30 - 14:15 Invited talk II: Haizhou Li
14:15 - 14:20 Break
14:20 - 15:00 Oral Presentation II (2 systems)
15:00 - 15:15 Poster Booster Session
15:15 - 16:45 Poster Presentation (all systems)
16:45 - 16:55 Closing
16:55 - 17:00 Commemorative photo
INVITED TALK
------------
We are planning to have two invited talks as follows:
Speaker1: Dr. Eiichiro Sumita
Associate Director General of Universal Communication Research Institute
and Director of Multilingual Translation Laboratory (NICT)
Title: Government project for multi-lingual speech translation system to bridge the language barrier in Japan
Time: 10:40 - 11:25
Speaker2: Dr. Haizhou Li
Research Director of the Institute for Infocomm Research in Singapore
Principal Scientist and Department Head of Human Language Technology
Title: Adequacy-Fluency Metrics: Evaluating MT in the Continuous Space Model Framework
Time: 13:30 - 14:15
Please visit the homepage of WAT for the biography of the speaker.
http://lotus.kuee.kyoto-u.ac.jp/WAT/#invited-talk.html
REGISTRATION
------------
There is no need to register in advance. The registration fee is FREE
for all audiences include participants.
ORGANIZERS
----------
Toshiaki Nakazawa (Japan Science and Technology Agency (JST))
Hideya Mino (National Institute of Information and Communications Technology (NICT))
Isao Goto (Japan Broadcasting Corporation (NHK))
Graham Neubig (Nara Institute of Science and Technology (NAIST))
Eiichiro Sumita (National Institute of Information and Communications Technology (NICT))
Sadao Kurohashi (Kyoto University)
CONTACT
-------
wat@nlp.ist.i.kyoto-u.ac.jp
---------------------------------------------------------------------------
--
Toshiaki Nakazawa (Researcher)
Japan Science and Technology Agency (JST)
(@ Graduate School of Informatics, Kyoto University)
Yoshida-honmachi, Sakyo-ku, Kyoto, 606-8501, Japan
tel: +81-75-753-5346, fax: +81-75-753-5962
nakazawa@pa.jst.jp / nakazawa@nlp.ist.i.kyoto-u.ac.jp
------------------------------
Message: 4
Date: Fri, 25 Sep 2015 19:34:13 +0800 (CST)
From: "gang tang" <gangtang2014@126.com>
Subject: [Moses-support] Issue with alignment
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <4355a352.6db8.150044838e7.Coremail.gangtang2014@126.com>
Content-Type: text/plain; charset="gbk"
Dear all,
I have a problem with alignment. I'd greatly appreciate if anyone can help solve my issue.
I have the following corpus:
?sandalo camufluge" -> "camufluge sandal"
"sandalo daino" -> "daino sandal"
"sandalo madras" -> "madras sandal"
"sandalo vernice" -> "vernice sandal"
The alignment software I used was GIZA++, and the alignment result was always 0-0 1-1, which meant that "sandalo" wasn't aligned with "sandal". And after training phrase.translation.table always had entries such as "sandalo" -> "camufluge", "sandalo" -> "daino", "sandalo"->"madras", and "sandalo"->"vernice", and no "sandalo"->"sandal". Is there any way this problem could be solved? Could I add more data to align "sandalo" with "sandal" and translate "sandalo" to "sandal"? How should I tune the system?
Thanks for your attention,
Gang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150925/a7dbeba0/attachment-0001.html
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 107, Issue 60
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 107, Issue 60"
Post a Comment