Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. POS Tag set in factored model (Mukund Roy)
2. Re: Dyer's Fast Align (Ale? Tamchyna)
3. Re: POS Tag set in factored model (Rajen Chatterjee)
4. Re: Word alignment heuristic values (Sara Stymne)
----------------------------------------------------------------------
Message: 1
Date: Mon, 1 Dec 2014 13:49:04 +0530
From: Mukund Roy <mukundkumarroy@cdac.in>
Subject: [Moses-support] POS Tag set in factored model
To: moses-support@mit.edu
Message-ID: <20141201134904.487d8ea0@controller.noida.cdac.in>
Content-Type: text/plain; charset=US-ASCII
Dear Sir
As I had been experimenting with Factored model for English-Hindi pair,
the baseline Phrase based model produced BLEU score of around 24 but
Factored model produced BLEU score of 3.5 . I tried several
combinations of parameter (Translation factor, Generation factor,
Reordering factor, decoding Steps etc) using three factors in both
Source and Target side (Surface, Lemma, POS) but the BLEU measure
hovered in between 3.5-4.0 .
For Hindi we used in-house Hindi POS Tagset and for English we used
Penn Tag set. My question is whether different Tagsets could be the
reason for Low score?
Thanks & Regards
Mukund Roy
-------------------------------------------------------------------------------------------------------------------------------
[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.
-------------------------------------------------------------------------------------------------------------------------------
------------------------------
Message: 2
Date: Mon, 1 Dec 2014 10:55:16 +0100
From: Ale? Tamchyna <a.tamchyna@gmail.com>
Subject: Re: [Moses-support] Dyer's Fast Align
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: "<moses-support@mit.edu>" <moses-support@mit.edu>
Message-ID:
<CAAUuB+0MXumQbYbmF=dUsnHsZhfantxk-=qqwJB6CR8DFu3vYA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi,
we've used fast_align for a number of experiments. In terms of BLEU, it's
usually on par with GIZA++ (IBM4) alignment but it's much (much) faster.
Note that fast_align doesn't support multithreading though. The quality of
the actual alignments is probably worse (as reported even in the paper) but
it doesn't seem to make a difference for final MT quality.
Best,
Ales
On Sun, Nov 30, 2014 at 10:07 AM, Tom Hoar <
tahoar@precisiontranslationtools.com> wrote:
> I've read the NAACL 2013 paper on Dyer Fast Align
> (http://www.ark.cs.cmu.edu/cdyer/fast_valign.pdf) and it seems pretty
> straight forward.
>
> There's a comment on statme.org
> (http://www.statmt.org/moses/?n=FactoredTraining.EMS#ntoc13), it's
> faster and maybe better, "especially for language pairs without much
> large-scale reordering."
>
> Other than the risk associated with the reordering, has anyone uncovered
> any other potential draw-backs of using Fast Align? For example,
> although BerkeleyAligner is nice, its multi-threading is buggy and tends
> to randomly fault when using a large thread pool.
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141201/ef5174a4/attachment-0001.htm
------------------------------
Message: 3
Date: Mon, 1 Dec 2014 11:11:15 +0100
From: Rajen Chatterjee <rajen.k.chatterjee@gmail.com>
Subject: Re: [Moses-support] POS Tag set in factored model
To: Mukund Roy <mukundkumarroy@cdac.in>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAC4-+Nz6WQgAB18do9va97JQ=-uK6sA3jN-vVn_axWYpw5V96Q@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi Mukund,
I have done similar experiments like you, for en-hi factored model. Usually
the score may differ by +/- 2 BLUE in general but surely not 24 v/s 4.
Is your* target test set* in tagged format? (usually while tagging target
data set by mistake we may tag all train, dev, and test ). If it is the
case then you may remove factors from the test set and score it again. But,
If it is not the case then you can repeat phrase based exp. using factored
model by setting translation factor as 0-0 and then have a look at the
phrase table to ensure you have correct translation model.
On Mon, Dec 1, 2014 at 9:19 AM, Mukund Roy <mukundkumarroy@cdac.in> wrote:
>
> Dear Sir
>
> As I had been experimenting with Factored model for English-Hindi pair,
> the baseline Phrase based model produced BLEU score of around 24 but
> Factored model produced BLEU score of 3.5 . I tried several
> combinations of parameter (Translation factor, Generation factor,
> Reordering factor, decoding Steps etc) using three factors in both
> Source and Target side (Surface, Lemma, POS) but the BLEU measure
> hovered in between 3.5-4.0 .
>
> For Hindi we used in-house Hindi POS Tagset and for English we used
> Penn Tag set. My question is whether different Tagsets could be the
> reason for Low score?
>
>
> Thanks & Regards
> Mukund Roy
>
>
> -------------------------------------------------------------------------------------------------------------------------------
> [ C-DAC is on Social-Media too. Kindly follow us at:
> Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
>
> This e-mail is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information. If you are not the
> intended recipient, please contact the sender by reply e-mail and destroy
> all copies and the original message. Any unauthorized review, use,
> disclosure, dissemination, forwarding, printing or copying of this email
> is strictly prohibited and appropriate legal action will be taken.
>
> -------------------------------------------------------------------------------------------------------------------------------
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
--
-Regards,
Rajen Chatterjee.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141201/366dd5a5/attachment-0001.htm
------------------------------
Message: 4
Date: Mon, 1 Dec 2014 13:39:32 +0100
From: Sara Stymne <sara.stymne@lingfil.uu.se>
Subject: Re: [Moses-support] Word alignment heuristic values
To: <moses-support@mit.edu>
Message-ID: <547C6184.3070907@lingfil.uu.se>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Hi Tom,
It's not exactly what you asked for, but I recently had a paper where we
compared different alignment models and symmetrization heuristics for
both the translation task and for two reordering tasks for
German-English, which might be of interest:
Estimating Word Alignment Quality for SMT Reordering Tasks. Sara Stymne,
J?rg Tiedemann, Joakim Nivre. WMT 2014.
Similar experiments for French-English gives the best Bleu scores for
GIZA model 4 and fast_align with grow-diag and grow-diag-final-and.
Some other papers that discusses different alignments, and to some
extent symmetrization are:
Patrik Lambert, Simon Petitrenaud, Yanjun Ma, and Andy Way. 2012. What
types of word alignment improve statistical machine translation? Machine
Translation, 26(4).
Necip Fazil Ayan and Bonnie J. Dorr. 2006. Going beyond AER: An
extensive analysis of word alignments and their impact on MT. ACL 2006.
Alexander Fraser and Daniel Marcu. 2007. Measuring word alignment
quality for statistical machine translation. Computational Linguistics,
33(3).
As a general trend it seems that the precision of alignment links is
more important with small corpora, and recall is more important with
larger corpora.
Best,
Sara
Den 2014-11-30 10:25, Tom Hoar skrev:
> When using (M)GIZA++, train-model.perl defines 9 heuristic values for
> word alignment: intersect, union, grow, grow-final, grow-diag,
> grow-diag-final (default), grow-diag-final-and, srctotgt, tgttosrc
>
> I found two different heuristic values for for word alignments when
> using BerkeleyAligner: softunion, low-posterior
>
> Also, the recaser script uses a method to create word alignments:
> word-to-word
>
> Some recent moses-support exchanges got me wondering. Does anyone know
> of any references that describe the strengths/weaknesses of each of
> these heuristics and best use cases for each? Such as, use for specific
> languages or pairs, use when there's a scarcity of training corpus, etc.
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 98, Issue 1
********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 98, Issue 1"
Post a Comment