Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: is there a way to remove a bad entry in the phrase table
? (Vincent Nguyen)
2. Re: is there a way to remove a bad entry in the phrase table
? (Matthias Huck)
----------------------------------------------------------------------
Message: 1
Date: Thu, 24 Sep 2015 18:19:09 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] is there a way to remove a bad entry in
the phrase table ?
To: Matthias Huck <mhuck@inf.ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <5604227D.4040506@neuf.fr>
Content-Type: text/plain; charset=utf-8; format=flowed
This is an interesting subject ......
As a matter of fact I have done several tests.
I came up to that need after realizing that even though my results were
good in a "standard dev + test set" situation
I had some strange results with real-world documents.
That's why I investigated.
But you are right removing some so-called bad entries could have
unexpected results.
For instance here is a test I did :
I trained a fr-en model on europarl v7 ( 2 millions sentences)
I tuned with a subset of 3 K sentences.
I ran a evaluation on the full 2 million lines.
then I removed the 90 K sentences for which the score was less than 0.2
retrained on 1917853 sentences.
In the end I got more sentences (in %) with a score above 0.2
but when analyzing at > 0.3 it becomes similar and > 0.4 the initial
corpus is better.
Just weird.
Le 24/09/2015 16:42, Matthias Huck a ?crit :
> Hi,
>
> If your analysis revealed that there's an issue with only a few specific
> entries, then write regular expressions and grep them out. However, you
> risk that those entries are a problem on the devtest set you're looking
> at only, whereas on different input data it'll be other bad translation
> options which pop up.
>
> On Thu, 2015-09-24 at 16:08 +0200, Vincent Nguyen wrote:
>> Matthias,
>>
>> Pruning :
>> I use the cube pop limit at 400 instead of default values (1000 or 5000)
>> I use the MinScore 0.001
> It seems to me that something like MinScore 2:0.001 should be effective
> for most of the bad phrases you copied into your original mail as an
> example.
>
>> I tried sigtest filtering once, it never worked.
> Why not?
>
>> table-limit=20
>> I have the feeling this is only for CreateOnDiskPt
>> am I wrong ?
>> does it work with ProcessPhrasetableMin ?
> I think it works. The decoder does this, not the phrase table binarizer.
> You could run a simple experiments in order to verify. Add
> -feature-overwrite 'TranslationModel0 table-limit=20' (or equivalent) to
> your decoder call.
>
> Cheers,
> Matthias
>
>
>> Le 24/09/2015 15:21, Matthias Huck a ?crit :
>>> Hi Vincent,
>>>
>>> Pruning the phrase table will discard many bad entries.
>>>
>>> The decoder is typically configured to load no more than a maximum
>>> number of translation options per distinct source side. Use
>>> table-limit=20 as a parameter to your translation model feature to limit
>>> the amount of candidates to the top 20.
>>>
>>> Alternatively you can pre-prune the phrase table. The following page
>>> provides instructions:
>>> http://www.statmt.org/moses/?n=Advanced.RuleTables
>>>
>>> In case you want to remove just a handful of individual entries, I
>>> recommend grep -v on the Linux command line.
>>>
>>> Cheers,
>>> Matthias
>>>
>>>
>>> On Thu, 2015-09-24 at 11:05 +0100, Hieu Hoang wrote:
>>>> i've just added a new feature function that allows you to give a list
>>>> of rules that you don't want to be used:
>>>> " 1 ||| One Million Roofs
>>>>
>>>> oui ||| no
>>>>
>>>> To use this list, add the following to your moses.ini file
>>>>
>>>> [feature]
>>>> DeleteRules path=/path/to/list
>>>>
>>>> Not tested.
>>>>
>>>>
>>>>
>>>> Hieu Hoang
>>>> http://www.hoang.co.uk/hieu
>>>>
>>>>
>>>> On 24 September 2015 at 10:11, Vincent Nguyen <vnguyen@neuf.fr> wrote:
>>>>
>>>> well at times it does, the sequence:
>>>> " 1 "
>>>> became
>>>> One Million Roofs
>>>> completely off ....
>>>>
>>>>
>>>> " 1 " . ||| one . ||| 4.77044e-05 2.56689e-08
>>>> 0.103519 0.0135382 ||| 1-0 3-1 ||| 2170 1 1 ||| |||
>>>> " 1 " une ||| " 1 " meaning ||| 0.0517593
>>>> 0.00140486 0.103519 5.98457e-06 ||| 0-0 1-1 0-2 2-2 2-3 ||| 2
>>>> 1 1 ||| |||
>>>> " 1 " ||| " 1 " meaning ||| 0.0517593
>>>> 0.121628 0.0517593 5.98457e-06 ||| 0-0 1-1 0-2 2-2 2-3 ||| 2 2
>>>> 1 ||| |||
>>>> " 1 " ||| one ||| 1.34779e-06 2.65512e-08 0.0517593
>>>> 0.0141179 ||| 1-0 ||| 76806 2 1 ||| |||
>>>> " 1 + ||| ' one @-@ on ||| 0.0517593 8.76241e-09
>>>> 0.0345062 2.43009e-07 ||| 0-0 2-0 1-1 ||| 2 3 1 ||| |||
>>>> " 1 + ||| ' one @-@ ||| 0.0129398 8.76241e-09
>>>> 0.0345062 1.65217e-05 ||| 0-0 2-0 1-1 ||| 8 3 1 ||| |||
>>>> " 1 + ||| ' one ||| 0.000685554 8.76241e-09
>>>> 0.0345062 0.00189493 ||| 0-0 2-0 1-1 ||| 151 3 1 ||| |||
>>>> " 1 . ||| '1 . ||| 0.103519 0.241693 0.0345062
>>>> 5.37965e-05 ||| 0-0 1-0 2-1 ||| 1 3 1 ||| |||
>>>> " 1 . ||| " 1 . ||| 0.508332 0.34958 0.338888
>>>> 0.180103 ||| 0-0 1-1 2-2 ||| 2 3 2 ||| |||
>>>> " 1 billion de dollars ||| $ 1 trillion of ||| 0.0207037
>>>> 2.46862e-05 0.103519 0.0679424 ||| 4-0 1-1 2-2 3-3 ||| 5 1 1
>>>> ||| |||
>>>> " 1 billion de ||| 1 trillion of ||| 0.0345062
>>>> 5.93019e-05 0.103519 0.161697 ||| 1-0 2-1 3-2 ||| 3 1 1 |||
>>>> |||
>>>> " 1 billion ||| 1 trillion ||| 0.00108967 0.000131965
>>>> 0.103519 0.536768 ||| 1-0 2-1 ||| 95 1 1 ||| |||
>>>> " 1 milliard $ , ||| $ 1 billion ||| 0.00199074
>>>> 2.23776e-06 0.103519 0.420148 ||| 3-0 1-1 2-2 ||| 52 1 1 |||
>>>> |||
>>>> " 1 milliard $ ||| $ 1 billion ||| 0.00199074 3.32223e-05
>>>> 0.103519 0.420148 ||| 3-0 1-1 2-2 ||| 52 1 1 ||| |||
>>>> " 1 milliard d' euros ||| EUR 1 billion |||
>>>> 0.00026749 3.23583e-05 0.103519 0.179568 ||| 4-0 1-1 2-2 3-2
>>>> ||| 387 1 1 ||| |||
>>>> " 1 milliard d' ||| 1 billion ||| 0.000137475
>>>> 6.11551e-05 0.103519 0.25129 ||| 1-0 2-1 3-1 ||| 753 1 1 |||
>>>> |||
>>>> " 1 milliard de dollars ||| $ 1 billion ||| 0.0195512
>>>> 2.47433e-05 0.508332 0.105231 ||| 0-0 4-0 1-1 2-2 ||| 52 2 2
>>>> ||| |||
>>>> " 1 milliard de personnes ||| one billion people |||
>>>> 0.00252484 9.77577e-09 0.103519 0.00258395 ||| 2-0 1-1 2-1 4-2
>>>> ||| 41 1 1 ||| |||
>>>> " 1 milliard de ||| 1 billion of ||| 0.00941078
>>>> 0.000159942 0.0517593 0.15086 ||| 1-0 2-1 3-2 ||| 11 2 1 |||
>>>> |||
>>>> " 1 milliard de ||| one billion ||| 0.000509944
>>>> 4.32371e-08 0.0517593 0.00492989 ||| 2-0 1-1 2-1 ||| 203 2 1
>>>> ||| |||
>>>> " 1 milliard ||| 1 billion ||| 0.0026678 0.000355919
>>>> 0.502213 0.500792 ||| 1-0 2-1 ||| 753 4 3 ||| |||
>>>> " 1 milliard ||| one billion ||| 0.000509944 3.43309e-07
>>>> 0.0258796 0.00492989 ||| 2-0 1-1 2-1 ||| 203 4 1 ||| |||
>>>> " 1 million $ ||| $ 1 million ||| 0.0172531 1.31973e-05
>>>> 0.103519 0.221619 ||| 0-0 3-0 1-1 2-2 ||| 6 1 1 ||| |||
>>>> " 1 million de toits ||| one million solar roofs |||
>>>> 0.0517593 5.86831e-10 0.103519 1.43348e-10 ||| 2-0 1-1 4-3 |||
>>>> 2 1 1 ||| |||
>>>> " 1 million de ||| one million solar ||| 0.0258796
>>>> 9.85876e-10 0.0517593 3.44036e-10 ||| 2-0 1-1 ||| 4 2 1 |||
>>>> |||
>>>> " 1 million de ||| one million ||| 0.00021344 9.85876e-10
>>>> 0.0517593 0.000202374 ||| 2-0 1-1 ||| 485 2 1 ||| |||
>>>> " 1 million ||| one million solar ||| 0.0258796
>>>> 7.82802e-09 0.0517593 3.44036e-10 ||| 2-0 1-1 ||| 4 2 1 |||
>>>> |||
>>>> " 1 million ||| one million ||| 0.00021344 7.82802e-09
>>>> 0.0517593 0.000202374 ||| 2-0 1-1 ||| 485 2 1 ||| |||
>>>> " 1 ou 2 % ||| one or two percent ||| 0.0258796
>>>> 6.85867e-09 0.103519 1.36871e-06 ||| 1-0 2-1 3-2 4-3 ||| 4 1 1
>>>> ||| |||
>>>> " 1 ou 2 ||| one or two ||| 0.000164315 2.30435e-08
>>>> 0.103519 0.00032742 ||| 1-0 2-1 3-2 ||| 630 1 1 ||| |||
>>>> " 1 ou ||| one or ||| 8.83264e-05 3.76903e-06 0.103519
>>>> 0.0112293 ||| 1-0 2-1 ||| 1172 1 1 ||| |||
>>>> " 1 seul coup , ||| ' 1 shot , ||| 0.103519
>>>> 1.88862e-06 0.103519 0.00165224 ||| 0-0 1-1 3-2 4-3 ||| 1 1 1
>>>> ||| |||
>>>> " 1 seul coup ||| ' 1 shot ||| 0.103519 2.45247e-06
>>>> 0.103519 0.00222575 ||| 0-0 1-1 3-2 ||| 1 1 1 ||| |||
>>>> " 1 seul ||| ' 1 ||| 0.0129398 2.78897e-05 0.103519
>>>> 0.214656 ||| 0-0 1-1 ||| 8 1 1 ||| |||
>>>> " 1 ||| ' 1 ||| 0.127083 0.278063 0.0391025 0.214656
>>>> ||| 0-0 1-1 ||| 8 26 2 ||| |||
>>>> " 1 ||| '1 ||| 0.103519 0.25 0.00398148 5.61e-05 |||
>>>> 0-0 1-0 ||| 1 26 1 ||| |||
>>>> " 1 ||| " 1 ||| 0.503492 0.361595 0.11619 0.187815
>>>> ||| 0-0 1-1 ||| 6 26 4 ||| |||
>>>> " 1 ||| 1 ||| 0.0010136 0.00278649 0.461538 0.805151 |||
>>>> 1-0 ||| 11839 26 12 ||| |||
>>>> " 1 ||| One Million Roofs ||| 0.103519 0.00213892
>>>> 0.00398148 3.32314e-15 ||| 0-0 1-0 0-1 0-2 ||| 1 26 1 ||| |||
>>>> " 1 ||| hardly 1 ||| 0.0258796 0.00278649 0.00398148
>>>> 1.73108e-05 ||| 1-1 ||| 4 26 1 ||| |||
>>>> " 1 ||| million solar ||| 0.0345062 3.55949e-06
>>>> 0.00398148 3.29783e-09 ||| 1-0 ||| 3 26 1 ||| |||
>>>> " 1 ||| million ||| 5.83433e-06 3.55949e-06 0.00398148
>>>> 0.0019399 ||| 1-0 ||| 17743 26 1 ||| |||
>>>> " 1 ||| of 1 ||| 0.000263406 0.00278649 0.00398148
>>>> 0.0270917 ||| 1-1 ||| 393 26 1 ||| |||
>>>> " 1 ||| one ||| 1.32368e-05 5.22671e-06 0.0391025
>>>> 0.0141179 ||| 1-0 ||| 76806 26 2 ||| |||
>>>> " 1,1 % ||| 1.1 % ||| 0.0022504 0.00241746 0.103519
>>>> 0.875731 ||| 1-0 2-1 ||| 46 1 1 ||| |||
>>>> " 1,1 milliard d' euros ||| EUR 1.1 billion |||
>>>> 0.00544835 6.98053e-05 0.0517593 0.110019 ||| 3-0 4-0 1-1 2-1
>>>> 2-2 ||| 19 2 1 ||| |||
>>>> " 1,1 milliard d' euros ||| by EUR 1.1 billion |||
>>>> 0.0345062 6.98053e-05 0.0517593 0.000791519 ||| 3-1 4-1 1-2
>>>> 2-2 2-3 ||| 3 2 1 ||| |||
>>>>
>>>>
>>>>
>>>> Le 24/09/2015 09:54, Felipe S?nchez Mart?nez a ?crit :
>>>>
>>>> > Hi,
>>>> >
>>>> > This is quite common. If you look at the scores, they are
>>>> > pretty low when they do not make sense, so, even though they
>>>> > are in the phrase table, most probably they will never be
>>>> > used for translation. I would not bother.
>>>> >
>>>> > Cheers
>>>> > --
>>>> > Felipe
>>>> >
>>>> > El 23/09/15 a las 16:50, Vincent Nguyen escribi?:
>>>> > > I agree and would like to.
>>>> > > But this is tricky, look at the first 30 lines of my
>>>> > > phrase table below.
>>>> > >
>>>> > > and this happens a lot in the first line of tables where
>>>> > > there are &apos
>>>> > > or weird codes, EN/FR pairs do not match.
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > ! ! ! ! ||| ! ! ! ! ||| 0.103413 0.132185 0.103413
>>>> > > 0.401758 ||| 0-0 1-1
>>>> > > 2-2 3-3 ||| 1 1 1 ||| |||
>>>> > > ! ! ! ) ||| ! ! ! ) ||| 0.339323 0.167884 0.508985 0.4246
>>>> > > ||| 0-0 1-0
>>>> > > 2-0 2-1 2-2 3-3 ||| 3 2 2 ||| |||
>>>> > > ! ! ! ||| ! ! ! ||| 0.501834 0.219223 0.716905 0.50463 |||
>>>> > > 0-0 1-1 2-2
>>>> > > ||| 10 7 6 ||| |||
>>>> > > ! ! ! ||| budget ! ! ! ||| 0.0517067 0.219223 0.0147733
>>>> > > 4.50635e-05 |||
>>>> > > 0-1 1-2 2-3 ||| 2 7 1 ||| |||
>>>> > > ! ! ) , ||| ! ! ) - , ||| 0.103413 0.111989 0.103413
>>>> > > 0.00192967 ||| 0-0
>>>> > > 1-1 2-2 3-3 3-4 ||| 1 1 1 ||| |||
>>>> > > ! ! ) ||| ! ! ) ||| 0.103413 0.278429 0.103413 0.533321
>>>> > > ||| 0-0 1-1 2-2
>>>> > > ||| 1 1 1 ||| |||
>>>> > > ! ! ||| ! ! ||| 0.625 0.363573 0.769231 0.633844 ||| 0-0
>>>> > > 1-1 ||| 16 13
>>>> > > 10 ||| |||
>>>> > > ! ! ||| . ||| 4.65922e-08 6.71089e-07 0.00795487 0.140779
>>>> > > ||| 0-0 1-0
>>>> > > ||| 2.21954e+06 13 1 ||| |||
>>>> > > ! ! ||| budget ! ! ||| 0.0517067 0.363573 0.00795487
>>>> > > 5.66022e-05 ||| 0-1
>>>> > > 1-2 ||| 2 13 1 ||| |||
>>>> > > ! ! ||| n?cessaire ! ! ||| 0.103413 0.363573 0.00795487
>>>> > > 0.000130572 |||
>>>> > > 0-1 1-2 ||| 1 13 1 ||| |||
>>>> > > ! [ never again ! ||| ! ||| 6.51628e-06 5.42074e-13
>>>> > > 0.103413
>>>> > > 0.796143 ||| 0-0 4-0 ||| 15870 1 1 ||| |||
>>>> > > ! ] this is ||| tel est ||| 7.38667e-05 9.16191e-11
>>>> > > 0.103413
>>>> > > 0.00147917 ||| 2-0 3-1 ||| 1400 1 1 ||| |||
>>>> > > ! ] this ||| tel ||| 1.09594e-05 1.44188e-10 0.103413
>>>> > > 0.0035893 |||
>>>> > > 2-0 ||| 9436 1 1 ||| |||
>>>> > > ! ] ||| ! ] ||| 0.103413 0.352335 0.103413
>>>> > > 0.472387 ||| 0-0 1-1
>>>> > > ||| 1 1 1 ||| |||
>>>> > > ! & quot ; ||| ! " . et ||| 0.0517067 2.36396e-12
>>>> > > 0.0517067
>>>> > > 1.88268e-05 ||| 0-0 1-1 2-1 3-3 ||| 2 2 1 ||| |||
>>>> > > ! & quot ; ||| ! " ||| 0.000222394 1.44515e-11
>>>> > > 0.0517067
>>>> > > 0.518419 ||| 0-0 2-1 ||| 465 2 1 ||| |||
>>>> > > ! & quot ||| ! " . ||| 0.000662906 8.30626e-09
>>>> > > 0.0344711
>>>> > > 0.00232791 ||| 0-0 1-1 2-1 ||| 156 3 1 ||| |||
>>>> > > ! & quot ||| ! " ||| 0.00218918 8.30626e-09
>>>> > > 0.339323 0.518419
>>>> > > ||| 0-0 2-1 ||| 465 3 2 ||| |||
>>>> > > ! & ||| ! ||| 6.51628e-06 7.21755e-05 0.103413
>>>> > > 0.796143 ||| 0-0 |||
>>>> > > 15870 1 1 ||| |||
>>>> > > ! ' ] , addressed ||| ! " adress? |||
>>>> > > 0.103413 3.70838e-07
>>>> > > 0.103413 0.00596848 ||| 0-0 1-1 2-1 4-2 ||| 1 1 1 ||| |||
>>>> > > ! ' ] , ||| ! " ||| 0.000222394 2.49698e-06
>>>> > > 0.103413
>>>> > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
>>>> > > ! ' ] ||| ! " ||| 0.000222394 3.57128e-05
>>>> > > 0.103413
>>>> > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| |||
>>>> > > ! ' ' Alstom shares ||| l' on constate un
>>>> > > dysfonctionnement ||| 0.0344711 5.62605e-16 0.103413
>>>> > > 1.03361e-14 ||| 1-0
>>>> > > 2-0 1-1 3-4 4-4 ||| 3 1 1 ||| |||
>>>> > > ! ' ' ||| l' on constate un ||| 0.0147733
>>>> > > 1.56906e-11
>>>> > > 0.0129267 2.2766e-12 ||| 1-0 2-0 1-1 ||| 7 8 1 ||| |||
>>>> > > ! ' ' ||| l' on constate ||| 0.000984889
>>>> > > 1.56906e-11
>>>> > > 0.0129267 2.36929e-10 ||| 1-0 2-0 1-1 ||| 105 8 1 ||| |||
>>>> > > ! ' ' ||| l' on ||| 6.76656e-06 1.56906e-11
>>>> > > 0.0129267
>>>> > > 6.18613e-06 ||| 1-0 2-0 1-1 ||| 15283 8 1 ||| |||
>>>> > > ! ' ' ||| ou que l' on constate |||
>>>> > > 0.0344711 1.56906e-11
>>>> > > 0.0129267 4.69534e-15 ||| 1-2 2-2 1-3 ||| 3 8 1 ||| |||
>>>> > > ! ' ' ||| ou que l' on ||| 0.00304157
>>>> > > 1.56906e-11
>>>> > > 0.0129267 1.22594e-10 ||| 1-2 2-2 1-3 ||| 34 8 1 ||| |||
>>>> > > ! ' ' ||| que l' on constate un |||
>>>> > > 0.0344711 1.56906e-11
>>>> > > 0.0129267 4.56092e-14 ||| 1-1 2-1 1-2 ||| 3 8 1 ||| |||
>>>> > > ! ' ' ||| que l' on constate ||| 0.00323167
>>>> > > 1.56906e-11
>>>> > > 0.0129267 4.74661e-12 ||| 1-1 2-1 1-2 ||| 32 8 1 ||| |||
>>>> > >
>>>> > >
>>>> > >
>>>> > > Le 23/09/2015 15:12, Tom Hoar a ?crit :
>>>> > > > Vincent,
>>>> > > >
>>>> > > > If you suspect bad entries, isn't it better to address
>>>> > > > the root of the
>>>> > > > problem and prepare your training corpus better?
>>>> > > >
>>>> > > >
>>>> > > > On 9/23/2015 6:46 PM, moses-support-request@mit.edu
>>>> > > > wrote:
>>>> > > > > Date: Tue, 22 Sep 2015 20:24:02 +0200
>>>> > > > > From: Philipp Koehn<phi@jhu.edu>
>>>> > > > > Subject: Re: [Moses-support] is there a way to remove
>>>> > > > > a bad entry in
>>>> > > > > the phrase table ?
>>>> > > > > To: Vincent Nguyen<vnguyen@neuf.fr>
>>>> > > > > Cc: moses-support<moses-support@mit.edu>
>>>> > > > >
>>>> > > > > Hi,
>>>> > > > >
>>>> > > > > you can remove it manually (just edit the text file),
>>>> > > > > there will be no
>>>> > > > > negative consequences.
>>>> > > > >
>>>> > > > > However, it is not a realistic strategy to try to
>>>> > > > > remove by hand every
>>>> > > > > offending phrase table entry.
>>>> > > > >
>>>> > > > > -phi
>>>> > > > >
>>>> > > > > On Tue, Sep 22, 2015 at 4:05 PM, Vincent
>>>> > > > > Nguyen<vnguyen@neuf.fr> wrote:
>>>> > > > >
>>>> > > > > > >Hi,
>>>> > > > > > >
>>>> > > > > > >I was wondering if after an analysis of the
>>>> > > > > > BLEU-Annotation file we
>>>> > > > > > >realize that there must be a bad entry in the
>>>> > > > > > phrase table,
>>>> > > > > > >we could remove it manually or in some other
>>>> > > > > > ways ?
>>>> > > > > > >
>>>> > > > > > >Gracias.
>>>> > > > > > >V.
>>>> > > > > > >_______________________________________________
>>>> > > > > > >Moses-support mailing list
>>>> > > > > > >Moses-support@mit.edu
>>>> > > > > > >http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> > > > > > >
>>>> > > >
>>>> > > > --
>>>> > > > Best regards,
>>>> > > >
>>>> > > > Tom Hoar
>>>> > > > Chief Executive Officer
>>>> > > > /*Precision Translation Tools Pte Ltd*/
>>>> > > > Singapore/Thailand
>>>> > > > Web: www.precisiontranslationtools.com
>>>> > > > <http://www.precisiontranslationtools.com>
>>>> > > > Thailand Mobile: +66 87 345-1875
>>>> > > > Skype: tahoar
>>>> > > >
>>>> > > >
>>>> > > > _______________________________________________
>>>> > > > Moses-support mailing list
>>>> > > > Moses-support@mit.edu
>>>> > > > http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> > >
>>>> > >
>>>> > >
>>>> > > _______________________________________________
>>>> > > Moses-support mailing list
>>>> > > Moses-support@mit.edu
>>>> > > http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> > >
>>>> >
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>
>
------------------------------
Message: 2
Date: Thu, 24 Sep 2015 21:13:34 +0100
From: Matthias Huck <mhuck@inf.ed.ac.uk>
Subject: Re: [Moses-support] is there a way to remove a bad entry in
the phrase table ?
To: Vincent Nguyen <vnguyen@neuf.fr>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <1443125614.13101.924.camel@portedgar>
Content-Type: text/plain; charset="UTF-8"
Hi Vincent,
This is a different topic, and I'm not completely clear about what
exactly you did here. Did you decode the source side of the parallel
training data, conduct sentence selection by applying a threshold on the
decoder score, and extract a new phrase table from the selected fraction
of the original parallel training data? If this is the case, I have some
comments:
- Be careful when you translate training data. The system knows these
sentences and does things like frequently applying long singleton
phrases that have been extracted from the very same sentence.
https://aclweb.org/anthology/P/P10/P10-1049.pdf
- Longer sentences may have worse model score than shorter sentences.
Consider normalizing by sentence length if you use model score for data
selection.
Difficult sentences generally have worse model score than easy ones but
may still be useful for training. You possibly keep the parts of the
data that are easy to translate or are highly redundant in the corpus.
- You probably see no out-of-vocabulary words (OOVs) when translating
training data, or very few of them (depending on word alignment, phrase
extraction method, and phrase table pruning), but be aware that if there
are OOVs, this may affect the model score a lot.
- Check to what extent the sentence selection reduces the vocabulary of
your system.
Last but not least, two more general comments:
- You need dev and test sets that are similar to the type of real-world
documents that you're building your system for. Don't tune on Europarl
if you eventually want to translate pharmaceutical patents, for
instance. Try to collect in-domain training data as well.
- In case you have in-domain and out-of-domain training corpora, you can
try modified Moore-Lewis filtering for data selection.
https://aclweb.org/anthology/D/D11/D11-1033.pdf
Cheers,
Matthias
On Thu, 2015-09-24 at 18:19 +0200, Vincent Nguyen wrote:
> This is an interesting subject ......
>
> As a matter of fact I have done several tests.
> I came up to that need after realizing that even though my results were
> good in a "standard dev + test set" situation
> I had some strange results with real-world documents.
> That's why I investigated.
>
> But you are right removing some so-called bad entries could have
> unexpected results.
>
> For instance here is a test I did :
>
> I trained a fr-en model on europarl v7 ( 2 millions sentences)
> I tuned with a subset of 3 K sentences.
> I ran a evaluation on the full 2 million lines.
> then I removed the 90 K sentences for which the score was less than 0.2
> retrained on 1917853 sentences.
>
> In the end I got more sentences (in %) with a score above 0.2
> but when analyzing at > 0.3 it becomes similar and > 0.4 the initial
> corpus is better.
>
> Just weird.
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 107, Issue 59
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 107, Issue 59"
Post a Comment