Moses-support Digest, Vol 162, Issue 12

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: phrase-table with ' " and other strage things.
Additional corpus cleaning necessary? (Philipp Koehn)
2. Re: phrase-table with ' " and other strage things.
Additional corpus cleaning necessary? (Artem Shevchenko)

----------------------------------------------------------------------

Message: 1
Date: Thu, 16 Apr 2020 17:47:36 -0400
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] phrase-table with ' " and other
strage things. Additional corpus cleaning necessary?
To: Artem Shevchenko <shevart@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAAFADDA4hu_U7YfwiPv_DLOrUoiArq8sL3s7f+XB4D2MU0A+nw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

these items are introduced by the tokenizer - they are used to escape
characters that
have special meaning in (some) Moses components.

They should show up in the phrase table, as you show them. Any input text
that is
pre-processed with the tokenizer will have them, and any output that is
post-processed
with the detokenizer will have them restored.

-phi

On Sat, Apr 4, 2020 at 7:44 PM Artem Shevchenko <shevart@gmail.com> wrote:

> Hello,
>
> following the manual for baseline creaition, I have trained the model
> using Europarl v9 de-en pair.
> Now I observe that obtained phrase table contains a lot of noise.
>
> E.g. a lot of "' ", """ which seem to distort the model and
> decoder.
> E.g. truecasing did not work properly with those special symbols:
>
> " ( Das sind sehr ||| ' ( these are very ||| 0.5 2.47962e-05
> 0.333333 7.4064e-05 ||| 0-0 1-1 2-2 3-3 4-4 ||| 2 3 1 ||| |||
>
> Did you do any additional purification of the corpus before training?
> Please share your experience.
>
> Artem Shevchenko
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20200416/9dbf0fe8/attachment-0001.html

------------------------------

Message: 2
Date: Fri, 17 Apr 2020 11:27:59 +0200
From: Artem Shevchenko <shevart@gmail.com>
Subject: Re: [Moses-support] phrase-table with ' " and other
strage things. Additional corpus cleaning necessary?
To: Philipp Koehn <phi@jhu.edu>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CACmqYH2XHihcowcr8JYuOwdO18Twweg5OuWWgpN7CDMrwUfLww@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hello,

thank you for your response.
However I'm not quite sure I understand it right.

My observation is that those special signs are not good to have in the
training corpus, as e.g. the truecaser and decoder get confused with those
and do not provide their proper function.
In the example I gave:
" ( Das sind sehr ||| ' ( these are very ||| 0.5 2.47962e-05
0.333333 7.4064e-05 ||| 0-0 1-1 2-2 3-3 4-4 ||| 2 3 1 ||| |||

it is not correct, that "Das" in german "Das sind sehr" is translated into
"these" lowercase.

Also the produced entry is very specific with the quotation marks, so such
entries just represent "noise" and lead only to increase of phrase table
without any added value.
It would be much better to have translation table without quotation marks,
like:
das sind sehr ||| these are very ||| 0.5 2.47962e-05 0.333333 7.4064e-05
||| 0-0 1-1 2-2 3-3 4-4 ||| 2 3 1 ||| |||
the quotation marks can be translated as well, like:
" ||| ' ||| ......

So either tokenizer does not work well, or one needs additional
purificaiton steps afterwards to produce pure corpus for the training.
Right?

Regards
Artem Shevchenko

??, 16 ???. 2020 ?. ? 23:47, Philipp Koehn <phi@jhu.edu>:

> Hi,
>
> these items are introduced by the tokenizer - they are used to escape
> characters that
> have special meaning in (some) Moses components.
>
> They should show up in the phrase table, as you show them. Any input text
> that is
> pre-processed with the tokenizer will have them, and any output that is
> post-processed
> with the detokenizer will have them restored.
>
> -phi
>
> On Sat, Apr 4, 2020 at 7:44 PM Artem Shevchenko <shevart@gmail.com> wrote:
>
>> Hello,
>>
>> following the manual for baseline creaition, I have trained the model
>> using Europarl v9 de-en pair.
>> Now I observe that obtained phrase table contains a lot of noise.
>>
>> E.g. a lot of "' ", """ which seem to distort the model and
>> decoder.
>> E.g. truecasing did not work properly with those special symbols:
>>
>> " ( Das sind sehr ||| ' ( these are very ||| 0.5 2.47962e-05
>> 0.333333 7.4064e-05 ||| 0-0 1-1 2-2 3-3 4-4 ||| 2 3 1 ||| |||
>>
>> Did you do any additional purification of the corpus before training?
>> Please share your experience.
>>
>> Artem Shevchenko
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20200417/e560a5d2/attachment-0001.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 162, Issue 12
**********************************************

Moses-support Digest, Vol 162, Issue 12

0 Response to "Moses-support Digest, Vol 162, Issue 12"

Post a Comment