Moses-support Digest, Vol 145, Issue 2

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: German tokenizer may fail with numeric endings (Ergun Bicici)
2. Re: German tokenizer may fail with numeric endings (Ozan ?a?layan)
3. Re: German tokenizer may fail with numeric endings (Ergun Bicici)
4. Re: German tokenizer may fail with numeric endings (Hieu Hoang)


----------------------------------------------------------------------

Message: 1
Date: Wed, 7 Nov 2018 00:41:47 +0300
From: Ergun Bicici <bicici@gmail.com>
Subject: Re: [Moses-support] German tokenizer may fail with numeric
endings
To: ozancag@gmail.com
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAB59qTOnp1iVxKtfHxpZerbt3EACZ-2NhBHi=vze76Xyn9rvrQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

There might be some rule that prevents. Scripts contain language specific
tokenization rules and they are checked in a sequence.

Did you try all 1-99? :)

On Mon, Nov 5, 2018 at 9:15 PM Ozan ?a?layan <ozancag@gmail.com> wrote:

> Hello,
>
> I just discovered that the German tokenizer does not split the final <dot>
> if preceded by a number. This is because of the nonbreaking prefixes file
> which lists ordinals in the form '<number>.'. Since the list is between
> 1-99, for numbers > 99, the tokenizer works correctly. Here's a sentence
> from europarl:
>
> $ echo 'Sie akzeptiert im Prinzip die ?nderungsantr?ge 5 und 6 und voll
> die ?nderungsantr?ge 2 und *3.*' | tokenizer.perl -q -l de
> Sie akzeptiert im Prinzip die ?nderungsantr?ge 5 und 6 und voll die
> ?nderungsantr?ge 2 und *3.*
>
> $ echo 'Sie akzeptiert im Prinzip die ?nderungsantr?ge 5 und 6 und voll
> die ?nderungsantr?ge 2 und *100.*' | tokenizer.perl -q -l de
> Sie akzeptiert im Prinzip die ?nderungsantr?ge 5 und 6 und voll die
> ?nderungsantr?ge 2 und *100 .*
>
>
>
> --
> Ozan Caglayan
> PhD student @ University of Le Mans
> Team LST -- Language and Speech Technology
> http://www.ozancaglayan.com
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


--

Regards,
Ergun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20181106/be1a7912/attachment-0001.html

------------------------------

Message: 2
Date: Tue, 6 Nov 2018 23:39:26 +0100
From: Ozan ?a?layan <ozancag@gmail.com>
Subject: Re: [Moses-support] German tokenizer may fail with numeric
endings
To: undisclosed-recipients:;
Cc: moses-support@mit.edu
Message-ID:
<CAFub=KT0s3oX-acd=DPLc5EbAiH8bC2w=E9ErLMJax4_CscCsg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Yes the rules are coming from the nonbreaking_prefixes files which are text
files listing which prefixes, when preceded by a <dot> should not be
tokenized. But I think this rule should not be applied if the prefix is
actually a suffix of the sentence. Similar situations arise for French and
other languages as well. For french, "sec." is a non-breaking prefix which
is the abbreviation for "seconds" but sec also means "dry". So if a
sentence ends with the "dry" meaning of "sec." the <dot> is also not
tokenized.

When the size of the corpora goes to infinity, this means that all
nonbreaking_prefixes for a language will end up in the model vocabulary for
NMT.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20181106/94fbaecf/attachment-0001.html

------------------------------

Message: 3
Date: Wed, 7 Nov 2018 09:33:15 +0300
From: Ergun Bicici <bicici@gmail.com>
Subject: Re: [Moses-support] German tokenizer may fail with numeric
endings
To: Ozan ?a?layan <ozancag@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAB59qTMDESQPLEea=DWZp8tsoO_PYqT_Br2Vbwj_h+Me_hSs5w@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Funny part is trying all 1-99 :)

prefix is actually a suffix of the sentence: This need not be true since
there can be itemized lists. "1. one microsoft way from 9 to 1." Such
sentence can be frequently found in Europarl.

On Wed, Nov 7, 2018 at 1:46 AM Ozan ?a?layan <ozancag@gmail.com> wrote:

> Yes the rules are coming from the nonbreaking_prefixes files which are
> text files listing which prefixes, when preceded by a <dot> should not be
> tokenized. But I think this rule should not be applied if the prefix is
> actually a suffix of the sentence. Similar situations arise for French and
> other languages as well. For french, "sec." is a non-breaking prefix which
> is the abbreviation for "seconds" but sec also means "dry". So if a
> sentence ends with the "dry" meaning of "sec." the <dot> is also not
> tokenized.
>
> When the size of the corpora goes to infinity, this means that all
> nonbreaking_prefixes for a language will end up in the model vocabulary for
> NMT.
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


--

Regards,
Ergun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20181107/5df17d5f/attachment-0001.html

------------------------------

Message: 4
Date: Wed, 7 Nov 2018 08:57:57 +0000
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] German tokenizer may fail with numeric
endings
To: Ozan ?a?layan <ozancag@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAEKMkbhRZZ-WBYMcuk-DMMjKuAtP=jMPezh58jwLkSFCJ2yvnw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

I think you have a point. If you change tokenizer.perl to avoid applying
non-breaking prefix to the last word, please send me the change

Hieu Hoang
Sent while bumping into things

On Tue, 6 Nov 2018, 10:46 pm Ozan ?a?layan <ozancag@gmail.com wrote:

> Yes the rules are coming from the nonbreaking_prefixes files which are
> text files listing which prefixes, when preceded by a <dot> should not be
> tokenized. But I think this rule should not be applied if the prefix is
> actually a suffix of the sentence. Similar situations arise for French and
> other languages as well. For french, "sec." is a non-breaking prefix which
> is the abbreviation for "seconds" but sec also means "dry". So if a
> sentence ends with the "dry" meaning of "sec." the <dot> is also not
> tokenized.
>
> When the size of the corpora goes to infinity, this means that all
> nonbreaking_prefixes for a language will end up in the model vocabulary for
> NMT.
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20181107/6d9db97f/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 145, Issue 2
*********************************************

0 Response to "Moses-support Digest, Vol 145, Issue 2"

Post a Comment