Moses-support Digest, Vol 112, Issue 3

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Polysynthetic languages? (Marcin Junczys-Dowmunt)
2. Re: Polysynthetic languages? (Rico Sennrich)
3. Re: Polysynthetic languages? (Marcin Junczys-Dowmunt)


----------------------------------------------------------------------

Message: 1
Date: Mon, 01 Feb 2016 14:31:14 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Polysynthetic languages?
To: Michael Joyner <mjoyner@vbservices.net>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <b32f18022889d2275c1ddee0e82ae999@amu.edu.pl>
Content-Type: text/plain; charset="utf-8"



Hi Mike,

Maybe take a look at Rico's tool for handling unknown words in neural
machine translation. I have been playing around with that for
Russian-English and standard phrase-based SMT with some success. I am
just not sure if your small corpora will be enough to learn useful
segmentations though.

It's an unsupervised method for word segmentation. For Russian-English I
created a code dictionary of the 100,000 most-frequent segments per
language. Unseen tokens will get segmented. The segmentation is not
neccessarily similar to a linguisticly correct segmentation, though. You
will probably want to try smaller numbers.

Best,

Marcin

W dniu 2016-02-01 14:12, Michael Joyner napisa?(a):

> I am trying to use Moses with Cherokee using the New Testament and Genesis as primary corpus. I am feeding it the WEB, BBE as source English texts at the moment.
>
> As Cherokee uses bound pronouns and no articles and has almost nil preposition analogues, (these features are mostly verb infixes), is there a technique for corpus adjustment that can be done to improve the phrase mapping between Cherokee and English?
>
> I am currently doing Cherokee => English.
>
> Thanks, Mike
> --
>
> WEB: World English Bible (Public Domain)
> BBE: Basic English Bible (Public Domain)
>
> * Learn to the Cherokee language: http://jalagigawoni.gnomio.com/ [2]
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support [1]



Links:
------
[1] http://mailman.mit.edu/mailman/listinfo/moses-support
[2] http://jalagigawoni.gnomio.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160201/331ea33c/attachment-0001.html

------------------------------

Message: 2
Date: Mon, 1 Feb 2016 14:04:48 +0000
From: Rico Sennrich <rico.sennrich@gmx.ch>
Subject: Re: [Moses-support] Polysynthetic languages?
To: moses-support@mit.edu
Message-ID: <56AF6600.9080007@gmx.ch>
Content-Type: text/plain; charset="utf-8"

Hi Mike,

here's a link to the tool Marcin mentioned:
https://github.com/rsennrich/subword-nmt

I haven't tried it on phrase-based MT myself, but feel free to give it a
try.

You could also try other unsupervised morpheme segmenters like
morfessor: https://github.com/aalto-speech/morfessor

I don't know if there's any segmentation methods specific for Cherokee.

best wishes,
Rico

On 01.02.2016 13:31, Marcin Junczys-Dowmunt wrote:
>
> Hi Mike,
>
> Maybe take a look at Rico's tool for handling unknown words in neural
> machine translation. I have been playing around with that for
> Russian-English and standard phrase-based SMT with some success. I am
> just not sure if your small corpora will be enough to learn useful
> segmentations though.
>
> It's an unsupervised method for word segmentation. For Russian-English
> I created a code dictionary of the 100,000 most-frequent segments per
> language. Unseen tokens will get segmented. The segmentation is not
> neccessarily similar to a linguisticly correct segmentation, though.
> You will probably want to try smaller numbers.
>
> Best,
>
> Marcin
>
> W dniu 2016-02-01 14:12, Michael Joyner napisa?(a):
>
>> I am trying to use Moses with Cherokee using the New Testament and
>> Genesis as primary corpus. I am feeding it the WEB, BBE as source
>> English texts at the moment.
>>
>> As Cherokee uses bound pronouns and no articles and has almost nil
>> preposition analogues, (these features are mostly verb infixes), is
>> there a technique for corpus adjustment that can be done to improve
>> the phrase mapping between Cherokee and English?
>>
>> I am currently doing Cherokee => English.
>> Thanks, Mike
>> --
>>
>> WEB: World English Bible (Public Domain)
>> BBE: Basic English Bible (Public Domain)
>>
>> * Learn to the Cherokee language: http://jalagigawoni.gnomio.com/
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160201/49dbf7e2/attachment-0001.html

------------------------------

Message: 3
Date: Mon, 01 Feb 2016 15:48:06 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Polysynthetic languages?
To: Rico Sennrich <rico.sennrich@gmx.ch>
Cc: moses-support@mit.edu
Message-ID: <5471ca2ba66b286a74443ada0d1016b9@amu.edu.pl>
Content-Type: text/plain; charset="utf-8"



Oh yes! The link, good catch :)

W dniu 2016-02-01 15:04, Rico Sennrich napisa?(a):

> Hi Mike,
>
> here's a link to the tool Marcin mentioned: https://github.com/rsennrich/subword-nmt [2]
>
> I haven't tried it on phrase-based MT myself, but feel free to give it a try.
>
> You could also try other unsupervised morpheme segmenters like morfessor: https://github.com/aalto-speech/morfessor [3]
>
> I don't know if there's any segmentation methods specific for Cherokee.
>
> best wishes,
> Rico
>
> On 01.02.2016 13:31, Marcin Junczys-Dowmunt wrote:
>
> Hi Mike,
>
> Maybe take a look at Rico's tool for handling unknown words in neural machine translation. I have been playing around with that for Russian-English and standard phrase-based SMT with some success. I am just not sure if your small corpora will be enough to learn useful segmentations though.
>
> It's an unsupervised method for word segmentation. For Russian-English I created a code dictionary of the 100,000 most-frequent segments per language. Unseen tokens will get segmented. The segmentation is not neccessarily similar to a linguisticly correct segmentation, though. You will probably want to try smaller numbers.
>
> Best,
>
> Marcin
>
> W dniu 2016-02-01 14:12, Michael Joyner napisa?(a):
>
> I am trying to use Moses with Cherokee using the New Testament and Genesis as primary corpus. I am feeding it the WEB, BBE as source English texts at the moment.
> As Cherokee uses bound pronouns and no articles and has almost nil preposition analogues, (these features are mostly verb infixes), is there a technique for corpus adjustment that can be done to improve the phrase mapping between Cherokee and English?
> I am currently doing Cherokee => English.
>
> Thanks, Mike
> --
>
> WEB: World English Bible (Public Domain)
> BBE: Basic English Bible (Public Domain)
>
> * Learn to the Cherokee language: http://jalagigawoni.gnomio.com/ [4]
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support [1]
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support [1]

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support [1]



Links:
------
[1] http://mailman.mit.edu/mailman/listinfo/moses-support
[2] https://github.com/rsennrich/subword-nmt
[3] https://github.com/aalto-speech/morfessor
[4] http://jalagigawoni.gnomio.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160201/2699d0be/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 112, Issue 3
*********************************************

0 Response to "Moses-support Digest, Vol 112, Issue 3"

Post a Comment