Moses-support Digest, Vol 137, Issue 9

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. How to improve BLEU score with low-resource language?
(Petro ORYNYCZ-GLEASON)
2. Re: How to improve BLEU score with low-resource language?
(Aileen Joan Vicente)
3. Re: How to improve BLEU score with low-resource language?
(amittai)

----------------------------------------------------------------------

Message: 1
Date: Tue, 20 Mar 2018 18:29:52 +0100
From: Petro ORYNYCZ-GLEASON <pgleasonjr@gmail.com>
Subject: [Moses-support] How to improve BLEU score with low-resource
language?
To: moses-support@mit.edu
Cc: Michael Decerbo <michaeldecerbo@gmail.com>
Message-ID:
<CAM00gxiRq9qZR-HQCoghQKHC=Tf3mCj+_M02Re5+UZqJtysk3A@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

Dear Colleagues,
We are using Moses to revitalize Lemko, an endangered low-resource
language. We have 70,000 Lemko words in 3,387 segments perfectly
translated into native English and perfectly aligned.
Current BLEU score is about 0.10.
As far as hardware goes, we're using the cloud: Amazon EC2 p2.xlarge
(1 GPU, 4 vCPus, 61 GiB RAM).
Questions:
- How divide our precious 3,387 bilingual segments into training,
tuning, and testing data? What ratio is ideal?
- Considering that at this point, bilingual content is much dearer to
us than processing power (Amazon AWS costs us USD 0.90 per hour, while
translation costs us USD 0.15 per word), how do we make the most of
what we've got?
- Is there anything we could do other than the default settings that
might lead to a large improvement in the BLEU score?

Current training model:
~/workspace/mosesdecoder/scripts/training/train-model.perl \
--parallel --mgiza-cpus 4 \
-root-dir train \
--corpus ~/corpus/train.ru-en.clean \
--f ru --e en \
--alignment grow-diag-final-and \
--reordering msd-bidirectional-fe \
--lm 0:3:/home/ubuntu/lm/train.ru-en.blm.en:8 \
-external-bin-dir ~/workspace/bin/training-tools/mgizapp

Current tuning model:
~/workspace/mosesdecoder/scripts/training/mert-moses.pl \
~/corpus/tune.ru-en.true.ru ~/corpus/tune.ru-en.true.en \
~/workspace/mosesdecoder/bin/moses ~/working/train/model/moses.ini
--mertdir ~/workspace/mosesdecoder/bin/ \
--decoder-flags="-threads 4"

Thanks for your help!

------------------------------

Message: 2
Date: Wed, 21 Mar 2018 07:06:28 +0800
From: Aileen Joan Vicente <aovicente@up.edu.ph>
Subject: Re: [Moses-support] How to improve BLEU score with
low-resource language?
To: Petro ORYNYCZ-GLEASON <pgleasonjr@gmail.com>
Cc: moses-support <moses-support@mit.edu>, Michael Decerbo
<michaeldecerbo@gmail.com>
Message-ID:
<CAHEHrW1mfu38zLq2d3Vm=s0AqzDDm=21YP49yy_Xd9Tu4nHmqg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Would love to hear inputs from others. I am working on a low-resource
Chavacano corpus too.

On Wed, Mar 21, 2018 at 1:29 AM, Petro ORYNYCZ-GLEASON <pgleasonjr@gmail.com
> wrote:

> Dear Colleagues,
> We are using Moses to revitalize Lemko, an endangered low-resource
> language. We have 70,000 Lemko words in 3,387 segments perfectly
> translated into native English and perfectly aligned.
> Current BLEU score is about 0.10.
> As far as hardware goes, we're using the cloud: Amazon EC2 p2.xlarge
> (1 GPU, 4 vCPus, 61 GiB RAM).
> Questions:
> - How divide our precious 3,387 bilingual segments into training,
> tuning, and testing data? What ratio is ideal?
> - Considering that at this point, bilingual content is much dearer to
> us than processing power (Amazon AWS costs us USD 0.90 per hour, while
> translation costs us USD 0.15 per word), how do we make the most of
> what we've got?
> - Is there anything we could do other than the default settings that
> might lead to a large improvement in the BLEU score?
>
> Current training model:
> ~/workspace/mosesdecoder/scripts/training/train-model.perl \
> --parallel --mgiza-cpus 4 \
> -root-dir train \
> --corpus ~/corpus/train.ru-en.clean \
> --f ru --e en \
> --alignment grow-diag-final-and \
> --reordering msd-bidirectional-fe \
> --lm 0:3:/home/ubuntu/lm/train.ru-en.blm.en:8 \
> -external-bin-dir ~/workspace/bin/training-tools/mgizapp
>
> Current tuning model:
> ~/workspace/mosesdecoder/scripts/training/mert-moses.pl \
> ~/corpus/tune.ru-en.true.ru ~/corpus/tune.ru-en.true.en \
> ~/workspace/mosesdecoder/bin/moses ~/working/train/model/moses.ini
> --mertdir ~/workspace/mosesdecoder/bin/ \
> --decoder-flags="-threads 4"
>
> Thanks for your help!
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20180320/85111d1e/attachment-0001.html

------------------------------

Message: 3
Date: Tue, 20 Mar 2018 21:19:04 -0400
From: amittai <amittai@umiacs.umd.edu>
Subject: Re: [Moses-support] How to improve BLEU score with
low-resource language?
To: Aileen Joan Vicente <aovicente@up.edu.ph>, Petro ORYNYCZ-GLEASON
<pgleasonjr@gmail.com>
Cc: moses-support <moses-support@mit.edu>, Michael Decerbo
<michaeldecerbo@gmail.com>
Message-ID: <2e680c3083d6d52b1151e5ff7baeab55@umiacs.umd.edu>
Content-Type: text/plain; charset=US-ASCII; format=flowed

Hi --

For what it's worth, those are wonderful goals, and I hope you succeed.
The silence on the mailing list is a large number of people not wanting
to be the one pointing out that you are in dire need of more data ;)
But, it sounds like you know that, and want to build what you can with
what you've got. Here are my opinions, others might have different
takes.

-- I'd say a Train/Tune/Test ratio of 80%:10%:10% is textbook. This
means a tuning and test set of about 300-ish lines. That's tiny,
but we used test sets that size around 2005, so there is good
precedent. If you can afford it, try some cross-validation with
different tune/test subsets.

-- What to do with the budget? I'd spend 90% of the money on getting
more bilingual data, and then build an off-the-shelf MT system with
the rest.

-- Not sure that many internal system settings can compensate for a
fundamental lack of data. I'd make sure my pre- and post-processing
setup made my output look as fluent as possible. MT systems can make
very consistent mistakes in the output, and some of them can be patched
up with regexes.

Point #2 raises the question of _which_ data to have translated... If it
were me building a Lemko--EN system, I'd do it like this:

1. Find a (monolingual) English corpus of exactly the kind of data I
want my MT system to be good at. (We're being realistic, right? Leuko
translations of the entire internet will have to wait until after we
have good Lemko translations of e.g. government forms, street signs, or
tourist phrases). Accuracy is more important than size.
let's call this the REPRESENTATIVE (REPR) corpus.

2. Assemble a (monolingual) English corpus of plausible data, meaning
sentences that looks like they might be helpful (i.e. not the UN corpus)
and that I could pay to have (some of) them translated. If nothing else,
this can just be corpus #1 (REPR), but I'd make it as large as I could
without extra effort.
call this the UNADAPTED or AVAILABLE (AVAIL) corpus.

3. Put my bilingual Lemko--EN data in a small pile, and call it the
SEED. Maybe pat it on the head, too, and tell it I'm working to find
some friends. This is the data I already have translated.

I want to eventually be able to bilingually model the REPR corpus (by
training a system). I can't do that, and I can't use my Lemko data to
figure out how, either. What I *can* do is use my English data to figure
out:

What sentences from AVAIL should I add to SEED in order to better model
REPR?

Monolingually, this means:
"I want to build a LM on {SEED plus some data}, and I want the LM to
have the lowest possible perplexity on REPR. Which sentences should I
add to SEED from AVAIL in order to do that?"

The English sentences I move from AVAIL to SEED in order to better model
REPR are precisely the sentences that I should pay to have translated.
This is because these are the sentences in AVAIL with the most
information about the REPR corpus that is not already in SEED.

I've written a tool that can do this:
https://github.com/amittai/cynical

There might be other tools, and they might be better, but I'm not aware
of them. It'll output the sentences in AVAIL, but in order of how useful
they are to me. I'd go down the list, and translate as many as I could
afford. If at some point in the future I got more money, I could
continue bootstrapping by re-running the algorithm with the larger SEED
corpus containing all my translated data.

"Cynical selection" was originally intended for regular domain
adaptation stuff, but it can also do the monolingual corpus-growing that
you might want. Documentation is mostly inside the code at the moment.
For now, to run it, edit the bash wrapper script to point to your files
etc and then just hit 'bash amittai-cynical-wrapper.sh'

I think these settings might be useful:
task_distribution_file="representative_sentences.en"
unadapted_distribution_file="all_plausible_data.en"
seed_corpus_file="bilingual_data.en"
available_corpus_file=$unadapted_distribution_file
batchmode=0 ## disable it!
numlines=50000 ## stop after 50k lines, or whatever you think your
budget allows for

it can be quite memory intensive if AVAIL is large. if you have hardware
constraints, try playing with the following settings:
mincount=20 ## if REPR is really big, increase mincount.
and
$save_memory=1 in the selection script itself.

If you (or anyone) run into difficulties, just open a github issue here:
https://github.com/amittai/cynical/issues
and I'd be more than happy to help debug, clarify, walk through steps,
etc.

Cheers,
~amittai

On 2018-03-20 19:06, Aileen Joan Vicente wrote:
> Would love to hear inputs from others. I am working on a low-resource
> Chavacano corpus too.
>
> On Wed, Mar 21, 2018 at 1:29 AM, Petro ORYNYCZ-GLEASON
> <pgleasonjr@gmail.com> wrote:
>
>> Dear Colleagues,
>> We are using Moses to revitalize Lemko, an endangered low-resource
>> language. We have 70,000 Lemko words in 3,387 segments perfectly
>> translated into native English and perfectly aligned.
>> Current BLEU score is about 0.10.
>> As far as hardware goes, we're using the cloud: Amazon EC2 p2.xlarge
>> (1 GPU, 4 vCPus, 61 GiB RAM).
>> Questions:
>> - How divide our precious 3,387 bilingual segments into training,
>> tuning, and testing data? What ratio is ideal?
>> - Considering that at this point, bilingual content is much dearer
>> to
>> us than processing power (Amazon AWS costs us USD 0.90 per hour,
>> while
>> translation costs us USD 0.15 per word), how do we make the most of
>> what we've got?
>> - Is there anything we could do other than the default settings that
>> might lead to a large improvement in the BLEU score?
>>
>> Current training model:
>> ~/workspace/mosesdecoder/scripts/training/train-model.perl \
>> --parallel --mgiza-cpus 4 \
>> -root-dir train \
>> --corpus ~/corpus/train.ru-en.clean \
>> --f ru --e en \
>> --alignment grow-diag-final-and \
>> --reordering msd-bidirectional-fe \
>> --lm 0:3:/home/ubuntu/lm/train.ru-en.blm.en:8 \
>> -external-bin-dir ~/workspace/bin/training-tools/mgizapp
>>
>> Current tuning model:
>> ~/workspace/mosesdecoder/scripts/training/mert-moses.pl [1] \
>> ~/corpus/tune.ru-en.true.ru [2] ~/corpus/tune.ru-en.true.en \
>> ~/workspace/mosesdecoder/bin/moses ~/working/train/model/moses.ini
>> --mertdir ~/workspace/mosesdecoder/bin/ \
>> --decoder-flags="-threads 4"
>>
>> Thanks for your help!
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support [3]
>
>
>
> Links:
> ------
> [1] http://mert-moses.pl
> [2] http://tune.ru-en.true.ru
> [3] http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 137, Issue 9
*********************************************

Moses-support Digest, Vol 137, Issue 9

0 Response to "Moses-support Digest, Vol 137, Issue 9"

Post a Comment