Moses-support Digest, Vol 137, Issue 12

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: How to improve BLEU score with low-resource language?
(Petro ORYNYCZ-GLEASON)
2. Re: How to improve BLEU score with low-resource language?
(Scherrer, Yves)

----------------------------------------------------------------------

Message: 1
Date: Mon, 26 Mar 2018 20:11:43 +0200
From: Petro ORYNYCZ-GLEASON <pgleasonjr@gmail.com>
Subject: Re: [Moses-support] How to improve BLEU score with
low-resource language?
To: amittai <amittai@umiacs.umd.edu>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAM00gxgf0srEV8A6bxrN+eC0VqYndfUFkKRyY06F4-1oaQTgCg@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

Hi Amittai,
Thanks for all your help. I've already implemented a lot of your
suggestions and hope to implement more this month or next.
The output looks a lot better after including some language models as
you suggested. That dinged the internal BLEU score a bit, but it's
worth it.
I've implemented the suggested Train/Tune/Test ratio of 80%:10%:10% as
well as (monolingual) English corpus and look forward to doing
cross-validation with different tune/test subsets.
I see where you're coming from with finding a monolingual English
corpus of exactly the kind of data we want the MT system to be good at
and assembling a monolingual English corpus of plausible data. I'll
have to confer with the community on that to see where the most urgent
need is. I've starred and watched your repo
https://github.com/amittai/cynical, looking forward to taking it for a
spin.
Yes, looking forward to getting an off-the-shelf system or even phone
apps off the ground.
I'm wondering if Polish data could be used with copious amounts of
regex to get a dramatic BLEU score improvement.
Thanks so much again for helping to revitalize this endangered
low-resource language.
Best regards,
Petro

On 21 March 2018 at 02:19, amittai <amittai@umiacs.umd.edu> wrote:
> Hi --
>
> For what it's worth, those are wonderful goals, and I hope you succeed.
> The silence on the mailing list is a large number of people not wanting to
> be the one pointing out that you are in dire need of more data ;)
> But, it sounds like you know that, and want to build what you can with what
> you've got. Here are my opinions, others might have different takes.
>
> -- I'd say a Train/Tune/Test ratio of 80%:10%:10% is textbook. This
> means a tuning and test set of about 300-ish lines. That's tiny,
> but we used test sets that size around 2005, so there is good
> precedent. If you can afford it, try some cross-validation with
> different tune/test subsets.
>
> -- What to do with the budget? I'd spend 90% of the money on getting
> more bilingual data, and then build an off-the-shelf MT system with
> the rest.
>
> -- Not sure that many internal system settings can compensate for a
> fundamental lack of data. I'd make sure my pre- and post-processing setup
> made my output look as fluent as possible. MT systems can make very
> consistent mistakes in the output, and some of them can be patched up with
> regexes.
>
> Point #2 raises the question of _which_ data to have translated... If it
> were me building a Lemko--EN system, I'd do it like this:
>
> 1. Find a (monolingual) English corpus of exactly the kind of data I want my
> MT system to be good at. (We're being realistic, right? Leuko translations
> of the entire internet will have to wait until after we have good Lemko
> translations of e.g. government forms, street signs, or tourist phrases).
> Accuracy is more important than size.
> let's call this the REPRESENTATIVE (REPR) corpus.
>
> 2. Assemble a (monolingual) English corpus of plausible data, meaning
> sentences that looks like they might be helpful (i.e. not the UN corpus) and
> that I could pay to have (some of) them translated. If nothing else, this
> can just be corpus #1 (REPR), but I'd make it as large as I could without
> extra effort.
> call this the UNADAPTED or AVAILABLE (AVAIL) corpus.
>
> 3. Put my bilingual Lemko--EN data in a small pile, and call it the SEED.
> Maybe pat it on the head, too, and tell it I'm working to find some friends.
> This is the data I already have translated.
>
> I want to eventually be able to bilingually model the REPR corpus (by
> training a system). I can't do that, and I can't use my Lemko data to figure
> out how, either. What I *can* do is use my English data to figure out:
>
> What sentences from AVAIL should I add to SEED in order to better model
> REPR?
>
> Monolingually, this means:
> "I want to build a LM on {SEED plus some data}, and I want the LM to have
> the lowest possible perplexity on REPR. Which sentences should I add to SEED
> from AVAIL in order to do that?"
>
> The English sentences I move from AVAIL to SEED in order to better model
> REPR are precisely the sentences that I should pay to have translated. This
> is because these are the sentences in AVAIL with the most information about
> the REPR corpus that is not already in SEED.
>
> I've written a tool that can do this:
> https://github.com/amittai/cynical
>
> There might be other tools, and they might be better, but I'm not aware of
> them. It'll output the sentences in AVAIL, but in order of how useful they
> are to me. I'd go down the list, and translate as many as I could afford. If
> at some point in the future I got more money, I could continue bootstrapping
> by re-running the algorithm with the larger SEED corpus containing all my
> translated data.
>
> "Cynical selection" was originally intended for regular domain adaptation
> stuff, but it can also do the monolingual corpus-growing that you might
> want. Documentation is mostly inside the code at the moment. For now, to run
> it, edit the bash wrapper script to point to your files etc and then just
> hit 'bash amittai-cynical-wrapper.sh'
>
> I think these settings might be useful:
> task_distribution_file="representative_sentences.en"
> unadapted_distribution_file="all_plausible_data.en"
> seed_corpus_file="bilingual_data.en"
> available_corpus_file=$unadapted_distribution_file
> batchmode=0 ## disable it!
> numlines=50000 ## stop after 50k lines, or whatever you think your budget
> allows for
>
> it can be quite memory intensive if AVAIL is large. if you have hardware
> constraints, try playing with the following settings:
> mincount=20 ## if REPR is really big, increase mincount.
> and
> $save_memory=1 in the selection script itself.
>
> If you (or anyone) run into difficulties, just open a github issue here:
> https://github.com/amittai/cynical/issues
> and I'd be more than happy to help debug, clarify, walk through steps, etc.
>
> Cheers,
> ~amittai
>
>
> On 2018-03-20 19:06, Aileen Joan Vicente wrote:
>>
>> Would love to hear inputs from others. I am working on a low-resource
>> Chavacano corpus too.
>>
>> On Wed, Mar 21, 2018 at 1:29 AM, Petro ORYNYCZ-GLEASON
>> <pgleasonjr@gmail.com> wrote:
>>
>>> Dear Colleagues,
>>> We are using Moses to revitalize Lemko, an endangered low-resource
>>> language. We have 70,000 Lemko words in 3,387 segments perfectly
>>> translated into native English and perfectly aligned.
>>> Current BLEU score is about 0.10.
>>> As far as hardware goes, we're using the cloud: Amazon EC2 p2.xlarge
>>> (1 GPU, 4 vCPus, 61 GiB RAM).
>>> Questions:
>>> - How divide our precious 3,387 bilingual segments into training,
>>> tuning, and testing data? What ratio is ideal?
>>> - Considering that at this point, bilingual content is much dearer
>>> to
>>> us than processing power (Amazon AWS costs us USD 0.90 per hour,
>>> while
>>> translation costs us USD 0.15 per word), how do we make the most of
>>> what we've got?
>>> - Is there anything we could do other than the default settings that
>>> might lead to a large improvement in the BLEU score?
>>>
>>> Current training model:
>>> ~/workspace/mosesdecoder/scripts/training/train-model.perl \
>>> --parallel --mgiza-cpus 4 \
>>> -root-dir train \
>>> --corpus ~/corpus/train.ru-en.clean \
>>> --f ru --e en \
>>> --alignment grow-diag-final-and \
>>> --reordering msd-bidirectional-fe \
>>> --lm 0:3:/home/ubuntu/lm/train.ru-en.blm.en:8 \
>>> -external-bin-dir ~/workspace/bin/training-tools/mgizapp
>>>
>>> Current tuning model:
>>> ~/workspace/mosesdecoder/scripts/training/mert-moses.pl [1] \
>>> ~/corpus/tune.ru-en.true.ru [2] ~/corpus/tune.ru-en.true.en \
>>> ~/workspace/mosesdecoder/bin/moses ~/working/train/model/moses.ini
>>> --mertdir ~/workspace/mosesdecoder/bin/ \
>>> --decoder-flags="-threads 4"
>>>
>>> Thanks for your help!
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support [3]
>>
>>
>>
>>
>> Links:
>> ------
>> [1] http://mert-moses.pl
>> [2] http://tune.ru-en.true.ru
>> [3] http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

Message: 2
Date: Mon, 26 Mar 2018 19:17:11 +0000
From: "Scherrer, Yves" <yves.scherrer@helsinki.fi>
Subject: Re: [Moses-support] How to improve BLEU score with
low-resource language?
To: moses-support <moses-support@mit.edu>
Message-ID: <86CFE4E5-6751-44C7-B0E0-D92D6496CF52@helsinki.fi>
Content-Type: text/plain; charset="utf-8"

Hi Petro,

I?m a bit late to the discussion, but I?d nevertheless like to add my thoughts, especially as you hint at a possible solution yourself:

> I'm wondering if Polish data could be used with copious amounts of
> regex to get a dramatic BLEU score improvement.

Indeed, 3300 Lemko-English sentence pairs is not a lot, but you?re in the comfortable position that Lemko is closely related to two official EU languages, Polish and Slovak (Ukrainian might also help, but I don?t know a lot about the data situation there). With this, you have essentially two options:

1. Go for a classical pivot approach, by training a Lemko => Polish system and a Polish => English system and feeding the output of the former to the latter. The first step could be done on the character level, requiring less parallel data (this system would basically learn the ?copious amounts of regex? you?re referring to). See for example J?rg Tiedemann: Character-based pivot translation for under-resourced languages and domains, EACL 2012. This approach requires some Lemko-Polish (or Slovak) parallel data though, which you may not have.

2. Use a ?domain-adaptation? approach, where you?d start by creating a Polish-English MT system and gradually mix in some Lemko data during the training process. In this approach, you wouldn?t need any Lemko-Polish data, but it might be a bit trickier to get it working as the Lemko data will be outnumbered by the Polish data.

These ideas can easily be combined with Amittai?s suggestions, as they would basically create a better SEED model to get started.

Oh, and if you happen to be interested in morphological tagging for Lemko, you might want to have a look at this:
Yves Scherrer & Achim Rabus: Multi-source morphosyntactic tagging for Spoken Rusyn, VarDial workshop, EACL 2017.

Best of luck in your endeavors, and apologize for the shameless self-promotion :D
Yves

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 137, Issue 12
**********************************************

Moses-support Digest, Vol 137, Issue 12

0 Response to "Moses-support Digest, Vol 137, Issue 12"

Post a Comment