Moses-support Digest, Vol 112, Issue 7

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Polysynthetic languages? (Michael Joyner)
2. Re: Error while using config.factored (Philipp Koehn)


----------------------------------------------------------------------

Message: 1
Date: Mon, 1 Feb 2016 13:30:06 -0500
From: Michael Joyner <mjoyner@vbservices.net>
Subject: Re: [Moses-support] Polysynthetic languages?
To: Rico Sennrich <rico.sennrich@gmx.ch>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAdxTGh+ZWSerdyRPRVf3yds28ZWRAfMLEH8adgJ0EfL8K4mVQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

So how does that work?

it just takes all the words from the corpus and guesses "infix themes" ? Or
do I have to supply pre-tagged data?

On Mon, Feb 1, 2016 at 9:04 AM, Rico Sennrich <rico.sennrich@gmx.ch> wrote:

> Hi Mike,
>
> here's a link to the tool Marcin mentioned:
> https://github.com/rsennrich/subword-nmt
>
> I haven't tried it on phrase-based MT myself, but feel free to give it a
> try.
>
> You could also try other unsupervised morpheme segmenters like morfessor:
> https://github.com/aalto-speech/morfessor
>
> I don't know if there's any segmentation methods specific for Cherokee.
>
> best wishes,
> Rico
>
>
> On 01.02.2016 13:31, Marcin Junczys-Dowmunt wrote:
>
> Hi Mike,
>
> Maybe take a look at Rico's tool for handling unknown words in neural
> machine translation. I have been playing around with that for
> Russian-English and standard phrase-based SMT with some success. I am just
> not sure if your small corpora will be enough to learn useful segmentations
> though.
>
> It's an unsupervised method for word segmentation. For Russian-English I
> created a code dictionary of the 100,000 most-frequent segments per
> language. Unseen tokens will get segmented. The segmentation is not
> neccessarily similar to a linguisticly correct segmentation, though. You
> will probably want to try smaller numbers.
>
> Best,
>
> Marcin
>
> W dniu 2016-02-01 14:12, Michael Joyner napisa?(a):
>
> I am trying to use Moses with Cherokee using the New Testament and
> Genesis as primary corpus. I am feeding it the WEB, BBE as source English
> texts at the moment.
>
> As Cherokee uses bound pronouns and no articles and has almost nil
> preposition analogues, (these features are mostly verb infixes), is there a
> technique for corpus adjustment that can be done to improve the phrase
> mapping between Cherokee and English?
>
> I am currently doing Cherokee => English.
>
> Thanks, Mike
> --
>
> WEB: World English Bible (Public Domain)
> BBE: Basic English Bible (Public Domain)
>
> - Learn to the Cherokee language: <http://jalagigawoni.gnomio.com/>
> http://jalagigawoni.gnomio.com/
>
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


--

- Learn to the Cherokee language: http://jalagigawoni.gnomio.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160201/a00e48ff/attachment-0001.html

------------------------------

Message: 2
Date: Mon, 1 Feb 2016 13:38:00 -0500
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] Error while using config.factored
To: Sunayana Gawde <sunayanagawde17@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAAFADDArNvZmRTTerP1J9fj+FcbqOjnYejDV+vzDne_ocXH+rw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

[CORPUS:train1]

comment out
get-corpus-script = /home/development/sunayana/POS-eng-kon/corpus
which should point to an actual script, and you already specify
factorized-stem = $wmt12-data/train


[LM:nc=pos]

I had some problems with the "=" in corpus names, so maybe better go with
[LM:nc-pos]

What is the file "kn.lm"?
factorized-corpus = $wmt12-data/kn.lm
Did you already train a language model?
(1) if yes:
lm = $wmt12-data/kn.lm
(2) if no:
factorized-corpus = $wmt12-data/train.$output-extension

You should also have a surface word language model:
[LM:nc]
order = 5
settings = "-interpolate -unk"
factorized-corpus = $wmt12-data/train.$output-extension


[EVALUATION:test]

You should specify

factorized-input = $wmt12-data/test.en
tokenized-reference = $wmt12-data/test.kn.just-word

and not the sgm specifications.

The reference translations should not be factorized, but have only surface
forms, this is also the case for tuning:

[TUNING]

tokenized-input = $wmt12-data/tune.kn.just-word


-phi

On Mon, Feb 1, 2016 at 1:03 PM, Sunayana Gawde <sunayanagawde17@gmail.com>
wrote:

> Sir,
>
> Here is my config file:
>
> On Mon, Feb 1, 2016 at 11:29 PM, Philipp Koehn <phi@jhu.edu> wrote:
>
>> Hi,
>>
>> can you send me your full config file?
>>
>> The example factored model has a surface and POS LM - so these are the
>> files.
>>
>> Using the same data for language modelling as for translation model
>> training is fine.
>>
>> -phi
>>
>> On Mon, Feb 1, 2016 at 12:07 PM, Sunayana Gawde <
>> sunayanagawde17@gmail.com> wrote:
>>
>>> Sir,
>>>
>>> I have already replaced "\" with "|". But still it gives me same error.
>>>
>>> I downloaded the sample data from statmt.org website(factored corpus),
>>> It contains surface.lm and pos.lm.
>>>
>>> What are these files. Do i need to have these?
>>>
>>> I have my language model file which contains the same text data as my
>>> target train file(48500 lines).
>>>
>>> On Mon, Feb 1, 2016 at 9:28 PM, Philipp Koehn <phi@jhu.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> one thing that will likely come up: the Moses factored setup assumes
>>>> uses the bar character "|" to separate factors, while you seem to be using
>>>> backslash "\". So, you will have to change that in your data.
>>>>
>>>> Otherwise you seem to be on the right track - yes, you need to split
>>>> your data into train/tune/test and your splits look reasonable (I'd prefer
>>>> a larger tune set for more stability, though).
>>>>
>>>> -phi
>>>>
>>>> On Mon, Feb 1, 2016 at 9:41 AM, Sunayana Gawde <
>>>> sunayanagawde17@gmail.com> wrote:
>>>>
>>>>> Sir,
>>>>>
>>>>> I figured out that i need some additional input files for factored
>>>>> models.
>>>>>
>>>>> What i had was a text data of type:
>>>>>
>>>>> In\IN Shimla\NNP Ice\NNP Skating\NNP Ring\NNP ,\, Roller\NNP
>>>>> Skatin\NNP Ring\NNP etc\NN .\: are\VBP major\JJ skating\NN ring\NN .\.
>>>>>
>>>>> and same parallel data in Konkani with POS tags as well.
>>>>>
>>>>> The whole data i splitted as train (48500), tune(500) and test(1000).
>>>>> So i have total 6 files with extensions .en and .kn
>>>>>
>>>>> 1 more file i have which is language model in konkani (kn.lm)
>>>>>
>>>>> So what more i need to run a config.factored file?
>>>>>
>>>>> Your suggestions will be greatly appreciated.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mon, Feb 1, 2016 at 3:13 PM, Sunayana Gawde <
>>>>> sunayanagawde17@gmail.com> wrote:
>>>>>
>>>>>> Yeah. Now that error went. Thanks
>>>>>> But now i get this error:
>>>>>>
>>>>>> BUGGY CONFIG LINE (40): in : get-corpus-script
>>>>>> 1 ERROR IN CONFIG FILE at
>>>>>> /usr/local/bin/smt/mosesdecoder-3.0/scripts/ems/experiment.perl line 363,
>>>>>> <INI> line 698.
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 1, 2016 at 3:03 PM, Sunayana Gawde <
>>>>>> sunayanagawde17@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks. I did the changes. But still i get this error:
>>>>>>>
>>>>>>> ERROR: you need to define GENERAL:get-corpus-script
>>>>>>>
>>>>>>> On Sat, Jan 30, 2016 at 10:34 PM, Philipp Koehn <phi@jhu.edu> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> remove the IGNORE here:
>>>>>>>> [CORPUS:train1] IGNORE
>>>>>>>>
>>>>>>>> and add an IGNORE here:
>>>>>>>> [LM:nc]
>>>>>>>>
>>>>>>>> Also, your current configuration does not have a surface word
>>>>>>>> language model.
>>>>>>>> You can do this, but I would expect better results with one.
>>>>>>>>
>>>>>>>> -phi
>>>>>>>>
>>>>>>>> On Sat, Jan 30, 2016 at 2:28 AM, Sunayana Gawde <
>>>>>>>> sunayanagawde17@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Sir,
>>>>>>>>>
>>>>>>>>> Here is the corpus section of my config file:
>>>>>>>>>
>>>>>>>>> [CORPUS]
>>>>>>>>>
>>>>>>>>> ### long sentences are filtered out, since they slow down GIZA++
>>>>>>>>> # and are a less reliable source of data. set here the maximum
>>>>>>>>> # length of a sentence
>>>>>>>>> #
>>>>>>>>> max-sentence-length = 50
>>>>>>>>>
>>>>>>>>> [CORPUS:train1] IGNORE
>>>>>>>>>
>>>>>>>>> ### command to run to get raw corpus files
>>>>>>>>> #
>>>>>>>>> #get-corpus-script = /home/development/sunayana/POS-eng-kon/corpus
>>>>>>>>> ### raw corpus files (untokenized, but sentence aligned)
>>>>>>>>> #
>>>>>>>>> #raw-stem = $wmt12-data/training/europarl-v7.$pair-extension
>>>>>>>>>
>>>>>>>>> ### tokenized corpus files (may contain long sentences)
>>>>>>>>> #
>>>>>>>>> #tokenized-stem =
>>>>>>>>>
>>>>>>>>> ### if sentence filtering should be skipped,
>>>>>>>>> # point to the clean training data
>>>>>>>>> #
>>>>>>>>> clean-stem = $wmt12-data/train
>>>>>>>>>
>>>>>>>>> ### if corpus preparation should be skipped,
>>>>>>>>> # point to the prepared training data
>>>>>>>>> #
>>>>>>>>> #lowercased-stem =
>>>>>>>>>
>>>>>>>>> [CORPUS:nc] IGNORE
>>>>>>>>> #raw-stem = $wmt12-data/training/news-commentary-v7.$pair-extension
>>>>>>>>>
>>>>>>>>> [CORPUS:un] IGNORE
>>>>>>>>> #raw-stem = $wmt12-data/training/undoc.2000.$pair-extension
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------------------------------
>>>>>>>>> And here is my LM section:
>>>>>>>>>
>>>>>>>>> # srilm
>>>>>>>>> lm-training = $srilm-dir/ngram-count
>>>>>>>>> settings = "-interpolate -kndiscount -unk"
>>>>>>>>>
>>>>>>>>> # order of the language model
>>>>>>>>> order = 5
>>>>>>>>>
>>>>>>>>> ### tool to be used for training randomized language model from
>>>>>>>>> scratch
>>>>>>>>> # (more commonly, a SRILM is trained)
>>>>>>>>> #
>>>>>>>>> #rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8"
>>>>>>>>>
>>>>>>>>> ### script to use for binary table format for irstlm or kenlm
>>>>>>>>> # (default: no binarization)
>>>>>>>>>
>>>>>>>>> # irstlm
>>>>>>>>> #lm-binarizer = $irstlm-dir/compile-lm
>>>>>>>>>
>>>>>>>>> # kenlm, also set type to 8
>>>>>>>>> lm-binarizer = $moses-bin-dir/build_binary
>>>>>>>>> type = 8
>>>>>>>>>
>>>>>>>>> ### script to create quantized language model format (irstlm)
>>>>>>>>> # (default: no quantization)
>>>>>>>>> #
>>>>>>>>> #lm-quantizer = $irstlm-dir/quantize-lm
>>>>>>>>>
>>>>>>>>> ### script to use for converting into randomized table format
>>>>>>>>> # (default: no randomization)
>>>>>>>>> #
>>>>>>>>> #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"
>>>>>>>>>
>>>>>>>>> ### each language model to be used has its own section here
>>>>>>>>>
>>>>>>>>> [LM:europarl] IGNORE
>>>>>>>>>
>>>>>>>>> ### command to run to get raw corpus files
>>>>>>>>> #
>>>>>>>>> #get-corpus-script = ""
>>>>>>>>>
>>>>>>>>> ### raw corpus (untokenized)
>>>>>>>>> #
>>>>>>>>> #raw-corpus = $wmt12-data/training/europarl-v7.$output-extension
>>>>>>>>>
>>>>>>>>> ### tokenized corpus files (may contain long sentences)
>>>>>>>>> #
>>>>>>>>> #tokenized-corpus =
>>>>>>>>>
>>>>>>>>> ### if corpus preparation should be skipped,
>>>>>>>>> # point to the prepared language model
>>>>>>>>> #
>>>>>>>>> #lm =
>>>>>>>>>
>>>>>>>>> [LM:nc]
>>>>>>>>> #raw-corpus =
>>>>>>>>> $wmt12-data/training/news-commentary-v7.$pair-extension.$output-extension
>>>>>>>>>
>>>>>>>>> [LM:un] IGNORE
>>>>>>>>> #raw-corpus =
>>>>>>>>> $wmt12-data/training/undoc.2000.$pair-extension.$output-extension
>>>>>>>>>
>>>>>>>>> [LM:news] IGNORE
>>>>>>>>> #raw-corpus = $wmt12-data/training/news.$output-extension.shuffled
>>>>>>>>>
>>>>>>>>> [LM:nc=pos]
>>>>>>>>> factors = "pos"
>>>>>>>>> order = 7
>>>>>>>>> settings = "-interpolate -unk"
>>>>>>>>> clean-corpus = $wmt12-data/kn.lm
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> Here kn.lm is my language model and training files are named as
>>>>>>>>> train.en and train.kn.
>>>>>>>>> In the beginning i have specified the path to my data files as:
>>>>>>>>> wmt12-data = /home/development/sunayana/POS-eng-kon/corpus
>>>>>>>>>
>>>>>>>>> where corpus folder contains all the training, tune,LM and test
>>>>>>>>> files.
>>>>>>>>>
>>>>>>>>> I dont understand how to define GENERAL:get-corpus-script.
>>>>>>>>>
>>>>>>>>> Please guide me with this. Thanks
>>>>>>>>>
>>>>>>>>> On Fri, Jan 29, 2016 at 10:28 PM, Philipp Koehn <phi@jhu.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> you are not properly specifying your training data in the config
>>>>>>>>>> file.
>>>>>>>>>> Can you double check or post the [CORPUS] and [LM] sections of
>>>>>>>>>> your config file?
>>>>>>>>>>
>>>>>>>>>> -phi
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 28, 2016 at 6:04 AM, Sunayana Gawde <
>>>>>>>>>> sunayanagawde17@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello all,
>>>>>>>>>>>
>>>>>>>>>>> I am using EMS and the config.factored file from moses website.
>>>>>>>>>>>
>>>>>>>>>>> My train, tune and test data is a POS tagged data in the
>>>>>>>>>>> following format:
>>>>>>>>>>>
>>>>>>>>>>> In\IN Shimla\NNP Ice\NNP Skating\NNP Ring\NNP ,\, Roller\NNP
>>>>>>>>>>> Skatin\NNP Ring\NNP etc\NN .\: are\VBP major\JJ skating\NN ring\NN .\.
>>>>>>>>>>>
>>>>>>>>>>> when i run the command:
>>>>>>>>>>>
>>>>>>>>>>> nohup nice
>>>>>>>>>>> /usr/local/bin/smt/mosesdecoder-3.0/scripts/ems/experiment.perl -config
>>>>>>>>>>> config.POSen-kn &> log &
>>>>>>>>>>>
>>>>>>>>>>> i get the error in log file:
>>>>>>>>>>> ERROR: you need to define GENERAL:get-corpus-script
>>>>>>>>>>>
>>>>>>>>>>> Please help me.
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> *Regards*
>>>>>>>>>>>
>>>>>>>>>>> Ms. Sunayana R. Gawde.
>>>>>>>>>>>
>>>>>>>>>>> DCST, Goa University.
>>>>>>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need
>>>>>>>>>>> to.*
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Moses-support mailing list
>>>>>>>>>>> Moses-support@mit.edu
>>>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Regards*
>>>>>>>>>
>>>>>>>>> Ms. Sunayana R. Gawde.
>>>>>>>>>
>>>>>>>>> DCST, Goa University.
>>>>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need
>>>>>>>>> to.*
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Regards*
>>>>>>>
>>>>>>> Ms. Sunayana R. Gawde.
>>>>>>>
>>>>>>> DCST, Goa University.
>>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Regards*
>>>>>>
>>>>>> Ms. Sunayana R. Gawde.
>>>>>>
>>>>>> DCST, Goa University.
>>>>>> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Regards*
>>>>>
>>>>> Ms. Sunayana R. Gawde.
>>>>>
>>>>> DCST, Goa University.
>>>>> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Regards*
>>>
>>> Ms. Sunayana R. Gawde.
>>>
>>> DCST, Goa University.
>>> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>>>
>>
>>
>
>
> --
> *Regards*
>
> Ms. Sunayana R. Gawde.
>
> DCST, Goa University.
> * P**leas**e don't print t**his e-mail unles**s you really need to.*
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160201/ff039969/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 112, Issue 7
*********************************************

0 Response to "Moses-support Digest, Vol 112, Issue 7"

Post a Comment