Moses-support Digest, Vol 112, Issue 10

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Problem with processPhraseTableMin (Jeremy Gwinnup)
2. Re: Problem with processPhraseTableMin (Marcin Junczys-Dowmunt)
3. Error in factored models, get-corpus crashed (Sunayana Gawde)
4. Call for participation: WMT 2016 Shared Task on Cross-lingual
Pronoun Prediction (Jorg Tiedemann)


----------------------------------------------------------------------

Message: 1
Date: Tue, 2 Feb 2016 12:16:30 -0500
From: Jeremy Gwinnup <jeremy@gwinnup.org>
Subject: Re: [Moses-support] Problem with processPhraseTableMin
To: moses-support@mit.edu
Message-ID: <F0017DCB-B6AB-4E52-B264-BF8D9EEB80A4@gwinnup.org>
Content-Type: text/plain; charset=utf-8

Marcin,

I was able to use -T with processLexicalTableMin successfully. I also tried processPhraseTableMin using a local tmp dir with 200G free and it still crashed at step 3 with the huge malloc message. Phrase table is nothing fancy - just standard 4 scores and 3 domain indicator features. Here?s a complete output with more info about the phrase table:

Phrase table in question:

-rw-rw-r-- 1 jgwinnup scream 2.2G Feb 1 23:58 phrase-table.1.gz

Machine in question has 1TB RAM/32 cores - should be more than enough for the job

Moses git-rev ends with: 80572b4 (Jan. 27)

1tqoct1:model> $MOSES/bin/processPhraseTableMin -in phrase-table.1.gz -out phrase-table.1 -threads all -nscores 7 -T /tmp_with_200G_free
WARNING: You are using a nonstandard number of scores (7) with PREnc. Set the index of P(t|s) with -rankscore int if it is not 2.
Used options:
Text phrase table will be read from: phrase-table.1.gz
Output phrase table will be written to: phrase-table.1.minphr
Step size for source landmark phrases: 2^10=1024
Source phrase fingerprint size: 16 bits / P(fp)=1.52588e-05
Selected target phrase encoding: Huffman + PREnc
Maxiumum allowed rank for PREnc: 100
Number of score components in phrase table: 7
Single Huffman code set for score components: no
Using score quantization: no
Explicitly included alignment information: yes
Running with 32 threads

Pass 1/3: Creating hash function for rank assignment
..................................................[5000000]
..................................................[10000000]
..................................................[15000000]
..................................................[20000000]
..................................................[25000000]
..................................................[30000000]
..................................................[35000000]
..................................................[40000000]
..................................................[45000000]
....

Pass 2/3: Creating source phrase index + Encoding target phrases
..................................................[5000000]
..................................................[10000000]
..................................................[15000000]
..................................................[20000000]
..................................................[25000000]
..................................................[30000000]
..................................................[35000000]
..................................................[40000000]
..................................................[45000000]
....

Intermezzo: Calculating Huffman code sets
Creating Huffman codes for 471366 target phrase symbols
tcmalloc: large alloc 13808820224 bytes == 0xb0592000 @
tcmalloc: large alloc 27617640448 bytes == 0x3e86b0000 @
tcmalloc: large alloc 5187358422106112 bytes == (nil) @
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc




> On Feb 2, 2016, at 10:21 AM, Jeremy Gwinnup <jeremy@gwinnup.org> wrote:
>
> Hi,
>
> I?m having a problem using processPhraseTableMin to compress a phrase table with 7 scores - the program consistently coredumps at step 3 - command and relevant output below. Is there anything I?m doing glaringly wrong?
>
> Thanks!
> -Jeremy
>
> Command:
>
> 1tqoct1:model> $MOSES/bin/processPhraseTableMin -in phrase-table.1.gz -out phrase-table.1 -threads all -nscores 7
>
> Once we get to step 3:
>
> Intermezzo: Calculating Huffman code sets
> Creating Huffman codes for 471366 target phrase symbols
> tcmalloc: large alloc 13983629312 bytes == 0xb14ce000 @
> tcmalloc: large alloc 27967250432 bytes == 0x3f3ca4000 @
> tcmalloc: large alloc 15681406635450368 bytes == (nil) @
> terminate called after throwing an instance of 'std::bad_alloc'
> what(): std::bad_alloc
>
> Top looked like this when the program ran into trouble:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 27416 jgwinnup 20 0 45.9g 30g 4.0g R 10.6 3.0 1589:17 processPhraseTa




------------------------------

Message: 2
Date: Tue, 2 Feb 2016 18:21:34 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Problem with processPhraseTableMin
To: moses-support@mit.edu
Message-ID: <56B0E59E.2070402@amu.edu.pl>
Content-Type: text/plain; charset=utf-8; format=flowed

Looks fine, I had no problems running it with 18 and more domain
indicators. Your machine is certainly more than suitable. Just one
remark, using more than 8-12 threads usually slows things down, but
should not cause crashes. Any chance to have a look at that table?

W dniu 02.02.2016 o 18:16, Jeremy Gwinnup pisze:
> Marcin,
>
> I was able to use -T with processLexicalTableMin successfully. I also tried processPhraseTableMin using a local tmp dir with 200G free and it still crashed at step 3 with the huge malloc message. Phrase table is nothing fancy - just standard 4 scores and 3 domain indicator features. Here?s a complete output with more info about the phrase table:
>
> Phrase table in question:
>
> -rw-rw-r-- 1 jgwinnup scream 2.2G Feb 1 23:58 phrase-table.1.gz
>
> Machine in question has 1TB RAM/32 cores - should be more than enough for the job
>
> Moses git-rev ends with: 80572b4 (Jan. 27)
>
> 1tqoct1:model> $MOSES/bin/processPhraseTableMin -in phrase-table.1.gz -out phrase-table.1 -threads all -nscores 7 -T /tmp_with_200G_free
> WARNING: You are using a nonstandard number of scores (7) with PREnc. Set the index of P(t|s) with -rankscore int if it is not 2.
> Used options:
> Text phrase table will be read from: phrase-table.1.gz
> Output phrase table will be written to: phrase-table.1.minphr
> Step size for source landmark phrases: 2^10=1024
> Source phrase fingerprint size: 16 bits / P(fp)=1.52588e-05
> Selected target phrase encoding: Huffman + PREnc
> Maxiumum allowed rank for PREnc: 100
> Number of score components in phrase table: 7
> Single Huffman code set for score components: no
> Using score quantization: no
> Explicitly included alignment information: yes
> Running with 32 threads
>
> Pass 1/3: Creating hash function for rank assignment
> ..................................................[5000000]
> ..................................................[10000000]
> ..................................................[15000000]
> ..................................................[20000000]
> ..................................................[25000000]
> ..................................................[30000000]
> ..................................................[35000000]
> ..................................................[40000000]
> ..................................................[45000000]
> ....
>
> Pass 2/3: Creating source phrase index + Encoding target phrases
> ..................................................[5000000]
> ..................................................[10000000]
> ..................................................[15000000]
> ..................................................[20000000]
> ..................................................[25000000]
> ..................................................[30000000]
> ..................................................[35000000]
> ..................................................[40000000]
> ..................................................[45000000]
> ....
>
> Intermezzo: Calculating Huffman code sets
> Creating Huffman codes for 471366 target phrase symbols
> tcmalloc: large alloc 13808820224 bytes == 0xb0592000 @
> tcmalloc: large alloc 27617640448 bytes == 0x3e86b0000 @
> tcmalloc: large alloc 5187358422106112 bytes == (nil) @
> terminate called after throwing an instance of 'std::bad_alloc'
> what(): std::bad_alloc
>
>
>
>
>> On Feb 2, 2016, at 10:21 AM, Jeremy Gwinnup <jeremy@gwinnup.org> wrote:
>>
>> Hi,
>>
>> I?m having a problem using processPhraseTableMin to compress a phrase table with 7 scores - the program consistently coredumps at step 3 - command and relevant output below. Is there anything I?m doing glaringly wrong?
>>
>> Thanks!
>> -Jeremy
>>
>> Command:
>>
>> 1tqoct1:model> $MOSES/bin/processPhraseTableMin -in phrase-table.1.gz -out phrase-table.1 -threads all -nscores 7
>>
>> Once we get to step 3:
>>
>> Intermezzo: Calculating Huffman code sets
>> Creating Huffman codes for 471366 target phrase symbols
>> tcmalloc: large alloc 13983629312 bytes == 0xb14ce000 @
>> tcmalloc: large alloc 27967250432 bytes == 0x3f3ca4000 @
>> tcmalloc: large alloc 15681406635450368 bytes == (nil) @
>> terminate called after throwing an instance of 'std::bad_alloc'
>> what(): std::bad_alloc
>>
>> Top looked like this when the program ran into trouble:
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 27416 jgwinnup 20 0 45.9g 30g 4.0g R 10.6 3.0 1589:17 processPhraseTa
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support




------------------------------

Message: 3
Date: Wed, 3 Feb 2016 14:02:25 +0530
From: Sunayana Gawde <sunayanagawde17@gmail.com>
Subject: [Moses-support] Error in factored models, get-corpus crashed
To: moses-support@mit.edu
Message-ID:
<CANQTV3SaR0UN64knywVd7C3kpz-JR4DGZdOxK6y9-mfnAa2SKA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

I am developing a MT system for English to Konkani and my corpus has POS
tags to each word. I am using the same config file from statmt.org website
and after doing necessary changes in it, i run this command:

nohup nice /usr/local/bin/smt/mosesdecoder-3.0/scripts/ems/experiment.perl
-config config.en-kn -exec &> log &

But then when i check log file, i see this error:

EXECUTE STEPS
number of steps doable or running: 1 at Tue Feb 2 19:11:28 IST 2016
doable: CORPUS:train1:get-corpus
executing
/home/development/sunayana/POS-eng-kon/steps/3/CORPUS_train1_get-corpus.3
via sh (1 active)
step CORPUS:train1:get-corpus crashed
number of steps doable or running: 0 at Tue Feb 2 19:11:35 IST 2016

Please tell me how to remove this error and run my system successfully.

thanks

--
*Regards*

Ms. Sunayana R. Gawde.

DCST, Goa University.
* P**leas**e don't print t**his e-mail unles**s you really need to.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160203/6506d841/attachment-0001.html

------------------------------

Message: 4
Date: Wed, 3 Feb 2016 13:55:43 +0200
From: Jorg Tiedemann <tiedeman@gmail.com>
Subject: [Moses-support] Call for participation: WMT 2016 Shared Task
on Cross-lingual Pronoun Prediction
To: moses-support <moses-support@mit.edu>
Cc: mt-list@eamt.org
Message-ID: <F9685527-F504-4FB9-B10F-A0C2CA450153@gmail.com>
Content-Type: text/plain; charset="utf-8"

WMT 2016 Shared Task on Cross-lingual Pronoun Prediction

CALL FOR PARTICIPATION

========================================================
WMT 2016 Shared Task on Cross-lingual Pronoun Prediction
========================================================

Website: http://www.statmt.org/wmt16/pronoun-task.html
At WMT 2016 (collocated with ACL 2016)

We are pleased to announce an exciting cross-lingual pronoun prediction task for people interested in (discourse-aware) machine translation, anaphora resolution and machine learning in general.

In the cross-lingual pronoun prediction task, participants are asked to predict a target-language pronoun given a source-language pronoun in the context of a sentence. For example, in the English-to-French sub-task, to predict the correct translation of "it" or "they" into French (ce, elle, elles, il, ils, ?a, cela, on, OTHER). You may use any type of information that can be extracted from the documents. We provide training and development data and a simple baseline system using an N-gram language model.

Participants are invited to submit systems for the English-French and English-German language pairs, for both directions.

More details can be found below, and on our website: http://www.statmt.org/wmt16/pronoun-task.html


Important Dates:

2nd February 2016, Release of training data
4th April 2016, Release of test data
11th April 2016, System submission
8th May 2016, Paper submission deadline
5th June 2016, Notification of acceptance
22nd June, Camera-ready deadline


Mailing list: https://groups.google.com/forum/#!forum/wmt-2016-cross-lingual-pronoun-prediction-shared-task

-------------------------------------------------------------------------
Acknowledgements:
The organisation of this task has received support from the following project: Discourse-Oriented Statistical Machine Translation funded by the Swedish Research Council (2012-916)
-------------------------------------------------------------------------

=========================
Detailed Task Description
=========================

OVERVIEW

Pronoun translation poses a problem for current state-of-the-art SMT systems as pronoun systems do not map well across languages, e.g., due to differences in gender, number, case, formality, or humanness, and to differences in where pronouns may be used. Translation divergences typically lead to mistakes in SMT, as when translating the English "it" into French ("il", "elle", or "cela"?) or into German ("er", "sie", or "es"?). One way to model pronoun translation is to treat it as a cross-lingual pronoun prediction task.

We propose such a task, which asks participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We further provide a lemmatised target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. In the translation, the words aligned to a subset of the source-language third-person pronouns are substituted by placeholders. The aim of the task is to predict, for each placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the documents.

The cross-lingual pronoun prediction task will be similar to the task of the same name at DiscoMT 2015:

http://www.idiap.ch/workshop/DiscoMT/shared-task
Participants are invited to submit systems for the English-French and English-German language pairs, for both directions.


TASK DESCRIPTION

In the cross-lingual pronoun prediction task, you are given a source-language document with a lemmatised and POS-tagged human-authored translation and a set of word alignments between the two languages. In the translation, the lemmatised tokens aligned to the source-language third-person pronouns are substituted by placeholders. Your task is to predict, for each placeholder, the fully inflected word token that should replace the placeholder from a small, closed set of classes. I.e., to provide the fully inflected (German|French) translation of the English pronoun in the context sketched by the lemmatised/tagged target side (in the case of English-to-German|French translation). You may use any type of information that you can extract from the documents.

Lemmatised and POS-tagged target-language data is provided in place of fully inflected text. The provision of lemmatised data is intended both to provide a challenging task, and to simulate a scenario that is more closely aligned with working with machine translation system output. POS tags provide additional information which may be useful in the disambiguation of lemmas (e.g. noun vs. verb, etc.) and in the detection of patterns of pronoun use.

The pronoun prediction task will be run for the following sub-tasks:
English-to-German
German-to-English
English-to-French
French-to-English

Details of the source-language pronouns and the prediction classes that exist for each of the above sub-tasks are provided in the following section (below). The different combinations of source-language pronoun and target-language prediction classes represent some of the different problems that SMT systems face when translating pronouns for a given language pair and translation direction.

The task will be evaluated automatically by matching the predictions against the words found in the reference translation by computing the overall accuracy and precision, recall and F-score for each class. The primary score for the evaluation is the macro-averaged F-score over all classes. Compared to accuracy, the macro-averaged F-score favours systems that consistently perform well on all classes and penalises systems that maximise the performance on frequent classes while sacrificing infrequent ones.

The data supplied for the classification task consists of parallel source-target text with word alignments. In the target-language text, a subset of the words aligned to source-language occurrences of a specified set of pronouns have been replaced by placeholders of the form REPLACE_xx, where xx is the index of the source-language word the placeholder is aligned to. Your task is to predict one of the classes listed in the relevant source-target section below, for each occurrence of a placeholder.

The training, development and test datasets have been filtered to remove non-subject position pronouns. Additional filtering has also been applied to the test set to remove erroneous pronoun examples and thereby ensure the fair and accurate evaluation of system performance. For more information on the format of the data files and their filtering, please see the website.

The complete test data for the classification task, including reference translations and word alignments, will be released on 4th April 2016. Your submission is due on 11th April 2016.


SOURCE-LANGUAGE PRONOUN SETS AND TARGET-LANGUAGE PREDICTION CLASS DETAILS

The following sections describe the set of source-language pronouns and target-language classes to be predicted, for each of the four sub-tasks. Please note that the sub-tasks are asymmetric in terms of the source-language pronouns and prediction classes. The selection of the source-language pronouns and their target-language prediction classes for each sub-task is based on the variation that is possible when translating a given source-language pronoun. For example, when translating the English pronoun "it" into French, a decision must be made as to the gender of the French pronoun, with "il" and "elle" both providing valid options. The translation of the English pronouns "he" and "she" into French, however, does not require such a decision. These may simply be mapped 1-to-1, as "il" and "elle" respectively. The translation of "he" and "she" from English into French is therefore not considered an "interesting" problem and as such, these pronouns are excluded from the source-!
language set for the English->French sub-task. In the opposite translation, the French pronoun "il" may be translated as "it" or "he", and "elle" as "it" or "she". As a decision must be taken as to the appropriate target-language translation of "il" and "elle", these are included in the set of source-language pronouns for the French->English sub-task.

You should *always* predict either a word token or "OTHER". See prediction class lists below for a list of word tokens to predict for each sub-task.

English-to-French

This sub-task will concentrate on the translation of subject position "it" and "they" from English into French. The following prediction classes exist for this sub-task:

* ce: The French pronoun ce (sometimes with elided vowel as c') as in the expression c'est "it is"
* elle: Feminine singular subject pronoun
* elles: Feminine plural subject pronoun
* il: Masculine singular subject pronoun
* ils: Masculine plural subject pronoun
* cela: Demonstrative pronouns. Includes "cela", "?a", the misspelling "ca", and the rare elided form "?' "
* on: Indefinite pronoun
* OTHER: Some other word, or nothing at all, should be inserted

French-to-English

This sub-task will concentrate on the translation of subject position "elle", "elles", "il", and "ils" from French into English. The following prediction classes exist for this sub-task:

* he: Masculine singular subject pronoun
* she: Feminine singular subject pronoun
* it: Non-gendered singular subject pronoun
* they: Non-gendered plural subject pronoun
* this: Demonstrative pronouns (singular). Includes both "this" and "that"
* these: Demonstrative pronouns (plural). Includes both "these" and "those"
* there: Existential "there"
* OTHER: Some other word, or nothing at all, should be inserted

English-to-German

This sub-task will concentrate on the translation of subject position "it" and "they" from English into German. The following prediction classes exist for this sub-task:

* er: Masculine singular subject pronoun
* sie: Feminine singular subject pronoun
* es: Neuter singular subject pronoun
* man: Indefinite pronoun
* OTHER: Some other word, or nothing at all, should be inserted

German-to-English

This sub-task will concentrate on the translation of subject position "er", "sie" and "es" from German into English. The following prediction classes exist for this sub-task:

* he: Masculine singular subject pronoun
* she: Feminine singular subject pronoun
* it: Non-gendered singular subject pronoun
* they: Non-gendered plural subject pronoun
* you: Second person pronoun (with both generic or deictic uses)
* this: Demonstrative pronouns (singular). Includes both "this" and "that"
* these: Demonstrative pronouns (plural). Includes both "these" and "those"
* there: Existential "there"
* OTHER: Some other word, or nothing at all, should be inserted


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160203/07fc1975/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 112, Issue 10
**********************************************

0 Response to "Moses-support Digest, Vol 112, Issue 10"

Post a Comment