Moses-support Digest, Vol 150, Issue 12

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. CfP+Updates: Shared Task: Parallel Corpus Filtering for
Low-Resource Conditions (Philipp Koehn)

----------------------------------------------------------------------

Message: 1
Date: Fri, 19 Apr 2019 23:06:40 -0400
From: Philipp Koehn <phi@jhu.edu>
Subject: [Moses-support] CfP+Updates: Shared Task: Parallel Corpus
Filtering for Low-Resource Conditions
To: "corpora@uib.no" <CORPORA@uib.no>, Moses Support
<moses-support@mit.edu>, <wmt-tasks@googlegroups.com>, Multiple
recipients of list <mt_list@nist.gov>
Message-ID:
<CAAFADDCCWZwQMuZ7__XfZ6yKo9s__eudDvfFandmWyxmgNsaCw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

THIRD CALL FOR PARTICIPATION
*Shared Task: Parallel Corpus Filtering for Low-Resource Conditions*
at the Fourth Conference on Machine Translation (WMT19)
http://statmt.org/wmt19/parallel-corpus-filtering.html

*UPDATES*

- There is an updated higher quality version of the Nepali Paracrawl
corpus
- We published scores for a baseline trained on off-the-shelf Zipporah
- We will also train systems on a 1 million word subset in the official
evaluation

This new shared task tackles the problem of cleaning noisy parallel corpora.
Following the WMT18 shared task on parallel corpus filtering
<http://www.statmt.org/wmt18/parallel-corpus-filtering.html>, we now pose
the problem under more challenging low-resource conditions. Instead of
German-English, this year there are two low-resource language pairs:
Nepali-English and Sinhala-English.
Otherwise, the shared task follows the same set-up: given a noisy parallel
corpus (crawled from the web), participants develop methods to filter it to
a smaller size of high quality sentence pairs.

*DETAILS*
We provide a very noisy 35.5 million-word (English token count)
Nepali-English corpus and a 59.6 million-word Sinhala-English corpus crawled
from the web as part of the Paracrawl <http://paracrawl.eu/> project. We
ask participants to provide scores for each sentence in each of the noisy
parallel sets. The scores will be used to subsample sentence pairs that
amount to 5 million English words. The quality of the resulting subsets is
determined by the quality of a statistical machine translation (Moses,
phrase-based) and neural machine translation system (FAIRseq) trained on
this data. The quality of the machine translation system is measured by
BLEU score (sacrebleu) on a held-out test set of Wikipedia translations
<https://github.com/facebookresearch/flores>for Sinhala-English and
Nepali-English.

We also provide links to training data for the two language pairs. This
existing data comes from a variety of sources and is of mixed quality and
relevance. We provide a script to fetch and compose the training data.

Note that the task addresses the challenge of *data quality* and *not
domain-relatedness* of the data for a particular use case. While we provide
a development and development test set that are also drawn from Wikipedia
articles, these may be very different from the final official test set in
terms of topics.
The provided raw parallel corpora are the outcome of a processing pipeline
that aimed from high recall at the cost of precision, so they are very
noisy. They exhibit noise of all kinds (wrong language in source and
target, sentence pairs that are not translations of each other, bad
language, incomplete of bad translations, etc.).

*IMPORTANT DATES*
Release of raw parallel data: February 8, 2019
Submission deadline for subsampled sets: May 10, 2019
System descriptions due: May 17, 2019
Announcement of results: June 3, 2019
Paper notification: June 7, 2019
Camera-ready for system descriptions: June 17, 2019

*ORGANIZERS*
Philipp Koehn (Johns Hopkins University / University of Edinburgh)
Francisco (Paco) Guzm?n (Facebook)
Vishrav Chaudhary (Facebook)
Juan Pino (Facebook)

More information is available at http://statmt.org/wmt19/parallel-corpus-
filtering.html

Similarly to other WMT tasks, intending participants are encouraged to
register to https://groups.google.com/forum/#!forum/wmt-tasks for
discussions and announcements.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20190419/07c3b9a2/attachment-0001.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 150, Issue 12
**********************************************

Moses-support Digest, Vol 150, Issue 12

0 Response to "Moses-support Digest, Vol 150, Issue 12"

Post a Comment