Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. 2nd CFP: WAT2020 (The 7th Workshop on Asian Translation)
(Toshiaki Nakazawa)
2. Re: Moses-support Digest, Vol 166, Issue 5 (G/her G/libanos)
----------------------------------------------------------------------
Message: 1
Date: Wed, 12 Aug 2020 09:34:04 +0900
From: Toshiaki Nakazawa <nakazawa@logos.t.u-tokyo.ac.jp>
Subject: [Moses-support] 2nd CFP: WAT2020 (The 7th Workshop on Asian
Translation)
To: <moses-support@mit.edu>
Message-ID:
<CAMMh7mpe+d7SYXn=Yf65C0aD5HsoyYK3N+k5cD0MkkEh1ki6FA@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Dear all MT researchers/users,
I'm Toshiaki Nakazawa from The University of Tokyo, Japan. This is the
2nd call for participation for the MT shared tasks and research papers to
the 7th Workshop on Asian Translation (WAT2020), workshop of
AACL-IJCNLP 2020. Those who are working on machine translation, please
join us.
UPDATES
--------------
- details of Japanese <--> English multimodal task is out
https://nlab-mpg.github.io/wat2020-mmt-jp/
- newly added 2 document-level translation tasks
ParaNatCom: English --> Japanese Scientific Paper
http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2020/aspec_doc.html
BSD Corpus: English <--> Japanese Business Scene Dialogue
http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2020/bsd.html
IMPORTANT DATES
---------------
August 28, 2020 Translation Task Submission Deadline
September 18, 2020 ? Research Paper Submission Deadline
October 23, 2020 ? Notification of Acceptance for Research Papers
October 23, 2020 System Description Paper Submission Deadline
October 30, 2020 Review Feedback of System Description Papers
November 6, 2020 - Camera-ready Deadline
December 4-7, 2020 Workshop Dates (one of these days)
* All deadlines are calculated at 11:59PM UTC-12
Best regards,
---------------------------------------------------------------------------
WAT2020
(The 7th Workshop on Asian Translation)
in conjunction with AACL-IJCNLP2020
http://lotus.kuee.kyoto-u.ac.jp/WAT/
December 4-7, 2020, Suzhou, China (ONLINE)
Following the success of the previous WAT workshops (WAT2014 --
WAT2019), WAT2020 will bring together machine translation researchers
and users to try, evaluate, share and discuss brand-new ideas about
machine translation. For the 7th WAT, we will include the following
new translation tasks:
* Document-level translation tasks
- English --> Japanese scientific paper abstract task
- English <--> Japanese business scene dialogue task
* Japanese <--> English multimodal task
* Document-level test set for Japanese <--> English newswire task
* Hindi/Thai/Malay/Indonesian <--> English IT-domain and Wikinews task
* Odia <--> English mixed-domain task
together with the following continuing tasks:
* English/Chinese <--> Japanese scientific paper task
* English/Chinese/Korean <--> Japanese patent task
* English <--> Japanese newswire task
* Russian <--> Japanese news commentary task
* Myanmar <--> English mixed-domain task
* Khmer <--> English mixed-domain task
* Indian language <--> English mixed-domain multilingual translation task
* English --> Hindi multimodal task
In addition to the shared tasks, the workshop will also feature
scientific papers on topics related to the machine translation,
especially for Asian languages. Topics of interest include, but are
not limited to:
- analysis of the automatic/human evaluation results in the past WAT workshops
- word-/phrase-/syntax-/semantics-/rule-based, neural and hybrid
machine translation
- Asian language processing
- incorporating linguistic information into machine translation
- decoding algorithms
- system combination
- error analysis
- manual and automatic machine translation evaluation
- machine translation applications
- quality estimation
- domain adaptation
- machine translation for low resource languages
- language resources
************************* IMPORTANT NOTICE *************************
Participants of the previous workshop are also required to sign up to WAT2020
********************************************************************
TRANSLATION TASKS
-----------------
The task is to improve the text translation quality for scientific
papers and patent documents. Participants choose any of the subtasks
in which they would like to participate and translate the test data
using their machine translation systems. The WAT organizers will
evaluate the results submitted using automatic evaluation and human
evaluation. We will also provide a baseline machine translation.
Tasks:
Document-level Translation tasks: (NEW!)
ParaNatCom: English --> Japanese Scientific Paper
BSD Corpus: English <--> Japanese Business Scene Dialogue
Scientific Paper: [Asian Scientific Paper Excerpt Corpus (ASPEC)]
English/Chinese <--> Japanese
Patent: [Japan Patent Office Patent Corpus 2.0 (JPC2)]
English/Chinese/Korean <--> Japanese
Newswire: [JIJI Corpus] (document-level testset is newly added)
Japanese <--> English
News Commentary:
Japanese <--> Russian (Japanese <--> English and English <-->
Russian included)
IT Documentation and Wikinews: [SAP-NICT Corpus]
Hindi/Thai/Malay/Indonesian <--> English [ALT and other mixed corpora] NEW!!
Mixed domain:
Myanmar <--> English [UCSY and ALT corpora]
Khmer <--> English [ECCC and ALT corpora]
Indic:
Indian Language <--> English multilingual [Assorted Corpus from
various sources]
Odia <--> English [UFAL (EnOdia) corpus] NEW!!
Multimodal:
Hindi --> English Multimodal [Hindi Visual Genome corpus]
Japanese <--> English Multimodal [Flickr30kEnt-JP corpus] NEW!!
Dataset:
* Scientific paper
WAT uses ASPEC for the dataset including training, development,
development test and test data. Participants of the scientific papers
subtask must get a copy of ASPEC by themselves. ASPEC consists of
approximately 3 million Japanese-English parallel sentences from paper
abstracts (ASPEC-JE) and approximately 0.7 million Japanese-Chinese
paper excerpts (ASPEC-JC)
* Patent
WAT uses JPO Patent Corpus, which is constructed by Japan Patent
Office (JPO). This corpus consists of 1 million English-Japanese
parallel sentences, 1 million Chinese-Japanese parallel sentences, and
1 million Korean-Japanese parallel sentences from patent description
with four categories. Participants of patent tasks are required to get
it on WAT2019 site of JPO Patent Corpus.
- English/Chinese/Korean <--> Japanese:
These tasks evaluate performance of a translation model similarly as
the other translation tasks. Differing from the previous tasks at
WAT2015, WAT2016 and WAT2017, new test sets of these tasks consists
of (a) patent documents published between 2011 and 2013, which were
used in the past years' WAT, and (b) ones published between 2016 and
2017 for each language pair. We will also evaluate performance of the
section (a) so as to compare systems submitted in the past years'
WAT.
- Chinese -> Japanese expression pattern task:
This task evaluates performance of a translation model for each
predifined category of expression patterns, which corresponds to
title of invention (TIT), abstract (ABS), scope of claim (CLM) or
description (DES). Test set of this task consists of sentences each
of which is annotated with a corresponding category of expression
patterns.
* Newswire
WAT uses JIJI Corpus, which is constructed by Jiji Press Ltd. in
collaboration with the National Institute of Information and
Communications Technology (NICT). This corpus consists of a
Japanese-English news corpus of 200K parallel sentences, from Jiji
Press news with various categories. At WAT2020, the organizers newly
added a new document-level translation testset, which consists of
manually filtered test and reference sentences and document-level
context of the test sentences. Participants of the newswire subtask
are required to get it on WAT2020 site of JIJI Corpus.
* News Commentary
WAT uses a manually aligned and cleaned Japanese <--> Russian corpus
from the News Commentary domain to study extremely low resource
situations for distant language pairs. The parallel corpus contains
around 12,000 lines. This year, we invite participants to utilize any
existing monolingual or parallel corpora from WMT 2020 in addition to
those listed on the WAT website. In particular, solutions focusing on
monolingual pretraining and multilingualism are encouraged.
* IT and Wikinews
- Hindi/Thai/Malay/Indonesian <--> English
In collaboration with SAP and NICT, WAT is organising a pilot
translation task to/from English to/from Hindi, Thai, Malay and
Indonesian. The evaluation data belongs to the IT domain (Software
Documentation) and Wikinews domain (Asian Language Treebank).
Participants will be expected to train systems and submit translations
for all language pairs (to and from English) and both domains using
any existing monolingual or parallel data. Given the growing focus on
a universal translation model for multiple languages and domains, WAT
encourages a single multilingual and multi-domain model for all
language pairs and both domains (IT as well as Wikinews). Additional
details will be given on the WAT 2020 website.
* Mixed domain
- Myanmar (Burmese) <--> English
WAT uses UCSY Corpus and ALT Corpus. The UCSY corpus and a portion of
the ALT corpus are use as training data, which are around 220,000
lines of sentences and phrases. The development and test data are
from the ALT corpus.
- Khmer <--> English
WAT uses ECCC Corpus and ALT Corpus. The ECCC corpus and a portion of
the ALT corpus are use as training data, which are around 120,000
lines of sentences and phrases. The development and test data are
from the ALT corpus.
* Indic
- Odia <--> English
For the first time, WAT organizing a translation task for the low
resource language Odia. WAT will use the OdiEnCorp2.0 corpus collected
by researchers at Idiap Research Institute and UFAL. The training data
contains around 98K parallel sentences covering different domains.
- Indian language <--> English multilingual translation task. This
task is being revived after 2018 with major revisions. There has been
an increase in the available datasets for Indian languages in the last
couple of years along with major advances in multilingual learning.
The task will involve training a single model for multiple Indian
languages to English (and vice-versa) translation. The goal is to
encourage exploration of methods which utilize language relatedness to
improve translation quality for low-resource languages while having a
single, compact translation model. The training set would be compiled
from many publicly available datasets spanning 7-8 Indian languages.
* Multimodal
- Hindi --> English Multimodal (Visual Genome) After a warm
response from the participants for the ?WAT 2019 Multimodal
Translation Task?, WAT will continue organizing a multimodal English
--> Hindi translation task where the input will be text and an Image
and the output will be a caption (text). The training set contains
around 30,000 segments. Additional details will be given on the task
website.
- Japanese <--> English Multimodal (Flickr30kEnt-JP)
Details of this task will be announced later. We will use the
Flickr30kEnt-JP corpus for this task.
https://github.com/nlab-mpg/Flickr30kEnt-JP
EVALUATION
----------
Automatic evaluation:
We are providing an automatic evaluation server. It is free for
everyone, but you need to create an account for evaluation. Just
showing the list of evaluation results does not require an account.
Sign-up: http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2020/
Eval. result: http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/
Human evaluation:
Both crowdsourcing evaluation and JPO adequacy evaluation will be
carried out for selected subtasks and selected submitted systems (the
details will be announced later).
INVITED TALK
------------
TBA
ORGANIZERS
----------
Toshiaki Nakazawa, The University of Tokyo, Japan
Hideki Nakayama, The University of Tokyo, Japan
Chenchen Ding, National Institute of Information and Communications
Technology (NICT), Japan
Raj Dabre, National Institute of Information and Communications
Technology (NICT), Japan
Hiroshi Manabe, National Institute of Information and Communications
Technology (NICT), Japan
Anoop Kunchukuttan, Microsoft, India
Win Pa Pa, University of Computer Studies, Yangon (UCSY), Myanmar
Ond?ej Bojar, Charles University, Prague, Czech Republic
Shantipriya Parida, Idiap Research Institute, Martigny, Switzerland
Isao Goto, Japan Broadcasting Corporation (NHK), Japan
Hidaya Mino, Japan Broadcasting Corporation (NHK), Japan
Katsuhito Sudoh, Nara Institute of Science and Technology (NAIST), Japan
Sadao Kurohashi, Kyoto University, Japan
Pushpak Bhattacharyya, Indian Institute of Technology Bombay (IITB), India
CONTACT
-------
wat-organizer@googlegroups.com
------------------------------
Message: 2
Date: Wed, 12 Aug 2020 06:40:58 -0700
From: "G/her G/libanos" <gerizaba@gmail.com>
Subject: Re: [Moses-support] Moses-support Digest, Vol 166, Issue 5
To: moses-support@mit.edu
Message-ID:
<CALR6RoS9_1=s+wykinegzetJX_BYrh8_LF=ORgkjObx=1K9Efg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
please use strlm instead
On Aug 8, 2020 7:05 PM, <moses-support-request@mit.edu> wrote:
> Send Moses-support mailing list submissions to
> moses-support@mit.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.mit.edu/mailman/listinfo/moses-support
> or, via email, send a message with subject or body 'help' to
> moses-support-request@mit.edu
>
> You can reach the person managing the list at
> moses-support-owner@mit.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Moses-support digest..."
>
>
> Today's Topics:
>
> 1. Failed to get the language model by KenLM (Chen, Y.)
> 2. Re: Failed to get the language model by KenLM (Hieu Hoang)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 5 Aug 2020 21:08:04 +0200
> From: "Chen, Y." <y.w.chen.1@student.rug.nl>
> Subject: [Moses-support] Failed to get the language model by KenLM
> To: moses-support@mit.edu
> Message-ID:
> <CAKzNZFTC_gu_DRLAuAP0tJLQ_OujE8iZ7=Ez609MhLODUMQcbg@
> mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> To whom it may concern,
>
> I used Moses to train the language model by KenLM, however, I got the
> following error and I could not get the final model. Could you help me with
> this? Thank you in advance!
>
> The command line I entered:
> bzcat 40m_training_data.bz2 | python preprocess.py |
> ./mosesdecoder/bin/lmplz -o 3 > 40m_training_data.arpa
>
> Error message I got at the first step of Counting and sorting n-grams (out
> of 5 steps):
> terminate called after throwing an instance of 'util::FDException'
> what(): util/file.cc:220 in void util::WriteOrThrow(int, const void*,
> std::size_t) threw FDException because `ret < 1'.
> No space left on device in /tmp/lmWazsTd (deleted) while writing 732919964
> bytes
> Aborted
>
> Best,
> Yu-Wen
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://mailman.mit.edu/mailman/private/moses-support/
> attachments/20200805/549afa26/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Fri, 7 Aug 2020 19:04:06 -0700
> From: Hieu Hoang <hieuhoang@gmail.com>
> Subject: Re: [Moses-support] Failed to get the language model by KenLM
> To: "Chen, Y." <y.w.chen.1@student.rug.nl>, moses-support@mit.edu
> Message-ID: <169b5dcf-0cbd-8f64-f177-c8f52150a2f9@gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> use lmplz -T [path]
>
> and set path to somewhere with lots of disk space
>
> On Wed Aug 5,20 12:08 PM, Chen, Y. wrote:
> > To whom it may concern,
> >
> > I used Moses to train the language model by KenLM, however,?I got the
> > following?error and I could not get?the final model. Could you help me
> > with this? Thank you in advance!
> >
> > The command line I entered:
> > bzcat 40m_training_data.bz2 | python preprocess.py |
> > ./mosesdecoder/bin/lmplz -o 3 > 40m_training_data.arpa
> > Error message I got at the first step of Counting and sorting n-grams
> > (out of 5 steps):
> > terminate called after throwing an instance of 'util::FDException'
> > ? what(): ?util/file.cc:220 in void util::WriteOrThrow(int, const
> > void*, std::size_t) threw FDException because `ret < 1'.
> > No space left on device in /tmp/lmWazsTd (deleted) while writing
> > 732919964 bytes
> > Aborted
> >
> > Best,
> > ?Yu-Wen
> >
> > _______________________________________________
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
>
> --
> Hieu Hoang
> http://statmt.org/hieu
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://mailman.mit.edu/mailman/private/moses-support/
> attachments/20200807/2088b2c0/attachment-0001.html
>
> ------------------------------
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> End of Moses-support Digest, Vol 166, Issue 5
> *********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20200812/29e9c937/attachment.html
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 166, Issue 6
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 166, Issue 6"
Post a Comment