Moses-support Digest, Vol 129, Issue 2

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. CFP: WAT2017 (The 4th Workshop on Asian Translation)
(Toshiaki Nakazawa)
2. a tool for extracting specific terms from the corpora
(Mariusz Hawry?kiewicz)
3. Re: a tool for extracting specific terms from the corpora
(Mathias M?ller)
4. Re: a tool for extracting specific terms from the corpora
(Mariusz Hawry?kiewicz)

----------------------------------------------------------------------

Message: 1
Date: Mon, 03 Jul 2017 11:50:30 +0900
From: Toshiaki Nakazawa <nakazawa@nlp.ist.i.kyoto-u.ac.jp>
Subject: [Moses-support] CFP: WAT2017 (The 4th Workshop on Asian
Translation)
To: moses-support@mit.edu
Cc: nakazawa@nlp.ist.i.kyoto-u.ac.jp
Message-ID: <m21spy44jt.wl-nakazawa@nlp.ist.i.kyoto-u.ac.jp>
Content-Type: text/plain; charset=US-ASCII

Dear all MT researchers/users,

I'm Toshiaki Nakazawa from JST (Japan Science and Technology Agency),
Japan. This is the call for participation for the shared tasks and
papers to the 4th Workshop on Asian Translation (WAT2017), workshop of
IJCNLP2017. Those who are working on machine translation, please join
us.

** PleaseNote that the deadline for the task submission is approaching! **

IMPORTANT DATES
---------------

July 31 Translation Task Submission Deadline
August 20 Small NMT Task Submission Deadline
September 5 Research Paper and System Description Submission Deadline
September 30 Notification of Research Paper Acceptance and Review Feedback of System Description
October 10 Camera-ready Deadline
November 27 WAT2017

* All deadlines are calculated at 11:59PM UTC-7

Best regards,

---------------------------------------------------------------------------
WAT 2017
(The 4th Workshop on Asian Translation)
in conjunction with IJCNLP2017
http://lotus.kuee.kyoto-u.ac.jp/WAT/
November 27, 2017, Taipei, Taiwan

Following the success of the previous WAT workshops (WAT2014, WAT2015,
WAT2016), WAT2017 will bring together machine translation researchers
and users to try, evaluate, share and discuss brand-new ideas about
machine translation. For the 4th WAT, we will include two new
translation subtasks:

* Japanese-English newspaper translation subtask
* Japanese-English recipe translation subtask

Also, we will include brand-new task:

* small NMT task

The goal of this task is to build a small neural machine translation
system while keeping a reasonable translation quality. There is a high
demand in industries to equip smart devices with translation
capabilities. Though neural machine translation reaches the point that
such capability is not a dream anymore, it usually needs huge
resources which are not available on daily devices. The current
solution is to run a translation engine on powerful servers and to
arrange the device talk to them over Internet. However reliable
low-latency connection is not available in the most part of the world
and will not in a short term. If we can build a small system while
keeping the translation capability reasonably, it has a huge impact in
the application of machine translation.

Unfortunately almost all research work of neural machine translation
is biased toward improving quality with little consideration to
computing resource at inference time. We hope this shared task
provides a common language and asset to the NLP community to open a
new research field, which will have a huge impact in cross-language
communication of our society.

The small NMT task consists of two subtasks.

* Model optimization subtask:

The participants are given pre-processed parallel data (essentially
pairs of token id sequences) and requested to build a neural machine
translation which convert id sequences to id sequences. The
participants are required to submit the output id sequences and the
number of model parameters as well as model details.

* Full optimization subtask:

The participants are given raw parallel data and requested to build a
small neural machine translation system which convert texts to
texts. The participants are required to submit the output translations
and any relevant measurements of the size of the system as long as
system details.

In both subtasks, the participants are strongly encouraged to make the
system as small as possible. Especially exploration over various
setups including extreme one is highly recommended.

In addition to the shared tasks, the workshop will also feature
scientific papers on topics related to the machine translation,
especially for Asian languages. Topics of interest include, but are
not limited to:

- analysis of the automatic/human evaluation results in the past WAT workshops
- word-/phrase-/syntax-/semantics-/rule-based, neural and hybrid machine translation
- Asian language processing
- incorporating linguistic information into machine translation
- decoding algorithms
- system combination
- error analysis
- manual and automatic machine translation evaluation
- machine translation applications
- quality estimation
- domain adaptation
- machine translation for low resource languages
- language resources

************************* IMPORTANT NOTICE *************************
Participants of the previous workshop are also required to sign up to
WAT2017
********************************************************************

TASK
----

The task is to improve the text translation quality for scientific
papers and patent documents. Participants choose any of the subtasks
in which they would like to participate and translate the test data
using their machine translation systems. The WAT organizers will
evaluate the results submitted using automatic evaluation and human
evaluation. We will also provide a baseline machine translation.

Subtasks:
Scientific Paper Subtasks: [Asian Scientific Paper Excerpt Corpus (ASPEC)]
English/Chinese <--> Japanese
Patent Subtasks: [Japan Patent Office Patent Corpus (JPC)]
English/Chinese/Korean <--> Japanese
Newswire Subtasks:
English <--> Indonesian [BPPT Corpus]
English <--> Japanese [NEW! JIJI Corpus]
Mixed domain Subtasks: [IIT Bombay (IITB) Corpus]
Hindi <--> English/Japanese
Recipe subtask: [NEW! Cookpad Comparable Corpus]
Japanese <--> English

Dataset:

* Scientific paper Subtasks:

WAT uses ASPEC for the dataset including training, development,
development test and test data. Participants of the scientific papers
subtask must get a copy of ASPEC by themselves. ASPEC consists of
approximately 3 million Japanese-English parallel sentences from paper
abstracts (ASPEC-JE) and approximately 0.7 million Japanese-Chinese
paper excerpts (ASPEC-JC)

* Patent Subtasks:

WAT uses JPO Patent Corpus, which is constructed by Japan Patent
Office (JPO). This corpus consists of 1 million English-Japanese
parallel sentences, 1 million Chinese-Japanese parallel sentences, and
1 million Korean-Japanese parallel sentences from patent description
with four categories. Participants of patents subtask are required to
get it on WAT2017 site of JPO Patent Corpus.

* Newswire Subtasks (English <--> Indonesian):

WAT uses BPPT Corpus, which is constructed by Badan Pengkajian dan
Penerapan Teknologi (BPPT). This corpus consists of 50,000
Indonesian-Japanese parallel sentences from news description with five
categories. Participants of patents subtask are required to get it on
WAT2017 site of BPPT Corpus.

* Newswire Subtasks (English <--> Japanese):

WAT uses JIJI Corpus, which is constructed by Jiji Press Ltd. in
collaboration with the National Institute of Information and
Communications Technology (NICT). This corpus consists of a
Japanese-English news corpus of 200K parallel sentences, from Jiji
Press news with various categories. Participants of patents subtask
are required to get it on WAT2017 site of JIJI Corpus.

* Mixed domain Subtask:

- Hindi <--> English
WAT uses IITB Corpus for the dataset for training, development,
development test and test data. The training corpus is mixed domain
and contains around 1 million lines of sentences and phrases. In order
to access the corpus participants should sign the following agreement,
scan and send it to the addresss mentioned in it. The training corpus
is a mixed domain corpus. The development and test set are from the
News domain and are exactly the same as the ones in WMT 2014.

- Hindi <--> Japanese Pivot Language Task

For the first time we are introducing a pivot language task. For this tasks participants can use the following corpora.

1. A parallel corpus (created using openly available corpora) which is
located at
http://lotus.kuee.kyoto-u.ac.jp/WAT/Hindi-corpus/WAT2017-Ja-Hi.zip.

2. The Hindi-English (IITB) task corpus and the English-Japanese
(ASPEC) task corpus for pivoting. For triangulation of the
source-pivot and pivot-target phrase tables they can use the scripts
provided by: MultiMT (https://github.com/tamhd/MultiMT).

The objective of this task is to compare the performance of a baseline
system constructed only on a mixed domain parallel corpus with a
system that uses additional mixed domain corpus by means of pivoting.

* Recipe Subtasks:

WAT uses Recipe Corpus, which is constructed by Cookpad Inc. This
corpus consists of 16,282 Japanese-English parallel sentences from
recipes. Participants of recipe subtask are required to get it on
WAT2017 site of Recipe Corpus.

EVALUATION
----------

Automatic evaluation:
We are providing an automatic evaluation server. It is for free for
everyone, but you need to create an account for evaluation. Just
showing the list of evaluation results does not require an account.

Sign-up: http://lotus.kuee.kyoto-u.ac.jp/WAT/registration/index.html
Eval. result: http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/index.html

Human evaluation:
Both crowdsourcing evaluation and JPO adequacy evaluation will be
carried out for selected subtasks and selected submitted systems (the
details will be announced later). Participants can submit one
translation result for each subtask.

INVITED TALK
------------

TBA

ORGANIZERS
----------

Toshiaki Nakazawa, Japan Science and Technology Agency (JST), Japan
Hideya Mino, National Institute of Information and Communications Technology (NICT), Japan
Chenchen Ding, National Institute of Information and Communications Technology (NICT), Japan
Shohei Higashiyama, National Institute of Information and Communications Technology (NICT), Japan
Isao Goto, Japan Broadcasting Corporation (NHK), Japan
Graham Neubig, Nara Institute of Science and Technology (NAIST), Japan
Hideo Kazawa, Google, Japan
Yusuke Oda, Nara Institute of Science and Technology (NAIST), Japan
Jun Harashima, Cookpad Inc., Japan
Sadao Kurohashi, Kyoto University, Japan
Ir. Hammam Riza, Agency for the Assessment and Application of Technology (BPPT), Indonesia
Pushpak Bhattacharyya, Indian Institute of Technology Bombay (IIT), India

CONTACT
-------

wat@nlp.ist.i.kyoto-u.ac.jp

------------------------------

Message: 2
Date: Tue, 4 Jul 2017 09:39:55 +0200
From: Mariusz Hawry?kiewicz <mariusz.hawrylkiewicz@gmail.com>
Subject: [Moses-support] a tool for extracting specific terms from the
corpora
To: moses-support@mit.edu
Message-ID:
<CAO0ECUtq79D0wL12ePxDYTf=sYPPEPGrjtk5tyF5HDKh+qDfEQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Dear all,

I have been searching for the most efficient way to extract untranslatable
content from the corpora that always begin from the capital letter (product
names etc.), the problem is that all the segments begin with the capital
letter and what's obvious, the sentence may also begin with the
untranslatable content (product name) :-).

I want to avoid using common dictionaries to eliminate common words.

Would you have any other suggestions?

Thank you very much!
Mariusz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20170704/3b48bb89/attachment-0001.html

------------------------------

Message: 3
Date: Tue, 4 Jul 2017 10:02:14 +0200
From: Mathias M?ller <mmueller@ifi.uzh.ch>
Subject: Re: [Moses-support] a tool for extracting specific terms from
the corpora
To: Mariusz Hawry?kiewicz <mariusz.hawrylkiewicz@gmail.com>
Cc: moses-support@mit.edu
Message-ID: <755A5BB2-95C8-4ACE-A3C2-D49665258C17@ifi.uzh.ch>
Content-Type: text/plain; charset="utf-8"

Hi Mariusz

What do you mean by ?extracting? this content? What do you need the list of proper names for? What are the languages involved?

Regards,
Mathias

?

Mathias M?ller
AND-2-20
Institute of Computational Linguistics
University of Zurich
Switzerland
+41 44 635 75 81 <tel:+41%2044%20635%2075%2081>
mmueller@cl.uzh.ch <mailto:mmueller@cl.uzh.ch>
> On 4 Jul 2017, at 09:39, Mariusz Hawry?kiewicz <mariusz.hawrylkiewicz@gmail.com> wrote:
>
> Dear all,
>
> I have been searching for the most efficient way to extract untranslatable content from the corpora that always begin from the capital letter (product names etc.), the problem is that all the segments begin with the capital letter and what's obvious, the sentence may also begin with the untranslatable content (product name) :-).
>
> I want to avoid using common dictionaries to eliminate common words.
>
> Would you have any other suggestions?
>
> Thank you very much!
> Mariusz
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20170704/ed737d7c/attachment-0001.html

------------------------------

Message: 4
Date: Tue, 4 Jul 2017 10:41:28 +0200
From: Mariusz Hawry?kiewicz <mariusz.hawrylkiewicz@gmail.com>
Subject: Re: [Moses-support] a tool for extracting specific terms from
the corpora
To: Mathias M?ller <mmueller@ifi.uzh.ch>
Cc: moses-support@mit.edu
Message-ID:
<CAO0ECUvWqUFPmc474CrMuTbMfGXrh+fSw+kgVF2+UobKqKKs+w@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Mathias, thank you for getting back - let me give you an example from a
monolingual EN corpora:

*Acoustic* measurement precision and uncertainty.
Each press of the *Acoustic Output *? key decreases the transmission power
setting (TX) displayed in the monitor display.

In the first sentence the word Acoustic should not be exported. In the
second sentence Acoustic Output should.
Now I have written a program in Java that exports all the terms or group of
terms with first capital letter, but this obviously includes the words like
from the first example and it should not.

The purpose is that the proper names only should be exported to a separate
file.

Best regards
Mariusz

2017-07-04 10:02 GMT+02:00 Mathias M?ller <mmueller@ifi.uzh.ch>:

> Hi Mariusz
>
> What do you mean by ?extracting? this content? What do you need the list
> of proper names for? What are the languages involved?
>
> Regards,
> Mathias
>
> ?
>
> Mathias M?ller
> AND-2-20
> Institute of Computational Linguistics
> University of Zurich
> Switzerland
> +41 44 635 75 81 <+41%2044%20635%2075%2081>
> mmueller@cl.uzh.ch
>
> On 4 Jul 2017, at 09:39, Mariusz Hawry?kiewicz <
> mariusz.hawrylkiewicz@gmail.com> wrote:
>
> Dear all,
>
> I have been searching for the most efficient way to extract untranslatable
> content from the corpora that always begin from the capital letter (product
> names etc.), the problem is that all the segments begin with the capital
> letter and what's obvious, the sentence may also begin with the
> untranslatable content (product name) :-).
>
> I want to avoid using common dictionaries to eliminate common words.
>
> Would you have any other suggestions?
>
> Thank you very much!
> Mariusz
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20170704/c03a4ad1/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 129, Issue 2
*********************************************

Moses-support Digest, Vol 129, Issue 2

0 Response to "Moses-support Digest, Vol 129, Issue 2"

Post a Comment