Moses-support Digest, Vol 110, Issue 36

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Chinese & Arabic Tokenizers (Dingyuan Wang)
2. Re: Doubts on Multiple Decoding Paths (Philipp Koehn)
3. Re: Chinese & Arabic Tokenizers (Matthias Huck)
4. 1st CFP: LREC 9th Workshop and Shared Task on Building and
Using Comparable Corpora (Reinhard Rapp)


----------------------------------------------------------------------

Message: 1
Date: Sat, 19 Dec 2015 01:19:49 +0800
From: Dingyuan Wang <abcdoyle888@gmail.com>
Subject: Re: [Moses-support] Chinese & Arabic Tokenizers
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAFt8H75PZ_t9J6cPcX8EsHxzrs88mTLhzd7GP2EBrMRtqMrApA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Tom,

As far as I know, the following are widely-used and open-source Chinese
tokenizers:

* https://github.com/fxsjy/jieba
* http://sourceforge.net/projects/zpar/
* https://github.com/NLPchina/ansj_seg

And this proprietary one:

* http://ictclas.nlpir.org/

(Disclaimer: I am one of the developers of jieba, and I personally use
this.)

--
Dingyuan Wang
2015?12?19? 00:51? "Tom Hoar" <tahoar@precisiontranslationtools.com>???

> I'm looking for Chinese and Arabic tokenizers. We've been using
> Stanford's for a while but it has downfalls. The Chinese mode loads its
> statistical models very slowly. The Arabic mode stems the resulting
> tokens. The coup de grace is that their latest jar update (9 days ago)
> was compiled run only with Java 1.8.
>
> So, with the exception of Stanford, what choices are available for
> Chinese and Arabic that you're finding worthwhile?
>
> Thanks!
> Tom
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151218/9c3e749c/attachment-0001.html

------------------------------

Message: 2
Date: Fri, 18 Dec 2015 13:08:47 -0500
From: Philipp Koehn <phi@jhu.edu>
Subject: Re: [Moses-support] Doubts on Multiple Decoding Paths
To: Anoop (?????) <anoop.kunchukuttan@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAFADDBm0dodHBh8OLqANxceUy+bmgZBP4qVonGrnLJvXLHrpw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

that sounds right.

The "union" option is fairly new, developed by Michael Denkowski.
I am not aware of any empirical study of the different methods,
so I'd be curious to see what you find.

-phi

On Fri, Dec 18, 2015 at 1:35 AM, Anoop (?????) <anoop.kunchukuttan@gmail.com
> wrote:

> Hi,
>
> I am trying to understand the multiple decoding paths feature in Moses.
>
> The documentation (http://www.statmt.org/moses/?n=Advanced.Models#ntoc7)
> describes 3 methods: both, either and union
>
> The following is my understanding of the options. Please let me know if it
> is correct:
>
>
> - With *both* option, the constituent phrases of the target hypothesis
> come from both tables (since they are shared) and are scored with both the
> tables.
> - With *either* option, all the constituent phrases of a target
> hypothesis come from a single table, but different hypothesis can use
> different tables. Each hypothesis is scored using one table only. I did not
> understand the " additional options are collected from the other tables"
> bit in the documentation.
> - With *union* option, the constituent phrases of a target hypothesis
> come from different tables and are scored using scores from all the tables.
> Use 0 if the option doesn't appear in some table, unless the
> *default-average-others=true* option is used.
>
>
> Regards,
> Anoop.
>
> --
> I claim to be a simple individual liable to err like any other fellow
> mortal. I own, however, that I have humility enough to confess my errors
> and to retrace my steps.
>
> http://flightsofthought.blogspot.com
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151218/151fdfd6/attachment-0001.html

------------------------------

Message: 3
Date: Fri, 18 Dec 2015 18:08:32 +0000
From: Matthias Huck <mhuck@inf.ed.ac.uk>
Subject: Re: [Moses-support] Chinese & Arabic Tokenizers
To: Dingyuan Wang <abcdoyle888@gmail.com>
Cc: Tom Hoar <tahoar@precisiontranslationtools.com>, moses-support
<moses-support@mit.edu>
Message-ID: <1450462112.22442.51.camel@portedgar>
Content-Type: text/plain; charset="UTF-8"

Hi Tom,

There used to be a freely available Chinese word segmenter provided by
the LDC as well. Unfortunately, things keep disappearing from the web.
https://web.archive.org/web/20130907032401/http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm

For Arabic, I think that many academic research groups used to work with
MADA. But it seems like you'll need a special license for commercial
use.
http://www1.cs.columbia.edu/~rambow/software-downloads/MADA_Distribution.html
https://secure.nouvant.com/columbia/technology/cu14012/license/492

Or you try MorphTagger/Segmenter, a segmentation tool for Arabic SMT.
http://www.hltpr.rwth-aachen.de/~mansour/MorphSegmenter/
It may not be maintained any more. You can contact Saab Mansour to ask
about it.

Saab has published a couple of papers about this, some of which report
comparisons of different Arabic segmentation strategies for SMT.
http://www.hltpr.rwth-aachen.de/publications/download/687/Mansour-IWSLT-2010.pdf
http://www.hltpr.rwth-aachen.de/publications/download/808/Mansour-LREC-2012.pdf
http://link.springer.com/article/10.1007%2Fs10590-011-9102-0

Cheers,
Matthias


On Sat, 2015-12-19 at 01:19 +0800, Dingyuan Wang wrote:
> Hi Tom,
>
> As far as I know, the following are widely-used and open-source Chinese
> tokenizers:
>
> * https://github.com/fxsjy/jieba
> * http://sourceforge.net/projects/zpar/
> * https://github.com/NLPchina/ansj_seg
>
> And this proprietary one:
>
> * http://ictclas.nlpir.org/
>
> (Disclaimer: I am one of the developers of jieba, and I personally use
> this.)
>
> --
> Dingyuan Wang
> 2015?12?19? 00:51? "Tom Hoar" <tahoar@precisiontranslationtools.com>???
>
> > I'm looking for Chinese and Arabic tokenizers. We've been using
> > Stanford's for a while but it has downfalls. The Chinese mode loads its
> > statistical models very slowly. The Arabic mode stems the resulting
> > tokens. The coup de grace is that their latest jar update (9 days ago)
> > was compiled run only with Java 1.8.
> >
> > So, with the exception of Stanford, what choices are available for
> > Chinese and Arabic that you're finding worthwhile?
> >
> > Thanks!
> > Tom
> > _______________________________________________
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



------------------------------

Message: 4
Date: Fri, 18 Dec 2015 21:37:04 +0100
From: "Reinhard Rapp" <reinhardrapp@gmx.de>
Subject: [Moses-support] 1st CFP: LREC 9th Workshop and Shared Task on
Building and Using Comparable Corpora
To: <lr_egroup@mail.iiit.ac.in>, <moses-support@mit.edu>,
<news@multilingual.com>
Message-ID: <01E7FD1551D045078DA604C4B95810B1@ASUSPC>
Content-Type: text/plain; charset="windows-1252"

============================================================

Call for Papers

9th WORKSHOP ON BUILDING AND USING COMPARABLE CORPORA

Special Topic: Continuous Vector Space Models and Comparable Corpora

Shared Task: Identifying Parallel Sentences in Comparable Corpora

https://comparable.limsi.fr/bucc2016/

Monday, May 23, 2016

Co-located with LREC 2016, Portoro?, Slovenia

DEADLINE FOR PAPERS: February 10, 2016

============================================================

MOTIVATION

In the language engineering and the linguistics communities, research
on comparable corpora has been motivated by two main reasons. In
language engineering, on the one hand, it is chiefly motivated by the
need to use comparable corpora as training data for statistical
Natural Language Processing applications such as statistical machine
translation or cross-lingual retrieval. In linguistics, on the other
hand, comparable corpora are of interest in themselves by making
possible inter-linguistic discoveries and comparisons. It is generally
accepted in both communities that comparable corpora are documents in
one or several languages that are comparable in content and form in
various degrees and dimensions. We believe that the linguistic
definitions and observations related to comparable corpora can improve
methods to mine such corpora for applications of statistical NLP. As
such, it is of great interest to bring together builders and users of
such corpora.


SHARED TASK

There will be a shared task on "Identifying Parallel Sentences in
Comparable Corpora" whose details will be described on the
workshop website (URL see above).


TOPICS

Beyond this year's special topic "Continuous Vector Space Models and
Comparable Corpora" and the shared task on "Identifying Parallel
Sentences in Comparable Corpora", we solicit contributions including
but not limited to the following topics:

Building comparable corpora:

* Human translations
* Automatic and semi-automatic methods
* Methods to mine parallel and non-parallel corpora from the Web
* Tools and criteria to evaluate the comparability of corpora
* Parallel vs non-parallel corpora, monolingual corpora
* Rare and minority languages, across language families
* Multi-media/multi-modal comparable corpora

Applications of comparable corpora:

* Human translations
* Language learning
* Cross-language information retrieval & document categorization
* Bilingual projections
* Machine translation
* Writing assistance

Mining from comparable corpora:

* Cross-language distributional semantics
* Extraction of parallel segments or paraphrases from comparable corpora
* Extraction of translations of single words and multi-word expressions,
proper names, named entities, etc.


IMPORTANT DATES

February 10, 2016 Deadline for submission of full papers
March 10, 2016 Notification of acceptance
March 25, 2016 Camera-ready papers due
May 23, 2016 Workshop date


SUBMISSION INFORMATION

Papers should follow the LREC main conference formatting details (to be
announced on the conference website http://lrec2016.lrec-conf.org/en/ )
and should be submitted as a PDF-file via the START workshop manager at

https://www.softconf.com/lrec2016/BUCC2016/

Contributions can be short or long papers. Short paper submission must
describe original and unpublished work without exceeding six (6)
pages. Characteristics of short papers include: a small, focused
contribution; work in progress; a negative result; an opinion piece;
an interesting application nugget. Long paper submissions must
describe substantial, original, completed and unpublished work without
exceeding ten (10) pages.

Reviewing will be double blind, so the papers should not reveal the
authors' identity. Accepted papers will be published in the workshop
proceedings.

Double submission policy: Parallel submission to other meetings or
publications is possible but must be immediately notified to the
workshop organizers.

Please also observe the following two paragraphs which are applicable
to all LREC workshops as well as to the main conference:

Describing your LRs in the LRE Map is now a normal practice in the
submission procedure of LREC (introduced in 2010 and adopted by other
conferences). To continue the efforts initiated at LREC 2014 about
?Sharing LRs? (data, tools, web-services, etc.), authors will have
the possibility, when submitting a paper, to upload LRs in a special
LREC repository. This effort of sharing LRs, linked to the LRE Map
for their description, may become a new ?regular? feature for conferences
in our field, thus contributing to creating a common repository where
everyone can deposit and share data.

As scientific work requires accurate citations of referenced work so
as to allow the community to understand the whole context and also
replicate the experiments conducted by other researchers, LREC 2016
endorses the need to uniquely Identify LRs through the use of the
International Standard Language Resource Number (ISLRN, www.islrn.org),
a Persistent Unique Identifier to be assigned to each Language Resource.
The assignment of ISLRNs to LRs cited in LREC papers will be offered at
submission time.


ORGANISERS

Reinhard Rapp, University of Mainz (Germany)
Pierre Zweigenbaum, LIMSI, CNRS, Orsay (France)
Serge Sharoff, University of Leeds (UK)


FURTHER INFORMATION

Reinhard Rapp: reinhardrapp (at) gmx (dot) de


SCIENTIFIC COMMITTEE

* Ahmet Aker, University of Sheffield (UK)
* Herv? D?jean (Xerox Research Centre Europe, Grenoble, France)
* ?ric Gaussier (Universit? Joseph Fourier, Grenoble, France)
* Gregory Grefenstette (INRIA, Saclay, France)
* Silvia Hansen-Schirra (University of Mainz, Germany)
* Hitoshi Isahara (Toyohashi University of Technology)
* Kyo Kageura (University of Tokyo, Japan)
* Philippe Langlais (Universit? de Montr?al, Canada)
* Michael Mohler (Language Computer Corp., US)
* Emmanuel Morin (Universit? de Nantes, France)
* Lene Offersgaard (University of Copenhagen, Denmark)
* Dragos Stefan Munteanu (Language Weaver, Inc., US)
* Ted Pedersen (University of Minnesota, Duluth, US)
* Reinhard Rapp (University of Mainz, Germany)
* Serge Sharoff (University of Leeds, UK)
* Michel Simard (National Research Council Canada)
* Pierre Zweigenbaum (LIMSI-CNRS, Orsay, France)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151218/e3fb6ee4/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 110, Issue 36
**********************************************

0 Response to "Moses-support Digest, Vol 110, Issue 36"

Post a Comment