Moses-support Digest, Vol 92, Issue 39

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Tokenizer script (Cyrine NASRI)
2. CfP: Journal of Natural Language Engineering - Special Issue
on ?Machine Translation Using Comparable Corpora? (Reinhard Rapp)
3. Re: Creating a 2-gram language model (Philipp Koehn)
4. Re: Tokenizer script (Tom Hoar)

----------------------------------------------------------------------

Message: 1
Date: Fri, 20 Jun 2014 11:45:02 +0200
From: Cyrine NASRI <cyrine.nasri@univ-lorraine.fr>
Subject: Re: [Moses-support] Tokenizer script
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAPg_V0gWee_uhN3BBXpSBiu01dBQjX2ZrH-0QvXY8ue7oFLM5g@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Thank you Tom for your reply.
So i keep these """"" &co in the language model too?

Bests

2014-06-18 16:04 GMT+02:00 Cyrine NASRI <cyrine.nasri@univ-lorraine.fr>:

> Hello
> I have concern the tonkenizer script,
>
> When i so the tokenization, i got some """ and "'".. wHen i let
> them in the training process i think it damage the translation quality?
> So should i really let them or transform them to " and ' after training.
>
> Thank you in advance for your reply
>
> Best Cyrine
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140620/f2c97b64/attachment-0001.htm

------------------------------

Message: 2
Date: Fri, 20 Jun 2014 12:30:33 +0200
From: "Reinhard Rapp" <reinhardrapp@gmx.de>
Subject: [Moses-support] CfP: Journal of Natural Language Engineering
- Special Issue on ?Machine Translation Using Comparable Corpora?
To: <IRList@lists.shef.ac.uk>, <listmaster@loria.fr>, <ln@cines.fr>,
<lr_egroup@mail.iiit.ac.in>, <moses-support@mit.edu>,
<news@multilingual.com>
Message-ID: <45AAADB6E08D4114A33C67C5B869DAD2@ASUSPC>
Content-Type: text/plain; charset="windows-1252"

***** Journal of Natural Language Engineering - Special Issue on ?Machine Translation Using Comparable Corpora? *****

CALL FOR PAPERS

Statistical machine translation based on parallel corpora has been very successful. The major search engines' translation systems, which are used by millions of people, are primarily using this approach, and it has been possible to come up with new language pairs in a fraction of the time that would be required when using more traditional rule-based methods.

In contrast, research on comparable corpora is still at an earlier stage. Comparable corpora can be defined as monolingual corpora covering roughly the same subject area in different languages but without being exact translations of each other.

However, despite its tremendous success, the use of parallel corpora in MT has a number of drawbacks:

1) It has been shown that translated language is somewhat different from original language, for example Klebanov & Flor showed that "associative texture" is lost in translation.

2) As they require translation, parallel corpora will always be a far scarcer resource than comparable corpora. This is a severe drawback for a number of reasons:

a) Among the about 7000 world languages, of which 600 have a written form, the vast majority are of the "low resource" type.

b) The number of possible language pairs increases with the square of the number of languages. When using parallel corpora, one bitext is needed for each language pair. When using comparable corpora, one monolingual corpus per language suffices.

c) For improved translation quality, translation systems specialized on particular genres and domains are desirable. But it is far more difficult to acquire appropriate parallel rather than comparable training corpora.

d) As language evolves over time, the training corpora should be updated on a regular basis. Again, this is more difficult in the parallel case.

For such reasons it would be a big step forward if it were possible to base statistical machine translation on comparable rather than on parallel corpora: The acquisition of training data would be far easier, and the unnatural "translation bias" (source language shining through) within the training data could be avoided.

But is there any evidence that this is possible? Motivation for using comparable corpora in MT research comes from a cognitive perspective: Experience tells that persons who have learned a second language completely independently from their mother tongue can nevertheless translate between the languages. That is, human performance shows that there must be a way to bridge the gap between languages which does not rely on parallel data. Using parallel data for MT is of course a nice shortcut. But avoiding this shortcut by doing MT based on comparable corpora may well be a key to a better understanding of human translation, and to better MT quality.

Work on comparable corpora in the context of MT has been ongoing for almost 20 years. It has turned out that this is a very hard problem to solve, but as it is among the grand challenges in multilingual NLP, interest has steadily increased. Apart from the increase in publications this can be seen from the considerable number of research projects (such as ACCURAT and TTC) which are fully or partially devoted to MT using comparable corpora. Given also the success of the workshop series on ?Building and Using Comparable Corpora? (BUCC), which is now in its seventh year, and following the publication of a related book (http://www.springer.com/computer/ai/book/978-3-642-20127-1), we think that it is now time to devote a journal special issue to this field. It is meant to bundle the latest top class research, make it available to everybody working in the field, and at the same time give an overview on the state of the art to all interested researchers.

TOPICS OF INTEREST

We solicit contributions including but not limited to the following topics:

? Comparable corpora based MT systems (CCMTs)
? Architectures for CCMTs
? CCMTs for less-resourced languages
? CCMTs for less-resourced domains
? CCMTs dealing with morphologically rich languages
? CCMTs for spoken translation
? Applications of CCMTs
? CCMT evaluation
? Open source CCMT systems
? Hybrid systems combining SMT and CCMT
? Hybrid systems combining rule-based MT and CCMT
? Enhancing phrase-based SMT using comparable corpora
? Expanding phrase tables using comparable corpora
? Comparable corpora based processing tools/kits for MT
? Methods for mining comparable corpora from the Web
? Applying Harris' distributional hypothesis to comparable corpora
? Induction of morphological, grammatical, and translation rules from comparable corpora
? Machine learning techniques using comparable corpora
? Parallel corpora vs. pairs of non-parallel monolingual corpora
? Extraction of parallel segments or paraphrases from comparable corpora
? Extraction of bilingual and multilingual translations of single words and multi-word expressions, proper names, and named entities from comparable corpora

IMPORTANT DATES

December 1, 2014: Paper submission deadline
February 1, 2015: Notification
May 1, 2015: Deadline for revised papers
July 1, 2015: Final notification
September 1, 2015: Final paper due

GUEST EDITORS

Reinhard Rapp, Universities of Aix Marseille (France) and Mainz (Germany)
Serge Sharoff, University of Leeds (UK)
Pierre Zweigenbaum, LIMSI, CNRS (France)

FURTHER INFORMATION

Please use the following e-mail address to contact the guest editors: jnle.bucc (at) limsi (dot) fr

Further details on paper submission will be made available in due course at the BUCC website: http://comparable.limsi.fr/bucc2014/bucc-introduction.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140620/eedc78ed/attachment-0001.htm

------------------------------

Message: 3
Date: Fri, 20 Jun 2014 07:58:15 -0400
From: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Subject: Re: [Moses-support] Creating a 2-gram language model
To: Rajkiran Rajkumar <rajkiran2507@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAAFADDD+pHv7jFjSM=j-V=i-peyXVQvDZzFNiZWzq6P+r=sYcQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

Hi,

> And, in general, which is more efficient for a bilingual corpus of 160,000
> sentences? 2-gram or 3-gram?

Even with just 160,000 sentences you should build at least a 3-gram model
- for quality reasons not efficiency reasons (smaller models will be faster).

-phi

------------------------------

Message: 4
Date: Fri, 20 Jun 2014 19:03:31 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Tokenizer script
To: moses-support@mit.edu
Message-ID: <53A42313.4020407@precisiontranslationtools.com>
Content-Type: text/plain; charset="iso-8859-1"

Yes, Cyrine. You need to prepare the language model corpus the same as
the target half of your parallel corpus.

On 06/20/2014 04:45 PM, Cyrine NASRI wrote:
> Thank you Tom for your reply.
> So i keep these """"" &co in the language model too?
>
> Bests
>
>
> 2014-06-18 16:04 GMT+02:00 Cyrine NASRI <cyrine.nasri@univ-lorraine.fr
> <mailto:cyrine.nasri@univ-lorraine.fr>>:
>
> Hello
> I have concern the tonkenizer script,
>
> When i so the tokenization, i got some """ and "'"..
> wHen i let them in the training process i think it damage the
> translation quality?
> So should i really let them or transform them to " and ' after
> training.
>
> Thank you in advance for your reply
>
> Best Cyrine
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140620/6d9c4f63/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 92, Issue 39
*********************************************

Moses-support Digest, Vol 92, Issue 39

0 Response to "Moses-support Digest, Vol 92, Issue 39"

Post a Comment