Moses-support Digest, Vol 85, Issue 15

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. test sets (Read, James C)
2. Re: 10 years of OPUS (Read, James C)
3. Re: getting WER metrics (Felipe S?nchez Mart?nez)
4. scores.size() == indexes.second - indexes.first failed
(Arththika Paramanathan)
5. Re: 10 years of OPUS (joerg)

----------------------------------------------------------------------

Message: 1
Date: Wed, 6 Nov 2013 07:34:40 +0000
From: "Read, James C" <jcread@essex.ac.uk>
Subject: [Moses-support] test sets
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<F00840E41983C645928E21E3C35F4EB1012CF35451@mbx1-node2.essex.ac.uk>
Content-Type: text/plain; charset="iso-8859-1"

Hi all,

does anybody know about efforts made to create large test sets from a vast array of genres such that a researcher could get a general idea of how his system performs in the open domain (e.g. speech2speech or webpage translation)?

thanks,
James

------------------------------

Message: 2
Date: Wed, 6 Nov 2013 07:48:20 +0000
From: "Read, James C" <jcread@essex.ac.uk>
Subject: Re: [Moses-support] 10 years of OPUS
To: Jorg Tiedemann <tiedeman@gmail.com>, "moses-support@MIT.EDU"
<moses-support@mit.edu>
Message-ID:
<F00840E41983C645928E21E3C35F4EB1012CF35468@mbx1-node2.essex.ac.uk>
Content-Type: text/plain; charset="iso-8859-1"

I wonder if there would be a demand for ready made phrase tables generated from the data.
________________________________________
From: moses-support-bounces@mit.edu [moses-support-bounces@mit.edu] on behalf of Jorg Tiedemann [tiedeman@gmail.com]
Sent: 02 November 2013 18:54
To: moses-support@MIT.EDU
Subject: [Moses-support] 10 years of OPUS

After attending the 20-years-of-bitext workshop at EMNLP I suddenly realized that OPUS (http://opus.lingfil.uu.se) also has its 10-years anniversary this year (send me some champagne if you like). I will celebrate this anniversary by sending out this e-mail with some recent news and highlights.

OPUS is a growing collection of parallel corpora for many languages and various domains. The collection becomes pretty big and includes a variety of data sets and tools that are not only useful for statistical machine translation. OPUS has been extended a lot since its first appearance in 2003. Actually the best birthday present would be if anyone would decide to start a mirror of OPUS. Let me know if you are interested.

Here some of the highlights:

- over 150 languages and language variants
- over 5 billion aligned translation units
- downloads in XML/XCES, plain text (Moses/SMT) and TMX
- raw, tokenized and machine-annotated data
- monolingual data sets (for language modeling)
- search interfaces

Some recent news and data sets:

- EUbookshop: a large but noisy corpus (converted from PDF)
- Tatoeba: a small but clean corpus with many languages
- OpenSubtitles2012: an improved version of the 2011 version
- coming soon: OpenSubtitles2013 - an extension of OpenSubtitles2012
- UN, MultiUN, Europarl v7: aligned for all language combinations
- word alignments and phrase tables for the majority of bitexts

The Web Site: http://opus.lingfil.uu.se
More information: http://opus.lingfil.uu.se/trac/wiki

Feedback is very welcome!
And, be nice to our server!

J?rg Tiedemann
tiedeman@gmail.com

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

Message: 3
Date: Wed, 06 Nov 2013 10:46:21 +0100
From: Felipe S?nchez Mart?nez <fsanchez@dlsi.ua.es>
Subject: Re: [Moses-support] getting WER metrics
To: Andrew Shin <ravenyj@hotmail.com>, moses-support
<moses-support@mit.edu>
Message-ID: <527A0FED.8070000@dlsi.ua.es>
Content-Type: text/plain; charset=UTF-8; format=flowed

Hi Andrew,

I was adding the WER scorer I implemented in my old version of Moses to
a fresh, up-to-date version of Moses and it seems that WER has already
been added to Moses. There is a translation scorer (CDER) that can also
compute WER. This is good news for you.

To run MERT using WER (1-WER, actually) you just need to add
--mertargs="\"--sctype WER\"" in the command line.

Cheers
--
Felipe

El 06/11/13 02:58, Andrew Shin escribi?:
> thank you very much for your reply.
>
> I would very much like to try your source files.
> I greatly appreciate your help.
>
> > Date: Tue, 5 Nov 2013 16:01:31 +0100
> > From: fsanchez@dlsi.ua.es
> > To: moses-support@mit.edu
> > CC: ravenyj@hotmail.com
> > Subject: Re: [Moses-support] getting WER metrics
> >
> > Hi Andrew,
> >
> > I recently implemented WER for tunnig (MERT) with Moses on an old
> > version of Moses I am using. Contributing my code is on my TODO list; I
> > have not done it yet because I want to be sure that it does not break
> > anything in the up-to-date version. If you want to try yourself I can
> > send you the source files.
> >
> > Cheers
> > --
> > Felipe
> >
> > El 28/10/13 16:37, Francis Tyers escribi?:
> > > There is a package in Apertium which is a simple perl script which
> > > calculates WER and PER:
> > >
> > > https://svn.code.sf.net/p/apertium/svn/trunk/apertium-eval-translator
> > >
> > >
> http://wiki.apertium.org/wiki/Evaluation#Using_apertium-eval-translator_for_WER_and_PER
> > >
> > > Fran
> > >
> > > El dl 28 de 10 de 2013 a les 11:33 -0400, en/na Philipp Koehn va
> > > escriure:
> > >> Hi,
> > >>
> > >> Moses currently does not include a tool to measure WER.
> > >> It should be simple to write, so I would encourage you to
> > >> implement it and contribute it back.
> > >>
> > >> -phi
> > >>
> > >> On Sun, Oct 27, 2013 at 11:11 PM, Andrew Shin
> <ravenyj@hotmail.com> wrote:
> > >>> Hello,
> > >>> sorry to ask another question..
> > >>>
> > >>> I've done getting BLEU score in the past following the baseline
> tutorial,
> > >>> but is there a way to also get WER given a reference text?
> > >>>
> > >>> Thank you very much for your help.
> > >>>
> > >>> _______________________________________________
> > >>> Moses-support mailing list
> > >>> Moses-support@mit.edu
> > >>> http://mailman.mit.edu/mailman/listinfo/moses-support
> > >>>
> > >> _______________________________________________
> > >> Moses-support mailing list
> > >> Moses-support@mit.edu
> > >> http://mailman.mit.edu/mailman/listinfo/moses-support
> > >
> > >
> >
> > --
> > Felipe S?nchez Mart?nez
> > Dep. de Llenguatges i Sistemes Inform?tics
> > Universitat d'Alacant, E-03071 Alacant (Spain)
> > Tel.: +34 965 903 400, ext: 2966 Fax: +34 965 909 326
> > http://www.dlsi.ua.es/~fsanchez

--
Felipe S?nchez Mart?nez
Dep. de Llenguatges i Sistemes Inform?tics
Universitat d'Alacant, E-03071 Alacant (Spain)
Tel.: +34 965 903 400, ext: 2966 Fax: +34 965 909 326
http://www.dlsi.ua.es/~fsanchez

------------------------------

Message: 4
Date: Wed, 6 Nov 2013 17:28:50 +0530
From: Arththika Paramanathan <arthiparamanathan@gmail.com>
Subject: [Moses-support] scores.size() == indexes.second -
indexes.first failed
To: moses-support@mit.edu
Message-ID:
<CAJSfqEx6u1BikRx6cRJdZDi8vgTcpgi1YgVMeGn2QLmjjG-XGQ@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi,
I tried Experiment Management System (EMS) & faced an issue given below.
And there are no errors in TUNING_apply-weights.26.STDERR file.

/home/arththika/Desktop/arthi/mosesdecoder/bin
line=UnknownWordPenalty
WEIGHT UnknownWordPenalty0=
line=WordPenalty
WEIGHT WordPenalty0=
Check scores.size() == indexes.second - indexes.first failed in
./moses/ScoreComponentCollection.h:235
Aborted (core dumped)

--
regards,
P.Arththika
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131106/2f7842f6/attachment-0001.htm

------------------------------

Message: 5
Date: Wed, 6 Nov 2013 14:05:20 +0100
From: joerg <tiedeman@gmail.com>
Subject: Re: [Moses-support] 10 years of OPUS
To: "Read, James C" <jcread@essex.ac.uk>
Cc: "moses-support@MIT.EDU" <moses-support@mit.edu>
Message-ID: <D288F8F4-5582-483F-8F15-C2B6507FAFA2@gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

This is a good question. I was also wondering about this and, as I wrote, I started providing word alignments and even phrase tables for the data collected in OPUS. The idea is, of course, that people could skip running GIZA++ with standard settings over and over again on Europarl data and other standard sets. For example, you can download now word alignments for all language pairs for Europarl v7 from OPUS
http://opus.lingfil.uu.se/Europarl/wordalign/

This, of course, assumes that you're happy with my tokenization and other kinds of pre-processing that may have influenced the data. In some cases you may also want to do other things like compound splitting and preordering which would require a different alignment model.

I also have phrase tables in those folders as well extracted with standard settings from the grow-diag-final-and alignments. This is maybe less interesting as this assumes that you work with phrase-based SMT and that you have to work with the true cased data I provide etc. The truecaser models are, of course also available (for example for Europarl in http://opus.lingfil.uu.se/Europarl/wordalign/truecaser/). Monolingual data files with the same tokenization are also available (Europarl: http://opus.lingfil.uu.se/Europarl/mono/)

As all of this takes a lot of space, I would actually like to know, if
- I should keep those files on-line
- I could remove some of them (for example, phrase tables that take most of the disk space)
- if anyone actually is interested in using those models and alignments (and which ones)

Thank you for your feedback!

J?rg

**********************************************************************************
J?rg Tiedemann http://stp.lingfil.uu.se/~joerg/

On Nov 6, 2013, at 8:48 AM, Read, James C wrote:

> I wonder if there would be a demand for ready made phrase tables generated from the data.
> ________________________________________
> From: moses-support-bounces@mit.edu [moses-support-bounces@mit.edu] on behalf of Jorg Tiedemann [tiedeman@gmail.com]
> Sent: 02 November 2013 18:54
> To: moses-support@MIT.EDU
> Subject: [Moses-support] 10 years of OPUS
>
> After attending the 20-years-of-bitext workshop at EMNLP I suddenly realized that OPUS (http://opus.lingfil.uu.se) also has its 10-years anniversary this year (send me some champagne if you like). I will celebrate this anniversary by sending out this e-mail with some recent news and highlights.
>
> OPUS is a growing collection of parallel corpora for many languages and various domains. The collection becomes pretty big and includes a variety of data sets and tools that are not only useful for statistical machine translation. OPUS has been extended a lot since its first appearance in 2003. Actually the best birthday present would be if anyone would decide to start a mirror of OPUS. Let me know if you are interested.
>
>
> Here some of the highlights:
>
> - over 150 languages and language variants
> - over 5 billion aligned translation units
> - downloads in XML/XCES, plain text (Moses/SMT) and TMX
> - raw, tokenized and machine-annotated data
> - monolingual data sets (for language modeling)
> - search interfaces
>
>
> Some recent news and data sets:
>
> - EUbookshop: a large but noisy corpus (converted from PDF)
> - Tatoeba: a small but clean corpus with many languages
> - OpenSubtitles2012: an improved version of the 2011 version
> - coming soon: OpenSubtitles2013 - an extension of OpenSubtitles2012
> - UN, MultiUN, Europarl v7: aligned for all language combinations
> - word alignments and phrase tables for the majority of bitexts
>
>
> The Web Site: http://opus.lingfil.uu.se
> More information: http://opus.lingfil.uu.se/trac/wiki
>
> Feedback is very welcome!
> And, be nice to our server!
>
>
> J?rg Tiedemann
> tiedeman@gmail.com
>
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131106/4799ff00/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 85, Issue 15
*********************************************

Moses-support Digest, Vol 85, Issue 15

0 Response to "Moses-support Digest, Vol 85, Issue 15"

Post a Comment