Moses-support Digest, Vol 111, Issue 68

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Multilingually Sentence-Aligned Corpora (Graham Neubig)
2. Re: Multilingually Sentence-Aligned Corpora (J?rg Tiedemann)


----------------------------------------------------------------------

Message: 1
Date: Fri, 22 Jan 2016 13:32:33 -0500
From: Graham Neubig <neubig@is.naist.jp>
Subject: Re: [Moses-support] Multilingually Sentence-Aligned Corpora
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Cc: "<moses-support@mit.edu>" <moses-support@mit.edu>
Message-ID:
<CADkjOCN5AQe4wSjNiMWSUaXB44bcMLUKGb=cqgsmpE4uLeXkHQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Marcin,

Wow, that would be really excellent. I'm looking forward to it!

Graham

On Fri, Jan 22, 2016 at 10:36 AM, Marcin Junczys-Dowmunt <junczys@amu.edu.pl
> wrote:

> Hi Graham,
> At the UN we are now working to release an official version of our data.
> As a bonus to the pair-wise alignment, it will contain a 6-way fully
> aligned subcorpus for English, French, Spanish, Russian, Chinese, Arabic;
> about 13M segments per language. We are waiting for some LREC feedback and
> the official greenlight from UN officials, but that should be a matter of a
> couple of weeks now (maybe one, maybe two, maybe four). Once it is ready I
> can make an announcement here.
> Best,
> Marcin
>
> W dniu 22.01.2016 o 16:26, Graham Neubig pisze:
>
> Dear Moses Mailing List,
>
> This is not directly related to Moses, but I was wondering if there are
> any high-quality, multi-lingually sentence aligned corpora available (i.e.
> 3 or more languages with aligned sentences). We're aware of the Europarl
> and Bible corpora, but Europarl only covers European languages, and the
> Bible corpus is quite small in MT terms.
>
> TED and MULTI-UN are options, but as far as I know the data is only
> bilingually aligned at the moment, and it can be a bit hard to get a clean
> multi-lingual corpus from them. If anyone has any experience with this, or
> resource available, I'd love some info.
>
> Thanks in advance,
> Graham
>
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160122/d27d3b82/attachment-0001.html

------------------------------

Message: 2
Date: Fri, 22 Jan 2016 18:50:14 +0000
From: J?rg Tiedemann <Jorg.Tiedemann@lingfil.uu.se>
Subject: Re: [Moses-support] Multilingually Sentence-Aligned Corpora
To: "neubig@is.naist.jp" <neubig@is.naist.jp>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <2C9E7271-170F-4201-B8C2-AE014A071FAB@lingfil.uu.se>
Content-Type: text/plain; charset="utf-8"


The DGT translation memories are truly multilingually aligned:
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory

Otherwise you would have several multilingual corpora in OPUS even though they are all bilingually aligned. In most cases it is quite straightforward to combine the sentence alignments to extract multilingual corpora. I would recommend that you'd use the standoff annotation of the sentence alignments in OPUS (ces files). I have a simple script that does that:
http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2multi

After that you could use this script to convert to Moses format if you like:
http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2moses

The biggest multilingual corpus would be OpenSubtitles:
http://opus.lingfil.uu.se/OpenSubtitles2016.php

But here the problem is that it?s not always the same subtitle version that is aligned for each language pair. I have just recently compiled the intra-lingual alignments for different subtitle variants. Those links could be used to transitively map to other languages again, but for this you have to implement your own tool. Intra-lingual links are here:
http://opus.lingfil.uu.se/OpenSubtitles2016alt.php
(in the column ?all?)

The coverage is very different for various languages. You should select languages for which you would expect to have a good coverage of the same movies and their subtitles. It?s a bit complicated but should be doable.


All the best and good luck,
J?rg

**********************************************************************************
J?rg Tiedemann
Department of Modern Languages http://www.helsinki.fi/~tiedeman/
University of Helsinki

On 22 Jan 2016, at 17:55, Lane Schwartz <dowobeha@gmail.com<mailto:dowobeha@gmail.com>> wrote:

Marcin,

That sounds great! Yes, please do make an announcement. I would definitely make use of such a multi-aligned corpus.

Lane


On Fri, Jan 22, 2016 at 9:36 AM, Marcin Junczys-Dowmunt <junczys@amu.edu.pl<mailto:junczys@amu.edu.pl>> wrote:
Hi Graham,
At the UN we are now working to release an official version of our data. As a bonus to the pair-wise alignment, it will contain a 6-way fully aligned subcorpus for English, French, Spanish, Russian, Chinese, Arabic; about 13M segments per language. We are waiting for some LREC feedback and the official greenlight from UN officials, but that should be a matter of a couple of weeks now (maybe one, maybe two, maybe four). Once it is ready I can make an announcement here.
Best,
Marcin

W dniu 22.01.2016 o 16:26, Graham Neubig pisze:
Dear Moses Mailing List,

This is not directly related to Moses, but I was wondering if there are any high-quality, multi-lingually sentence aligned corpora available (i.e. 3 or more languages with aligned sentences). We're aware of the Europarl and Bible corpora, but Europarl only covers European languages, and the Bible corpus is quite small in MT terms.

TED and MULTI-UN are options, but as far as I know the data is only bilingually aligned at the moment, and it can be a bit hard to get a clean multi-lingual corpus from them. If anyone has any experience with this, or resource available, I'd love some info.

Thanks in advance,
Graham



_______________________________________________
Moses-support mailing list
Moses-support@mit.edu<mailto:Moses-support@mit.edu>
http://mailman.mit.edu/mailman/listinfo/moses-support



_______________________________________________
Moses-support mailing list
Moses-support@mit.edu<mailto:Moses-support@mit.edu>
http://mailman.mit.edu/mailman/listinfo/moses-support




--
When a place gets crowded enough to require ID's, social collapse is not
far away. It is time to go elsewhere. The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu<mailto:Moses-support@mit.edu>
http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160122/7a8519ca/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 111, Issue 68
**********************************************

0 Response to "Moses-support Digest, Vol 111, Issue 68"

Post a Comment