Moses-support Digest, Vol 111, Issue 69

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Multilingually Sentence-Aligned Corpora (Jorg Tiedemann)
2. Re: Moses-support post from jasneet.sabharwal@sfu.ca requires
approval (Jasneet Sabharwal)


----------------------------------------------------------------------

Message: 1
Date: Fri, 22 Jan 2016 20:57:52 +0200
From: Jorg Tiedemann <tiedeman@gmail.com>
Subject: Re: [Moses-support] Multilingually Sentence-Aligned Corpora
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <7303E912-943F-4E28-B0E0-1DE656296412@gmail.com>
Content-Type: text/plain; charset="utf-8"


The DGT translation memories are truly multilingually aligned:
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory <https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory>

Otherwise you would have several multilingual corpora in OPUS even though they are all bilingually aligned. In most cases it is quite straightforward to combine the sentence alignments to extract multilingual corpora. I would recommend that you'd use the standoff annotation of the sentence alignments in OPUS (ces files). I have a simple script that does that:
http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2multi <http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2multi>

After that you could use this script to convert to Moses format if you like:
http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2moses <http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2moses>

The biggest multilingual corpus would be OpenSubtitles:
http://opus.lingfil.uu.se/OpenSubtitles2016.php <http://opus.lingfil.uu.se/OpenSubtitles2016.php>

But here the problem is that it?s not always the same subtitle version that is aligned for each language pair. I have just recently compiled the intra-lingual alignments for different subtitle variants. Those links could be used to transitively map to other languages again, but for this you have to implement your own tool. Intra-lingual links are here:
http://opus.lingfil.uu.se/OpenSubtitles2016alt.php <http://opus.lingfil.uu.se/OpenSubtitles2016alt.php>
(in the column ?all?)

The coverage is very different for various languages. You should select languages for which you would expect to have a good coverage of the same movies and their subtitles. It?s a bit complicated but should be doable.


All the best and good luck,
J?rg

**********************************************************************************
J?rg Tiedemann
Department of Modern Languages http://www.helsinki.fi/~tiedeman/ <http://www.helsinki.fi/~tiedeman/>
University of Helsinki

> On 22 Jan 2016, at 17:55, Lane Schwartz <dowobeha@gmail.com <mailto:dowobeha@gmail.com>> wrote:
>
> Marcin,
>
> That sounds great! Yes, please do make an announcement. I would definitely make use of such a multi-aligned corpus.
>
> Lane
>
>
> On Fri, Jan 22, 2016 at 9:36 AM, Marcin Junczys-Dowmunt <junczys@amu.edu.pl <mailto:junczys@amu.edu.pl>> wrote:
> Hi Graham,
> At the UN we are now working to release an official version of our data. As a bonus to the pair-wise alignment, it will contain a 6-way fully aligned subcorpus for English, French, Spanish, Russian, Chinese, Arabic; about 13M segments per language. We are waiting for some LREC feedback and the official greenlight from UN officials, but that should be a matter of a couple of weeks now (maybe one, maybe two, maybe four). Once it is ready I can make an announcement here.
> Best,
> Marcin
>
> W dniu 22.01.2016 o 16:26, Graham Neubig pisze:
>> Dear Moses Mailing List,
>>
>> This is not directly related to Moses, but I was wondering if there are any high-quality, multi-lingually sentence aligned corpora available (i.e. 3 or more languages with aligned sentences). We're aware of the Europarl and Bible corpora, but Europarl only covers European languages, and the Bible corpus is quite small in MT terms.
>>
>> TED and MULTI-UN are options, but as far as I know the data is only bilingually aligned at the moment, and it can be a bit hard to get a clean multi-lingual corpus from them. If anyone has any experience with this, or resource available, I'd love some info.
>>
>> Thanks in advance,
>> Graham
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support <http://mailman.mit.edu/mailman/listinfo/moses-support>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support <http://mailman.mit.edu/mailman/listinfo/moses-support>
>
>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away. It is time to go elsewhere. The best thing about space travel
> is that it made it possible to go elsewhere.
> -- R.A. Heinlein, "Time Enough For Love"
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support

All the best,
J?rg


J?rg Tiedemann
tiedeman@gmail.com






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160122/b7c3a379/attachment-0001.html

------------------------------

Message: 2
Date: Fri, 22 Jan 2016 21:39:07 -0800
From: Jasneet Sabharwal <jasneet.sabharwal@sfu.ca>
Subject: Re: [Moses-support] Moses-support post from
jasneet.sabharwal@sfu.ca requires approval
To: Hieu Hoang <hieuhoang@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <6A6DBF2C-FFF2-43E0-AF38-BC84105DB3BA@sfu.ca>
Content-Type: text/plain; charset="windows-1252"

Thanks Hieu.

I?m using the eclipse project for development. I followed your video to set it up and I have linked the srilm and irstlm installations in the root directory of mosesdecoder. I first tried to compile the project, but neither the SRILM nor the IRSTLM LM cpp files get compiled. So, I added LM_IRST and included "${workspace_loc}/../../irstlm/include? path in the C/C++ Build settings of the project. But I still cannot compile IRST.cpp.

The reason I?m not using the included KenLM is because my new feature function requires an 8-gram language model with witten bell smoothing, which is provided by SRILM. As, IRSTLM can use SRILM generated language models, so I decided to call IRSTLM code inside my feature function to get the score for a phrase.

Any pointers on how can I debug the eclipse project with IRSTLM/SRILM?

Best,
Jasneet

PS: When I compile the whole project using "./bjam -j4 ?with-boost=<absolute path to boost> ?with-cmph=<absolute path to cmph> ?with-irstlm=<absolute path to irstlm>?, it successfully compiles without any errors.


> On Jan 19, 2016, at 4:39 PM, Hieu Hoang <hieuhoang@gmail.com> wrote:
>
> I believe Nadir Durrani's OSM uses KenLM inside it. You can look in
> moses/FF/OSM-Feature
> for tips
>
> On 20/01/16 00:31, Jasneet Sabharwal wrote:
>> Thanks Hieu.
>>
>> One last question. What do you think is the best way to load the SRILM language model inside my custom feature function and to get a score for a string that my feature function created?
>>
>> Best,beli
>> Jasneet
>>> On Jan 17, 2016, at 3:45 AM, Hieu Hoang < <mailto:hieuhoang@gmail.com>hieuhoang@gmail.com <mailto:hieuhoang@gmail.com>> wrote:
>>>
>>>
>>>
>>> On 17/01/16 04:05, Jasneet Sabharwal wrote:
>>>> Thanks Hieu,
>>>>
>>>> I had subscribed to the mailing list and I?m getting the digest, but not sure why my email went for your approval. When I get the alignments from GetAlignTerm(), the index of the source word is relative? To get the index in the source sentence, I?m assuming that I would need to get the starting position of the source words from CurrSourceWordsRange().GetStartPos() from current hypothesis and offset the source alignment index with that value?
>>> yep. And to get the index in the target sentence, use GetCurrTargetWordsRange().GetStartPos()
>>>>
>>>> Regards,
>>>> Jasneet
>>>>> On Jan 15, 2016, at 3:43 AM, Hieu Hoang <hieuhoang@gmail.com <mailto:hieuhoang@gmail.com>> wrote:
>>>>>
>>>>> please subscribe to the Moses mailing list before posting to it. You can subscribe here:
>>>>> <http://mailman.mit.edu/mailman/admin/moses-support>http://mailman.mit.edu/mailman/admin/moses-support <http://mailman.mit.edu/mailman/admin/moses-support>
>>>>> To answer you question - the target phrase has a method called
>>>>> GetAlignTerm()
>>>>> that contains the alignment for terminals. This comes from the phrase-table, and ultimately from the word alignment.
>>>>>
>>>>> -------- Forwarded Message --------
>>>>> Subject: Moses-support post from <mailto:jasneet.sabharwal@sfu.ca>jasneet.sabharwal@sfu.ca <mailto:jasneet.sabharwal@sfu.ca> requires approval
>>>>> Date: Wed, 13 Jan 2016 23:36:50 -0500
>>>>> From: moses-support-owner@mit.edu <mailto:moses-support-owner@mit.edu>
>>>>> To: moses-support-owner@mit.edu <mailto:moses-support-owner@mit.edu>
>>>>>
>>>>> As list administrator, your authorization is requested for the
>>>>> following mailing list posting:
>>>>>
>>>>> List: Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>> From: jasneet.sabharwal@sfu.ca <mailto:jasneet.sabharwal@sfu.ca>
>>>>> Subject: Getting alignments for current hypothesis in phrase based model
>>>>> Reason: Post by non-member to a members-only list
>>>>>
>>>>> At your convenience, visit:
>>>>>
>>>>> http://mailman.mit.edu/mailman/admindb/moses-support <http://mailman.mit.edu/mailman/admindb/moses-support>
>>>>>
>>>>> to approve or deny the request.
>>>>>
>>>>>
>>>>>
>>>>> <ForwardedMessage.eml><ForwardedMessage.eml>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support <http://mailman.mit.edu/mailman/listinfo/moses-support>
>>>
>>> --
>>> Hieu Hoang
>>> http://www.hoang.co.uk/hieu <http://www.hoang.co.uk/hieu>
>
> --
> Hieu Hoang
> http://www.hoang.co.uk/hieu <http://www.hoang.co.uk/hieu>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20160123/353a7122/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 111, Issue 69
**********************************************

0 Response to "Moses-support Digest, Vol 111, Issue 69"

Post a Comment