Moses-support Digest, Vol 85, Issue 17

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: 10 years of OPUS (J?rg Tiedemann)
2. Re: 10 years of OPUS (Kenneth Heafield)


----------------------------------------------------------------------

Message: 1
Date: Wed, 6 Nov 2013 18:09:50 +0100
From: J?rg Tiedemann <jorg.tiedemann@lingfil.uu.se>
Subject: Re: [Moses-support] 10 years of OPUS
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: "<moses-support@mit.edu>" <moses-support@mit.edu>
Message-ID: <C8B180DF-2866-436C-A3A8-45241E9C9783@lingfil.uu.se>
Content-Type: text/plain; charset="utf-8"


Extracting lexicalized reordering tables shouldn't be a problem. Data files and word alignment are given. Note also that phrase tables exist only for one direction. For the other direction you will need to swap phrases, scores and word alignments. But this is easy as well.

Tokenization is a bit more tricky. In some cases I use the tree tagger for pre-processing to get better tagging results. Otherwise, it's mainly the Moses tokenizer with some additional exceptions and smaller changes. Good point that I should document this more carefully.

The question remains, should I leave the alignment files online or is it not useful for anyone anyway?

J?rg




On 6 nov 2013, at 16:40, Tom Hoar <tahoar@precisiontranslationtools.com> wrote:

> You make some good points, J?rg.
>
> I think the word alignments and phrase tables would be more useful if you reference what tokenizers and other pre/post-processing tools/chains were used for each pair and their versions. Some people may wish to augment the alignments and/or tables with their own corpora for incremental training or new language models will need to re-create the preparation toolchains. Tools change from time to time, even the Moses tokenizer.perl script.
>
> It looks like the reordering tables are missing from your collection. How useful are the phrase tables without them? For basic phrase-based mode, don't users still need to run the step to create the reordering tables?
>
>
>
> On 11/06/2013 08:05 PM, joerg wrote:
>>
>> This is a good question. I was also wondering about this and, as I wrote, I started providing word alignments and even phrase tables for the data collected in OPUS. The idea is, of course, that people could skip running GIZA++ with standard settings over and over again on Europarl data and other standard sets. For example, you can download now word alignments for all language pairs for Europarl v7 from OPUS
>> http://opus.lingfil.uu.se/Europarl/wordalign/
>>
>> This, of course, assumes that you're happy with my tokenization and other kinds of pre-processing that may have influenced the data. In some cases you may also want to do other things like compound splitting and preordering which would require a different alignment model.
>>
>> I also have phrase tables in those folders as well extracted with standard settings from the grow-diag-final-and alignments. This is maybe less interesting as this assumes that you work with phrase-based SMT and that you have to work with the true cased data I provide etc. The truecaser models are, of course also available (for example for Europarl in http://opus.lingfil.uu.se/Europarl/wordalign/truecaser/). Monolingual data files with the same tokenization are also available (Europarl: http://opus.lingfil.uu.se/Europarl/mono/)
>>
>>
>> As all of this takes a lot of space, I would actually like to know, if
>> - I should keep those files on-line
>> - I could remove some of them (for example, phrase tables that take most of the disk space)
>> - if anyone actually is interested in using those models and alignments (and which ones)
>>
>>
>> Thank you for your feedback!
>>
>> J?rg
>>
>>
>> **********************************************************************************
>> J?rg Tiedemann http://stp.lingfil.uu.se/~joerg/
>>
>>
>>
>> On Nov 6, 2013, at 8:48 AM, Read, James C wrote:
>>
>>> I wonder if there would be a demand for ready made phrase tables generated from the data.
>>> ________________________________________
>>> From: moses-support-bounces@mit.edu [moses-support-bounces@mit.edu] on behalf of Jorg Tiedemann [tiedeman@gmail.com]
>>> Sent: 02 November 2013 18:54
>>> To: moses-support@MIT.EDU
>>> Subject: [Moses-support] 10 years of OPUS
>>>
>>> After attending the 20-years-of-bitext workshop at EMNLP I suddenly realized that OPUS (http://opus.lingfil.uu.se) also has its 10-years anniversary this year (send me some champagne if you like). I will celebrate this anniversary by sending out this e-mail with some recent news and highlights.
>>>
>>> OPUS is a growing collection of parallel corpora for many languages and various domains. The collection becomes pretty big and includes a variety of data sets and tools that are not only useful for statistical machine translation. OPUS has been extended a lot since its first appearance in 2003. Actually the best birthday present would be if anyone would decide to start a mirror of OPUS. Let me know if you are interested.
>>>
>>>
>>> Here some of the highlights:
>>>
>>> - over 150 languages and language variants
>>> - over 5 billion aligned translation units
>>> - downloads in XML/XCES, plain text (Moses/SMT) and TMX
>>> - raw, tokenized and machine-annotated data
>>> - monolingual data sets (for language modeling)
>>> - search interfaces
>>>
>>>
>>> Some recent news and data sets:
>>>
>>> - EUbookshop: a large but noisy corpus (converted from PDF)
>>> - Tatoeba: a small but clean corpus with many languages
>>> - OpenSubtitles2012: an improved version of the 2011 version
>>> - coming soon: OpenSubtitles2013 - an extension of OpenSubtitles2012
>>> - UN, MultiUN, Europarl v7: aligned for all language combinations
>>> - word alignments and phrase tables for the majority of bitexts
>>>
>>>
>>> The Web Site: http://opus.lingfil.uu.se
>>> More information: http://opus.lingfil.uu.se/trac/wiki
>>>
>>> Feedback is very welcome!
>>> And, be nice to our server!
>>>
>>>
>>> J?rg Tiedemann
>>> tiedeman@gmail.com
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131106/e9650fa6/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 06 Nov 2013 09:35:48 -0800
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] 10 years of OPUS
To: moses-support@MIT.EDU
Message-ID: <527A7DF4.8000905@kheafield.com>
Content-Type: text/plain; charset=ISO-8859-1

Hi,

Multiple column won't really work because the set of phrase pairs will
be different. You could of course take the union of phrase pairs and
just have null values for inapplicable phrases, but it's not clear how
much compression you'd get.

Kenneth

On 11/06/13 06:21, Read, James C wrote:
> So here's a random crazy idea I had lately. A phrase table could have multiple columns giving different scores for different probabilities from different alignments, different corpora, different domains etc. Recent work at Edinburgh, Cambridge and Sheffield has had some emphasis on adaptation of models for speech recognition purposes. I guess a similar principle could be applied to SMT. Given a text from some unknown domain the engine could perform some automated recognition test to guess which translation model best fits the text to be translated. A primitive form of automatic domain recognition and adaptation if you like.
>
> I guess even making available multiple forms of a phrase table or a single compact version with multiple columns for scoring could even have some demand in the future.
>
> James
>
> ________________________________
> From: joerg [tiedeman@gmail.com]
> Sent: 06 November 2013 13:05
> To: Read, James C
> Cc: moses-support@MIT.EDU
> Subject: Re: [Moses-support] 10 years of OPUS
>
>
> This is a good question. I was also wondering about this and, as I wrote, I started providing word alignments and even phrase tables for the data collected in OPUS. The idea is, of course, that people could skip running GIZA++ with standard settings over and over again on Europarl data and other standard sets. For example, you can download now word alignments for all language pairs for Europarl v7 from OPUS
> http://opus.lingfil.uu.se/Europarl/wordalign/
>
> This, of course, assumes that you're happy with my tokenization and other kinds of pre-processing that may have influenced the data. In some cases you may also want to do other things like compound splitting and preordering which would require a different alignment model.
>
> I also have phrase tables in those folders as well extracted with standard settings from the grow-diag-final-and alignments. This is maybe less interesting as this assumes that you work with phrase-based SMT and that you have to work with the true cased data I provide etc. The truecaser models are, of course also available (for example for Europarl in http://opus.lingfil.uu.se/Europarl/wordalign/truecaser/). Monolingual data files with the same tokenization are also available (Europarl: http://opus.lingfil.uu.se/Europarl/mono/)
>
>
> As all of this takes a lot of space, I would actually like to know, if
> - I should keep those files on-line
> - I could remove some of them (for example, phrase tables that take most of the disk space)
> - if anyone actually is interested in using those models and alignments (and which ones)
>
>
> Thank you for your feedback!
>
> J?rg
>
>
> **********************************************************************************
> J?rg Tiedemann http://stp.lingfil.uu.se/~joerg/
>
>
>
> On Nov 6, 2013, at 8:48 AM, Read, James C wrote:
>
> I wonder if there would be a demand for ready made phrase tables generated from the data.
> ________________________________________
> From: moses-support-bounces@mit.edu<mailto:moses-support-bounces@mit.edu> [moses-support-bounces@mit.edu<mailto:moses-support-bounces@mit.edu>] on behalf of Jorg Tiedemann [tiedeman@gmail.com<mailto:tiedeman@gmail.com>]
> Sent: 02 November 2013 18:54
> To: moses-support@MIT.EDU<mailto:moses-support@MIT.EDU>
> Subject: [Moses-support] 10 years of OPUS
>
> After attending the 20-years-of-bitext workshop at EMNLP I suddenly realized that OPUS (http://opus.lingfil.uu.se) also has its 10-years anniversary this year (send me some champagne if you like). I will celebrate this anniversary by sending out this e-mail with some recent news and highlights.
>
> OPUS is a growing collection of parallel corpora for many languages and various domains. The collection becomes pretty big and includes a variety of data sets and tools that are not only useful for statistical machine translation. OPUS has been extended a lot since its first appearance in 2003. Actually the best birthday present would be if anyone would decide to start a mirror of OPUS. Let me know if you are interested.
>
>
> Here some of the highlights:
>
> - over 150 languages and language variants
> - over 5 billion aligned translation units
> - downloads in XML/XCES, plain text (Moses/SMT) and TMX
> - raw, tokenized and machine-annotated data
> - monolingual data sets (for language modeling)
> - search interfaces
>
>
> Some recent news and data sets:
>
> - EUbookshop: a large but noisy corpus (converted from PDF)
> - Tatoeba: a small but clean corpus with many languages
> - OpenSubtitles2012: an improved version of the 2011 version
> - coming soon: OpenSubtitles2013 - an extension of OpenSubtitles2012
> - UN, MultiUN, Europarl v7: aligned for all language combinations
> - word alignments and phrase tables for the majority of bitexts
>
>
> The Web Site: http://opus.lingfil.uu.se
> More information: http://opus.lingfil.uu.se/trac/wiki
>
> Feedback is very welcome!
> And, be nice to our server!
>
>
> J?rg Tiedemann
> tiedeman@gmail.com<mailto:tiedeman@gmail.com>
>
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 85, Issue 17
*********************************************

0 Response to "Moses-support Digest, Vol 85, Issue 17"

Post a Comment