Moses-support Digest, Vol 120, Issue 6

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: News monolingual corpus question (Barry Haddow)
2. Re: News monolingual corpus question (Vincent Nguyen)
3. Re: News monolingual corpus question (Barry Haddow)
4. Re: News monolingual corpus question (Vincent Nguyen)
5. Age feature (Marwa Refaie)


----------------------------------------------------------------------

Message: 1
Date: Tue, 4 Oct 2016 20:46:28 +0100
From: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Subject: Re: [Moses-support] News monolingual corpus question
To: Vincent Nguyen <vnguyen@neuf.fr>, moses-support
<moses-support@mit.edu>
Message-ID: <c698de25-2b6d-e859-4d6e-c31ee33920f5@staffmail.ed.ac.uk>
Content-Type: text/plain; charset=windows-1252; format=flowed

Hi Vincent

Are you comparing compressed with uncompressed files?

cheers - Barry

On 04/10/16 14:40, Vincent Nguyen wrote:
> Hi,
>
> on this link:
>
> http://www.statmt.org/wmt11/translation-task.html
>
> on the download section for monolingual data, there is :
>
> one big file : http://www.statmt.org/wmt11/training-monolingual.tgz
>
> And separate files, of which news crawls per year.
>
> However, when you take a single file for a specific year, it is not the
> same size as the same name file in the big download.
>
> expanded size for english corpus :
>
> news2008: 4.3GB vs 1.6GB for single download
> news2009: 5.3GB vs 1.8GB for single download
>
> etc...
>
> can someone please explain the difference ?
>
> thanks
>
> Vincent.
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



------------------------------

Message: 2
Date: Tue, 4 Oct 2016 22:20:45 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] News monolingual corpus question
To: Barry Haddow <bhaddow@staffmail.ed.ac.uk>, moses-support
<moses-support@mit.edu>
Message-ID: <fb0dc45e-7d00-2125-6e89-bd2878568963@neuf.fr>
Content-Type: text/plain; charset=windows-1252; format=flowed


no.... but my mistake I was comparing with that link for the per year
files : http://www.statmt.org/wmt15/translation-task.html

what is the difference ? (with the wmt11 files)



Le 04/10/2016 ? 21:46, Barry Haddow a ?crit :
> Hi Vincent
>
> Are you comparing compressed with uncompressed files?
>
> cheers - Barry
>
> On 04/10/16 14:40, Vincent Nguyen wrote:
>> Hi,
>>
>> on this link:
>>
>> http://www.statmt.org/wmt11/translation-task.html
>>
>> on the download section for monolingual data, there is :
>>
>> one big file : http://www.statmt.org/wmt11/training-monolingual.tgz
>>
>> And separate files, of which news crawls per year.
>>
>> However, when you take a single file for a specific year, it is not the
>> same size as the same name file in the big download.
>>
>> expanded size for english corpus :
>>
>> news2008: 4.3GB vs 1.6GB for single download
>> news2009: 5.3GB vs 1.8GB for single download
>>
>> etc...
>>
>> can someone please explain the difference ?
>>
>> thanks
>>
>> Vincent.
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>



------------------------------

Message: 3
Date: Tue, 4 Oct 2016 21:24:17 +0100
From: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Subject: Re: [Moses-support] News monolingual corpus question
To: Vincent Nguyen <vnguyen@neuf.fr>, moses-support
<moses-support@mit.edu>
Message-ID: <77b27c0d-0911-fa5c-6bf6-d6c78465dfc4@staffmail.ed.ac.uk>
Content-Type: text/plain; charset=windows-1252; format=flowed

Hi Vincent

Could you say exactly which files you are comparing?

cheers - Barry

On 04/10/16 21:20, Vincent Nguyen wrote:
>
> no.... but my mistake I was comparing with that link for the per year
> files : http://www.statmt.org/wmt15/translation-task.html
>
> what is the difference ? (with the wmt11 files)
>
>
>
> Le 04/10/2016 ? 21:46, Barry Haddow a ?crit :
>> Hi Vincent
>>
>> Are you comparing compressed with uncompressed files?
>>
>> cheers - Barry
>>
>> On 04/10/16 14:40, Vincent Nguyen wrote:
>>> Hi,
>>>
>>> on this link:
>>>
>>> http://www.statmt.org/wmt11/translation-task.html
>>>
>>> on the download section for monolingual data, there is :
>>>
>>> one big file : http://www.statmt.org/wmt11/training-monolingual.tgz
>>>
>>> And separate files, of which news crawls per year.
>>>
>>> However, when you take a single file for a specific year, it is not the
>>> same size as the same name file in the big download.
>>>
>>> expanded size for english corpus :
>>>
>>> news2008: 4.3GB vs 1.6GB for single download
>>> news2009: 5.3GB vs 1.8GB for single download
>>>
>>> etc...
>>>
>>> can someone please explain the difference ?
>>>
>>> thanks
>>>
>>> Vincent.
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>
>


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



------------------------------

Message: 4
Date: Wed, 5 Oct 2016 10:54:58 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] News monolingual corpus question
To: Barry Haddow <bhaddow@staffmail.ed.ac.uk>, moses-support
<moses-support@mit.edu>
Message-ID: <4ef26f50-5395-c1aa-52c8-c52a4c5c5927@neuf.fr>
Content-Type: text/plain; charset=windows-1252; format=flowed

Thank Barry,

Actually I was trying to 1) replicate the 1 billion word benchmark
language model 2) trying to update these results with more recent data.

So technically this is not going to be very easy with most recent
version of the data, but as you say, the WMT11 were not dedup.

Anyway, I'll figure out something, but it was for clarification since my
word word counts was way off.

Thanks.


Le 05/10/2016 ? 10:46, Barry Haddow a ?crit :
> Hi Vincent
>
> I think at some point we re-extracted all previous years. One possible
> reason for the difference is that now we are de-duping, and before we
> didn't.
>
> I would say if you want to compare to recent WMT experiments, take the
> most recent version of the data,
>
> cheers - Barry
>
> On 04/10/16 21:34, Vincent Nguyen wrote:
>>
>> ok
>> this one http://www.statmt.org/wmt11/training-monolingual.tgz
>> includes ( I think)
>> http://www.statmt.org/wmt11/training-monolingual-news-2010.tgz
>> but if I extract news.2010.en.shuffled it is unzipped 2051344 Ko
>> (all above from WMT11 page)
>>
>> on this link :
>> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2010.en.shuffled.gz
>>
>> (from the WMT15 page)
>> unzipped it gives 807761 Ko
>>
>> 2010 is just an example, years are all different.
>>
>>
>> Le 04/10/2016 ? 22:24, Barry Haddow a ?crit :
>>> Hi Vincent
>>>
>>> Could you say exactly which files you are comparing?
>>>
>>> cheers - Barry
>>>
>>> On 04/10/16 21:20, Vincent Nguyen wrote:
>>>>
>>>> no.... but my mistake I was comparing with that link for the per
>>>> year files : http://www.statmt.org/wmt15/translation-task.html
>>>>
>>>> what is the difference ? (with the wmt11 files)
>>>>
>>>>
>>>>
>>>> Le 04/10/2016 ? 21:46, Barry Haddow a ?crit :
>>>>> Hi Vincent
>>>>>
>>>>> Are you comparing compressed with uncompressed files?
>>>>>
>>>>> cheers - Barry
>>>>>
>>>>> On 04/10/16 14:40, Vincent Nguyen wrote:
>>>>>> Hi,
>>>>>>
>>>>>> on this link:
>>>>>>
>>>>>> http://www.statmt.org/wmt11/translation-task.html
>>>>>>
>>>>>> on the download section for monolingual data, there is :
>>>>>>
>>>>>> one big file : http://www.statmt.org/wmt11/training-monolingual.tgz
>>>>>>
>>>>>> And separate files, of which news crawls per year.
>>>>>>
>>>>>> However, when you take a single file for a specific year, it is
>>>>>> not the
>>>>>> same size as the same name file in the big download.
>>>>>>
>>>>>> expanded size for english corpus :
>>>>>>
>>>>>> news2008: 4.3GB vs 1.6GB for single download
>>>>>> news2009: 5.3GB vs 1.8GB for single download
>>>>>>
>>>>>> etc...
>>>>>>
>>>>>> can someone please explain the difference ?
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>> Vincent.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>



------------------------------

Message: 5
Date: Wed, 5 Oct 2016 13:45:37 +0000
From: Marwa Refaie <basmallah@hotmail.com>
Subject: [Moses-support] Age feature
To: Moses <moses-support@mit.edu>
Message-ID:
<HE1PR01MB1164102446CBAE9E65855597BAC40@HE1PR01MB1164.eurprd01.prod.exchangelabs.com>

Content-Type: text/plain; charset="iso-8859-1"

Hi All


Does the "age" feature in the CBMT implemented in the Bitextsampling (using suffix array) which is equal to the 'Post-edit support" introduced by Denkoweki in cdec ??


Any info please !



Marwa N. Refaie
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20161005/accbc724/attachment-0001.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 120, Issue 6
*********************************************

0 Response to "Moses-support Digest, Vol 120, Issue 6"

Post a Comment