Moses-support Digest, Vol 120, Issue 5

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: differences between moses and moses2 output (Hieu Hoang)
2. News monolingual corpus question (Vincent Nguyen)


----------------------------------------------------------------------

Message: 1
Date: Tue, 4 Oct 2016 10:44:15 +0100
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] differences between moses and moses2
output
To: Vito Mandorino <vito.mandorino@linguacustodia.com>, moses-support
<moses-support@mit.edu>
Message-ID: <e4a17115-b9bd-931f-6848-2d28ed0b1fdd@gmail.com>
Content-Type: text/plain; charset="utf-8"

yes - the script expects the files to be gzipped.

It runs ok for me. I executed this:

MOSES_DIR=~/workspace/github/mosesdecoder.perf

$MOSES_DIR/scripts/generic/binarize4moses2.perl
--phrase-table=phrase-table.gz
--lex-ro=reordering-table.wbe-msd-bidirectional-fe.gz
--output-dir=integrated_phrase-reordering/ --num-lex-scores=6

Got this:

Executing: gzip -dc phrase-table.gz |
/home/hieu/workspace/github/mosesdecoder.perf/scripts/generic/../../contrib/sigtest-filter/filter-pt
-n 0 | gzip -c > ./tmp.14373/pt.gz
...
Reading phrase table finished, writing remaining files to disk.

$ ll integrated_phrase-reordering/
total 24688
drwxrwxr-x 2 hieu hieu 4096 Oct 4 10:38 ./
drwxrwxr-x 5 hieu hieu 4096 Oct 4 10:42 ../
-rw-rw-r-- 1 hieu hieu 917861 Oct 4 10:42 Alignments.dat
-rw-rw-r-- 1 hieu hieu 2267885 Oct 4 10:42 cache
-rw-rw-r-- 1 hieu hieu 76 Oct 4 10:42 config
-rw-rw-r-- 1 hieu hieu 3146720 Oct 4 10:42 probing_hash.dat
-rw-rw-r-- 1 hieu hieu 333856 Oct 4 10:42 source_vocabids
-rw-rw-r-- 1 hieu hieu 18429920 Oct 4 10:42 TargetColl.dat
-rw-rw-r-- 1 hieu hieu 121401 Oct 4 10:42 TargetVocab.dat


On 04/10/2016 09:06, Vito Mandorino wrote:
> The command was
>
> perl /home/Moses/mosesdecoder/scripts/generic/binarize4moses2.perl
> --phrase-table=/home/vito/phrase-table.sorted
> --lex-ro=/home/vito/reordering-table.sorted
> --output-dir=/home/vito/integrated_phrase-reordering/ --num-lex-scores=6
>
> The tables in the command are sorted with LC_ALL . I attach them in
> .gz format. Should one use the .gz format also in the command above?
>
> Vito

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20161004/5e80287e/attachment-0001.html

------------------------------

Message: 2
Date: Tue, 4 Oct 2016 15:40:52 +0200
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: [Moses-support] News monolingual corpus question
To: moses-support <moses-support@mit.edu>
Message-ID: <f446965d-bab8-21b6-3b41-e8b1a4698155@neuf.fr>
Content-Type: text/plain; charset=utf-8; format=flowed

Hi,

on this link:

http://www.statmt.org/wmt11/translation-task.html

on the download section for monolingual data, there is :

one big file : http://www.statmt.org/wmt11/training-monolingual.tgz

And separate files, of which news crawls per year.

However, when you take a single file for a specific year, it is not the
same size as the same name file in the big download.

expanded size for english corpus :

news2008: 4.3GB vs 1.6GB for single download
news2009: 5.3GB vs 1.8GB for single download

etc...

can someone please explain the difference ?

thanks

Vincent.




------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 120, Issue 5
*********************************************

0 Response to "Moses-support Digest, Vol 120, Issue 5"

Post a Comment