Moses-support Digest, Vol 88, Issue 67

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Moses training performance (Andrzej Zydron)
2. Re: Binarising the phrase table (Massinissa Ahmim)
3. Tokenisation issue - following the implementation baseline
(Janez Kadivec)

----------------------------------------------------------------------

Message: 1
Date: Fri, 28 Feb 2014 08:38:58 +0000
From: Andrzej Zydron <azydron@xtm-intl.com>
Subject: Re: [Moses-support] Moses training performance
To: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <53104B22.902@xtm-intl.com>
Content-Type: text/plain; charset=ISO-8859-2; format=flowed

Hi Barry,

Many thanks for your reply. All the operations on the Xeon are on
RAMDISK, includind the corpus, training and temp directories. The Xeon
is totally dedicated to Moses: there is absolutely nothing else running
on the server. It looks like the i7 3720QM has a better overall
architecture. It certainly is very fast for a laptop. After a year of
use of my Mac Book Pro Retina I am still amazed at how good it is.

Tuning though takes twice as long on the Mac: here the extra
cores/threads on the Xeon come into their own. Admitedly my training set
is on the small side at 9,300 segments and with much bigger sets the
small amount of RAM on the Mac Book Pro will certainly slow things down
a lot, but for development and quick turn around small test data sets it
certainly is good.

Email signature standard

Best Regards,

Andrzej Zydro?

---------------------------------------

CTO

*XTM International Ltd.*

PO Box 2167, Gerrards Cross, SL9 8XF, UK

email: azydron@xtm-intl.com <mailto:azydron@xtm-intl.com>

Tel: +44 (0) 1753 480 479

Mob: +44 (0) 7966 477 181

skype: Zydron

www.xtm-intl.com <http://www.xtm-intl.com/>

On 27/02/2014 20:43, Barry Haddow wrote:
> Hi Andrzej
>
> Whilst mgiza is the time hog in the training, I find it surprising
> that score takes 10 seconds on the mac and nearly 2 minutes on the
> xeon. Most of its work is sorting and reading and writing compressed
> files. I wonder if there is some difference in the sort? Is it using
> disk on the xeon, and doing everything in ram on the mac? Is it using
> a temporary directory outside the ram disk - although I think it
> should put its tmp directory inside the Moses training directory.
>
> cheers - Barry
>
>
> On 27/02/14 20:05, Andrzej Zydron wrote:
>> Hi Hieu, Barry and Marcin,
>>
>> Thank you for your replies and suggestions.
>>
>> The Xeon server is completely dedicated to Moses and is running
>> absolutely nothing else, as opposed to my Mac which is running the usual
>> laptop background stuff like mail etc., as well as having Eclipse doing
>> various Java stuff in the background.
>>
>> I re-ran the tests as Barry advised with only 4 cores and the results
>> were
>>
>> training: 41:56
>> tuning: 28:16
>> decoding: 01:36
>>
>> Total 01:08:17
>>
>> Therefore 18 minutes slower than the best time on the Xeon with 6 cores
>> (50:01 minutes).
>>
>> Regarding Marcin's suggestion, here are the individual moses-training
>> process' timings:
>>
>> MacBook Pro 4 Threads i7 3720QM 8Gb RAM SSD
>> mkls 17:39:16 17:39:49 00:00:33
>> snt2cooc.out 17:39:50 17:39:52 00:00:02
>> mgiza 17:39:52 17:45:47 00:05:55
>> snt2cooc.out 17:45:47 17:45:50 00:00:03
>> mgiza 17:45:50 17:53:20 00:07:30
>> giza2bal.pl 17:53:21 17:53:23 00:00:02
>> extract 17:53:25 17:53:31 00:00:06
>> score 17:53:31 17:53:41 00:00:10
>> lexical-reordering 17:53:41 17:53:45 00:00:04
>>
>> Total 00:14:25
>>
>> Using 6 Threads Xeon E5-1650v2 128GB RAM SATA using 28GB RAMDISK
>> mkls 19:31:07 19:35:43 00:04:36
>> snt2cooc.out 19:35:43 19:36:05 00:00:22
>> mgiza 19:36:05 19:49:47 00:13:42
>> snt2cooc.out 19:49:47 19:50:12 00:00:25
>> mgiza 19:50:12 20:04:09 00:13:57
>> giza2bal.pl 20:04:09 20:04:31 00:00:22
>> extract 20:04:31 20:05:19 00:00:48
>> score 20:05:19 20:07:00 00:01:41
>> lexical-reordering 20:07:00 20:07:01 00:00:01
>>
>> Total 00:35:54
>>
>> As you can see the culprit are mkls and mgiza.
>>
>> Email signature standard
>>
>> Best Regards,
>>
>>
>> Andrzej Zydro?
>>
>> ---------------------------------------
>>
>> CTO
>>
>> *XTM International Ltd.*
>>
>> PO Box 2167, Gerrards Cross, SL9 8XF, UK
>>
>> email: azydron@xtm-intl.com <mailto:azydron@xtm-intl.com>
>>
>> Tel: +44 (0) 1753 480 479
>>
>> Mob: +44 (0) 7966 477 181
>>
>> skype: Zydron
>>
>> www.xtm-intl.com <http://www.xtm-intl.com/>
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>

------------------------------

Message: 2
Date: Fri, 28 Feb 2014 09:55:14 +0100
From: Massinissa Ahmim <massinissa.ahmim@linguacustodia.com>
Subject: Re: [Moses-support] Binarising the phrase table
To: Per Tunedal <per.tunedal@operamail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CANN0mWbMXJ600Jcg-9LuvK4UB73MX9s9Rpxqf-KKo_iZgohZ1Q@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi Per,

Could you please past here the command your ran to binarise the
Phrase-table?

The current paths in your moses.ini should be to the out-put of
processPhraseTable,
for instance, if the output was
/home/per/working/train/model/whatever/phrase-table/
, then you have to change your current path by this one making sure to use
the stemming (taking the .gz off as I did above) and switching the first
moses parameter (from 0 to 1) as follows :

1 0 0 5 /home/per/working/train/model/whatever/phrase-table/

Same thing with the reordering table

Regards

Massinissa

2014-02-28 7:47 GMT+01:00 Per Tunedal <per.tunedal@operamail.com>:

>
> Hi,
> tried to binarise the phrase table and got in to trouble.
>
> 1. Error messages, as below. What's that?
> distinct source phrases: 439511 distinct first words of source phrases:
> 67600 number of phrase pairs (line count): 940438
> Count of lines with missing alignments: 0/940438
> WARNING: there are src voc entries with no phrase translation: count
> 2168
> There exists phrase translations for 65432 entries
>
> 2. Modify the moses.ini file. I've found this on the Moses/Baseline
> page:
> 1. Change PhraseDictionaryMemory to PhraseDictionaryBinary
> 2. Set the path of the PhraseDictionary feature to point to
> $HOME/working/train/binarised-model/Kryptering1.sv-fr.phrase-table
> 3. Set the path of the LexicalReordering feature to point to
> $HOME/working/train/binarised-model/Kryptering1.sv-fr.reordering-table
>
> But I cannot find any such entries in my moses.ini - maybe because I'm
> running a somewhat older version of Moses. I've found e.g. the following
> lines in my ini-file:
>
> [ttable-file]
> 0 0 0 5 /home/per/working/train/model/phrase-table.gz
>
> # distortion (reordering) files
> [distortion-file]
> 0-0 wbe-msd-bidirectional-fe-allff 6
> /home/per/working/train/model/reordering-table.wbe-msd-bidirectional-fe.gz
>
> How should I change the entries to use my binarised model?
>
> Yours,
> Per Tunedal
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

--

[image: Description : Description : lingua_custodia_final full logo]

*The Translation Trustee*

*1, Place Charles de Gaulle*

*78180 Montigny-le-Bretonneux*

*Tel : +33 1 30 44 04 23 Mobile : +33 7 61 44 40 84*

*Email :* *massinissa.ahmim@linguacustodia.com
<massinissa.ahmim@linguacustodia.com>*

*Website :* *www.linguacustodia.com <http://www.linguacustodia.com/> -
www.thetranslationtrustee.com <http://www.thetranslationtrustee.com>*

? Pensez ? l'environnement, n'imprimez ce courriel que si n?cessaire.

Please do not print this email unless it is absolutely necessary. Spread
environmental awareness.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140228/e37a54d9/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 4421 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20140228/e37a54d9/attachment-0001.jpg

------------------------------

Message: 3
Date: Fri, 28 Feb 2014 10:09:37 +0100
From: Janez Kadivec <jankad@zop-cr.com>
Subject: [Moses-support] Tokenisation issue - following the
implementation baseline
To: moses-support@mit.edu
Message-ID:
<CA+viJscmeiPCjtUK+dGY5hQ1z2nKfJ_CuAwJXxhL3SB1UJT=Pg@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hello,

I have a long translation related experience, but I'm new to Moses Machine
translation. I'd like to test the functionalities of the statistical
machine translation with Moses.

We have a working 32-bit Ubuntu 10.04 virtual machine and like to test
Moses by creating a simple model for a start.
-------------------------------------------

We are following the instructions in the
http://www.statmt.org/moses/?n=Moses.Baseline.

We installed a working Moses decoder binaries (We didn't compile it from
the source files, because there are too many errors during compiling),
which worked fine on a prepared samle model. We installed prepared binaries
from your website.

OK: We downloaded the sample models and extracted them into our working
directory by entering the folowing commands:

cd ~/mosesdecoder
wget http://www.statmt.org/moses/download/sample-models.tgz
tar xzf sample-models.tgz
cd sample-models

We ran the decoder

cd ~/mosesdecoder/sample-models
~/mosesdecoder/bin/moses -f phrase-model/moses.ini < phrase-model/in > out

OK: Everything worked out right, so the sentence "das ist ein kleines haus"
(in the file in) was translated as "it is a small house" (in the file out).

....................
We wanted to move to the next step in the baseline by downloading and
preparing the sample corpus in different languages.

We did the Corpus Preparation

OK: To train a translation system we need parallel data (text translated
into two different languages) which is aligned at the sentence level.
Luckily there's plenty of this data freely available, and for this system
I'm going to use a small (only 130,000 sentences!) data set released for
the 2013 Workshop in Machine Translation. To get the data we want, we have
to download the tarball and unpack it (into a corpus directory in our home
directory) as follows

cd
mkdir corpus
cd corpus
wget http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz
tar zxvf training-parallel-nc-v8.tgz

OK: The files are there.

The next step is tokenisation. We entered the following command for
tokenisation of the English file:
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< ~/corpus/training/news-commentary-v8.fr-en.en \
> ~/corpus/news-commentary-v8.fr-en.tok.en

NOK: The result on the screen was the following:

Tokenizer version: 1.1
Language: en
Number of threads: en

If I understand correctly, the tokenizer should create another file in the
corpus folder. But there is no such a file.

we entered the command also for the French file:

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< ~/corpus/training/news-commentary-v8.fr-en.fr \
> ~/corpus/news-commentary-v8.fr-en.tok.fr

The execution of this command "hangs" with no results. It doesn't display
the Tokenizer version: 1.1 and other information as it did for the English
language. This is pretty unusual for me.
This command should also create a tokenisated file in the corpus folder,
but it doesn't.

What we are doing wrong? Why the output file is not in the corpus folder?
Why the tokenisation of the French file doesn't finnish?

Thank you for your help and support.
Janez
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140228/829ef84b/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 88, Issue 67
*********************************************

Moses-support Digest, Vol 88, Issue 67

0 Response to "Moses-support Digest, Vol 88, Issue 67"

Post a Comment