Moses-support Digest, Vol 99, Issue 34

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Legacy tokenizer.perl functionality. (Ondrej Bojar)
2. Fuzzy Match Rule segfaults (Jon Olds)
3. MGIZA is slower than GIZA (Li Xiang)
4. Re: Legacy tokenizer.perl functionality. (Barry Haddow)


----------------------------------------------------------------------

Message: 1
Date: Fri, 16 Jan 2015 16:07:04 +0100 (CET)
From: Ondrej Bojar <bojar@ufal.mff.cuni.cz>
Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.
To: Christian Hardmeier <ch@rax.ch>
Cc: Tom Hoar <tahoar@precisiontranslationtools.com>, moses-support
support <moses-support@mit.edu>
Message-ID:
<1548998594.866465.1421420824896.JavaMail.zimbra@ufal.mff.cuni.cz>
Content-Type: text/plain; charset=utf-8

Hi, Christian,

when the scripts directory of moses was first created back in 2006, we had the same issues with versioning. At that point, I created the (ugly) need 'install' the scripts, mainly to provide all of them with a version number. Fortunately, we now got rid of this and the scripts are meant to be used rightaway after checkout.

I'm saying this just to point out that there is probably no ideal way of keeping up to date and yet ensuring compatibility for existing models with toolkits as complex as moses is.

For this, I use my eman, an experiment manager where even moses toolkit itself is something timestamped. So I have a a couple of moses checkouts, timestamped, and my models depend on one of them. Moving to a fresher moses checkout is easy (a new timestamped directory gets created), but requires to redo all the models (well, eman does this for me, so it's just waste of computer space and time, not mine).

Cheers, O.

----- Original Message -----
> From: "Christian Hardmeier" <ch@rax.ch>
> To: "Hieu Hoang" <hieuhoang@gmail.com>
> Cc: "Tom Hoar" <tahoar@precisiontranslationtools.com>, "moses-support support" <moses-support@mit.edu>
> Sent: Friday, 16 January, 2015 15:26:15
> Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.

> On Jan 16, 2015, at 12:46 PM, Hieu Hoang wrote:
>
>> i think it's too difficult to police.
>
> You'd probably need a regression test that checks if the tokenised output is
> still the same so changes don't go unnoticed. But of course it's still some
> extra work.
>
>> Another idea is to get the script to md5 its own source code, and the non-prefix
>> files it uses.
>
> That would definitely be better than nothing, even though it would raise false
> alarms from time to time.
>
>>
>> On 16/01/15 11:12, Christian Hardmeier wrote:
>>> On Jan 16, 2015, at 11:51 AM, Tom Hoar wrote:
>>>
>>>> I agree with versioning. Could be added to the command line.
>>>>
>>>> Also agree that this proposed change qualifies as a version change.
>>>>
>>>> How to you propose managing the issue of output changes due to
>>>> command-line switches, like -no-escape?
>>> Very good question. To be consistent, you'd probably have to increment the
>>> version number even if the change only applies when you use a certain
>>> command-line switch. But not if it doesn't affect the input, and maybe not if
>>> you just add a new command-line switch that is off by default. What do you
>>> think?
>>>
>>>
>>>
>>>>
>>>> On 01/16/2015 05:36 PM, Christian Hardmeier wrote:
>>>>> I'd like to suggest that there should be a version number in the tokeniser that
>>>>> is incremented whenever the output changes, even if the change is minor and
>>>>> even if it's just a bugfix. Otherwise when you pull a new version of moses you
>>>>> don't know if the output of tokenizer.perl is still compatible with your
>>>>> existing models. (Moving functionality from tokenizer.perl to
>>>>> normalize-punctuation.perl would count as a change from my point of view. I
>>>>> don't always use normalize-punctutation.)
>>>>>
>>>>> /Christian
>>>>>
>>>>> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:
>>>>>
>>>>>> it's probably a good idea to make this change. If you've done it
>>>>>> already, please send me the updated scripts and I'll check it in. If
>>>>>> not, I'll do it myself
>>>>>>
>>>>>> there's hopefully a fast, C++ tokenizer replacement coming soon.
>>>>>> Highlighting these issues now is useful to understanding exactly how the
>>>>>> tokenizer works/should work
>>>>>>
>>>>>> On 15/01/15 01:52, Tom Hoar wrote:
>>>>>>> This is a separate issue from the parallel "Tokenization problem" thread...
>>>>>>>
>>>>>>> The tokenizer.perl has had one line that transforms the grave accent (`)
>>>>>>> to apostrophe and another that transforms double apostrophe ('') to to
>>>>>>> single quote. I suspect these have been in the script since the
>>>>>>> beginning. However, they recently "bit" me on a recent project. Easy
>>>>>>> enough to work around.
>>>>>>>
>>>>>>> Still, I'm wondering. Do they still belong in the tokenizer.perl script?
>>>>>>> Or, should they moved into one of the other scripts? The
>>>>>>> normalize-punctuation.perl script seems to be a good candidate.
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> Moses-support@mit.edu
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

--
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo


------------------------------

Message: 2
Date: Fri, 16 Jan 2015 15:51:00 +0000
From: Jon Olds <joft_uk@yahoo.co.uk>
Subject: [Moses-support] Fuzzy Match Rule segfaults
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <54B93364.7010809@yahoo.co.uk>
Content-Type: text/plain; charset=utf-8; format=flowed

Hi,

I?ve been taking another look at the fuzzy match rule for hierarchical
models. I am really not sure how to set it up but seem to have got some
response by using the moses.ini below.

Unfortunately, it then segfaults every time after the following output.

I?m probably doing something very stupid, but any assistance would be
much appreciated.

Cheers,

Jon

(6.6) consolidating the two halves @ Fri Jan 16 15:21:28 UTC 2015
Executing: /home/ubuntu/tools/mosesdecoder/scripts/../bin/consolidate
/tmp/moses.gpitcm/fuzzyMatchFile.pt.half.f2e.gz
/tmp/moses.gpitcm/fuzzyMatchFile.pt.half.e2f.gz /dev/stdout
--Hierarchical | gzip -c > /tmp/moses.gpitcm/fuzzyMatchFile.pt.gz
Consolidate v2.0 written by Philipp Koehn
consolidating direct and indirect rule tables
processing hierarchical rules
Executing: rm -f /tmp/moses.gpitcm/fuzzyMatchFile.pt.half.*
Start loading fuzzy-match phrase model : [41.633] seconds
Line 0: Initialize search took 0.116 seconds total
Translating: <s> ce v?hicule est ? ce jour le plus important d?di? ? l'
immobilier tertiaire en Ile - de - France . </s> ||| [0,0]=X (1) [0,1]=X
(1) [0,2]=X (1) [0,3]=X (1) [0,4]=X (1) [0,5]=X (1) [0,6]=X (1) [0,7]=X
(1) [0,8]=X (1) [0,9]=X (1) [0,10]=X (1) [0,11]=X (1) [0,12]=X (1)
[0,13]=X (1) [0,14]=X (1) [0,15]=X (1) [0,16]=X (1) [0,17]=X (1)
[0,18]=X (1) [0,19]=X (1) [0,20]=X (1) [0,21]=X (1) [0,22]=X (1) [1,1]=X
(1) [1,2]=X (1) [1,3]=X (1) [1,4]=X (1) [1,5]=X (1) [1,6]=X (1) [1,7]=X
(1) [1,8]=X (1) [1,9]=X (1) [1,10]=X (1) [1,11]=X (1) [1,12]=X (1)
[1,13]=X (1) [1,14]=X (1) [1,15]=X (1) [1,16]=X (1) [1,17]=X (1)
[1,18]=X (1) [1,19]=X (1) [1,20]=X (1) [1,21]=X (1) [1,22]=X (1) [2,2]=X
(1) [2,3]=X (1) [2,4]=X (1) [2,5]=X (1) [2,6]=X (1) [2,7]=X (1) [2,8]=X
(1) [2,9]=X (1) [2,10]=X (1) [2,11]=X (1) [2,12]=X (1) [2,13]=X (1)
[2,14]=X (1) [2,15]=X (1) [2,16]=X (1) [2,17]=X (1) [2,18]=X (1)
[2,19]=X (1) [2,20]=X (1) [2,21]=X (1) [2,22]=X (1) [3,3]=X (1) [3,4]=X
(1) [3,5]=X (1) [3,6]=X (1) [3,7]=X (1) [3,8]=X (1) [3,9]=X (1) [3,10]=X
(1) [3,11]=X (1) [3,12]=X (1) [3,13]=X (1) [3,14]=X (1) [3,15]=X (1)
[3,16]=X (1) [3,17]=X (1) [3,18]=X (1) [3,19]=X (1) [3,20]=X (1)
[3,21]=X (1) [3,22]=X (1) [4,4]=X (1) [4,5]=X (1) [4,6]=X (1) [4,7]=X
(1) [4,8]=X (1) [4,9]=X (1) [4,10]=X (1) [4,11]=X (1) [4,12]=X (1)
[4,13]=X (1) [4,14]=X (1) [4,15]=X (1) [4,16]=X (1) [4,17]=X (1)
[4,18]=X (1) [4,19]=X (1) [4,20]=X (1) [4,21]=X (1) [4,22]=X (1) [5,5]=X
(1) [5,6]=X (1) [5,7]=X (1) [5,8]=X (1) [5,9]=X (1) [5,10]=X (1)
[5,11]=X (1) [5,12]=X (1) [5,13]=X (1) [5,14]=X (1) [5,15]=X (1)
[5,16]=X (1) [5,17]=X (1) [5,18]=X (1) [5,19]=X (1) [5,20]=X (1)
[5,21]=X (1) [5,22]=X (1) [6,6]=X (1) [6,7]=X (1) [6,8]=X (1) [6,9]=X
(1) [6,10]=X (1) [6,11]=X (1) [6,12]=X (1) [6,13]=X (1) [6,14]=X (1)
[6,15]=X (1) [6,16]=X (1) [6,17]=X (1) [6,18]=X (1) [6,19]=X (1)
[6,20]=X (1) [6,21]=X (1) [6,22]=X (1) [7,7]=X (1) [7,8]=X (1) [7,9]=X
(1) [7,10]=X (1) [7,11]=X (1) [7,12]=X (1) [7,13]=X (1) [7,14]=X (1)
[7,15]=X (1) [7,16]=X (1) [7,17]=X (1) [7,18]=X (1) [7,19]=X (1)
[7,20]=X (1) [7,21]=X (1) [7,22]=X (1) [8,8]=X (1) [8,9]=X (1) [8,10]=X
(1) [8,11]=X (1) [8,12]=X (1) [8,13]=X (1) [8,14]=X (1) [8,15]=X (1)
[8,16]=X (1) [8,17]=X (1) [8,18]=X (1) [8,19]=X (1) [8,20]=X (1)
[8,21]=X (1) [8,22]=X (1) [9,9]=X (1) [9,10]=X (1) [9,11]=X (1) [9,12]=X
(1) [9,13]=X (1) [9,14]=X (1) [9,15]=X (1) [9,16]=X (1) [9,17]=X (1)
[9,18]=X (1) [9,19]=X (1) [9,20]=X (1) [9,21]=X (1) [9,22]=X (1)
[10,10]=X (1) [10,11]=X (1) [10,12]=X (1) [10,13]=X (1) [10,14]=X (1)
[10,15]=X (1) [10,16]=X (1) [10,17]=X (1) [10,18]=X (1) [10,19]=X (1)
[10,20]=X (1) [10,21]=X (1) [10,22]=X (1) [11,11]=X (1) [11,12]=X (1)
[11,13]=X (1) [11,14]=X (1) [11,15]=X (1) [11,16]=X (1) [11,17]=X (1)
[11,18]=X (1) [11,19]=X (1) [11,20]=X (1) [11,21]=X (1) [11,22]=X (1)
[12,12]=X (1) [12,13]=X (1) [12,14]=X (1) [12,15]=X (1) [12,16]=X (1)
[12,17]=X (1) [12,18]=X (1) [12,19]=X (1) [12,20]=X (1) [12,21]=X (1)
[12,22]=X (1) [13,13]=X (1) [13,14]=X (1) [13,15]=X (1) [13,16]=X (1)
[13,17]=X (1) [13,18]=X (1) [13,19]=X (1) [13,20]=X (1) [13,21]=X (1)
[13,22]=X (1) [14,14]=X (1) [14,15]=X (1) [14,16]=X (1) [14,17]=X (1)
[14,18]=X (1) [14,19]=X (1) [14,20]=X (1) [14,21]=X (1) [14,22]=X (1)
[15,15]=X (1) [15,16]=X (1) [15,17]=X (1) [15,18]=X (1) [15,19]=X (1)
[15,20]=X (1) [15,21]=X (1) [15,22]=X (1) [16,16]=X (1) [16,17]=X (1)
[16,18]=X (1) [16,19]=X (1) [16,20]=X (1) [16,21]=X (1) [16,22]=X (1)
[17,17]=X (1) [17,18]=X (1) [17,19]=X (1) [17,20]=X (1) [17,21]=X (1)
[17,22]=X (1) [18,18]=X (1) [18,19]=X (1) [18,20]=X (1) [18,21]=X (1)
[18,22]=X (1) [19,19]=X (1) [19,20]=X (1) [19,21]=X (1) [19,22]=X (1)
[20,20]=X (1) [20,21]=X (1) [20,22]=X (1) [21,21]=X (1) [21,22]=X (1)
[22,22]=X (1)


### MOSES CONFIG FILE ###
#########################

# input factors
[input-factors]
0

# mapping steps
[mapping]
0 T 0

[cube-pruning-pop-limit]
1000

[non-terminals]
X

[search-algorithm]
3

[inputtype]
3

[max-chart-span]
20

# feature functions
[feature]
PhraseDictionaryFuzzyMatch source=/home/ubuntu/data/tok/base.clean.fr
target=/home/ubuntu/data/tok/base.clean.en
alignment=/home/ubuntu/train/model/aligned.grow-diag-final-and
num-features=0
KENLM lazyken=0 name=LM0 factor=0 path=/home/ubuntu/train/lm/base.blm.en
order=3

# dense weights for feature functions


[weight]
LM0= 0.0866615



------------------------------

Message: 3
Date: Fri, 16 Jan 2015 23:53:10 +0800
From: Li Xiang <lixiang.ict@gmail.com>
Subject: [Moses-support] MGIZA is slower than GIZA
To: moses-support <moses-support@mit.edu>
Message-ID: <6711AF35-114A-4A14-BBC0-34CB5FB5EDBD@gmail.com>
Content-Type: text/plain; charset=us-ascii

Hi all,

I trained the alignment model on the same data with the same parameters using GIZA and MGIZA respectively. The training corpus includes 200K sentences. My server has an Intel Quad CPU i4790K which has 4 cores and each core has 2 threads. It costs 2905 seconds for GIZA. But it costs 5259 seconds for MGIZA with 3 threads. I think MGIZA is much faster than GIZA. But I got bad result. I do not know the reason is the compile way or others.

Does anyone has relative experience? Thanks.

The following is the training command for MGIZA. And the training data is the FBIS zh-en data. But I can not public the data because of copyright.


${mosesScript}/training/train-model.perl \
--external-bin-dir "${binDir}" \
--root-dir "${trainDir}" \
--corpus train \
--f src \
--e ref \
--alignment grow-diag-final-and \
--parallel \
--first-step 1 \
--last-step 3 \
--mgiza --mgiza-cpus 3


------------------------------

Message: 4
Date: Fri, 16 Jan 2015 16:13:15 +0000
From: Barry Haddow <bhaddow@staffmail.ed.ac.uk>
Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.
To: Ondrej Bojar <bojar@ufal.mff.cuni.cz>, Christian Hardmeier
<ch@rax.ch>
Cc: Tom Hoar <tahoar@precisiontranslationtools.com>, moses-support
support <moses-support@mit.edu>
Message-ID: <54B9389B.5080603@staffmail.ed.ac.uk>
Content-Type: text/plain; charset=windows-1252; format=flowed

Hi

Yes, the EMS (experiment management system) included with Moses will
also deal with this by checking timestamps on the tokeniser scripts.

If you use your models outside the EMS (or Eman etc) however then
there's no easy way to ensure compatibility between tokeniser and model.
I agree that the tokeniser shouldn't be doing text normalisation, but it
was, and fixing it could cause more pain than leaving things as they are,

cheers - Barry

On 16/01/15 15:07, Ondrej Bojar wrote:
> Hi, Christian,
>
> when the scripts directory of moses was first created back in 2006, we had the same issues with versioning. At that point, I created the (ugly) need 'install' the scripts, mainly to provide all of them with a version number. Fortunately, we now got rid of this and the scripts are meant to be used rightaway after checkout.
>
> I'm saying this just to point out that there is probably no ideal way of keeping up to date and yet ensuring compatibility for existing models with toolkits as complex as moses is.
>
> For this, I use my eman, an experiment manager where even moses toolkit itself is something timestamped. So I have a a couple of moses checkouts, timestamped, and my models depend on one of them. Moving to a fresher moses checkout is easy (a new timestamped directory gets created), but requires to redo all the models (well, eman does this for me, so it's just waste of computer space and time, not mine).
>
> Cheers, O.
>
> ----- Original Message -----
>> From: "Christian Hardmeier" <ch@rax.ch>
>> To: "Hieu Hoang" <hieuhoang@gmail.com>
>> Cc: "Tom Hoar" <tahoar@precisiontranslationtools.com>, "moses-support support" <moses-support@mit.edu>
>> Sent: Friday, 16 January, 2015 15:26:15
>> Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.
>> On Jan 16, 2015, at 12:46 PM, Hieu Hoang wrote:
>>
>>> i think it's too difficult to police.
>> You'd probably need a regression test that checks if the tokenised output is
>> still the same so changes don't go unnoticed. But of course it's still some
>> extra work.
>>
>>> Another idea is to get the script to md5 its own source code, and the non-prefix
>>> files it uses.
>> That would definitely be better than nothing, even though it would raise false
>> alarms from time to time.
>>
>>> On 16/01/15 11:12, Christian Hardmeier wrote:
>>>> On Jan 16, 2015, at 11:51 AM, Tom Hoar wrote:
>>>>
>>>>> I agree with versioning. Could be added to the command line.
>>>>>
>>>>> Also agree that this proposed change qualifies as a version change.
>>>>>
>>>>> How to you propose managing the issue of output changes due to
>>>>> command-line switches, like -no-escape?
>>>> Very good question. To be consistent, you'd probably have to increment the
>>>> version number even if the change only applies when you use a certain
>>>> command-line switch. But not if it doesn't affect the input, and maybe not if
>>>> you just add a new command-line switch that is off by default. What do you
>>>> think?
>>>>
>>>>
>>>>
>>>>> On 01/16/2015 05:36 PM, Christian Hardmeier wrote:
>>>>>> I'd like to suggest that there should be a version number in the tokeniser that
>>>>>> is incremented whenever the output changes, even if the change is minor and
>>>>>> even if it's just a bugfix. Otherwise when you pull a new version of moses you
>>>>>> don't know if the output of tokenizer.perl is still compatible with your
>>>>>> existing models. (Moving functionality from tokenizer.perl to
>>>>>> normalize-punctuation.perl would count as a change from my point of view. I
>>>>>> don't always use normalize-punctutation.)
>>>>>>
>>>>>> /Christian
>>>>>>
>>>>>> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:
>>>>>>
>>>>>>> it's probably a good idea to make this change. If you've done it
>>>>>>> already, please send me the updated scripts and I'll check it in. If
>>>>>>> not, I'll do it myself
>>>>>>>
>>>>>>> there's hopefully a fast, C++ tokenizer replacement coming soon.
>>>>>>> Highlighting these issues now is useful to understanding exactly how the
>>>>>>> tokenizer works/should work
>>>>>>>
>>>>>>> On 15/01/15 01:52, Tom Hoar wrote:
>>>>>>>> This is a separate issue from the parallel "Tokenization problem" thread...
>>>>>>>>
>>>>>>>> The tokenizer.perl has had one line that transforms the grave accent (`)
>>>>>>>> to apostrophe and another that transforms double apostrophe ('') to to
>>>>>>>> single quote. I suspect these have been in the script since the
>>>>>>>> beginning. However, they recently "bit" me on a recent project. Easy
>>>>>>>> enough to work around.
>>>>>>>>
>>>>>>>> Still, I'm wondering. Do they still belong in the tokenizer.perl script?
>>>>>>>> Or, should they moved into one of the other scripts? The
>>>>>>>> normalize-punctuation.perl script seems to be a good candidate.
>>>>>>>> _______________________________________________
>>>>>>>> Moses-support mailing list
>>>>>>>> Moses-support@mit.edu
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> Moses-support@mit.edu
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 99, Issue 34
*********************************************

0 Response to "Moses-support Digest, Vol 99, Issue 34"

Post a Comment