Moses-support Digest, Vol 99, Issue 25

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Deprecating the Binary phrase-table (Christian Hardmeier)
2. Re: Deprecating the Binary phrase-table (Raj Dabre)
3. Re: Floating point exception in processPhraseTableMin
(Kenneth Heafield)
4. Re: Tokenization problem (Ihab Ramadan)

----------------------------------------------------------------------

Message: 1
Date: Tue, 13 Jan 2015 16:42:35 +0100
From: Christian Hardmeier <ch@rax.ch>
Subject: Re: [Moses-support] Deprecating the Binary phrase-table
To: Hieu Hoang <hieuhoang@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <1AA300AC-6642-491D-B5C8-A058957F72E6@rax.ch>
Content-Type: text/plain; charset=us-ascii

Hi,

> If people want binary phrase-tables, there's now a glutony of choice.
> 1. Marcin's compact phrase-table is pretty awesome - it's fast and
> small.
> 2. Nikolay's Probing Pt built on KenLM's datastructures.
> 3. Uli's dynamic suffix array
> 4. My OnDisk pt. Supports both phrase-based and syntax.

Is any of these available as a stand-alone library?
If moses gives up the old binary phrase table, I'll have to look around for a new phrase table implementation for my document-level decoder, Docent. It would be easier to link against a phrase table library instead of a whole decoder.

Thanks for any hints!

Christian

------------------------------

Message: 2
Date: Wed, 14 Jan 2015 00:54:47 +0900
From: Raj Dabre <prajdabre@gmail.com>
Subject: Re: [Moses-support] Deprecating the Binary phrase-table
To: Christian Hardmeier <ch@rax.ch>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAB3gfjDhO=SrwCPYBiC4Qc+R24-5x9MQnYHhBzd_EGujYf2tCg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hey,
Nikolay's Probing PT is standalone.
I have used it.
It is pretty fast and awesome.
Regards.

On Wed, Jan 14, 2015 at 12:42 AM, Christian Hardmeier <ch@rax.ch> wrote:

> Hi,
>
> > If people want binary phrase-tables, there's now a glutony of choice.
> > 1. Marcin's compact phrase-table is pretty awesome - it's fast and
> > small.
> > 2. Nikolay's Probing Pt built on KenLM's datastructures.
> > 3. Uli's dynamic suffix array
> > 4. My OnDisk pt. Supports both phrase-based and syntax.
>
> Is any of these available as a stand-alone library?
> If moses gives up the old binary phrase table, I'll have to look around
> for a new phrase table implementation for my document-level decoder,
> Docent. It would be easier to link against a phrase table library instead
> of a whole decoder.
>
> Thanks for any hints!
>
> Christian
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

--
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150114/63ef4d34/attachment-0001.htm

------------------------------

Message: 3
Date: Tue, 13 Jan 2015 22:37:48 -0500
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Floating point exception in
processPhraseTableMin
To: moses-support@mit.edu
Message-ID: <54B5E48C.8020208@kheafield.com>
Content-Type: text/plain; charset=windows-1252

Hi,

Now with a backtrace. Third time it's failed with 16 cores on the same
phrase table. All runs already had "-encoding None."

#0 0x0000000000421c5d in
Moses::Simple9::Encode<__gnu_cxx::__normal_iterator<unsigned int*,
std::vector<unsigned int, std::allocator<unsigned int> > >,
std::back_insert_iterator<std::vector<unsigned int,
std::allocator<unsigned int> > > > (it=..., end=..., outIt=...,
outIt@entry=...) at moses/TranslationModel/CompactPT/ListCoders.h:339
#1 0x00000000004222b4 in Moses::MonotonicVector<unsigned long, unsigned
int, 32ul, std::allocator>::push_back (this=this@entry=0xbebe3258,
i=3540308603) at moses/TranslationModel/CompactPT/MonotonicVector.h:109
#2 0x000000000042d344 in Moses::StringVector<unsigned char, unsigned
long, Moses::MmapAllocator>::push_back<std::string> (this=0xbebe3240,
s=...) at moses/TranslationModel/CompactPT/StringVector.h:386
#3 0x00000000004179a2 in FlushCompressedQueue (force=false,
this=0x7fffffffc550) at
moses/TranslationModel/CompactPT/PhraseTableCreator.cpp:986
#4 Moses::CompressionTask::operator() (this=0xbebe5378) at
moses/TranslationModel/CompactPT/PhraseTableCreator.cpp:1230
#5 0x00000000004678ea in thread_proxy ()
#6 0x0000003a03007851 in start_thread () from /lib64/libpthread.so.0
#7 0x0000003a024e890d in clone () from /lib64/libc.so.6

Looking at the code:

double log2 = log(2);
while(j < 9 && lastpos < 28 && (i+lastpos) < end) {
if(lastpos >= parts[j])
j++;

buffer[lastpos] = *(i + lastpos);

uint reqbit = ceil(log(buffer[lastpos]+1)/log2);
assert(reqbit <= 28);

// CRASH HERE
uint bit = 28/floor(28/reqbit);
if(lastbit < bit)
lastbit = bit;

if(parts[j] > 28/lastbit)
break;
else if(lastpos == parts[j]-1)
lastyes = lastpos;

lastpos++;
}

reqbit is 0 and 28/reqbit is triggering an integer divide by zero. Yes,
floating point exception is a misnomer and usually means integer divide
by zero, since it covers both types but NaNs are usually set to
non-signaling.

What is the problematic line "uint bit = 28/floor(28/reqbit);" trying to
do? Currently:

1. Integer division 28/reqbit, returning an integer.
2. Cast that integer to a float.
3. Call floor which should do nothing at this small scale.
4. Floating point divide 28.0 by the result.
5. Convert to integer, rounding down. If the floating-point operation
is imprecise, you'll get something lower that 28/(28/reqbit).

Moreover, it looks like there's some floating-point arithmetic to do
integer log2.

uint reqbit = ceil(log(buffer[lastpos]+1)/log2);

How about gcc's builtin, which is one asm instruction (if gcc is the
compiler)?

int __builtin_clz (unsigned int x)

But anyway buffer[lastpos] == 0 so the above integer log2 code is
correctly returning 0 == log2(0 + 1)

Tracing back a bit more, the function is attempting to encode a vector
containing the following integers: 0 118 128 72 63 71 64 114 41 74 46
375 374 425 112 502 496 485 474 493 106 110 104 110 115 296 287 105 113
0 0 . It's barfing on the 0th entry in that vector, which is a zero.

Sometimes Simple-9 doesn't expect 0s since it's delta encoding for
posting lists etc. Is the bug that 0s are being passed or that the
encoding scheme isn't handling this case?

Kenneth

On 01/13/2015 02:25 AM, Marcin Junczys-Dowmunt wrote:
> Hi Kenneth.
> Recently I am encountering an increased number of crashes, too. I guess
> there are some heisenbugs in the binarization that manifest maybe due to
> a new boost version or something. A workaround is usually to use less
> threads, only one or up to 4 (it's actually not much faster with 16
> anyway). If it still crashes try -encoding None . I am planning to write
> a new binarization tool from scratch, this one is giving me too much
> headache.
>
> W dniu 13.01.2015 o 04:20, Kenneth Heafield pisze:
>> Dear Moses/Marcin,
>>
>> I'm getting a Floating point exception in processPhraseTableMin from
>> Moses d0807c.
>>
>> Arguments, minus the absolute paths, are:
>>
>> processPhraseTableMin -in phrase-table.gz -out phrase-table -nscores 4
>> -threads 16 -T /tmp -encoding None
>>
>> The phrase table is rather large and it runs for several hours before
>> crashing. Log output is below.
>>
>> Used options:
>> Text phrase table will be read from: phrase-table.gz
>> Output phrase table will be written to: phrase-table.minphr
>> Step size for source landmark phrases: 2^10=1024
>> Source phrase fingerprint size: 16 bits / P(fp)=1.52588e-05
>> Selected target phrase encoding: Huffman
>> Number of score components in phrase table: 4
>> Single Huffman code set for score components: no
>> Using score quantization: no
>> Explicitly included alignment information: yes
>> Running with 16 threads
>>
>> Pass 1/2: Creating source phrase index + Encoding target phrases
>> ..................................................[5000000]
>> ..................................................[10000000]
>> ..................................................[15000000]
>> ..................................................[20000000]
>> ..................................................[25000000]
>> ..................................................[30000000]
>> ..................................................[35000000]
>> ..................................................[40000000]
>> ..................................................[45000000]
>> ..................................................[50000000]
>> ..................................................[55000000]
>> ..................................................[60000000]
>> ..................................................[65000000]
>> ..................................................[70000000]
>> ..................................................[75000000]
>> ..................................................[80000000]
>> ..................................................[85000000]
>> ..................................................[90000000]
>> ..................................................[95000000]
>> ..................................................[100000000]
>> ..................................................[105000000]
>> ..................................................[110000000]
>> ..................................................[115000000]
>> ..................................................[120000000]
>> ..................................................[125000000]
>> ..................................................[130000000]
>> ..................................................[135000000]
>> ..................................................[140000000]
>> ..................................................[145000000]
>> ..................................................[150000000]
>> ..................................................[155000000]
>> ..................................................[160000000]
>> ..................................................[165000000]
>> ..................................................[170000000]
>> ..................................................[175000000]
>> ..................................................[180000000]
>> ..............................................
>>
>> Intermezzo: Calculating Huffman code sets
>> Creating Huffman codes for 624564 target phrase symbols
>> Creating Huffman codes for 551381 scores
>> Creating Huffman codes for 15296482 scores
>> Creating Huffman codes for 582875 scores
>> Creating Huffman codes for 15806633 scores
>> Creating Huffman codes for 50 alignment points
>>
>> Pass 2/2: Compressing target phrases
>> ..................................................[5000000]
>> ..................................................[10000000]
>>
>> Kenneth
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

------------------------------

Message: 4
Date: Wed, 14 Jan 2015 11:13:46 +0200
From: "Ihab Ramadan" <i.ramadan@saudisoft.com>
Subject: Re: [Moses-support] Tokenization problem
To: <moses-support@mit.edu>
Message-ID: <004901d02fda$67ec4da0$37c4e8e0$@saudisoft.com>
Content-Type: text/plain; charset="iso-8859-1"

Dears,

I still have this problem, for not confusing the decoder I used the
??no-escape? parameter in the tokenizer.perl script but still have the
problem of adding extra space after quotations for tokenizing files however
in tokenizing a segment it comes without the extra space

For example

In the file

?which will guide you through connecting and configuring your printer's
wireless connection. ? ? ?which will guide you through connecting and
configuring your printer ' s wireless connection .?

As a segment

?which will guide you through connecting and configuring your printer's
wireless connection. ? ? ?which will guide you through connecting and
configuring your printer 's wireless connection .?

I wonder if it is the same script why it generated two different outputs

I have no experience in perl so I could not get the line of code which
differ between if the segment in a file or just one segment passed as a
parameter to the script

Please help

From: Ihab Ramadan [mailto:i.ramadan@saudisoft.com]
Sent: Monday, January 5, 2015 10:09 AM
To: moses-support@mit.edu
Subject: Tokenization problem

Dears,

Using the tokenizer on the training files replaces the apostrophes with
?' s? (with space) but if I use the same script to tokenize a sentence
it makes the apostrophes to be ?'s? (without a space)

This problem confuse the decoder while translation

How to solve this peoblem

Thanks

Best Regards

Ihab Ramadan| Senior Developer| <http://www.saudisoft.com/> Saudisoft -
Egypt | Tel +2 02 330 320 37 Ext- 0 | Mob+201007570826 | Fax+20233032036 |
Follow us on
<http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=V
SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri
mary> linked |
<https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
okmark> ZA102637861 | <https://twitter.com/Saudisoft> ZA102637858

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150114/6c6a7530/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1314 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150114/6c6a7530/attachment.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1317 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150114/6c6a7530/attachment-0001.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1351 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150114/6c6a7530/attachment-0002.gif

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 99, Issue 25
*********************************************

Moses-support Digest, Vol 99, Issue 25

0 Response to "Moses-support Digest, Vol 99, Issue 25"

Post a Comment