Moses-support Digest, Vol 99, Issue 29

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Tokenization problem (Tom Hoar)
2. Legacy tokenizer.perl functionality. (Tom Hoar)
3. Sparse features and overfitting (HOANG Cong Duy Vu)
4. Re: Tokenization problem (Ihab Ramadan)


----------------------------------------------------------------------

Message: 1
Date: Thu, 15 Jan 2015 08:44:29 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Tokenization problem
To: moses-support@mit.edu
Message-ID: <54B71B7D.50802@precisiontranslationtools.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

Good catch, Ken. I see your point, For example, considering the likely
language pair (EN-AR), there could be some non-printing characters in
the text file that the copy/paste clipboard drops.


On 01/15/2015 08:39 AM, Kenneth Heafield wrote:
> I'll inject that it is plausible there is some weird Unicode going on
> there and copy-paste on Linux sometimes canonicalized graphemes. Whilst
> I'm inclined to side with Tom, the only way to sort this out is with the
> raw file from Ihab as e.g. a gzipped attachment.
>
> Kenneth
>
> On 01/14/2015 08:33 PM, Tom Hoar wrote:
>> I just ran the same sentence through the newest github clone (today).
>>
>> corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
>> ./tokenizer.perl -no-escape -q -l en < test.txt
>> which will guide you through connecting and configuring your printer 's
>> wireless connection .
>> which will guide you through connecting and configuring your printer 's
>> wireless connection .
>> which will guide you through connecting and configuring your printer 's
>> wireless connection .
>> which will guide you through connecting and configuring your printer 's
>> wireless connection .
>> which will guide you through connecting and configuring your printer 's
>> wireless connection .
>>
>> This is not a Perl script problem. What shell and command line are you
>> using for your "in the file" results? You'll find the problem in either
>> your shell or your custom tool chain(s) before you run tokenizer.perl.
>>
>>
>>
>> On 01/14/2015 04:13 PM, Ihab Ramadan wrote:
>>> Dears,
>>>
>>> I still have this problem, for not confusing the decoder I used the
>>> ??no-escape? parameter in the tokenizer.perl script but still have the
>>> problem of adding extra space after quotations for tokenizing files
>>> however in tokenizing a segment it comes without the extra space
>>>
>>> For example
>>>
>>> In the file
>>>
>>> ?which will guide you through connecting and configuring your
>>> printer's wireless connection. ? ??which will guide you through
>>> connecting and configuring your printer ' s wireless connection .?
>>>
>>> As a segment
>>>
>>> ?which will guide you through connecting and configuring your
>>> printer's wireless connection. ? ??which will guide you through
>>> connecting and configuring your printer 's wireless connection .?
>>>
>>> I wonder if it is the same script why it generated two different outputs
>>>
>>> I have no experience in perl so I could not get the line of code which
>>> differ between if the segment in a file or just one segment passed as
>>> a parameter to the script
>>>
>>> Please help
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:*Ihab Ramadan [mailto:i.ramadan@saudisoft.com]
>>> *Sent:* Monday, January 5, 2015 10:09 AM
>>> *To:* moses-support@mit.edu
>>> *Subject:* Tokenization problem
>>>
>>>
>>>
>>> Dears,
>>>
>>> Using the tokenizer on the training files replaces the apostrophes
>>> with ?&apos; s? (with space) but if I use the same script to tokenize
>>> a sentence it makes the apostrophes to be ?&apos;s? (without a space)
>>>
>>> This problem confuse the decoder while translation
>>>
>>> How to solve this peoblem
>>>
>>> Thanks
>>>
>>>
>>>
>>> Best Regards
>>>
>>> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>
>>> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
>>> Fax+20233032036 | *Follow us on *linked
>>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* |
>>> **ZA102637861*
>>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* |
>>> **ZA102637858* <https://twitter.com/Saudisoft>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support



------------------------------

Message: 2
Date: Thu, 15 Jan 2015 08:52:47 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: [Moses-support] Legacy tokenizer.perl functionality.
To: moses-support@mit.edu
Message-ID: <54B71D6F.60205@precisiontranslationtools.com>
Content-Type: text/plain; charset=utf-8; format=flowed

This is a separate issue from the parallel "Tokenization problem" thread...

The tokenizer.perl has had one line that transforms the grave accent (`)
to apostrophe and another that transforms double apostrophe ('') to to
single quote. I suspect these have been in the script since the
beginning. However, they recently "bit" me on a recent project. Easy
enough to work around.

Still, I'm wondering. Do they still belong in the tokenizer.perl script?
Or, should they moved into one of the other scripts? The
normalize-punctuation.perl script seems to be a good candidate.


------------------------------

Message: 3
Date: Thu, 15 Jan 2015 13:54:13 +0800
From: HOANG Cong Duy Vu <duyvuleo@gmail.com>
Subject: [Moses-support] Sparse features and overfitting
To: moses-support <moses-support@mit.edu>
Message-ID:
<CAPRaJX3wRF8ZXoB7OMyWRUy7F6g-5UZr3DKd6TrS5gdUVD4kdw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

I am working on applying sparse features for *phrase-based* system on
*conversational
*domain (e.g. SMS, Chat).

I used sparse features such as: TargetWordInsertionFeature,
SourceWordDeletionFeature, WordTranslationFeature, PhraseLengthFeature.
Sparse features are used only for top source and target words (100, 150,
200, 250, ....).

My parallel data include: train(201K); tune(6214); test(641).
My system configuration: tuning with MIRA, 5-gram LM with KenLM, others by
default.

Here is the result:

*BLEU NIST METEOR*
Baseline 0.2009 5.2175 0.2603
Baseline + SP100 0.2021 5.135 0.2645
Baseline + SP150 0.2048 5.1804 0.2653
Baseline + SP200 0.2093 5.2272 0.2671
Baseline + SP250 0.2148 5.2603 0.2680
Baseline + SP300 0.2146 5.2631 0.2680
(SP: sparse features)

Although I got significantly improved result with SP250, I believe it was
due to over-fitting problem.
Then I tried to study the overlapping between train, tune and test data
sets.
The overlapping information is as follows:
- *train & test*:
*(based on source)*
size of test set = 625 ( 641 with duplicates )
size of overlap set = 65
proportion of train set inside test set = 6394 / 201301
*(based on target)*
size of test set = 621 ( 641 with duplicates )
size of overlap set = 69
proportion of training set inside test set = 13808 / 201301

- *tune & test*
*(based on source)*
size of test set = 625 ( 641 with duplicates )
size of overlap set = 624
proportion of tune set inside test set = 939 / 6214
*(based on target)*
size of test set = 621 ( 641 with duplicates )
size of overlap set = 386
proportion of training set inside test set = 706 / 6214

(tune & test have high overlapping parts based on source sentences, but
half of them have different target sentences)

After filtering overlapping parts (based on source sentences) for train and
tune based on test, my resulting parallel data include: train(194K);
tune(5274);
test(641).

And here is the result:

*BLEU NIST METEOR*
Baseline 0.1990 5.1764 0.2589
Baseline + SP250 0.1967 5.0109 0.2606

Only METEOR got slightly improved, others were dropped remarkably.

Is there any way to prevent over-fitting when applying the sparse features?
Or in this case, sparse features will not generalize well over "unseen"
data?
I am seeking for your advise.

Thanks so much!

--
Cheers,
Vu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/7b24e12a/attachment-0001.htm

------------------------------

Message: 4
Date: Thu, 15 Jan 2015 10:09:36 +0200
From: "Ihab Ramadan" <i.ramadan@saudisoft.com>
Subject: Re: [Moses-support] Tokenization problem
To: <moses-support@mit.edu>
Message-ID: <007601d0309a$9b6be260$d243a720$@saudisoft.com>
Content-Type: text/plain; charset="us-ascii"

Many thanks for all of you
As you mentioned the problem is not in the script it was in the text sent to
the terminal from my web app, I found that some characters does not goes as
it with weird Unicode
Thanks everybody

-----Original Message-----
From: moses-support-bounces@mit.edu [mailto:moses-support-bounces@mit.edu]
On Behalf Of moses-support-request@mit.edu
Sent: Thursday, January 15, 2015 3:39 AM
To: moses-support@mit.edu
Subject: Moses-support Digest, Vol 99, Issue 28

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific than
"Re: Contents of Moses-support digest..."


Today's Topics:

1. how to align some new parallel sentences using a trained
model (iamzcy_hit iamzcy_hit)
2. Re: Tokenization problem (Tom Hoar)
3. Re: Tokenization problem (Kenneth Heafield)


----------------------------------------------------------------------

Message: 1
Date: Thu, 15 Jan 2015 08:54:06 +0800
From: iamzcy_hit iamzcy_hit <iamzcyhit@gmail.com>
Subject: [Moses-support] how to align some new parallel sentences
using a trained model
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAGLowvLWHXb_J+=vZqMeOVCOD7Z=Uzyz_Sn=yjv+PTsfSyvn3A@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,all
If I've train a alignment model using a huge parallel corpus with the
help of giga++,mgiga or fast-align, now I am given some new sentences pairs
and want to align the words in the sentence, how should I do ?
Best regards

--
???????????????.....
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/9f
3850f8/attachment-0001.htm


------------------------------

Message: 2
Date: Thu, 15 Jan 2015 08:33:17 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Tokenization problem
To: moses-support@mit.edu
Message-ID: <54B718DD.4030109@precisiontranslationtools.com>
Content-Type: text/plain; charset="windows-1252"

I just ran the same sentence through the newest github clone (today).

corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
./tokenizer.perl -no-escape -q -l en < test.txt which will guide you through
connecting and configuring your printer 's wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .

This is not a Perl script problem. What shell and command line are you using
for your "in the file" results? You'll find the problem in either your shell
or your custom tool chain(s) before you run tokenizer.perl.



On 01/14/2015 04:13 PM, Ihab Ramadan wrote:
>
> Dears,
>
> I still have this problem, for not confusing the decoder I used the
> ??no-escape? parameter in the tokenizer.perl script but still have the
> problem of adding extra space after quotations for tokenizing files
> however in tokenizing a segment it comes without the extra space
>
> For example
>
> In the file
>
> ?which will guide you through connecting and configuring your
> printer's wireless connection. ? ??which will guide you through
> connecting and configuring your printer ' s wireless connection .?
>
> As a segment
>
> ?which will guide you through connecting and configuring your
> printer's wireless connection. ? ??which will guide you through
> connecting and configuring your printer 's wireless connection .?
>
> I wonder if it is the same script why it generated two different
> outputs
>
> I have no experience in perl so I could not get the line of code which
> differ between if the segment in a file or just one segment passed as
> a parameter to the script
>
> Please help
>
> *From:*Ihab Ramadan [mailto:i.ramadan@saudisoft.com]
> *Sent:* Monday, January 5, 2015 10:09 AM
> *To:* moses-support@mit.edu
> *Subject:* Tokenization problem
>
> Dears,
>
> Using the tokenizer on the training files replaces the apostrophes
> with ?&apos; s? (with space) but if I use the same script to tokenize
> a sentence it makes the apostrophes to be ?&apos;s? (without a space)
>
> This problem confuse the decoder while translation
>
> How to solve this peoblem
>
> Thanks
>
> Best Regards
>
> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>
> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
> Fax+20233032036 | *Follow us on *linked
> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trk
> Info=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVS
> RPcmpt%3Aprimary>* |
> **ZA102637861*
> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_t
> ype=bookmark>* |
> **ZA102637858* <https://twitter.com/Saudisoft>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84
784716/attachment-0001.htm

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1314 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84
784716/attachment-0003.gif

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1317 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84
784716/attachment-0004.gif

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1351 bytes
Desc: not available
Url :
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84
784716/attachment-0005.gif


------------------------------

Message: 3
Date: Wed, 14 Jan 2015 20:39:14 -0500
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Tokenization problem
To: moses-support@mit.edu
Message-ID: <54B71A42.7040703@kheafield.com>
Content-Type: text/plain; charset=windows-1252

I'll inject that it is plausible there is some weird Unicode going on there
and copy-paste on Linux sometimes canonicalized graphemes. Whilst I'm
inclined to side with Tom, the only way to sort this out is with the raw
file from Ihab as e.g. a gzipped attachment.

Kenneth

On 01/14/2015 08:33 PM, Tom Hoar wrote:
> I just ran the same sentence through the newest github clone (today).
>
> corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
> ./tokenizer.perl -no-escape -q -l en < test.txt which will guide you
> through connecting and configuring your printer 's wireless connection
> .
> which will guide you through connecting and configuring your printer
> 's wireless connection .
> which will guide you through connecting and configuring your printer
> 's wireless connection .
> which will guide you through connecting and configuring your printer
> 's wireless connection .
> which will guide you through connecting and configuring your printer
> 's wireless connection .
>
> This is not a Perl script problem. What shell and command line are you
> using for your "in the file" results? You'll find the problem in
> either your shell or your custom tool chain(s) before you run
tokenizer.perl.
>
>
>
> On 01/14/2015 04:13 PM, Ihab Ramadan wrote:
>>
>> Dears,
>>
>> I still have this problem, for not confusing the decoder I used the
>> ??no-escape? parameter in the tokenizer.perl script but still have
>> the problem of adding extra space after quotations for tokenizing
>> files however in tokenizing a segment it comes without the extra
>> space
>>
>> For example
>>
>> In the file
>>
>> ?which will guide you through connecting and configuring your
>> printer's wireless connection. ? ??which will guide you through
>> connecting and configuring your printer ' s wireless connection .?
>>
>> As a segment
>>
>> ?which will guide you through connecting and configuring your
>> printer's wireless connection. ? ??which will guide you through
>> connecting and configuring your printer 's wireless connection .?
>>
>> I wonder if it is the same script why it generated two different
>> outputs
>>
>> I have no experience in perl so I could not get the line of code
>> which differ between if the segment in a file or just one segment
>> passed as a parameter to the script
>>
>> Please help
>>
>>
>>
>>
>>
>>
>>
>> *From:*Ihab Ramadan [mailto:i.ramadan@saudisoft.com]
>> *Sent:* Monday, January 5, 2015 10:09 AM
>> *To:* moses-support@mit.edu
>> *Subject:* Tokenization problem
>>
>>
>>
>> Dears,
>>
>> Using the tokenizer on the training files replaces the apostrophes
>> with ?&apos; s? (with space) but if I use the same script to tokenize
>> a sentence it makes the apostrophes to be ?&apos;s? (without a space)
>>
>> This problem confuse the decoder while translation
>>
>> How to solve this peoblem
>>
>> Thanks
>>
>>
>>
>> Best Regards
>>
>> /Ihab Ramadan/| Senior Developer|Saudisoft
>> <http://www.saudisoft.com/>
>> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
>> Fax+20233032036 | *Follow us on *linked
>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&tr
>> kInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2C
>> VSRPcmpt%3Aprimary>* |
>> **ZA102637861*
>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_
>> type=bookmark>* |
>> **ZA102637858* <https://twitter.com/Saudisoft>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 99, Issue 28
*********************************************




------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 99, Issue 29
*********************************************

0 Response to "Moses-support Digest, Vol 99, Issue 29"

Post a Comment