Moses-support Digest, Vol 100, Issue 59

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. TOKENIZER.PERL (doc)
2. Re: TOKENIZER.PERL (Tom Hoar)
3. Number of enough segments (Ihab Ramadan)


----------------------------------------------------------------------

Message: 1
Date: Tue, 17 Feb 2015 06:54:39 +0530
From: doc <raymond.doctor@gmail.com>
Subject: [Moses-support] TOKENIZER.PERL
To: moses-support@mit.edu
Message-ID:
<CAJxcEy_RBkTfdF86iuYZGRdJL9YRcwFamFeDjbqForAZTr7xJg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hello,
I am using the tokenizer.perl script which I found on
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perlI
have tried to make it work for Indic languages which use the same punctuation
markers with the exception of the full-stop which is a
? U+0964 DEVANAGARI DANDA
My main issue is that Hindi and other languages using the character also
use the full-stop as an abbreviation marker. How do I manage to
keep both characters as tokenising elements? I would really appreciate if
someone could take some time off and propose modifications to the perl
script to accommodate also the Devanagari danda as well as the full-stop. I
work in C and hence the issue.
I am appending the a small sample of Hindi <raw.txt> for testing
Many thanks for your help
Best regards,

Raymond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150217/8df7b9dd/attachment-0001.htm
-------------- next part --------------
?
???? ?? ???? ??????, ??????? ???? ?????? ?? ?????
?? ??????, ????????, 13 ????? 2015 (09:23 IST)
?? ??????? ???????? ????? ??? ???? ?? ???? ????? ?????? ?????? ?? ??????? ????????? ??? ??? ??? ?? ????? ??? ?????? ?? ??????? ?? ???? ??? ???? ?? ?????? ??? ?? ????? ?????? ????? ??? ?? ?????? ????? ?? ???? ??? ???? ?????? ?? ????? ???? ?????

???? ?? ?? ?????? ???? ??????? ?????? ?? ??? ?? ?????? ?????? ?? ???? ????? ????? ????? ??? ????? ???????? ???????? ??? ?? ???

?????? ????? ?? ???? ???? ????? ???????? ?????? ?? ?? 19 ????? ?? ?????? ?? ??? ???? ?? ??? ??? ?????? ?? ??? ???? ???? ?? ??? ???? ???? ????

????? ??????? ?? ????? ?? ???? ?? ???? ??? ?? ?????? ????????? ??? ?? ???? ?? ??????? ???? ?????? ??? ??? ????????? ???

???? ?? ???????? ?? ???????????? ????? ???? ?? ??????? ??? ???? ?????? ??? ??? ??? ?? ?? ??? ?? ???? ?? ??? ?????? ?? ????? ??????? ??? ?????? ?? ?????????? ?? ???? ??? ???

????? ??????? ?? ?????? ???? ?? ??? ?? ??? ?????? ?? ????? ??? ??? ?? ????? ?????? ???? ???? ?? ?? ?????? ???? ????? ?????? ??????? ????? ??? ?? ????????? ??? ?????? ???? ???

?????? ?? ???? ??????? ??? ?? ???? ???? ?? ??? ???? ????? ???? ?????? ?? ?????? ????, ???? ????? ?? ????? ???? ?????, ???? ??? ??????? ?? ???? ??? ???????? ??. ??? ???? ??? ???? ????? ?????? ????? ?? ?? ?????? ??? ?? ????? ?? ??? ?? ?? ?????? ?? ???

??????? ?? ????? ?? ???? ?? ?????? ????? ?? ???? ?????? ?? ????? ???? ????? ?? ????? 2010 ?? ?????? ??? ?? ???? ????? ??? ?? ?? ?? ????? ????? ?????? ???

?? ???? ??? ?? ???????? ?????? ?? 70 ????? ???? ??? ????? ?? ??? ???? ?? ?? ???????? ????? ?? ??? ?????? ?????? ?????????? ???????? ????????? ??? 19 ????? ?????????? ?? ????? ??? ???????, ???????? ?? ?????? ?? ???? ?? ????? ?? ?????

??????? ?? ????? ?? ?????? ?? ????? ??? ?? ????? ????? ??????? ?????? ?? ?????? ?? ????? ?? ????? ???? ?? ??? ??????? ????? ?? ?? ???? ????? ?? ??? ?? ?????? ???? ????? ?? ?? ???? ??? ???????????? ??????? ??, ????? ??? ?????? ???????????? ??? ???? ?? ????? (????)

------------------------------

Message: 2
Date: Tue, 17 Feb 2015 09:08:18 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] TOKENIZER.PERL
To: moses-support@mit.edu
Message-ID: <54E2A292.5090006@precisiontranslationtools.com>
Content-Type: text/plain; charset="utf-8"

I think you'll find what you're looking for in the manual.pdf. Search
for "nonbreaking_prefixes" files. You'll also find more information in
the existing scripts/share/nonbreaking_prefixes language files.

For example, the files' header says:
#Anything in this file, followed by a period (and an upper-case word),
does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9
numbers.




On 02/17/2015 08:24 AM, doc wrote:
> Hello,
> I am using the tokenizer.perl script which I found on
> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perlI
> have tried to make it work for Indic languages which use the same
> punctuation markers with the exception of the full-stop which is a
> ? U+0964 DEVANAGARI DANDA
> My main issue is that Hindi and other languages using the character
> also use the full-stop as an abbreviation marker. How do I manage to
> keep both characters as tokenising elements? I would really appreciate
> if someone could take some time off and propose modifications to the
> perl script to accommodate also the Devanagari danda as well as the
> full-stop. I work in C and hence the issue.
> I am appending the a small sample of Hindi <raw.txt> for testing
> Many thanks for your help
> Best regards,
>
> Raymond
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150217/bc09c076/attachment-0001.htm

------------------------------

Message: 3
Date: Tue, 17 Feb 2015 11:00:26 +0200
From: "Ihab Ramadan" <i.ramadan@saudisoft.com>
Subject: [Moses-support] Number of enough segments
To: <moses-support@mit.edu>
Message-ID: <000001d04a90$33b12da0$9b1388e0$@saudisoft.com>
Content-Type: text/plain; charset="us-ascii"

Dears,

I just wonder how much data should I use to say I have enough data to build
a qualified MT

For example If I have 2 million segments in the parallel files is that
enough?

Thanks



Regards,
Ihab Ramadan | Senior Developer | Saudisoft-Egypt | Tel: +2 023 303 2037 -
ext 128 | M +2 01007570826 | Fax +2 023 303 2036 | Follow us on
<https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
okmark> | <https://twitter.com/Saudisoft> |
<https://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=
VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apr

imary>




-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150217/f730dad3/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1336 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150217/f730dad3/attachment.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1370 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150217/f730dad3/attachment-0001.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1314 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150217/f730dad3/attachment-0002.gif
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 6337 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150217/f730dad3/attachment.jpg

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 100, Issue 59
**********************************************

0 Response to "Moses-support Digest, Vol 100, Issue 59"

Post a Comment