Moses-support Digest, Vol 87, Issue 6

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: problem in tokenization (Renu Kumar)


----------------------------------------------------------------------

Message: 1
Date: Sat, 4 Jan 2014 06:35:45 +0530
From: Renu Kumar <renu17775@gmail.com>
Subject: Re: [Moses-support] problem in tokenization
To: arthiparamanathan@gmail.com, Hieu.Hoang@ed.ac.uk
Cc: moses-support@mit.edu
Message-ID:
<CAGOzkqRb=XrEs11RxYJC1L69hsSvMqBnS2edA=D67Ux-y0oNWw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Sorry, please find the attachment.

Regards
Renu


2014/1/4 Renu Kumar <renu17775@gmail.com>

> Hi,
>
> I had faced similar problem for Hindi. However I ignored the tokenization
> step then & moved ahead. However I would also like to sort this problem and
> add any changes needed for Hindi language.
>
> This is generally termed as a golu character that we see in the output and
> comes up for vowel characters which are used with another consonant to form
> a single character of Hindi (or may be Tamil also --I do not know Tamil but
> I think that will be the case for most of the Indian Languages).
>
> Since it is two and in some cases even more than two characters that are
> joined to form and infact represent a single character in Hindi.....so when
> we use the tokenizer script all the characters are broken up individually
> and hence the golu character appears, which infact is the actual
> representation of these characters if we look at the Unicode character
> chart , and these do not play any role as independent characters.
>
> Any suggestions.
> I am also attaching the Unicode character chart for Hindi.
>
> Regards
> Renu
>
>
> ---------- Original Message ----------
> From: Arththika Paramanathan <arthiparamanathan@gmail.com>
> To: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
> Cc: moses-support <moses-support@mit.edu>
> Date: January 3, 2014 at 11:33 PM
> Subject: Re: [Moses-support] problem in tokenization
> Hi,
>
> 1)this is an untokenized sentence,
> ???????? ??????? ?????? ??? ???????,????? ???????? ????? ?????? ????????
> ??????? ????????????? ?????.????????? ???????????? ??????
> ??????????????????? ,??????? ???????? ???????????,????????? ???????
> ?????????? ????????.
>
> 2)the command I gave is,
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ta <
> ~/corpus/training/squirrelmail.ta-en.ta >
> ~/corpus/squirrelmail.ta-en.tok.ta
>
> 3)the output is,
> ??? ? ??? ? ???? ? ?? ?? ? ??? ??? ???? ? ?? , ???? ? ??? ? ??? ? ?? ? ??
> ?????? ???? ? ?? ? ???? ? ?? ??? ? ?? ? ????? ? ???? ? .?? ? ?????? ????? ?
> ?????? ??? ? ?? ???? ? ????? ? ????? ? ?? , ??? ? ?? ? ???? ? ??? ??? ? ? ?
> ???? ? , ??? ? ???? ? ?? ? ??? ? ???? ? ???? ? ??????? ? .
>
> 4)Preferred output is,
> ???????? ??????? ?????? ??? ??????? , ????? ???????? ????? ?????? ????????
> ??????? ????????????? ????? . ????????? ???????????? ??????
> ??????????????????? , ??????? ???????? ??????????? , ????????? ???????
> ?????????? ???????? .
> I attached the non-breaking prefix file also, I want to add more
> abbreviations to this
>
>
> 2014/1/4 renubalyan <renubalyan@cdac.in>
>
>>
>>
>> ---------- Original Message ----------
>> From: Arththika Paramanathan <arthiparamanathan@gmail.com>
>> To: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
>> Cc: moses-support <moses-support@mit.edu>
>> Date: January 3, 2014 at 11:33 PM
>> Subject: Re: [Moses-support] problem in tokenization
>> Hi,
>>
>> 1)this is an untokenized sentence,
>> ???????? ??????? ?????? ??? ???????,????? ???????? ????? ?????? ????????
>> ??????? ????????????? ?????.????????? ???????????? ??????
>> ??????????????????? ,??????? ???????? ???????????,????????? ???????
>> ?????????? ????????.
>>
>> 2)the command I gave is,
>> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ta <
>> ~/corpus/training/squirrelmail.ta-en.ta >
>> ~/corpus/squirrelmail.ta-en.tok.ta
>>
>> 3)the output is,
>> ??? ? ??? ? ???? ? ?? ?? ? ??? ??? ???? ? ?? , ???? ? ??? ? ??? ? ?? ? ??
>> ?????? ???? ? ?? ? ???? ? ?? ??? ? ?? ? ????? ? ???? ? .?? ? ?????? ????? ?
>> ?????? ??? ? ?? ???? ? ????? ? ????? ? ?? , ??? ? ?? ? ???? ? ??? ??? ? ? ?
>> ???? ? , ??? ? ???? ? ?? ? ??? ? ???? ? ???? ? ??????? ? .
>>
>> 4)Preferred output is,
>> ???????? ??????? ?????? ??? ??????? , ????? ???????? ????? ??????
>> ???????? ??????? ????????????? ????? . ????????? ???????????? ??????
>> ??????????????????? , ??????? ???????? ??????????? , ????????? ???????
>> ?????????? ???????? .
>> I attached the non-breaking prefix file also, I want to add more
>> abbreviations to this
>>
>>
>>
>> --
>> regards,
>> P.Arththika
>>
>> -------------------------------------------------------------------------------------------------------------------------------
>>
>> This e-mail is for the sole use of the intended recipient(s) and may
>> contain confidential and privileged information. If you are not the
>> intended recipient, please contact the sender by reply e-mail and destroy
>> all copies and the original message. Any unauthorized review, use,
>> disclosure, dissemination, forwarding, printing or copying of this email
>> is strictly prohibited and appropriate legal action will be taken.
>> -------------------------------------------------------------------------------------------------------------------------------
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140104/20eac663/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: code_chart_hindi_U0900.pdf
Type: application/pdf
Size: 123032 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20140104/20eac663/attachment.pdf

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 87, Issue 6
********************************************

0 Response to "Moses-support Digest, Vol 87, Issue 6"

Post a Comment