Moses-support Digest, Vol 87, Issue 7

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: problem in tokenization (Tom Hoar)

----------------------------------------------------------------------

Message: 1
Date: Sat, 04 Jan 2014 10:10:10 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] problem in tokenization
To: moses-support@mit.edu
Message-ID: <52C77B92.4000001@precisiontranslationtools.com>
Content-Type: text/plain; charset="iso-8859-1"

Happy New Year all.

Renu and Arththika, have you tried using the characters-separated corpus
in an SMT model? They might actually be helpful because SMT doesn't care
if the token represents a single "word" or "concept" in a single
language. It only matters how the groupings of tokens (not words) in one
language are matched with groupings of tokens in another language.

For example with German, SMT models usually perform better when the
compound words are broken into their components. This 'segmentation'
gives the word alignment greater resolution when matching the groupings
between the two languages. German, however, does not use these
connecting characters. So, segmenting the compound words can be
cumbersome. In your cases, the work is already done.

You might want to create two corpora (characters separated and not) and
then evaluate the results. In the "character-separated" version, you'll
need to create a custom script to remove the spaces surrounding these
characters because the Moses detokenizer doesn't. In the "preferred"
case, I think the simplest approach is to write a custom script to only
search/replace desired punctuation.

Tom

On 01/04/2014 08:05 AM, Renu Kumar wrote:
> Sorry, please find the attachment.
>
> Regards
> Renu
>
>
> 2014/1/4 Renu Kumar <renu17775@gmail.com <mailto:renu17775@gmail.com>>
>
> Hi,
>
> I had faced similar problem for Hindi. However I ignored the
> tokenization step then & moved ahead. However I would also like to
> sort this problem and add any changes needed for Hindi language.
>
> This is generally termed as a golu character that we see in the
> output and comes up for vowel characters which are used with
> another consonant to form a single character of Hindi (or may be
> Tamil also --I do not know Tamil but I think that will be the case
> for most of the Indian Languages).
>
> Since it is two and in some cases even more than two characters
> that are joined to form and infact represent a single character in
> Hindi.....so when we use the tokenizer script all the characters
> are broken up individually and hence the golu character appears,
> which infact is the actual representation of these characters if
> we look at the Unicode character chart , and these do not play any
> role as independent characters.
>
> Any suggestions.
> I am also attaching the Unicode character chart for Hindi.
>
> Regards
> Renu
>
>
> ---------- Original Message ----------
> From: Arththika Paramanathan <arthiparamanathan@gmail.com
> <mailto:arthiparamanathan@gmail.com>>
> To: Hieu Hoang <Hieu.Hoang@ed.ac.uk <mailto:Hieu.Hoang@ed.ac.uk>>
> Cc: moses-support <moses-support@mit.edu
> <mailto:moses-support@mit.edu>>
> Date: January 3, 2014 at 11:33 PM
> Subject: Re: [Moses-support] problem in tokenization
> Hi,
>
> 1)this is an untokenized sentence,
> ???????? ??????? ?????? ??? ???????,????? ???????? ????? ??????
> ???????? ??????? ????????????? ?????.????????? ???????????? ??????
> ??????????????????? ,??????? ???????? ???????????,?????????
> ??????? ?????????? ????????.
>
> 2)the command I gave is,
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ta <
> ~/corpus/training/squirrelmail.ta-en.ta >
> ~/corpus/squirrelmail.ta-en.tok.ta
>
> 3)the output is,
> ??? ? ??? ? ???? ? ?? ?? ? ??? ??? ???? ? ?? , ???? ? ??? ? ??? ?
> ?? ? ?? ?????? ???? ? ?? ? ???? ? ?? ??? ? ?? ? ????? ? ???? ? .??
> ? ?????? ????? ? ?????? ??? ? ?? ???? ? ????? ? ????? ? ?? , ??? ?
> ?? ? ???? ? ??? ??? ? ? ? ???? ? , ??? ? ???? ? ?? ? ??? ? ???? ?
> ???? ? ??????? ? .
>
> 4)Preferred output is,
> ???????? ??????? ?????? ??? ??????? , ????? ???????? ????? ??????
> ???????? ??????? ????????????? ????? . ????????? ????????????
> ?????? ??????????????????? , ??????? ???????? ??????????? ,
> ????????? ??????? ?????????? ???????? .
> I attached the non-breaking prefix file also, I want to add more
> abbreviations to this
>
>
> 2014/1/4 renubalyan <renubalyan@cdac.in <mailto:renubalyan@cdac.in>>
>
>
> ---------- Original Message ----------
> From: Arththika Paramanathan <arthiparamanathan@gmail.com
> <mailto:arthiparamanathan@gmail.com>>
> To: Hieu Hoang <Hieu.Hoang@ed.ac.uk <mailto:Hieu.Hoang@ed.ac.uk>>
> Cc: moses-support <moses-support@mit.edu
> <mailto:moses-support@mit.edu>>
> Date: January 3, 2014 at 11:33 PM
> Subject: Re: [Moses-support] problem in tokenization
> Hi,
>
> 1)this is an untokenized sentence,
> ???????? ??????? ?????? ??? ???????,????? ???????? ?????
> ?????? ???????? ??????? ????????????? ?????.?????????
> ???????????? ?????? ??????????????????? ,??????? ????????
> ???????????,????????? ??????? ?????????? ????????.
>
> 2)the command I gave is,
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ta <
> ~/corpus/training/squirrelmail.ta-en.ta >
> ~/corpus/squirrelmail.ta-en.tok.ta
>
> 3)the output is,
> ??? ? ??? ? ???? ? ?? ?? ? ??? ??? ???? ? ?? , ???? ? ??? ?
> ??? ? ?? ? ?? ?????? ???? ? ?? ? ???? ? ?? ??? ? ?? ? ????? ?
> ???? ? .?? ? ?????? ????? ? ?????? ??? ? ?? ???? ? ????? ?
> ????? ? ?? , ??? ? ?? ? ???? ? ??? ??? ? ? ? ???? ? , ??? ?
> ???? ? ?? ? ??? ? ???? ? ???? ? ??????? ? .
>
> 4)Preferred output is,
> ???????? ??????? ?????? ??? ??????? , ????? ???????? ?????
> ?????? ???????? ??????? ????????????? ????? . ?????????
> ???????????? ?????? ??????????????????? , ??????? ????????
> ??????????? , ????????? ??????? ?????????? ???????? .
> I attached the non-breaking prefix file also, I want to add
> more abbreviations to this
>
>
>
> --
> regards,
> P.Arththika
>
> -------------------------------------------------------------------------------------------------------------------------------
>
> This e-mail is for the sole use of the intended recipient(s)
> and may
> contain confidential and privileged information. If you are
> not the
> intended recipient, please contact the sender by reply e-mail
> and destroy
> all copies and the original message. Any unauthorized review,
> use,
> disclosure, dissemination, forwarding, printing or copying of
> this email
> is strictly prohibited and appropriate legal action will be
> taken.
> -------------------------------------------------------------------------------------------------------------------------------
>
>
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140104/c50e03c1/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 87, Issue 7
********************************************

Moses-support Digest, Vol 87, Issue 7

0 Response to "Moses-support Digest, Vol 87, Issue 7"

Post a Comment