Moses-support Digest, Vol 128, Issue 1

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Chinese tokenizer / detokenizer (segmenter / unsegmenter)
(Tom Hoar)
2. Error during tuning (huda noor)
3. Do we gave detokenizer for German? (dai xin)


----------------------------------------------------------------------

Message: 1
Date: Wed, 31 May 2017 23:30:47 +0700
From: Tom Hoar <tahoar@pttools.net>
Subject: Re: [Moses-support] Chinese tokenizer / detokenizer
(segmenter / unsegmenter)
To: moses-support@mit.edu
Message-ID: <2ce5518d-e07c-f355-d4ba-bc47bcfe2fd9@pttools.net>
Content-Type: text/plain; charset="utf-8"

Slate Desktop integrates jieba python tokenizer and Python 2.7.13 for
Chinese.

Slate Desktop uses proprietary data preparation to make a
language-independent SMT model for detokenizing. Our test show, and
customers confirm, that it restores rendering with natural casing and
spacing within 98% character-level accuracy for any language, including
intermixed character sets in one segment. So, Latin brands and terms
within Chinese, Japanese, Korean and even Thai target languages are
restored properly. What's more, the final rendering typically shows a
3-6% boost in BLEU scores over the tokenized/lowercased tuning set when
comparing identical dev and test sets. We traced the improvements to
corrections in spacing and token ordering, again for any target language.

Tom



On 5/31/2017 11:00 PM, moses-support-request@mit.edu wrote:
> Date: Wed, 31 May 2017 14:40:53 +0800
> From: Dingyuan Wang<abcdoyle888@gmail.com>
> Subject: Re: [Moses-support] Chinese tokenizer / detokenizer
> (segmenter / unsegmenter)
> To: Vincent Nguyen<vnguyen@neuf.fr>, moses-support
> <moses-support@mit.edu>
>
> Hi,
>
> I personally use the jieba tokenizer (https://github.com/fxsjy/jieba).
> Install the python package and use `python -mjieba -d ' '`.
>
> For detokenizer, I wrote my own script
> (https://github.com/The-Orizon/nlputils/blob/master/detokenizer.py).
> Install `pangu` python package, and use `python3 detokenizer.py`. The
> idea is to remove spaces in CJK/fullwidth characters using regex.
>
> The above can't deal with numbers, abbreviations with dots, n't etc. though.
>
> 2017-05-30 02:31, Vincent Nguyen:
>> Hello team,
>>
>> I have read many post and it looks like most people tend to use the
>> Stanford segmenter.
>>
>> Do you have some good experience with other tools ?
>>
>> Also, what "detokenizer" do you actually use. It seems, that it is not
>> just a question of removing space, especially when Chinese target
>> contains some non chinese words / symbols.
>>
>> Thanks for your insight,
>>
>> Vincent.
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
> -- Dingyuan Wang

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20170531/4d52c6ff/attachment-0001.html

------------------------------

Message: 2
Date: Wed, 31 May 2017 22:48:02 +0500
From: huda noor <hudanoor36@gmail.com>
Subject: [Moses-support] Error during tuning
To: moses-support@mit.edu
Message-ID:
<CA+rzbK7e==hc0kLJ1ZiVX2eZK7=YezJXkOW6aiqpOn4xAUJdqA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Dear Moses Experts!!!
we are building Statistical Machine Translation system for English-Urdu
language pair and we get an error after tuning i.e. no moses.ini file build.

exec: /home/huda/mosesdecoder/working/mert-work/extractor.sh
Executing: /home/huda/mosesdecoder/working/mert-work/extractor.sh >
extract.out 2> extract.err
Exit code: 127
ERROR: Failed to run
'/home/huda/mosesdecoder/working/mert-work/extractor.sh'. at
../scripts/training/mert-moses.pl line 1748.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20170531/6cec80d2/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: extract.err
Type: application/octet-stream
Size: 107 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20170531/6cec80d2/attachment-0001.obj

------------------------------

Message: 3
Date: Thu, 1 Jun 2017 15:53:13 +0200
From: dai xin <wingsuestc@gmail.com>
Subject: [Moses-support] Do we gave detokenizer for German?
To: moses-support@mit.edu
Message-ID:
<CADEDxC9c4U9LvYzfH83jgAkE+45T-YH6Vvti9WzgT9Lbij7XQA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi MT experts,

I tokenized German corpus using

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l de

Do we have corresponding detokenizer for German as well?

It would be great if you can suggest some tools to do the detokenization
for German.

Thanks in advance and hoping for reply.

Best regards,

Xin Dai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20170601/3ccd5445/attachment-0001.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 128, Issue 1
*********************************************

0 Response to "Moses-support Digest, Vol 128, Issue 1"

Post a Comment