Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: (no subject) (Thomas Meyer)
2. Re: (no subject) (cyrine.nasri@univ-lorraine.fr)
3. Re: tokenizer script , special characters (Tom Hoar)
4. Adding exceptions using xml-input (Massinissa ?hmim)
----------------------------------------------------------------------
Message: 1
Date: Fri, 21 Feb 2014 14:55:04 +0100
From: Thomas Meyer <ithurtstom@gmail.com>
Subject: Re: [Moses-support] (no subject)
To: "cyrine.nasri@univ-lorraine.fr" <cyrine.nasri@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CADt3zzbM4EVmX-YWhvLm01tv92SQu71_bncdyPpuPWaWHHzPZg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi,
Ah, in that case it can actually cause problems: your training data should
always be formatted in the same way as your dev/test data.
2 possibilities:
- re-tokenize training data with the actual tokenizer script to have the
same mark-up (then retrain your system)
- re-tokenize your dev/test data with the same (possibly older) tokenizer
script as was used for your training data (then run tuning/decoding)
HTH,
Thomas
On 21 February 2014 14:49, cyrine.nasri@univ-lorraine.fr <
cyrine.nasri@gmail.com> wrote:
> Thank you Thomas,
>
> So, i keep the text with these Special characters, it will not cause
> problems? beacuse the training corpus is without these characters but only
> the development and test corpus are like this.
>
> Thank you :)
>
> Bets
>
>
> 2014-02-21 14:40 GMT+01:00 Thomas Meyer <ithurtstom@gmail.com>:
>
>>
>>
>> Hi,
>>
>> That is not a 'problem' but XML entities<http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references> mark-up
>> for special characters. You don't have to worry about this, as the
>> tokenizer script does it for all characters in a consistent way.
>>
>> Best,
>> Thomas
>>
>>
>> On 21 February 2014 14:20, cyrine.nasri@univ-lorraine.fr <
>> cyrine.nasri@gmail.com> wrote:
>>
>>>
>>> Hello all,
>>>
>>> I have a problem with the tokenizer.pl script. i get as a result a text
>>> ith some special punctuation , like this for example :
>>>
>>> EU 's Luxembourg-based statistical office reported
>>>
>>> The input file is a .txt file
>>>
>>> Is there any solution for this problem
>>>
>>> Thank you in advance
>>>
>>>
>>> Bests
>>> --
>>> *Cyrine*
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>
>
> --
>
> *Cyrine NASRIPh.D. Student in Computer Science*
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140221/86195268/attachment-0001.htm
------------------------------
Message: 2
Date: Fri, 21 Feb 2014 14:57:33 +0100
From: "cyrine.nasri@univ-lorraine.fr" <cyrine.nasri@gmail.com>
Subject: Re: [Moses-support] (no subject)
To: Thomas Meyer <ithurtstom@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAPg_V0hPA3GpXGa7FoerUWe-iDQAE9HKjpSVTQt5jPgSY_TQHA@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Ok,
Thank you
Bests
Cyrine
2014-02-21 14:55 GMT+01:00 Thomas Meyer <ithurtstom@gmail.com>:
> Hi,
>
> Ah, in that case it can actually cause problems: your training data should
> always be formatted in the same way as your dev/test data.
>
> 2 possibilities:
>
> - re-tokenize training data with the actual tokenizer script to have the
> same mark-up (then retrain your system)
> - re-tokenize your dev/test data with the same (possibly older) tokenizer
> script as was used for your training data (then run tuning/decoding)
>
> HTH,
> Thomas
>
>
> On 21 February 2014 14:49, cyrine.nasri@univ-lorraine.fr <
> cyrine.nasri@gmail.com> wrote:
>
>> Thank you Thomas,
>>
>> So, i keep the text with these Special characters, it will not cause
>> problems? beacuse the training corpus is without these characters but only
>> the development and test corpus are like this.
>>
>> Thank you :)
>>
>> Bets
>>
>>
>> 2014-02-21 14:40 GMT+01:00 Thomas Meyer <ithurtstom@gmail.com>:
>>
>>>
>>>
>>> Hi,
>>>
>>> That is not a 'problem' but XML entities<http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references> mark-up
>>> for special characters. You don't have to worry about this, as the
>>> tokenizer script does it for all characters in a consistent way.
>>>
>>> Best,
>>> Thomas
>>>
>>>
>>> On 21 February 2014 14:20, cyrine.nasri@univ-lorraine.fr <
>>> cyrine.nasri@gmail.com> wrote:
>>>
>>>>
>>>> Hello all,
>>>>
>>>> I have a problem with the tokenizer.pl script. i get as a result a
>>>> text ith some special punctuation , like this for example :
>>>>
>>>> EU 's Luxembourg-based statistical office reported
>>>>
>>>> The input file is a .txt file
>>>>
>>>> Is there any solution for this problem
>>>>
>>>> Thank you in advance
>>>>
>>>>
>>>> Bests
>>>> --
>>>> *Cyrine*
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> *Cyrine NASRIPh.D. Student in Computer Science*
>>
>
>
--
*Cyrine NASRIPh.D. Student in Computer Science*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140221/bffdad80/attachment-0001.htm
------------------------------
Message: 3
Date: Fri, 21 Feb 2014 21:13:37 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] tokenizer script , special characters
To: moses-support@MIT.EDU
Message-ID: <53075F11.7070600@precisiontranslationtools.com>
Content-Type: text/plain; charset="iso-8859-1"
Cyrine
This is not a problem. It's the design. The tokenizer.pl script escapes
characters that Moses reserves for its own use. When you use the
detokenizer.pl script unescapes these characters after translations.
On 02/21/2014 08:20 PM, cyrine.nasri@univ-lorraine.fr wrote:
> reserves for
> Hello all,
>
> I have a problem with the tokenizer.pl <http://tokenizer.pl> script. i
> get as a result a text ith some special punctuation , like this for
> example :
>
> EU 's Luxembourg-based statistical office reported
>
> The input file is a .txt file
>
> Is there any solution for this problem
>
> Thank you in advance
>
>
> Bests
> --
> /Cyrine/
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140221/e4fbf219/attachment-0001.htm
------------------------------
Message: 4
Date: Fri, 21 Feb 2014 15:20:11 +0100
From: Massinissa ?hmim <massinissa.ahmim@linguacustodia.com>
Subject: [Moses-support] Adding exceptions using xml-input
To: moses-support@mit.edu
Message-ID:
<CANN0mWa6y5+JqkRRODLHUGMjSyaWB3j4wm9b7yX705u7S8BfDw@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Dear all,
I am currently considering the using xml-input to force Moses to translate
certain parts of text (titles...etc) in a specific way.
So far I have been using the command line for instance :
echo '<np translation="This document is cool."> Ce document est cool</np> '
| /mosesdecoder/bin/moses -xml-input exclusive -f moses.ini -t
This might be good for a relative amount of exceptions, but as I want to
implement a large number of 'exception' it might grow quickly. I was
wounder whether there is a already a way in Moses to store a set of
exceptions in an file that the decoder will check before starting the
translation process?
Many thanks
Regards
Massinissa
--
[image: Description : Description : lingua_custodia_final full logo]
*The Translation Trustee*
*1, Place Charles de Gaulle*
*78180 Montigny-le-Bretonneux*
*Tel : +33 1 30 44 04 23 Mobile : +33 7 61 44 40 84*
*Email :** massinissa.ahmim**@linguacustodia.com
<olivier.debeugny@linguacustodia.com>*
*Website :* *www.linguacustodia.com <http://www.linguacustodia.com/> -
www.thetranslationtrustee.com <http://www.thetranslationtrustee.com>*
? Pensez ? l'environnement, n'imprimez ce courriel que si n?cessaire.
Please do not print this email unless it is absolutely necessary. Spread
environmental awareness.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140221/cc7b4993/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 4421 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20140221/cc7b4993/attachment.jpg
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 88, Issue 45
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 88, Issue 45"
Post a Comment