Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: Moses-support Digest, Vol 91, Issue 52
(Marcin Junczys-Dowmunt)
2. Re: Moses-support Digest, Vol 91, Issue 52 (Lane Schwartz)
3. Re: Moses-support Digest, Vol 91, Issue 52 (Lane Schwartz)
----------------------------------------------------------------------
Message: 1
Date: Fri, 30 May 2014 18:07:03 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
To: moses-support@mit.edu
Message-ID: <5388BAB7.80405@amu.edu.pl>
Content-Type: text/plain; charset="iso-8859-1"
How's this?
cat baa | perl -C -pe 'chomp; s/\p{C}/ /g; $_="$_\n"'
W dniu 30.05.2014 18:01, Hieu Hoang pisze:
> in the attached file, there are 2 or more non-printing chars on the
> 1st line, between the words 'place' and 'binding'. They should be
> removed/replaced with a space. Those chars are deleted by parsers,
> making the word alignments incorrect and crashing extract
>
> The 2nd line is perfectly good utf8. It shouldn't be touched.
>
> just another friday nlp malaise
>
>
>
> On 30 May 2014 17:51, Miles Osborne <miles@inf.ed.ac.uk
> <mailto:miles@inf.ed.ac.uk>> wrote:
>
> it is trivial to change it to say a ? mark.
>
> but I'm not sure what you want as output now. the original request
> was for removing non-printable characters, which the Perl does,
>
> Miles
>
> On 30 May 2014 12:43, Hieu Hoang <Hieu.Hoang@ed.ac.uk
> <mailto:Hieu.Hoang@ed.ac.uk>> wrote:
> > forgot to say. The input is utf8. The snippet turns
> > gonz?lez
> > to
> > gonz lez
> >
> >
> > On 30 May 2014 17:22, Miles Osborne <miles@inf.ed.ac.uk
> <mailto:miles@inf.ed.ac.uk>> wrote:
> >>
> >> this perl snippet:
> >>
> >> $line =~ tr/\040-\176/ /c;
> >>
> >> On 30 May 2014 12:17, <moses-support-request@mit.edu
> <mailto:moses-support-request@mit.edu>> wrote:
> >> > Send Moses-support mailing list submissions to
> >> > moses-support@mit.edu <mailto:moses-support@mit.edu>
> >> >
> >> > To subscribe or unsubscribe via the World Wide Web, visit
> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >> > or, via email, send a message with subject or body 'help' to
> >> > moses-support-request@mit.edu
> <mailto:moses-support-request@mit.edu>
> >> >
> >> > You can reach the person managing the list at
> >> > moses-support-owner@mit.edu <mailto:moses-support-owner@mit.edu>
> >> >
> >> > When replying, please edit your Subject line so it is more
> specific
> >> > than "Re: Contents of Moses-support digest..."
> >> >
> >> >
> >> > Today's Topics:
> >> >
> >> > 1. removing non-printing character (Hieu Hoang)
> >> >
> >> >
> >> >
> ----------------------------------------------------------------------
> >> >
> >> > Message: 1
> >> > Date: Fri, 30 May 2014 16:24:30 +0100
> >> > From: Hieu Hoang <Hieu.Hoang@ed.ac.uk
> <mailto:Hieu.Hoang@ed.ac.uk>>
> >> > Subject: [Moses-support] removing non-printing character
> >> > To: moses-support <moses-support@mit.edu
> <mailto:moses-support@mit.edu>>
> >> > Message-ID:
> >> >
> >> >
> <CAEKMkbj4tEDZYVGeAStmg51+w-5SYE5YGRmibcYPC2j8YbKGfg@mail.gmail.com <mailto:CAEKMkbj4tEDZYVGeAStmg51%2Bw-5SYE5YGRmibcYPC2j8YbKGfg@mail.gmail.com>>
> >> > Content-Type: text/plain; charset="utf-8"
> >> >
> >> > does anyone have a script/program that can remove all
> non-printing
> >> > characters?
> >> >
> >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY
> removes
> >> > all
> >> > non-printing chars
> >> >
> >> > --
> >> > Hieu Hoang
> >> > Research Associate
> >> > University of Edinburgh
> >> > http://www.hoang.co.uk/hieu
> >> > -------------- next part --------------
> >> > An HTML attachment was scrubbed...
> >> > URL:
> >> >
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
> >> >
> >> > ------------------------------
> >> >
> >> > _______________________________________________
> >> > Moses-support mailing list
> >> > Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >> >
> >> >
> >> > End of Moses-support Digest, Vol 91, Issue 52
> >> > *********************************************
> >>
> >>
> >>
> >> --
> >> The University of Edinburgh is a charitable body, registered in
> >> Scotland, with registration number SC005336.
> >> _______________________________________________
> >> Moses-support mailing list
> >> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >
> >
> > --
> > Hieu Hoang
> > Research Associate
> > University of Edinburgh
> > http://www.hoang.co.uk/hieu
> >
> >
> > The University of Edinburgh is a charitable body, registered in
> > Scotland, with registration number SC005336.
> >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/2800be16/attachment-0001.htm
------------------------------
Message: 2
Date: Fri, 30 May 2014 13:21:37 -0400
From: Lane Schwartz <dowobeha@gmail.com>
Subject: Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
To: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>, Miles Osborne
<miles@inf.ed.ac.uk>
Message-ID:
<CABv3vZnaFCTS0nNP-FqnSx5_7o-d7LxGu9b=iv_MhuqgYkhCfw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
As far as I know, no such general purpose tool exists. We wrote a
custom in-house script that removes many, but not all, possible
non-printing Unicode characters as part of our WMT submission.
I am interested in writing one, though.
I think the right way to do this would be to parse the Unicode
character database for all characters of certain classes, and build
the tool from that data.
Lane
On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang <Hieu.Hoang@ed.ac.uk> wrote:
> in the attached file, there are 2 or more non-printing chars on the 1st
> line, between the words 'place' and 'binding'. They should be
> removed/replaced with a space. Those chars are deleted by parsers, making
> the word alignments incorrect and crashing extract
>
> The 2nd line is perfectly good utf8. It shouldn't be touched.
>
> just another friday nlp malaise
>
>
>
> On 30 May 2014 17:51, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>>
>> it is trivial to change it to say a ? mark.
>>
>> but I'm not sure what you want as output now. the original request
>> was for removing non-printable characters, which the Perl does,
>>
>> Miles
>>
>> On 30 May 2014 12:43, Hieu Hoang <Hieu.Hoang@ed.ac.uk> wrote:
>> > forgot to say. The input is utf8. The snippet turns
>> > gonz?lez
>> > to
>> > gonz lez
>> >
>> >
>> > On 30 May 2014 17:22, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>> >>
>> >> this perl snippet:
>> >>
>> >> $line =~ tr/\040-\176/ /c;
>> >>
>> >> On 30 May 2014 12:17, <moses-support-request@mit.edu> wrote:
>> >> > Send Moses-support mailing list submissions to
>> >> > moses-support@mit.edu
>> >> >
>> >> > To subscribe or unsubscribe via the World Wide Web, visit
>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >> > or, via email, send a message with subject or body 'help' to
>> >> > moses-support-request@mit.edu
>> >> >
>> >> > You can reach the person managing the list at
>> >> > moses-support-owner@mit.edu
>> >> >
>> >> > When replying, please edit your Subject line so it is more specific
>> >> > than "Re: Contents of Moses-support digest..."
>> >> >
>> >> >
>> >> > Today's Topics:
>> >> >
>> >> > 1. removing non-printing character (Hieu Hoang)
>> >> >
>> >> >
>> >> >
>> >> > ----------------------------------------------------------------------
>> >> >
>> >> > Message: 1
>> >> > Date: Fri, 30 May 2014 16:24:30 +0100
>> >> > From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
>> >> > Subject: [Moses-support] removing non-printing character
>> >> > To: moses-support <moses-support@mit.edu>
>> >> > Message-ID:
>> >> >
>> >> > <CAEKMkbj4tEDZYVGeAStmg51+w-5SYE5YGRmibcYPC2j8YbKGfg@mail.gmail.com>
>> >> > Content-Type: text/plain; charset="utf-8"
>> >> >
>> >> > does anyone have a script/program that can remove all non-printing
>> >> > characters?
>> >> >
>> >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY removes
>> >> > all
>> >> > non-printing chars
>> >> >
>> >> > --
>> >> > Hieu Hoang
>> >> > Research Associate
>> >> > University of Edinburgh
>> >> > http://www.hoang.co.uk/hieu
>> >> > -------------- next part --------------
>> >> > An HTML attachment was scrubbed...
>> >> > URL:
>> >> >
>> >> > http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
>> >> >
>> >> > ------------------------------
>> >> >
>> >> > _______________________________________________
>> >> > Moses-support mailing list
>> >> > Moses-support@mit.edu
>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >> >
>> >> >
>> >> > End of Moses-support Digest, Vol 91, Issue 52
>> >> > *********************************************
>> >>
>> >>
>> >>
>> >> --
>> >> The University of Edinburgh is a charitable body, registered in
>> >> Scotland, with registration number SC005336.
>> >> _______________________________________________
>> >> Moses-support mailing list
>> >> Moses-support@mit.edu
>> >> http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>> >
>> >
>> >
>> > --
>> > Hieu Hoang
>> > Research Associate
>> > University of Edinburgh
>> > http://www.hoang.co.uk/hieu
>> >
>> >
>> > The University of Edinburgh is a charitable body, registered in
>> > Scotland, with registration number SC005336.
>> >
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
--
When a place gets crowded enough to require ID's, social collapse is not
far away. It is time to go elsewhere. The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"
------------------------------
Message: 3
Date: Fri, 30 May 2014 13:23:13 -0400
From: Lane Schwartz <dowobeha@gmail.com>
Subject: Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
To: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>, Miles Osborne
<miles@inf.ed.ac.uk>
Message-ID:
<CABv3vZnx5w66tv1ZRGkaciuK+xdTfQwisGOD_1udxfB1P2hcHw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
We also used charlint. It might do what you want.
On Fri, May 30, 2014 at 1:21 PM, Lane Schwartz <dowobeha@gmail.com> wrote:
> As far as I know, no such general purpose tool exists. We wrote a
> custom in-house script that removes many, but not all, possible
> non-printing Unicode characters as part of our WMT submission.
>
> I am interested in writing one, though.
>
> I think the right way to do this would be to parse the Unicode
> character database for all characters of certain classes, and build
> the tool from that data.
>
> Lane
>
>
> On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang <Hieu.Hoang@ed.ac.uk> wrote:
>> in the attached file, there are 2 or more non-printing chars on the 1st
>> line, between the words 'place' and 'binding'. They should be
>> removed/replaced with a space. Those chars are deleted by parsers, making
>> the word alignments incorrect and crashing extract
>>
>> The 2nd line is perfectly good utf8. It shouldn't be touched.
>>
>> just another friday nlp malaise
>>
>>
>>
>> On 30 May 2014 17:51, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>>>
>>> it is trivial to change it to say a ? mark.
>>>
>>> but I'm not sure what you want as output now. the original request
>>> was for removing non-printable characters, which the Perl does,
>>>
>>> Miles
>>>
>>> On 30 May 2014 12:43, Hieu Hoang <Hieu.Hoang@ed.ac.uk> wrote:
>>> > forgot to say. The input is utf8. The snippet turns
>>> > gonz?lez
>>> > to
>>> > gonz lez
>>> >
>>> >
>>> > On 30 May 2014 17:22, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>>> >>
>>> >> this perl snippet:
>>> >>
>>> >> $line =~ tr/\040-\176/ /c;
>>> >>
>>> >> On 30 May 2014 12:17, <moses-support-request@mit.edu> wrote:
>>> >> > Send Moses-support mailing list submissions to
>>> >> > moses-support@mit.edu
>>> >> >
>>> >> > To subscribe or unsubscribe via the World Wide Web, visit
>>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
>>> >> > or, via email, send a message with subject or body 'help' to
>>> >> > moses-support-request@mit.edu
>>> >> >
>>> >> > You can reach the person managing the list at
>>> >> > moses-support-owner@mit.edu
>>> >> >
>>> >> > When replying, please edit your Subject line so it is more specific
>>> >> > than "Re: Contents of Moses-support digest..."
>>> >> >
>>> >> >
>>> >> > Today's Topics:
>>> >> >
>>> >> > 1. removing non-printing character (Hieu Hoang)
>>> >> >
>>> >> >
>>> >> >
>>> >> > ----------------------------------------------------------------------
>>> >> >
>>> >> > Message: 1
>>> >> > Date: Fri, 30 May 2014 16:24:30 +0100
>>> >> > From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
>>> >> > Subject: [Moses-support] removing non-printing character
>>> >> > To: moses-support <moses-support@mit.edu>
>>> >> > Message-ID:
>>> >> >
>>> >> > <CAEKMkbj4tEDZYVGeAStmg51+w-5SYE5YGRmibcYPC2j8YbKGfg@mail.gmail.com>
>>> >> > Content-Type: text/plain; charset="utf-8"
>>> >> >
>>> >> > does anyone have a script/program that can remove all non-printing
>>> >> > characters?
>>> >> >
>>> >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY removes
>>> >> > all
>>> >> > non-printing chars
>>> >> >
>>> >> > --
>>> >> > Hieu Hoang
>>> >> > Research Associate
>>> >> > University of Edinburgh
>>> >> > http://www.hoang.co.uk/hieu
>>> >> > -------------- next part --------------
>>> >> > An HTML attachment was scrubbed...
>>> >> > URL:
>>> >> >
>>> >> > http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
>>> >> >
>>> >> > ------------------------------
>>> >> >
>>> >> > _______________________________________________
>>> >> > Moses-support mailing list
>>> >> > Moses-support@mit.edu
>>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
>>> >> >
>>> >> >
>>> >> > End of Moses-support Digest, Vol 91, Issue 52
>>> >> > *********************************************
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> The University of Edinburgh is a charitable body, registered in
>>> >> Scotland, with registration number SC005336.
>>> >> _______________________________________________
>>> >> Moses-support mailing list
>>> >> Moses-support@mit.edu
>>> >> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Hieu Hoang
>>> > Research Associate
>>> > University of Edinburgh
>>> > http://www.hoang.co.uk/hieu
>>> >
>>> >
>>> > The University of Edinburgh is a charitable body, registered in
>>> > Scotland, with registration number SC005336.
>>> >
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>
>>
>>
>> --
>> Hieu Hoang
>> Research Associate
>> University of Edinburgh
>> http://www.hoang.co.uk/hieu
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away. It is time to go elsewhere. The best thing about space travel
> is that it made it possible to go elsewhere.
> -- R.A. Heinlein, "Time Enough For Love"
--
When a place gets crowded enough to require ID's, social collapse is not
far away. It is time to go elsewhere. The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 91, Issue 54
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 91, Issue 54"
Post a Comment