Moses-support Digest, Vol 86, Issue 77

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Warning during tokenizing Urdu Corpus (Hieu Hoang)
2. Re: Warning during tokenizing Urdu Corpus (John D. Burger)
3. Re: Moses-support Digest, Vol 86, Issue 76
(Arththika Paramanathan)

----------------------------------------------------------------------

Message: 1
Date: Fri, 27 Dec 2013 16:07:40 +0000
From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Subject: Re: [Moses-support] Warning during tokenizing Urdu Corpus
To: "Asad A.Malik" <asad_12204@yahoo.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAEKMkbiAUSoqrj-qeLKTs-b7gwTGkqnGHwVwacfZ70ECod=F1w@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

The output will be tokenized, but probably very badly. If you know Urdu and
can create a better tokenizer, please add it to Moses.

You can start by looking at the configuration file for the English
tokenizer in
scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
You can copy that and change it specifically for Urdu.

On 26 December 2013 16:35, Asad A.Malik <asad_12204@yahoo.com> wrote:

> Hi All,
>
> I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus
> and the 1st step in manual is to tokenize the corpus, but when I enter
> following command:
>
> ~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur <
> ~/SMT/corpus/training/mycorpus.ur-en.ur > ~/SMT/corpus/mycorpus.ur-en.tok.ur
>
>
> it gives me warning:
>
> WARNING: No known abbreviations for language 'ur', attempting fall-back to
> English version...
>
> It also generates the output file but I don't know that this output is
> tokenized or not
>
>
> Regards
>
> Asad A.Malik
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131227/77328679/attachment.htm

------------------------------

Message: 2
Date: Fri, 27 Dec 2013 11:06:58 -0500
From: "John D. Burger" <john@mitre.org>
Subject: Re: [Moses-support] Warning during tokenizing Urdu Corpus
To: "Asad A.Malik" <asad_12204@yahoo.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <CA44EEB2-98F1-48AA-8917-3B2891C8BC7A@mitre.org>
Content-Type: text/plain; charset="iso-8859-1"

The default tokenizer script only knows specific rules for a few languages. The fallback (English) rules may suffice for your purposes, they do the obvious thing with spaces and English punctuation, and also handle some special cases for abbreviations like "Mr." and "Mrs.".

I'd suggest you eyeball the output and see if the result is OK for you. If not, you could try editing the tokenizer and adding any abbreviations you would like to tokenize differently, or finding and using an Urdu-specific tokenizer.

As an aside for those who can edit the web site, this seems like a good candidate for the FAQ.

- John Burger
MITRE

On Dec 26, 2013, at 11:35 , Asad A.Malik wrote:

> Hi All,
>
> I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus and the 1st step in manual is to tokenize the corpus, but when I enter following command:
>
> ~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur < ~/SMT/corpus/training/mycorpus.ur-en.ur > ~/SMT/corpus/mycorpus.ur-en.tok.ur
>
> it gives me warning:
>
> WARNING: No known abbreviations for language 'ur', attempting fall-back to English version...
>
> It also generates the output file but I don't know that this output is tokenized or not
>
>
> Regards
>
> Asad A.Malik
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131227/ad8ca0bc/attachment.htm

------------------------------

Message: 3
Date: Fri, 27 Dec 2013 22:17:22 +0530
From: Arththika Paramanathan <arthiparamanathan@gmail.com>
Subject: Re: [Moses-support] Moses-support Digest, Vol 86, Issue 76
To: moses-support <moses-support@mit.edu>, asad_12204@yahoo.com
Message-ID:
<CAJSfqEw9uENaEXNSZq11A2W8OuUrx7XsQQEVYX0uabLFiCBrAA@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi,
It gave you warning because moses does not have a Urudu tokenizer. I think
you want to edit the tokenizer.perl file according to your language.

On Fri, Dec 27, 2013 at 9:53 PM, <moses-support-request@mit.edu> wrote:

> Send Moses-support mailing list submissions to
> moses-support@mit.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.mit.edu/mailman/listinfo/moses-support
> or, via email, send a message with subject or body 'help' to
> moses-support-request@mit.edu
>
> You can reach the person managing the list at
> moses-support-owner@mit.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Moses-support digest..."
>
>
> Today's Topics:
>
> 1. Warning during tokenizing Urdu Corpus (Asad A.Malik)
> 2. Does Moses support C++11 compilation? (Li Xiang)
> 3. Call for Papers: 9th SaLTMiL workshop on ?Free/open-source
> language resources for the machine translation of less-resourced
> languages? at LREC 2014. (Mikel Forcada)
> 4. 1st CfP: LREC 2014 Workshop on Free/Open-Source Arabic
> Corpora and Corpora Processing Tools (OSACT)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 26 Dec 2013 08:35:59 -0800 (PST)
> From: "Asad A.Malik" <asad_12204@yahoo.com>
> Subject: [Moses-support] Warning during tokenizing Urdu Corpus
> To: "moses-support@mit.edu" <moses-support@MIT.EDU>
> Message-ID:
> <1388075759.10596.YahooMailNeo@web122204.mail.ne1.yahoo.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi All,
>
> I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus
> and the 1st step in manual is to tokenize the corpus, but when I enter
> following command:
>
> ~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur <
> ~/SMT/corpus/training/mycorpus.ur-en.ur >
> ~/SMT/corpus/mycorpus.ur-en.tok.ur?
>
>
> it gives me warning:
>
> WARNING: No known abbreviations for language 'ur', attempting fall-back to
> English version...
>
> It also generates the output file but I don't know that this output is
> tokenized or not
>
>
> Regards
>
>
> Asad A.Malik
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20131226/d2cf40b3/attachment-0001.htm
>
> ------------------------------
>
> Message: 2
> Date: Fri, 27 Dec 2013 17:46:44 +0800
> From: Li Xiang <lixiang.ict@gmail.com>
> Subject: [Moses-support] Does Moses support C++11 compilation?
> To: moses-support <moses-support@mit.edu>
> Message-ID:
> <CA+fVw+7L5oZfu1vT=
> 5uGqmcKX5XDYjJy8pDrHjXGnr_-0z5kgw@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> Does Moses support C++11 compilation?
> Because I want to integrate my code which is base on C++11 into Moses.
> How to modify the bjam config file to compile Moses using C++11?
> Thanks.
>
> --
> Xiang Li
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20131227/d7a0a53c/attachment.htm
>
> ------------------------------
>
> Message: 3
> Date: Fri, 27 Dec 2013 13:39:01 +0100
> From: Mikel Forcada <mlf@dlsi.ua.es>
> Subject: [Moses-support] Call for Papers: 9th SaLTMiL workshop on
> ?Free/open-source language resources for the machine translation of
> less-resourced languages? at LREC 2014.
> To: moses-support@mit.edu
> Message-ID: <52BD74E5.6030601@dlsi.ua.es>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> Call for Papers: 9th SaLTMiL workshop on ?Free/open-source language
> resources for the machine translation of less-resourced languages? at
> LREC 2014.
>
> A full-day workshop at LREC 2014
> Tuesday, 27 May 2014.
> Reykjavik (Iceland)
>
> SALTMIL: http://ixa2.si.ehu.es/saltmil/
> LREC 2014: http://lrec2014.lrec-conf.org/en/
> Website: http://ixa2.si.ehu.es/saltmil/
> Paper submission: https://www.softconf.com/lrec2014/SaLTMiL/
>
> The 9th International Workshop of the Special Interest Group on Speech
> and Language Technology for Minority Languages (SaLTMiL) will be held in
> Reykjav?k, Iceland, on May 24, 2014, as part of the 2014 International
> Language Resources and Evaluation Conference (LREC). (For SALTMIL see:
> http://ixa2.si.ehu.es/saltmil/); it is also framed as one of the
> activities of European project Abu-Matran (http://www.abumatran.eu).
> Entitled "Free/open-source language resources for the machine
> translation of less-resourced languages", the workshop is intended to
> continue the series of SALTMIL/LREC workshops on computational language
> resources for minority languages, held in Granada (1998), Athens (2000),
> Las Palmas de Gran Canaria (2002), Lisbon (2004), Genoa (2006),
> Marrakech (2008), La Valetta (2010) and Istanbul (2012), and is also
> expected to attract the audience of Free Rule-Based Machine Translation
> workshops (2009, 2011, 2012). The workshop aims to share information on
> language resources, tools and best practice, to save isolated
> researchers from starting from scratch when building machine translation
> for a less-resourced language. An important aspect will be the
> strengthening of the free/open-source language resources community,
> which can minimize duplication of effort and optimize development and
> adoption, in line with the LREC 2014 hot topic ?LRs in the Collaborative
> Age? (http://is.gd/LREChot).
>
> The whole-day workshop will consist of short oral papers, a poster
> session preceded by a poster-boaster session (2 minutes, 2 slides per
> poster), and a round table.
>
> Papers are invited that describe research and development in the
> following areas:
>
> FOS LR for rule-based machine translation (dictionaries, rule sets)
> FOS LR for statistical machine translation (corpora)
> FOS tools to annotate, clean, preprocess, convert, etc. LRs for machine
> translation
> Machine translation as a tool for creating or enriching FOS LRs for
> less-resourced languages
>
> Position papers and (web based) demonstrations will also be considered
> for presentation.
>
> The best papers, as evaluated by the programme committee, will be
> presented orally and the remaining paper will be presented in poster
> format.
>
> We expect short papers of max 6,000 words (up to 6 pages) describing
> research addressing one of the above topics, to be submitted as PDF
> documents by using the LREC 2014 START conference management system
> (https://www.softconf.com/lrec2014/SaLTMiL/).
>
> Submissions should be anonymized. When submitting a paper through the
> START page, authors will be kindly asked to share the resources that
> have been used for the work described in their paper or that are the
> outcome of their research. For further information on this initiative,
> please refer to
>
> http://lrec2014.lrec-conf.org/en/calls-for-papers/lrec-2014-special-highlight/
> .
>
>
> Submissions of papers should follow the same style as the papers for the
> main LREC conference (an Author's Kit made of specific guidelines and
> downloadable templates will be published on the conference web site in
> due time). All contributions will be included in the workshop
> proceedings (CD). They will also be published on the SALTMIL website.
>
> The registration fees will be duly announced at the LREC 2014 site.
> Registration in the workshop willl include a coffee break and the
> Proceedings of the Workshop. Registration will be handled by the LREC
> 2014 Secretariat.
>
>
> Important dates
>
> Deadline for paper submission: February 10, 2014
> Notification of acceptance sent: March, 3, 2014
> Camera-ready paper due: March 21, 2014
>
>
> Organizing committee
>
> Joint e-mail address: saltmil2014@dlsi.ua.es
>
> (1) Dr Francis M Tyers
> Institutt for spr?kvitskap
> Det humanistiske fakultet,
> N-9037 Universitetet i Troms?
> ftyers@prompsit.com
>
> (2) Dr Kepa Sarasola
> Computer Science Faculty
> Dept. of Computer Languages
> The University of the Basque Country
> P.K. 649 20080 DONOSTIA
> Basque Country, Spain
> Tel: +34 943 01 81 54
> Fax: +34 943 21 93 06
> ksarasola@ehu.es
> http://ixa.si.ehu.es
>
> (3) Prof Mikel L. Forcada
> Dept. Llenguatges i Sistemes inform?tics
> Universitat d?Alacant
> E-03071 Alacant (Spain)
> Tel: +34 96 590 9776
> FAx: +34 96 590 9326
> mlf@ua.es
> http://www.dlsi.ua.es/~mlf
>
>
> Programme Committee
>
> I?aki Alegria, Euskal Herriko Unibertsitatea, Spain
> Lars Borin, G?teborgs Universitet, Sweden.
> Elaine U? Dhonnchadha, Trinity College Dublin, Ireland
> Mikel L. Forcada, Universitat d?Alacant, Spain
> Michael Gasser, Indiana University, USA
> M?ns Huld?n, Helsingin Yliopisto, Finland
> Krister Lind?n, Helsingin Yliopisto, Finland
> Nikola Ljube?i?, Sveu?ili?te u Zagrebu, Croatia
> Llu?s Padr?, Universitat Polit?cnica de Catalunya, Spain
> Juan Antonio P?rez-Ortiz, Universitat d?Alacant, Spain
> Felipe S?nchez-Mart?nez, Universitat d?Alacant
> Kepa Sarasola, Euskal Herriko Unibertsitatea, Spain
> Kevin P. Scannell, Saint Louis University, USA
> Antonio Toral, Dublin City University, Ireland
> Trond Trosterud, Universitet i Troms?, Norway
> Francis M. Tyers, Universitet i Troms?, Norway
>
> --
> Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
> Departament de Llenguatges i Sistemes Inform?tics
> Universitat d'Alacant
> E-03071 Alacant, Spain
> Phone: +34 96 590 9776
> Fax: +34 96 590 9326
>
> --
> Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
> Departament de Llenguatges i Sistemes Inform?tics
> Universitat d'Alacant
> E-03071 Alacant, Spain
> Phone: +34 96 590 9776
> Fax: +34 96 590 9326
>
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 26 Dec 2013 20:05:38 +0000
> From: OSACT <OSACT@kacst.edu.sa>
> Subject: [Moses-support] 1st CfP: LREC 2014 Workshop on
> Free/Open-Source Arabic Corpora and Corpora Processing Tools
> To: "IRList@lists.shef.ac.uk" <IRList@lists.shef.ac.uk>,
> "listmaster@loria.fr" <listmaster@loria.fr>, "ln@cines.fr"
> <ln@cines.fr>, "lr_egroup@mail.iiit.ac.in"
> <lr_egroup@mail.iiit.ac.in>, "moses-support@mit.edu"
> <moses-support@mit.edu>, "news@multilingual.com"
> <news@multilingual.com>
> Message-ID: <76B661A7930AFC49A0DC6418253B1FA2C6AC8E@ex-itu01-002>
> Content-Type: text/plain; charset="utf-8"
>
> We apologize for multiple postings, Please distribute to interested
> colleagues
>
> ----------------------------------------------
>
> 1st Call for Papers
>
> WORKSHOP ON Free/Open-Source Arabic Corpora and Corpora Processing Tools
>
> http://www.kacstac.org.sa/osact/index.html
>
> May 27, 2014
> Co-located with LREC 2014
> Harpa Conference Centre, Reykjavik (Iceland)
>
> DEADLINE FOR PAPERS: February 10, 2014
> https://www.softconf.com/lrec2014/OSACT/
>
> ============================================================
> Workshop description
>
> For Natural Language Processing (NLP) and Computational Linguistics (CL)
> communities, it was a known situation that Arabic is a resource poor
> language. This situation was thought to be the reason why there is a lack
> of corpus based studies in Arabic. However, the last years witnessed the
> emergence of new considerably free Arabic corpora and in lesser extent
> Arabic corpora processing tools.
>
> Freely available Arabic corpora can be divided into two groups. The first
> group contains large Arabic corpora, which are designed and constructed
> basically for Arabic linguistics research and activities, and maybe for
> Arabic NLP. These corpora are diverse in the genres they cover and their
> sizes range from one million words to 700 million words. The second group
> contains corpora that were designed basically for Arabic text
> classification and clustering, they mainly contain newspapers' articles.
> They range from less than 1 million words to 11 million words.
>
> Some Arabic corpora are available on the web to explore using different
> tools, basically large corpora, while other corpora are only available for
> download. For the corpora that are available for download, the user may
> need to use standalone corpus processing tools. These tools contain many
> functionality such as word frequency, concordance, collocation, etc.
> Therefore, with the availability of large and diverse Arabic corpora, the
> situation does not change. There is still a lack of Arabic corpus base
> studies. Is this because of representativeness of these corpora? The
> available functions and tools associated with these corpora? or is it
> because they are not well known enough for the Arabic linguistics community?
>
>
> Motivation and topics of interest
>
> This half-day-workshop aims to encourage the researchers and developers to
> foster the utilization of freely available Arabic corpora and open source
> Arabic corpora processing tools and help in highlighting the drawbacks of
> these resources and discuss techniques and approaches on how to improve
> them. The workshop topics include but not limited to:
>
>
> * Surveying and criticizing the design of freely available Arabic
> corpora, their associated tools and stand alone Arabic corpora processing
> tools.
> * The applications and uses of freely available Arabic language
> resources in fields such as Arabic language education e.g. L1 and L2.
> * Arabic language modeling.
> * Corpus based Arabic lexigraphy.
> * Lexical semantics and word sense.
> * Corpus based Arabic syntactic.
> * Corpus based Arabic morphology.
> * Development of Arabic mobile applications based on the available
> Arabic language resources.
> * Evaluation and assessment of Arabic Corpora and Corpora
> Processing Tools.
> * Future directions of Free/Open Arabic Corpora and Corpora
> Processing Tools.
>
>
> Organising Committee
>
>
> * Hend Al-Khalifa, King Saud University, KSA
> * Abdulmohsen Al-Thubaity, King Abdul Aziz City for Science and
> Technology, KSA
>
> Program Committee
>
>
> * Eric Atwell, University of Leeds, UK
> * Khaled Shaalan, The British University in Dubai (BUiD), UAE
> * Dilworth Parkinson, Brigham Young University, USA
> * Nizar Habash, Columbia University, USA
> * Khurshid Ahmad, Trinity College Dublin, Ireland
> * Abdulmalik AlSalman, King Saud University, KSA
> * Maha Alrabiah, King Saud University, KSA
> * Saleh Alosaimi, Imam University, KSA
> * Sultan almujaiwel, King Saud University, KSA
> * Adam Kilgarriff, Lexical Computing Ltd, UK
> * Amal AlSaif, Imam University, KSA
> * Maha AlYahya, King Saud University, KSA
> * Auhood AlFaries, King Saud University, KSA
> * Salwa Hamada, Taibah University, KSA
> * Mansour Algamdi, King Abdul Aziz City for Science and
> Technology, KSA
> * Abdullah Alfaifi, University of Leeds, UK
>
>
> Important Dates
>
>
> * Submission deadline: 10 February 2014
> * Notification of acceptance: 10 March 2013
> * Final submission of manuscripts: 21 March 2014
> * Workshop date: 27 May 2014 (morning session)
>
> Submissions
>
> The language of the workshop is English and submissions should be with
> respect to LREC 2014 paper submission instructions. All papers will be peer
> reviewed possibly by three independent referees. Papers must be submitted
> electronically in PDF format to the START system<
> https://www.softconf.com/lrec2014/OSACT/>. When submitting a paper from
> the START page, authors will be asked to provide essential information
> about resources (in a broad sense, i.e. also technologies, standards,
> evaluation kits, etc.) that have been used for the work described in the
> paper or are a new result of your research. Moreover, ELRA encourages all
> LREC authors to share the described LRs (data, tools, services, etc.), to
> enable their reuse, replicability of experiments, including evaluation
> ones, etc.
> Warning: This message and its attachment, if any, are confidential and may
> contain information protected by law. If you are not the intended
> recipient, please contact the sender immediately and delete the message and
> its attachment, if any. You should not copy the message and its attachment,
> if any, or disclose its contents to any other person or use it for any
> purpose. Statements and opinions expressed in this e-mail and its
> attachment, if any, are those of the sender, and do not necessarily reflect
> those of King Abdulaziz city for Science and Technology (KACST) in the
> Kingdom of Saudi Arabia. KACST accepts no liability for any damage caused
> by this email.
>
> ?????: ??? ??????? ??? ????? ?? ?????? (?? ????) ???? ????? ???? ?? ?????
> ??? ??????? ????? ????? ???????. ??? ?? ??? ????? ?????? ???? ??????? ????
> ???? ????? ??????? ???? ?????? ????? ???? ??????? ????????? (?? ????)? ???
> ???? ?? ??? ?? ????? ??? ??????? ?? ???????? (?? ????) ?? ?? ??? ????? ??
> ????? ?????????? ????? ?? ????????? ??? ???. ????? ??? ???? ??? ???????
> ????????? (?? ????) ???? ?? ??? ??????? ???? ???????? ??? ????? ?????
> ????????? ?????? ???????? ???????? ??????? ????????? ??? ????? ??????? ??
> ??????? ?? ??????? ??????? ?? ?? ?? ?????? ??? ??????.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20131226/04b123b2/attachment.htm
>
> ------------------------------
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> End of Moses-support Digest, Vol 86, Issue 76
> *********************************************
>

--
regards,
P.Arththika
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131227/662a3446/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 86, Issue 77
*********************************************

Moses-support Digest, Vol 86, Issue 77

0 Response to "Moses-support Digest, Vol 86, Issue 77"

Post a Comment