Moses-support Digest, Vol 97, Issue 77

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. SMT resources for Indian languages (Anoop (?????))
2. Re: (no subject) (Hieu Hoang)
3. CFP EAMT 2015: 18th Annual Conference of the European
Association for Machine Translation (Felipe S?nchez Mart?nez)
4. Re: Too large language models - how to handle that? (Hoang Cuong)


----------------------------------------------------------------------

Message: 1
Date: Tue, 25 Nov 2014 07:59:46 +0530
From: Anoop (?????) <anoop.kunchukuttan@gmail.com>
Subject: [Moses-support] SMT resources for Indian languages
To: moses-support@mit.edu
Message-ID:
<CADXxMYdi98xs8kz6w8c0oEVZyGb9_FaxVB02bL9+-Wto9zzDgA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Sharing a few SMT resources for Indian languages.

Center For Indian Language Technology <http://www.cfilt.iitb.ac.in>, IIT
Bombay has hosted Shata-Anuvaadak (100 Translators), a Statisitical Machine
Translation system for Indian languages. It currently supports translation
between 11 Indian languages:


- Indo-Aryan languages: Hindi, Urdu, Bengali, Gujarati, Punjabi,
Marathi, Konkani
- Dravidian languages: Tamil, Telugu, Malayalam
- English


It is a Phrase-Based MT system with pre-processing and post-processing
extensions. The pre-processing includes source-side reordering for English
to Indian language translation. The post-processing includes
transliteration between Indian languages for OOV words. The system can be
accessed at:

http://www.cfilt.iitb.ac.in/indic-translator

For more details, see the following publication:

Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak
Bhattacharyya. 2014. * Shata-Anuvadak: Tackling Multiway Translation of
Indian Languages* . Language and Resources and Evaluation Conference *(LREC
2014)*. 2014.

We are also making available software and resources developed in the Center
for the system and for ongoing research. These are available under an open
source license for research use. These include:

*Software*

- Indian Language, NLP tools: Common NLP tools for Indian languages that
are useful for machine translation. Unicode Normalizers, Tokenizers,
Morphology-analysers and Transliteration systems.
- Source Side Reodering system for SMT
- A simple experiment management system for Moses

*Resources*

- Translation Models for Phrase based SMT systems all language pairs in
Shata-anuvaadak
- Language Models for all language in Shata-anuvaadak
- Transliteration models for some language pairs (Moses-based)

You can access these resources at:

http://www.cfilt.iitb.ac.in/static/download.html

Regards,
Anoop.

http://www.cse.iitb.ac.in/~anoopk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141125/63ea2e27/attachment-0001.htm

------------------------------

Message: 2
Date: Tue, 25 Nov 2014 09:10:06 +0000
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] (no subject)
To: Daramola Olaife <d3ripleo@gmail.com>, moses-support@mit.edu,
user-irstlm@list.fbk.eu
Message-ID: <5474476E.5090702@gmail.com>
Content-Type: text/plain; charset="windows-1252"

I'm getting a different error when compiling irstlm5.80.06 with the
latest moses from github.
moses/LM/IRST.cpp:60:21: error: invalid use of incomplete type
?class lmContainer?
if (m_lmtb) m_lmtb->reset_mmap();

Using irstlm5.80.03 works fine
http://sourceforge.net/projects/irstlm/files/irstlm/irstlm-5.80/


On 24/11/14 12:50, Daramola Olaife wrote:
> After installing irstlm, I tried linking it to moses with
> ./bjam --with-irstlm=/home/olaife/irstlm-5.80.06 -j8
> but it was giving me error.
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141125/d2ea373d/attachment-0001.htm

------------------------------

Message: 3
Date: Tue, 25 Nov 2014 10:12:27 +0100
From: Felipe S?nchez Mart?nez <fsanchez@dlsi.ua.es>
Subject: [Moses-support] CFP EAMT 2015: 18th Annual Conference of the
European Association for Machine Translation
To: mt-list@eamt.org, moses-support <moses-support@mit.edu>,
corpora@uib.no, elsnet-list@elsnet.org
Cc: "awa >> Andy Way" <away@computing.dcu.ie>, "Mikel L. Forcada"
<mlf@dlsi.ua.es>
Message-ID: <547447FB.5060209@dlsi.ua.es>
Content-Type: text/plain; charset=utf-8; format=flowed


Apologies for cross-posting.
-----------------------------------------------------------

*18th Annual Conference of the European Association for Machine
Translation (EAMT 2015; Antalya, Turkey)*

The European Association for Machine Translation
(EAMT,http://www.eamt.org) invites everyone interested in machine
translation, translation-related tools and resources to participate in
this conference ? developers, researchers, users, professional
translators and translation/localisation managers: anyone who has a
stake in the vision of an information world in which language barriers
and issues become less visible to the information consumer. We
especially invite researchers to describe the state of the art and
demonstrate their cutting-edge results, and professional MT users to
share their experiences.

EAMT 2015, the 18th Annual Conference of the European Association for
Machine Translation, will be held in Antalya, Turkey from 11 to 13 May
2015.

We expect to receive manuscripts in these three categories:

------------------------------------
Research papers
------------------------------------
Long-paper submissions (8 pages) are invited for reports of significant
research results in any aspect of machine translation and related areas.
Such reports should include a substantial evaluation component, or have
a strong theoretical and/or methodological contribution where results
and in-depth evaluations may not be appropriate. Papers are welcome on
all topics in the area of Machine Translation or translation-related
technologies, including:

* Speech translation: speech to text, speech to speech
* Translation aids (translation memory, terminology databases, etc.)
* Translation environments (workflow, support tools, conversion tools
for lexica, etc.)
* Practical MT systems (MT for professionals, MT for multilingual
eCommerce, MT for localization, etc.)
* MT in multilingual public service (eGovernment etc.)
* MT for the web
* MT embedded in other services
* MT evaluation techniques and evaluation results
* Dictionaries and lexica for MT
* Text and speech corpora for MT
* Standards in text and lexicon encoding for MT
* Human factors in MT and user interfaces
* Related multilingual technologies (natural language generation,
information retrieval, text categorization, text summarization,
information extraction, etc.)

Papers should describe original work. They should emphasize completed
work rather than intended work, and should indicate clearly the state of
completion of the reported results. Where appropriate, concrete
evaluation results should be included.

------------------------------------
User studies
------------------------------------
Short-paper submissions (2-4 pages) are invited for reports on users'
experiences with MT, be it in small or medium size business (SMB),
enterprise, government, or NGOs. Contributions are welcome on:

* Integrating MT and computer-assisted translation into a translation
production workflow (e.g. transforming terminology glossaries into MT
resources, optimizing TM/MT thresholds, mixing online and offline tools,
using interactive MT, dealing with MT confidence scores);
* Use of MT to improve translation or localization workflows (e.g.
reducing turnaround times, improving translation consistency, increasing
the scope of globalization projects);
* Managing change when implementing and using MT (e.g. switching between
multiple MT systems, limiting degradations when updating or upgrading an
MT system);
* Implementing open-source MT in the SMB or enterprise (e.g. strategies
to get support, reports on taking pilot results into full deployment,
examples of advance customisation sought and obtained thanks to the
open-source paradigm, collaboration within open-source MT projects);
* Evaluation of MT in a real-world setting (e.g. error detection
strategies employed, metrics used, productivity or translation quality
gains achieved);
* Post-editing strategies and tools (e.g. limitations of traditional
translation quality assurance tools, challenges associated with
post-editing guidelines);
* Legal issues associated with MT, especially MT in the cloud (e.g.
copyright, privacy);
* Use of MT in social networking or real-time communication (e.g.
enterprise support chat, multilingual content for social media);
* Use of MT to process multilingual content for assimilation purposes
(e.g. cross-lingual information retrieval, MT for e-discovery or spam
detection, MT for highly dynamic content);
* Use of standards for MT.

Papers should highlight problems and solutions and not merely describe
MT integration process or project settings. Where solutions do not seem
to exist, suggestions for MT researchers and developers should be
clearly emphasized. For user papers produced by academics, we require
co-authorship with the actual users.

------------------------------------
Project/Product description
------------------------------------
Abstract submissions (1 page) are invited to report new, interesting:

* Tools for machine translation, computer aided translation, and the
like (including commercial products and open-source software). The
authors should be ready to present the tools in the form of demos or
posters during the conference.
* Research projects related to machine translation. The authors should
be ready to present the projects in the form of posters during the
conference. This follows on from the successful ?project villages? held
at the last two EAMT conferences.

------------------------------------
Programme
------------------------------------
The programme will include oral presentations and poster sessions.
Accepted papers may be assigned to an oral or poster session, but no
differentiation will be made in the conference proceedings.

------------------------------------
Important Dates
------------------------------------
* Paper submission: February 5, 2015
* Notification to authors: March 12, 2015
* Camera-ready deadline: April 2, 2015
* Conference: May 11-13, 2015

------------------------------------
Conference website
------------------------------------
http://www.eamt2015.org/

For further information about this call for papers please contact the
track chairs at eamt2015@dlsi.ua.es and put in the title "[user]" or
"[research]" depending on which track your question is related to. For
questions about the organisation (venue, registration, accommodation,
etc.) please contact the local organisers at secretariat@eamt2015.org.

Kind regards
--
Gema Ram?rez-S?nchez, Fred Hollowood and Felipe S?nchez-Mart?nez
on behalf of the EAMT 2015 Organising Committee


------------------------------

Message: 4
Date: Tue, 25 Nov 2014 12:02:32 +0100
From: Hoang Cuong <hoangcuong2011@gmail.com>
Subject: Re: [Moses-support] Too large language models - how to handle
that?
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Cc: moses-support@mit.edu
Message-ID:
<CAG1fz7d=J22g1SG1iemAtN9-MvptaXir7eoehZzzSC1oVigFFw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Raj, Tom and Marcin,
I binarized the ARPA file last night, following your suggestion. In the
end, it resulted a binarized LM file of roughly *100GB* (@Marcin - it is
not 20-30GB as you suggest, is it okay with this size?)
Fortunately, the infrastructure at my university allows me to run
experiments with that.
Thanks a lot for your help.
It is so great to play with such huge LMs :))
Best,


On Mon, Nov 24, 2014 at 3:19 PM, Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
wrote:

> The command
>
> moses/bin/build_binary trie -a 22 -b 8 -q 8 lm.arpa lm.kenlm
>
> will build a compressed binarized model with quantization. You can run
>
> moses/bin/build_binary lm.arpa
>
> without any parameters to get size estimates for different parameter
> settings. I would guess you will get a binarized LM of roughly 20 to 30 GB
> which is managable (provided the size you gave us is that of an
> uncompressed text file). You can also use lmplz to build pruned models in
> the first place, these will be much smaller.
>
> W dniu 2014-11-24 15:11, Tom Hoar napisa?(a):
>
> After binarizing such a large ARPA file with KenLM, you'll need to
> configure your moses.ini file to "lazily load the model using mmap." This
> involves using lmodel-file code "9" vs code "8." More details here:
> https://kheafield.com/code/kenlm/moses/
>
> Performance improves significantly if you store the binarized file on an
> SSD.
>
>
>
>
> On 11/24/2014 07:00 PM, Raj Dabre wrote:
>
> Hey Hoang,
> You should binarize the arpa file.
> The readme of the LM tool (KenLM or IRSTLM or SRILM) will tell you how.
> Regards.
>
> On Mon, Nov 24, 2014 at 7:07 PM, Hoang Cuong <hoangcuong2011@gmail.com>
> wrote:
>
>> Hi all,
>> I have trained an (unpruned) 5-grams language model on a large corpus of
>> 5 billion words, resulting an ARPA-format file of roughly 300GB (is it a
>> normal LM size with such a big monolingual data?). This is obviously too
>> big for running an SMT system.
>> I read several works where their system uses language models trained on
>> similar monolingual corpus. Could you give me some advice how to handle
>> this, making it feasible to run SMT systems?
>> I appreciate your help a lot,
>> Best,
>> --
>> Best Regards,
>> Hoang Cuong
>> SMTNerd
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> Raj Dabre.
> Research Student,
> Graduate School of Informatics,
> Kyoto University.
> CSE MTech, IITB., 2011-2014
>
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


--

*Best Regards,Hoang CuongSMTNerd*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141125/439873f3/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 97, Issue 77
*********************************************

0 Response to "Moses-support Digest, Vol 97, Issue 77"

Post a Comment