Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Convert parallel text files to sgm format (Roee Aharoni)
2. Re: Computing Perplexity with KenLM (Python API)
(Kenneth Heafield)
3. EUROPHRAS 2017 (THIRD CALL FOR PAPERS) (Evans, Richard J.)
----------------------------------------------------------------------
Message: 1
Date: Mon, 08 May 2017 06:26:37 +0000
From: Roee Aharoni <roee.aharoni@gmail.com>
Subject: [Moses-support] Convert parallel text files to sgm format
To: moses-support@mit.edu
Message-ID:
<CAAFz8fXuMnjGf+scQAjQQHwFpPP-Sm6caPcZ_agLzznrCMrkXw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi,
Is there a script in moses that converts simple parallel text files into
the .sgm format required by the mteval-v13a.pl?
I only found: /scripts/ems/support/wrap-xml.perl which wraps the target
file, but requires the source file to be in sgm format.
Thanks,
Roee
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20170508/5d461cd5/attachment-0001.html
------------------------------
Message: 2
Date: Mon, 8 May 2017 10:15:52 +0100
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Computing Perplexity with KenLM (Python
API)
To: moses-support@mit.edu
Message-ID: <56f2dafa-e297-ba36-64e5-ee47b5f3462d@kheafield.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Hi Liling,
You can test your program matches bin/query.
None of these is correct.
You want math.pow(10.0, sum_inv_logs / n)
Kenneth
On 05/08/2017 07:37 AM, liling tan wrote:
> Dear Moses Community,
>
> Does anyone know how to compute sentence perplexity with a KenLM model?
>
> Let's say we build a model on this:
>
> |$ wget
> https://gist.githubusercontent.com/alvations/1c1b388456dc3760ffb487ce950712ac/raw/86cdf7de279a2b9bceeb3adb481e42691d12fbba/something.txt
> $ lmplz -o 5 < something.txt > something.arpa|
>
>
> From the perplexity formula
> (https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf)
>
> Applying the sum of inverse log formula to get the inner variable and
> then taking the nth root, the perplexity number is unusually small:
>
>>>> import kenlm
>>>> m = kenlm.Model('something.arpa')
>
> # Sentence seen in data.
>>>> s = 'The development of a forward-looking and comprehensive European
> migration policy,'
>>>> list(m.full_scores(s))
> [(-0.8502398729324341, 2, False), (-3.0185394287109375, 3, False), (-0.3004383146762848, 4, False), (-1.0249041318893433, 5, False), (-0.6545327305793762, 5, False), (-0.29304179549217224, 5, False), (-0.4497605562210083, 5, False), (-0.49850910902023315, 5, False), (-0.3856896460056305, 5, False), (-0.3572353720664978, 5, False), (-1.7523181438446045, 1, False)]
>>>> n = len(s.split())
>>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>>> math.pow(sum_inv_logs, 1.0/n)
> 1.2536033936438895
>
>
> Trying again with a sentence not found in the data:
>
> # Sentence not seen in data.
>>>> s = 'The European developement of a forward-looking and comphrensive society
> is doh.'
>>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>>> sum_inv_logs
> 35.59524390101433
>>>> n = len(s.split())
>>>> math.pow(sum_inv_logs, 1.0/n)
> 1.383679905428275
>
>
> And trying again with totally out of domain data:
>
>>>> s = """On the evening of 5 May 2017, just before the French Presidential
> Election on 7 May, it was reported that nine gigabytes of Macron's
> campaign emails had been anonymously posted to Pastebin, a
> document-sharing site. In a statement on the same evening, Macron's
> political movement, En Marche!, said: "The En Marche! Movement has been
> the victim of a massive and co-ordinated hack this evening which has
> given rise to the diffusion on social media of various internal
> information"""
>>>> sum_inv_logs = -1 * sum(score for score, _, _ in m.full_scores(s))
>>>> sum_inv_logs
> 282.61719834804535
>>>> n = len(list(m.full_scores(s)))
>>>> n
> 79
>>>> math.pow(sum_inv_logs, 1.0/n)
> 1.0740582373271952
>
>
>
> Although, it is expected that the longer sentence has lower perplexity,
> it's strange that the difference is less than 1.0 and in the range of
> decimals.
>
> Is the above the right way to compute perplexity with KenLM? If not,
> does anyone know how to computer perplexity with the KenLM through the
> Python API?
>
> Thanks in advance for the help!
>
> Regards,
> Liling
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
------------------------------
Message: 3
Date: Mon, 8 May 2017 13:30:56 +0000
From: "Evans, Richard J." <R.J.Evans@wlv.ac.uk>
Subject: [Moses-support] EUROPHRAS 2017 (THIRD CALL FOR PAPERS)
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<B8C5913BF277A34B9C638833F0B065EC011F6449AB@Exchmbx10I02.unv.wlv.ac.uk>
Content-Type: text/plain; charset="utf-8"
[Apologies for cross-posting]
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[cid:ii_ixt6ibvg0_1598e6cf96d2c2ff]
International Conference ?Computational and Corpus-based Phraseology?
Recent advances and interdisciplinary approaches
London, 13-14 November 2017
THIRD CALL FOR PAPERS
The forthcoming international conference ?Computational and Corpus-based Phraseology ? recent advances and interdisciplinary approaches? will take place in London on 13 and 14 November 2017(http://rgcl.wlv.ac.uk/europhras2017/).
Conference topics
The conference will focus on interdisciplinary approaches to phraseology and invites submissions on a wide range of topics, including, but not limited to: computational, corpus-based, psycholinguistic and cognitive approaches to the study of phraseology, and practical applications in computational linguistics, translation, lexicography and language learning, teaching and assessment.
These topics cover but are not limited to the following:
Computational approaches to the study of multiword expressions, e.g. automatic detection, classification and extraction of multiword expressions; automatic translation of multiword expressions; computational treatment of proper names; multiword expressions in NLP tasks and applications such as parsing, machine translation, text summarisation, term extraction, web search
Corpus-based approaches to phraseology, e.g. corpus-based empirical studies of phraseology; task-orientated typologies of phraseological units (e.g. for annotation, lexicographic representation, etc.); annotation schemes; applications in applied linguistics and more specifically translation, interpreting, lexicography, terminology, language learning, teaching and assessment (see also below)
Phraseology in mono- and bilingual lexicography and terminography, e.g. new forms of presenting phraseological units in dictionaries and other lexical resources based on corpus-based and corpus-driven approaches; domain-specific terminology
Phraseology in translation and cross-linguistic studies, e.g. use parallel and comparable corpora for translating of phraseological units; phraseological units in computer-aided translation; study of phraseology across languages
Phraseology in specialised languages and language dialects, e.g. phraseology of specialised languages; study of phraseological use in different dialects or varieties of a specific language
Phraseology in language learning, teaching and assessment, e.g. second language/bilingual processing of phraseological units and formulaic language; phraseological units in learner language
Theoretical and descriptive approaches to phraseology, e.g. phraseological units and the lexis-grammar interface; the relevance of phraseology for theoretical models of grammar; the representation of phraseological units in constituency and dependency theories; phraseology and its interaction with semantics
Cognitive and psycholinguistic approaches, e.g. cognitive models of phraseological unit comprehension and production; on-line measures of phraseological unit processing (e.g. eye tracking, event-related potentials, self-paced reading); phraseology and language disorders; phraseology and text readability
As mentioned earlier, the above list is indicative and not exhaustive. Any submission presenting a study related to the alternative terms of phraseological units, multiword expressions, multiword units, formulaic language or polylexical expressions, will be considered.
The official language of the conference is English.
Keynote Speakers
? Ken Church
? Dmitrij Dobrovol?skij
? Patrick Hanks
? Gloria Corpas Pastor
? Milo? Jakub??ek
Submissions and Publication
The conference invites submissions reporting original unpublished work.
EUROPHRAS?2017 invites three types of submissions:
Regular papers: these papers will not be exceeding 15 pages including references and their minimum length will be 12 pages. The accepted regular papers will be published in a Springer LNAI volume which will be available at the time of the conference
Short papers: these papers will not exceed 7 pages (excluding references) and will be published by Tradulex as conference e-proceedings with ISBN. The proceedings will be available at the time of the conference
Poster presentations: these papers will not exceed 4 pages (excluding references) and will be included in the conference e-proceedings along with the short papers
Submission is electronic, using the Softconf START conference management system. For further instructions please follow the submission guidelines<http://rgcl.wlv.ac.uk/europhras2017/submission/> at the conference website.
Authors of accepted papers will receive guidelines regarding how to produce camera-ready versions of their papers for inclusion in the proceedings.
The conference will not consider the submission and evaluation of abstracts only.
All published papers will have a DOI assigned.
Each submission will be reviewed by at least 3 reviewers who will be either members of the Programme Committee or reviewers proposed by Programme Committee members.
Schedule
29 May 2017 ? deadline for submitting papers
17 July 2017 ? all authors notified of decisions
5 September 2017 ? deadline for final version of all types of papers
13-14 November 2017 ? conference takes place in London
Programme Committee
The Programme Committee features experts in different aspects of corpus-based and computational phraseology and includes:
Douglas Biber, Northern Arizona University
Nicoletta Calzolari, Institute for Computational Linguistics
Mar?a Luisa Carri?-Pastor, Universitat Polit?cnica de Val?ncia
Sheila Castilho, Dublin City University
Ken Church, IBM Research
Jean-Pierre Colson, Universit? Catholique de Louvain
Gloria Corpas, University of Malaga
Franti?ek ?erm?k, Charles University in Prague
Anna ?erm?kov?, Charles University
Dimitrij Dobrovolskij, Russian Academy of Sciences, Russian Language Institute
Jesse Egbert, Northern Arizona University
Thierry Fontenelle, Translation Centre for the Bodies of the European Union
Kleanthes K. Grohmann, University of Cyprus
Patrick Hanks, University of Wolverhampton
Ulrich Heid, University of Hildesheim
Milo? Jakub??ek, Lexical Computing and Masaryk University
Kyo Kageura, University of Tokyo
Valia Kordoni, Humboldt University of Berlin
Simon Krek, University of Ljubljana
Pedro Mogorr?n Huerta, University of Alicante
Johanna Monti, Universit? degli Studi di Napoli ?L?Orientale?
Sara Moze, University of Wolverhampton
Preslav Nakov, Qatar Computing Research Institute, HBKU
Michael Oakes, University of Wolverhampton
Marija Omazi?, University of Osijek
Petya Osenova, Sofia University
Magali Paquot, Universit? catholique de Louvain
Giovanni Parodi Sweis, Pontifical Catholic University of Valpara?so
Alain Polgu?re, Universit? de Lorraine
Carlos Ramisch, Laboratoire d?Informatique Fondamentale de Marseille
Ute R?mer, Georgia State University
Agata Savary, Fran?ois Rabelais University
Barbara Schl?cker, The University of Bonn
Violeta Seretan, University of Geneva
Yvonne Skalban, University of Wolverhampton
Kathrin Steyer, Institute of German Language
Yukio Tono, Tokyo University of Foreign Studies
Cornelia Tschichold, Swansea University
Benjamin Tsou, City University of Hong Kong
Agn?s Tutin, University of Grenoble
Aline Villavicencio, Federal University of Rio Grande do Sul
Eveline Wandl-Vogt, Austrian Academy of Sciences
Tom Wasow, Stanford University
Eric Wehrli, University of Geneva
Stefanie Wulff, University of Florida
Michael Zock, Laboratoire d?Informatique Fondamentale de Marseille
Conference Chair
Ruslan Mitkov, University of Wolverhampton.
Conference Workshops
EUROPHRAS?2017 will be accompanied by the following two workshops:
? MUMTTT: The 3rd Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2017) will take place on Tuesday 14th November (http://rgcl.wlv.ac.uk/europhras2017/mumttt-2017/). The invited speaker is Carlos Ramisch, Aix-Marseille University, France.
? Student Research Workshop: This workshop will take place on Monday 13th November (http://rgcl.wlv.ac.uk/europhras2017/student-workshop/). The Invited speaker is Jean-Pierre Colson, Universit? Catholique de Louvain.
Organisation and sponsors
The forthcoming international conference ?Computational and Corpus-based Phraseology ? Recent advances and interdisciplinary approaches? is jointly organised by the European Association for Phraseology EUROPHRAS, the University of Wolverhampton (Research Institute of Information and Language Processing) and the Association for Computational Linguistics ? Bulgaria.
EUROPHRAS and Sketch Engine are the official sponsors of the conference.
Further information and contact details
Registration is now open: (LINK TO REGISTRATION<http://213.191.204.62/europhras2017/reg20170504.php>)
The conference website (http://rgcl.wlv.ac.uk/europhras2017/) will be updated on a regular basis. For further information, please email europhras2017@wlv.ac.uk<mailto:europhras2017@wlv.ac.uk>.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20170508/0c689eb3/attachment.html
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 127, Issue 10
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 127, Issue 10"
Post a Comment