Moses-support Digest, Vol 135, Issue 28

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. PhD studentship in Translation Technology (University of
Wolverhampton) (Evans, Richard J.)
2. Re: M2 Scorer in EMS for Grammatical Error Correction
(Kelly Marchisio)

----------------------------------------------------------------------

Message: 1
Date: Thu, 25 Jan 2018 15:24:05 +0000
From: "Evans, Richard J." <R.J.Evans@wlv.ac.uk>
Subject: [Moses-support] PhD studentship in Translation Technology
(University of Wolverhampton)
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<B8C5913BF277A34B9C638833F0B065EC019F5D0291@EXCHMBX10I04.unv.wlv.ac.uk>

Content-Type: text/plain; charset="Windows-1252"

[Apologies for cross-posting]

One PhD studentship in Translation Technology
Closing date 26th February 2018

The Research Group in Computational Linguistics (http://rgcl.wlv.ac.uk)
at the University of Wolverhampton invites applications for ONE 3-year
PhD studentship in the area of translation technology. This PhD
studentship is part of a larger university investment which includes
other PhD students and members of staff with the aim to strengthen the
existing research undertaken by members of the group in this area.
This funded student bursary consist of a stipend towards living
expenses (?14,500 per year) and remission of fees.

We invite applications in the area of translation technology defined in
the broadest sense possible and ranging from advanced methods in
machine translation to user studies which involves the use of technology
in the translation process. We welcome proposals focusing on Natural
Language Processing techniques for translation memory systems and
translation tools in general. Given the current research interests of
the group and its focus on computational approaches, we would be
interested in topics including but not limited to:

- Enhancing retrieval and matching from translation memories with
linguistic information
- The use of deep learning (and in general, statistical) techniques in
translation memories
- (Machine) translation of user generated content
- The use of machine translation in cross-lingual applications (with
particular interest in sentiment analysis, automatic summarisation and
question answering)
- Phraseology and computational treatment of multi-word expressions in
machine translation and translation memory systems
- Quality estimation for translation professionals

Other topics will also be considered as long as they align with the
interests of the group. The appointed student is expected to work on a
project that has a significant computational component. For this reason
we expect that the successful candidate will have good background in
computer science and programming.

The application deadline is 26th February 2018 and Skype interviews
with the shortlisted candidates will be organised shortly after the
deadline. The starting date of the PhD position is as soon as possible
after the offer is made.

The successful applicant must have:

- A good honours degree or equivalent in Computational Linguistics,
Computer Science, Translation studies or Linguistics
- A strong background in Programming and Statistics/ Mathematics or in
closely related areas (if relevant to the proposed topic).
- Experience in Computational Linguistics / Natural Language
Processing, including statistical, Machine Learning and Deep Learning,
applications to Natural Language Processing.
- Experience with translation technology
- Experience with programming languages such as Python, Java or R is a
plus
- An IELTS certificate with a score of 6.5 is required from candidates
whose native language is not English. If a certificate is not available
at the time of application, the successful candidate must be able to
obtain it within one month from the offer being made.

Candidates from both UK/EU and non-EU can apply. We encourage
applications from female candidates.

Applications must include:

1. A curriculum vitae indicating degrees obtained, courses covered,
publications, relevant work experience and names of two referees that
could be contacted if necessary
2. A research statement which outlines the topics of interest. More
information about the expected structure of the research statement can
be found at
https://www.wlv.ac.uk/media/departments/star-office/documents/Guideline
s-for-completion-of-Research-Statement.doc

These documents will have to be sent by email before the deadline to
Amanda Bloore (A.Bloore@wlv.ac.uk). Informal enquiries can be sent to
Constantin Orasan (C.Orasan@wlv.ac.uk)

The shortlisted applicants will be interviewed by phone/Skype shortly
after the application deadline.

Established by Prof Mitkov in 1998, the research group in Computational
Linguistics delivers cutting-edge research in a number of NLP areas.
The results from the latest Research Evaluation Framework confirm the
research group in Computational Linguistics as one of the top
performers in UK research with its research defined as ?internationally
leading, internationally excellent and internationally recognised?. The
research group has recently completed successfully the coordination of
the EXPERT project a successful EC Marie Curie Initial Training Network
promoting research, development and use of data-driven technologies in
machine translation and translation technology (http://expert-itn.eu)

------------------------------

Message: 2
Date: Thu, 25 Jan 2018 16:18:21 +0000
From: Kelly Marchisio <kellymarchisio@gmail.com>
Subject: Re: [Moses-support] M2 Scorer in EMS for Grammatical Error
Correction
To: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CA++YajnacQss1pSMXW5myS3rcB0LKApQh58y+2nbwe_PucUcQg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Quick FYI - there are licence restrictions on the data I sent, and we?d be
grateful if all would abide by them. They can be found here:
https://ilexir.co.uk/datasets/index.html

On Mon, Jan 15, 2018 at 5:42 PM, Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
wrote:

> Seems like all I need is there. Will take a look today and report back.
>
>
>
> *From: *Kelly Marchisio <kellymarchisio@gmail.com>
> *Sent: *Monday, January 15, 2018 9:28 AM
>
> *To: *Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
> *Cc: *moses-support <moses-support@mit.edu>
> *Subject: *Re: [Moses-support] M2 Scorer in EMS for Grammatical Error
> Correction
>
>
>
> Sure - it's on my computer locally, but you can access the Google Drive
> folder here: https://drive.google.com/drive/folders/
> 1K9Cq9fprMmMo3KmySJw4ApuxBeK4qgNK
>
>
>
> You can see the results of 3 attempts in folders tmp.1-3. Tokenized and
> lowercased input and reference files are there, along with all other files
> automatically created by Moses. Please let me know if this is enough
> information or if you'd like more.
>
>
>
> On Mon, Jan 15, 2018 at 4:52 PM, Marcin Junczys-Dowmunt <
> junczys@amu.edu.pl> wrote:
>
> Hm, not really. Any chance you give me access to the tuning folder? I
> could try to run the scorer manually and see if I can reproduce the error.
> This looks like some debugging is needed.
>
>
>
> *From: *Kelly Marchisio <kellymarchisio@gmail.com>
> *Sent: *Monday, January 15, 2018 6:57 AM
>
>
> *To: *Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
> *Cc: *moses-support <moses-support@mit.edu>
> *Subject: *Re: [Moses-support] M2 Scorer in EMS for Grammatical Error
> Correction
>
>
>
> I've sentence-split all my data, but tuning with M2SCORER and mert fails
> again. In extract.err, I see:
>
> Binary write mode is NOT selected
>
> Scorer type: M2SCORER
>
> name: case value: true
>
> Data::m_score_type M2Scorer
>
> Data::Scorer type from Scorer: M2Scorer
>
> loading nbest from run1.best100.out.gz
>
> Levenshtein distance is greater than source size.
>
> Exception: vector
>
>
>
> On a previous run, I saw the std::bad_alloc error again.
>
>
>
> mert seems to get through one round, then dies right after finishing
> translating the document once. I know this because run1.out contains the
> entirely translated document, then it crashes.
>
>
>
> To simply things, my training data has 4998 sentence-split lines, and
> tuning has 500. (Training originally had 5000 - the tokenizer/cleaner must
> have removed 2 lines). Sentences appear well-aligned looking at
> training/giza.1
>
>
>
> Any ideas on this one?
>
>
>
> On Sun, Jan 14, 2018 at 4:43 AM, Kelly Marchisio <kellymarchisio@gmail.com>
> wrote:
>
> Yes, these errors happened during tuning with data like that. By the
> original Python implementation, do you mean the one from the CoNLL 2014
> shared task? (http://www.comp.nus.edu.sg/~nlp/conll14st.html, under
> "Official Scorer")
>
>
>
> Thanks so much for the advice! I'll fix up my data tomorrow and give this
> another go. Thank you :)
>
>
>
> On Sat, Jan 13, 2018 at 11:32 PM, Marcin Junczys-Dowmunt <
> junczys@amu.edu.pl> wrote:
>
> Are you tuning and testing on data like that? If yes this could be part of
> the problem. The M2 scorer in Moses is not really tested and probably not
> well suited for heavy duty (the original python implementation is even
> worse). So it would definitely be better to make sure that not too much
> weird stuff is going on in the data.
>
>
>
> *From: *Kelly Marchisio <kellymarchisio@gmail.com>
> *Sent: *Saturday, January 13, 2018 8:28 PM
> *To: *Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
> *Cc: *moses-support <moses-support@mit.edu>
>
>
> *Subject: *Re: [Moses-support] M2 Scorer in EMS for Grammatical Error
> Correction
>
>
>
> Ah, good to know that the scorer was called successfully and that I can
> ignore the Levenshtein distance errors.
>
>
>
> As for allocating a huge piece of memory -- I realized that though my
> parallel corpus is aligned, I actually split the original corpus by
> *paragraph* instead of sentence. They're mostly short paragraphs (each max.
> ~4 sentences, probably max 80-100 tokens or so), but they are some outliers
> (the largest being ~250-300 tokens). Most paragraphs would only need a few
> edits, but the largest might need 10-15+. Could this be causing this
> problem?
>
>
>
> On Sat, Jan 13, 2018 at 11:08 PM, Marcin Junczys-Dowmunt <
> junczys@amu.edu.pl> wrote:
>
> There seem to be multiple issues here.
>
>
>
> As I said, I have null experience with EMS, so maybe someone else can help
> with that.
>
>
>
> The message in extract.err seems to actually mean, that you were
> successful in calling the M2 scorer in EMS, the only problem is it dies ?
> The Levenshtein message is part of a failsafe that is meant to avoid
> exponentially long searches. It does not calculate the M2 metric for a
> sentence pair where there would be excessively many edits (these are
> usually wrong). Theses messages by themselves should not be a reason for
> worrying.
>
>
>
> The std::bad_alloc on the other hand is not good. It seems the scorer
> tries to allocate some huge piece of memory, probably some negative index
> somewhere and then dies. I have not seen this before. Is it possible that
> your system is creating a lot superfluous edits and the graph algorithm in
> M2 is going crazy due to that?
>
>
>
> *From: *Kelly Marchisio <kellymarchisio@gmail.com>
> *Sent: *Saturday, January 13, 2018 7:46 PM
> *To: *Marcin Junczys-Dowmunt <junczys@amu.edu.pl>; moses-support
> <moses-support@mit.edu>
> *Subject: *Re: [Moses-support] M2 Scorer in EMS for Grammatical Error
> Correction
>
>
>
> looping back in mailing-list and copying message :)
>
>
>
> Thanks so much for the response, Marcin!
>
>
>
> I did see your original repo, thanks for sending along. I'd love to get
> this going with EMS because it looks like I can just pass in the M2 scorer
> with:
>
> tuning-settings = "-mertdir $moses-bin-dir -mertargs='--sctype M2SCORER'
> -threads $cores"
>
> However it fails with:
>
> ERROR: Failed to run '/Users/kellymarchisio/L101Final/experiments/tuning/tmp.1/extractor.sh'.
> at /Users/kellymarchisio/L101Final/programs/mosesdecoder/scripts/training/
> mert-moses.pl line 1775.
> cp: /Users/kellymarchisio/L101Final/experiments/tuning/tmp.1/moses.ini:
> No such file or directory
>
> There may be an error with the mert-moses script itself used with M2,
> because moses.ini was never created within tmp.1
>
>
>
> Additionally, in extract.err, I see:
>
> Binary write mode is NOT selected
> Scorer type: M2SCORER
> name: case value: true
> Data::m_score_type M2Scorer
> Data::Scorer type from Scorer: M2Scorer
> loading nbest from run1.best100.out.gz
> Levenshtein distance is greater than source size.
> Levenshtein distance is greater than source size.
> extractor(67381,0x7fffde7dd3c0) malloc: *** mach_vm_map(size=3368542481395712)
> failed (error code=3)*** error: can't allocate region
> *** set a breakpoint in malloc_error_break to debug
> Exception: std::bad_alloc
>
>
>
> I'm curious if you've come across these issues (I'm interested why I'm
> seeing "Levenshtein distance is greater than source size.") and if you have
> any pointers for how I can get mert-moses.pl to work for me with
> M2Scorer.
>
>
>
> Best,
>
> Kelly
>
>
>
> On Sat, Jan 13, 2018 at 9:13 PM, Kelly Marchisio <kellymarchisio@gmail.com>
> wrote:
>
> Thanks so much for the response, Marcin!
>
>
>
> I did see your original repo, thanks for sending along. I'd love to get
> this going with EMS because it looks like I can just pass in the M2 scorer
> with:
>
> tuning-settings = "-mertdir $moses-bin-dir -mertargs='--sctype M2SCORER'
> -threads $cores"
>
> However it fails with:
>
> ERROR: Failed to run '/Users/kellymarchisio/L101Final/experiments/tuning/tmp.1/extractor.sh'.
> at /Users/kellymarchisio/L101Final/programs/mosesdecoder/scripts/training/
> mert-moses.pl line 1775.
> cp: /Users/kellymarchisio/L101Final/experiments/tuning/tmp.1/moses.ini:
> No such file or directory
>
> There may be an error with the mert-moses script itself used with M2,
> because moses.ini was never created within tmp.1
>
>
>
> Additionally, in extract.err, I see:
>
> Binary write mode is NOT selected
> Scorer type: M2SCORER
> name: case value: true
> Data::m_score_type M2Scorer
> Data::Scorer type from Scorer: M2Scorer
> loading nbest from run1.best100.out.gz
> Levenshtein distance is greater than source size.
> Levenshtein distance is greater than source size.
> extractor(67381,0x7fffde7dd3c0) malloc: *** mach_vm_map(size=3368542481395712)
> failed (error code=3)*** error: can't allocate region
> *** set a breakpoint in malloc_error_break to debug
> Exception: std::bad_alloc
>
>
>
> I'm curious if you've come across these issues (I'm interested why I'm
> seeing "Levenshtein distance is greater than source size.") and if you have
> any pointers for how I can get mert-moses.pl to work for me with
> M2Scorer.
>
>
>
> Best,
>
> Kelly
>
>
>
> On Fri, Jan 12, 2018 at 9:53 PM, Marcin Junczys-Dowmunt <
> junczys@amu.edu.pl> wrote:
>
> Hi,
>
> We never really used it with EMS, so I do not think anyone can help you
> here. Did you have a look at the original repo: https://github.com/
> grammatical/baselines-emnlp2016 ? Otherwise we can probably take this
> off-list and try to help you personally ?
>
>
>
> *From: *Kelly Marchisio <kellymarchisio@gmail.com>
> *Sent: *Friday, January 12, 2018 6:20 PM
> *To: *moses-support <moses-support@mit.edu>
> *Subject: *[Moses-support] M2 Scorer in EMS for Grammatical Error
> Correction
>
>
>
> Does anyone have experience using the M2 scorer for grammatical error
> correction with EMS for tuning and evaluation? Junczys-Dowmunt &
> Grundkiewicz (2016) use M2 (https://github.com/grammatical/baselines-
> emnlp2016/tree/c4fbcc09b45a46c7c46bdda2ba10484fa16e8f82), but I see no
> examples of using it with EMS.
>
>
>
> Does anyone have experience or advice on how I can use the M2 scorer for
> GEC in my project? I'm having trouble figuring out how to incorporate it
> without an example. (for instance, how best to setup experiment.meta & the
> config file to incorporate it)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20180125/08f9f0d6/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 135, Issue 28
**********************************************

Moses-support Digest, Vol 135, Issue 28

0 Response to "Moses-support Digest, Vol 135, Issue 28"

Post a Comment