Moses-support Digest, Vol 129, Issue 19

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: a tool for extracting specific terms from the corpora
(Mathias M?ller)

----------------------------------------------------------------------

Message: 1
Date: Mon, 31 Jul 2017 10:43:38 +0200
From: Mathias M?ller <mmueller@ifi.uzh.ch>
Subject: Re: [Moses-support] a tool for extracting specific terms from
the corpora
To: Mariusz Hawry?kiewicz <mariusz.hawrylkiewicz@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <4C354D14-72AE-4B9C-8A4A-135FF4CA1193@ifi.uzh.ch>
Content-Type: text/plain; charset="utf-8"

Hi Mariusz

Sorry for the delay.

If your problem is so dynamic that it cannot be described with rules, then you cannot extract such a list of terms automatically.

A semi-automatic method would be: you define rules that have little precision and high recall, this gets you an overly long list of terms that will include false positives. Then, look through this list manually, e.g. by looking at the term and its sentence context. Inspecting the data in this way might even suggest patterns you did not see before.

Another option is to still extract terms only automatically, with rules that work most of the time (probably more precision-oriented rules) and live with the margin of error.

If the terms to be extracted are a finite set (i.e. one that can be enumerated) that changes infrequently, consider taking the time to simply list all of the terms, for highest precision.

(We still don?t know what you will use the exported list for. Intended use also dictates the approach to a certain extent.)

Regards
Mathias

> On 4 Jul 2017, at 10:41, Mariusz Hawry?kiewicz <mariusz.hawrylkiewicz@gmail.com> wrote:
>
> Hi Mathias, thank you for getting back - let me give you an example from a monolingual EN corpora:
>
> Acoustic measurement precision and uncertainty.
> Each press of the Acoustic Output ? key decreases the transmission power setting (TX) displayed in the monitor display.
>
> In the first sentence the word Acoustic should not be exported. In the second sentence Acoustic Output should.
> Now I have written a program in Java that exports all the terms or group of terms with first capital letter, but this obviously includes the words like from the first example and it should not.
>
> The purpose is that the proper names only should be exported to a separate file.
>
> Best regards
> Mariusz
>
>
>
> 2017-07-04 10:02 GMT+02:00 Mathias M?ller <mmueller@ifi.uzh.ch <mailto:mmueller@ifi.uzh.ch>>:
> Hi Mariusz
>
> What do you mean by ?extracting? this content? What do you need the list of proper names for? What are the languages involved?
>
> Regards,
> Mathias
>
> ?
>
> Mathias M?ller
> AND-2-20
> Institute of Computational Linguistics
> University of Zurich
> Switzerland
> +41 44 635 75 81 <tel:+41%2044%20635%2075%2081>
> mmueller@cl.uzh.ch <mailto:mmueller@cl.uzh.ch>
>> On 4 Jul 2017, at 09:39, Mariusz Hawry?kiewicz <mariusz.hawrylkiewicz@gmail.com <mailto:mariusz.hawrylkiewicz@gmail.com>> wrote:
>>
>> Dear all,
>>
>> I have been searching for the most efficient way to extract untranslatable content from the corpora that always begin from the capital letter (product names etc.), the problem is that all the segments begin with the capital letter and what's obvious, the sentence may also begin with the untranslatable content (product name) :-).
>>
>> I want to avoid using common dictionaries to eliminate common words.
>>
>> Would you have any other suggestions?
>>
>> Thank you very much!
>> Mariusz
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support <http://mailman.mit.edu/mailman/listinfo/moses-support>
>
>

?

Mathias M?ller
AND-2-20
Institute of Computational Linguistics
University of Zurich
Switzerland
+41 44 635 75 81
mathias.mueller@uzh.ch

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20170731/9e6fb7f6/attachment-0001.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 129, Issue 19
**********************************************

Moses-support Digest, Vol 129, Issue 19

0 Response to "Moses-support Digest, Vol 129, Issue 19"