Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. GIZA++ word alignment with paragraphs (John Thompson)
----------------------------------------------------------------------
Message: 1
Date: Wed, 8 Jul 2020 18:12:26 -0700
From: John Thompson <john.thompson.jtsoftware@gmail.com>
Subject: [Moses-support] GIZA++ word alignment with paragraphs
To: moses-support@mit.edu
Message-ID:
<CAApr1KbVtoxete0SxbibRzmCgj6AEBuONyvw8neDUiURhJb_Ew@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi,
I'm using a 7162 line paragraph-aligned corpus. Unfortunately the
translation within the paragraph sometimes don't have the sentences
aligned, i.e. in one language the sentence could be one long sentence, and
in another language the sentence could have clauses broken up into multiple
sentences, hence I'm running GIZA++ on paragraphs.
It works partially, but the alignment of words is often wrong or it's
missing matches that should have been made.
I set the "maxsentencelength" configuration file parameter to 350, though
most of the paragraphs are around 100 or fewer words.
Q1: What difference do you estimate I should expect between using
paragraphs vs. sentences?
Q2: Are there GIZA++ parameters I could tune to improve the alignment?
Q3: If I concatenated multiple corpora, would the alignment output likely
improve?
I could preprocess the corpus, breaking up the paragraphs where the number
of sentences match, but there may be some cases where the sentences don't
align, where multiple sentences within the paragraph were joined or split
differently, such that the sentence count of the paragraph is the same, but
the sentences don't align.
Q4: How big of effect would these bad sentence alignments have on the rest
of the alignments?
Q5: Any ideas for how to get better word alignment with these corpora that
I have, either with GIZA++ or a different tool?
I'm using the word alignment in a language study tool. For example, I have
the text for a book in both English and Marshallese, but language resources
for Marshallese are scarce. In my tool I associate the alignment
information with the text, and also generate a dictionary using the
alignment output (or optionally the *.dict.actual.ti.final dictionary list
output). In one page I show the aligned sentences, and in another you can
click on words to get both the alignment definition, and the dictionary
definitions. For each source word dictionary entry, I sort the target
definitions by descending frequency (or probability if using the
*.dict.actual.ti.final dictionary list output), and then chop off the list
after a certain number, as otherwise there will be a lot of bad or spurious
definitions included.
Thanks!
-John
--
John Thompson
John.Thompson.JTSoftware@gmail.com
https://www.jtlanguage.com
1-909-283-4364 (home)
1-909-283-5642 (cell)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20200708/dd7acf4d/attachment-0001.html
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 165, Issue 3
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 165, Issue 3"
Post a Comment