Moses-support Digest, Vol 165, Issue 4

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: GIZA++ word alignment with paragraphs (Hieu Hoang)


----------------------------------------------------------------------

Message: 1
Date: Thu, 9 Jul 2020 11:00:02 -0700
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] GIZA++ word alignment with paragraphs
To: John Thompson <john.thompson.jtsoftware@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAEKMkbjtDN2Q-MVHWz7wsawkELVy2SRw9Zs=aCjMGvbrGYcbJw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi john

I'm afraid the word alignment tools like Giza++ aren't really designed to
be run against paragraph length input. Probably one reason why you're
getting bad alignments.

I don't know is tweaking the parameters would make it any better, or using
any other word alignment tool

On Wed, Jul 8, 2020, 6:19 PM John Thompson <
john.thompson.jtsoftware@gmail.com> wrote:

> Hi,
>
> I'm using a 7162 line paragraph-aligned corpus. Unfortunately the
> translation within the paragraph sometimes don't have the sentences
> aligned, i.e. in one language the sentence could be one long sentence, and
> in another language the sentence could have clauses broken up into multiple
> sentences, hence I'm running GIZA++ on paragraphs.
>
> It works partially, but the alignment of words is often wrong or it's
> missing matches that should have been made.
>
> I set the "maxsentencelength" configuration file parameter to 350, though
> most of the paragraphs are around 100 or fewer words.
>
> Q1: What difference do you estimate I should expect between using
> paragraphs vs. sentences?
>
> Q2: Are there GIZA++ parameters I could tune to improve the alignment?
>
> Q3: If I concatenated multiple corpora, would the alignment output likely
> improve?
>
> I could preprocess the corpus, breaking up the paragraphs where the number
> of sentences match, but there may be some cases where the sentences don't
> align, where multiple sentences within the paragraph were joined or split
> differently, such that the sentence count of the paragraph is the same, but
> the sentences don't align.
>
> Q4: How big of effect would these bad sentence alignments have on the rest
> of the alignments?
>
> Q5: Any ideas for how to get better word alignment with these corpora that
> I have, either with GIZA++ or a different tool?
>
> I'm using the word alignment in a language study tool. For example, I have
> the text for a book in both English and Marshallese, but language resources
> for Marshallese are scarce. In my tool I associate the alignment
> information with the text, and also generate a dictionary using the
> alignment output (or optionally the *.dict.actual.ti.final dictionary list
> output). In one page I show the aligned sentences, and in another you can
> click on words to get both the alignment definition, and the dictionary
> definitions. For each source word dictionary entry, I sort the target
> definitions by descending frequency (or probability if using the
> *.dict.actual.ti.final dictionary list output), and then chop off the list
> after a certain number, as otherwise there will be a lot of bad or spurious
> definitions included.
>
> Thanks!
>
> -John
>
> --
> John Thompson
> John.Thompson.JTSoftware@gmail.com
> https://www.jtlanguage.com
> 1-909-283-4364 (home)
> 1-909-283-5642 (cell)
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20200709/d3a418ad/attachment-0001.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 165, Issue 4
*********************************************

0 Response to "Moses-support Digest, Vol 165, Issue 4"

Post a Comment