Moses-support Digest, Vol 98, Issue 27

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: cleaning the corpus prunes entire dataset (Hieu Hoang)
2. Re: cleaning the corpus prunes entire dataset (Jaya Kumaran)
3. string of Words + states in feature functions (amir haghighi)
4. Re: string of Words + states in feature functions
(HOANG Cong Duy Vu)


----------------------------------------------------------------------

Message: 1
Date: Wed, 10 Dec 2014 01:03:13 +0000
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] cleaning the corpus prunes entire dataset
To: Jaya Kumaran <jayakarayil@gmail.com>, moses-support@mit.edu
Message-ID: <54879BD1.20604@gmail.com>
Content-Type: text/plain; charset="windows-1252"

the 9-1 ratio is required as Giza++ become very inefficient if the
parallel sentences are very mismatched in length.

if the sentences are so mismatched in length, are you sure they are
actually translations? It may be that the parallel corpus is not clean,
or that the tokenization is not good for the language pair you are
working with. Can you provide a few examples of parallel sentences that
violate the 9-1 ratio?

On 05/12/14 05:54, Jaya Kumaran wrote:
> Hi,
>
> When I run clean-corpus-n.perl with max-1000 on the dataset with
> 14k(tourism corpus) lines, I get only 2.5k lines as clean corpus.
>
> I see the script in addition to removing blank lines, and lines
> >1000(max) words, the script is removing lines which violates 9-1
> sentence ratio of Giza. I don't understand 9-1 sentence ratio.
>
> How do i increase my clean corpus size.
>
> Thanks,
> Jaya
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141210/223c75c0/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 10 Dec 2014 11:10:35 +0530
From: Jaya Kumaran <jayakarayil@gmail.com>
Subject: Re: [Moses-support] cleaning the corpus prunes entire dataset
To: Hieu Hoang <hieuhoang@gmail.com>
Cc: moses-support@mit.edu
Message-ID:
<CAPwTSQQtu=Rbj2SKiUGqEhFeOoCsnRzOAm7DFjauVY7NvjB0ug@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Thanks Hieu for the reply.

True, the corpus was not clean. There were partially aligned sentences,
translation mismatches, lots of '\n' and ^ M character.
Now I cleaned the corpus and Im able to create baseline.

Thanks,
Jaya K

On Wed, Dec 10, 2014 at 6:33 AM, Hieu Hoang <hieuhoang@gmail.com> wrote:

> the 9-1 ratio is required as Giza++ become very inefficient if the
> parallel sentences are very mismatched in length.
>
> if the sentences are so mismatched in length, are you sure they are
> actually translations? It may be that the parallel corpus is not clean, or
> that the tokenization is not good for the language pair you are working
> with. Can you provide a few examples of parallel sentences that violate the
> 9-1 ratio?
>
>
> On 05/12/14 05:54, Jaya Kumaran wrote:
>
> Hi,
>
> When I run clean-corpus-n.perl with max-1000 on the dataset with
> 14k(tourism corpus) lines, I get only 2.5k lines as clean corpus.
>
> I see the script in addition to removing blank lines, and lines
> >1000(max) words, the script is removing lines which violates 9-1 sentence
> ratio of Giza. I don't understand 9-1 sentence ratio.
>
> How do i increase my clean corpus size.
>
> Thanks,
> Jaya
>
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141210/dd484409/attachment-0001.htm

------------------------------

Message: 3
Date: Wed, 10 Dec 2014 11:41:10 +0330
From: amir haghighi <amir.haghighi.64@gmail.com>
Subject: [Moses-support] string of Words + states in feature functions
To: moses-support <moses-support@mit.edu>
Message-ID:
<CA+UVbEgyUMivfiOaAWzk8dtihJMXByNFk3btiG=0zKVVToCCqQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi everyone

I'm implementing a feature function in moses-chart. I need the source words
string and also their indexes in the source sentence. I've written a
function that gets the source words but I don't know how extract word
string from a word.
could anyone guide me how to do that? as I know, each word is implemented
as an array of factors, which of them is its string?

I have also some questions about the states in the stateful features,
what kind of variables should be stored in each state? only those ones that
should be used in the compare function? or any variable from the previous
hypothesis that we use in our feature?

Thanks in advance!

Cheers
Amir
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141210/b35c4891/attachment-0001.htm

------------------------------

Message: 4
Date: Wed, 10 Dec 2014 17:26:58 +0800
From: HOANG Cong Duy Vu <duyvuleo@gmail.com>
Subject: Re: [Moses-support] string of Words + states in feature
functions
To: amir haghighi <amir.haghighi.64@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAPRaJX1X30Xh0zm6LPvv=Snd5sK_gEsmSWFx=dDsgVTb_vS=ew@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Amir,

I'm implementing a feature function in moses-chart. I need the source words
> string and also their indexes in the source sentence. I've written a
> function that gets the source words but I don't know how extract word
> string from a word.
> could anyone guide me how to do that? as I know, each word is implemented
> as an array of factors, which of them is its string?


You can utilize some of the following functions to get the source
information:

//target phrase and range
const TargetPhrase& currTargetPhrase = cur_hypo.GetCurrTargetPhrase();
const WordsRange& sourceWordRage = cur_hypo.GetCurrSourceWordsRange();

//source sentence
Manager& manager = cur_hypo.GetManager();
const Sentence& source_sent = static_cast<const
Sentence&>(manager.GetSource());

//alignment
const AlignmentInfo& alignments = targetPhrase.GetAlignTerm();

I have also some questions about the states in the stateful features,
> what kind of variables should be stored in each state? only those ones
> that should be used in the compare function? or any variable from the
> previous hypothesis that we use in our feature?


Normally, for stateful functions, for instance, previous target words will
be stored.


--
Cheers,
Vu

On Wed, Dec 10, 2014 at 4:11 PM, amir haghighi <amir.haghighi.64@gmail.com>
wrote:

> Hi everyone
>
> I'm implementing a feature function in moses-chart. I need the source
> words string and also their indexes in the source sentence. I've written a
> function that gets the source words but I don't know how extract word
> string from a word.
> could anyone guide me how to do that? as I know, each word is implemented
> as an array of factors, which of them is its string?
>
> I have also some questions about the states in the stateful features,
> what kind of variables should be stored in each state? only those ones
> that should be used in the compare function? or any variable from the
> previous hypothesis that we use in our feature?
>
> Thanks in advance!
>
> Cheers
> Amir
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141210/98abcb3c/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 98, Issue 27
*********************************************

0 Response to "Moses-support Digest, Vol 98, Issue 27"

Post a Comment