Moses-support Digest, Vol 97, Issue 10

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. 3-grams removed from language model using SRILM ngram-count
(valluri saikiran)
2. Re: 3-grams removed from language model using SRILM
ngram-count (Kenneth Heafield)
3. Re: Factored LM / <s></s> (Marwa Refaie)
4. Re: Factored LM / <s></s> (Ondrej Bojar)
5. Re: Factored LM / <s></s> (Marwa Refaie)

----------------------------------------------------------------------

Message: 1
Date: Sat, 8 Nov 2014 10:36:24 +0530
From: valluri saikiran <saikiran730@gmail.com>
Subject: [Moses-support] 3-grams removed from language model using
SRILM ngram-count
To: moses-support@mit.edu
Message-ID:
<CABVhwAmmyNMU2ssf3KarALeTfehZSTNZU9VUoPzFi32MoJvQMg@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi,
I am using ngram-count tool of SRILM toolkit for generating language model
in ARPA format. The corpus has ~20million trigrams.
I want to include all the 3-grams in the lm.I am getting 3-gram count as
~5.7 million even after setting the prune threshold to 0(by giving - prune
0 in ngram-count arguments).Can anyone please help?

Thanks,
saikiran
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141108/2e511c04/attachment-0001.htm

------------------------------

Message: 2
Date: Sat, 08 Nov 2014 00:59:35 -0500
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] 3-grams removed from language model using
SRILM ngram-count
To: moses-support@mit.edu
Message-ID: <545DB147.6070905@kheafield.com>
Content-Type: text/plain; charset=ISO-8859-1

Hi,

You're looking for -gt3min 1. As to why a Good-Turing parameter
impacts Kneser-Ney smoothing, that's a question for Stolcke.

Kenneth

On 11/08/14 00:06, valluri saikiran wrote:
> Hi,
> I am using ngram-count tool of SRILM toolkit for generating language
> model in ARPA format. The corpus has ~20million trigrams.
> I want to include all the 3-grams in the lm.I am getting 3-gram count as
> ~5.7 million even after setting the prune threshold to 0(by giving -
> prune 0 in ngram-count arguments).Can anyone please help?
>
> Thanks,
> saikiran
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

------------------------------

Message: 3
Date: Sat, 8 Nov 2014 09:20:08 +0000
From: Marwa Refaie <basmallah@hotmail.com>
Subject: Re: [Moses-support] Factored LM / <s></s>
To: Ondrej Bojar <bojar@ufal.mff.cuni.cz>, "moses-support@mit.edu"
<moses-support@mit.edu>
Message-ID: <DUB118-W23004226955DA400E2A090BA820@phx.gbl>
Content-Type: text/plain; charset="windows-1256"

Hi all,
Why I still stuck on same error ??!! I cut data & be sure from its validity,
Then I run c:/irstlm/bin/add-start-end.sh <unH.en > unHse.en
before creating LM (pos & surface) using srilm. ... I can't apply add-start-end to corpus file , as moses show error on training "<s> has no second factor"
Please help I should resolve this fast
Tahanks

Marwa N. Refaie

> Subject: Re: [Moses-support] Factored LM / <s></s>
> From: bojar@ufal.mff.cuni.cz
> Date: Thu, 9 Oct 2014 07:40:28 +0200
> To: basmallah@hotmail.com; moses-support@mit.edu
>
> Dear Marwa,
>
> Try cutting the bad data in half and then in half again, etc. to get a very small input that still suffers from the error. Then you'll probably realize what is the problem or you can at least send it to the mailing list.
>
> Cheers, O.
>
>
> On October 9, 2014 2:10:12 AM CEST, Marwa Refaie <basmallah@hotmail.com> wrote:
> >How I should fix this error ?? Tokenizing didn't differ !! how to
> >normalize data or set sentence boundaries ???
> >
> >Start loading text SCFG phrase table. Moses format : [1.000]
> >secondsReading
> >/cygdrive/c/mosesdecoder-master/try/ai/sep/fsmt/work/model/phrase-table.
> >0,1-0,1.gz----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80
> >---85---90---95--100Either your data contains <s> in a position other
> >than the first word or your la
> >nguage model is missing <s>. Did you build your ARPA using IRSTLM and
> >forget to
> > run add-start-end.sh?
> >
> >Marwa N. Refaie
> >
> >
> >
> >------------------------------------------------------------------------
> >
> >_______________________________________________
> >Moses-support mailing list
> >Moses-support@mit.edu
> >http://mailman.mit.edu/mailman/listinfo/moses-support
>
> --
> Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
> http://www.cuni.cz/~obo
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20141108/4c47c36b/attachment-0001.htm

------------------------------

Message: 4
Date: Sat, 08 Nov 2014 11:28:48 +0100
From: Ondrej Bojar <bojar@ufal.mff.cuni.cz>
Subject: Re: [Moses-support] Factored LM / <s></s>
To: Marwa Refaie <basmallah@hotmail.com>, "moses-support@mit.edu"
<moses-support@mit.edu>
Message-ID: <6d5d33b4-2bda-48ba-a3d0-a7b59229c29e@email.android.com>
Content-Type: text/plain; charset=UTF-8

Hi,

It seems you are stepping somewhere where nobody really tested it. So the tools are not quite ready for factors and explicit sentence boundaries.

You may try using <s>|<s> and </s>|</s> as the sentence boundaries. This, when split into the two factors, will get you the correct sentence boudaries for LM training. However, if you want to base your LM on both of the factors at once, this trick wouldn't work, since IRSTLM won't recognize it as sentence boundaries...

I am afraid you have to prepare the files manually, separately for LM creation and for grammar extraction (no sentence boundaries there).

Try getting the system trained in a simpler setup, no factors, and if it works, mimic it for each of the factors.

Best, O.

On November 8, 2014 10:20:08 AM CET, Marwa Refaie <basmallah@hotmail.com> wrote:
>Hi all,
>Why I still stuck on same error ??!! I cut data & be sure from its
>validity,
>Then I run c:/irstlm/bin/add-start-end.sh <unH.en > unHse.en
>before creating LM (pos & surface) using srilm. ... I can't apply
>add-start-end to corpus file , as moses show error on training "<s> has
>no second factor"
>Please help I should resolve this fast
>Tahanks
>
>
>
>Marwa N. Refaie
>
>
>
>> Subject: Re: [Moses-support] Factored LM / <s></s>
>> From: bojar@ufal.mff.cuni.cz
>> Date: Thu, 9 Oct 2014 07:40:28 +0200
>> To: basmallah@hotmail.com; moses-support@mit.edu
>>
>> Dear Marwa,
>>
>> Try cutting the bad data in half and then in half again, etc. to get
>a very small input that still suffers from the error. Then you'll
>probably realize what is the problem or you can at least send it to the
>mailing list.
>>
>> Cheers, O.
>>
>>
>> On October 9, 2014 2:10:12 AM CEST, Marwa Refaie
><basmallah@hotmail.com> wrote:
>> >How I should fix this error ?? Tokenizing didn't differ !! how to
>> >normalize data or set sentence boundaries ???
>> >
>> >Start loading text SCFG phrase table. Moses format : [1.000]
>> >secondsReading
>>
>>/cygdrive/c/mosesdecoder-master/try/ai/sep/fsmt/work/model/phrase-table.
>>
>>0,1-0,1.gz----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80
>> >---85---90---95--100Either your data contains <s> in a position
>other
>> >than the first word or your la
>
>> >nguage model is missing <s>. Did you build your ARPA using IRSTLM
>and
>> >forget to
>
>> > run add-start-end.sh?
>> >
>> >Marwa N. Refaie
>> >
>> >
>> >
>>
>>------------------------------------------------------------------------
>> >
>> >_______________________________________________
>> >Moses-support mailing list
>> >Moses-support@mit.edu
>> >http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> --
>> Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
>> http://www.cuni.cz/~obo
>>
>

--
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo

------------------------------

Message: 5
Date: Sat, 8 Nov 2014 16:34:42 +0200
From: Marwa Refaie <basmallah@hotmail.com>
Subject: Re: [Moses-support] Factored LM / <s></s>
To: Ondrej Bojar <bojar@ufal.mff.cuni.cz>, <moses-support@mit.edu>
Message-ID: <DUB406-EAS764826ECF7BF32F9109D92BA820@phx.gbl>
Content-Type: text/plain; charset="utf-8"

Dear Ondrej
You are great .... Thanks for your help
It fix all & I have translation now .. ... so now I can move forward.

Regards

--- Original Message ---

From: "Ondrej Bojar" <bojar@ufal.mff.cuni.cz>
Sent: 8 November 2014 12:29
To: "Marwa Refaie" <basmallah@hotmail.com>, moses-support@mit.edu
Subject: RE: [Moses-support] Factored LM / <s></s>

Hi,

It seems you are stepping somewhere where nobody really tested it. So the tools are not quite ready for factors and explicit sentence boundaries.

You may try using <s>|<s> and </s>|</s> as the sentence boundaries. This, when split into the two factors, will get you the correct sentence boudaries for LM training. However, if you want to base your LM on both of the factors at once, this trick wouldn't work, since IRSTLM won't recognize it as sentence boundaries...

I am afraid you have to prepare the files manually, separately for LM creation and for grammar extraction (no sentence boundaries there).

Try getting the system trained in a simpler setup, no factors, and if it works, mimic it for each of the factors.

Best, O.

On November 8, 2014 10:20:08 AM CET, Marwa Refaie <basmallah@hotmail.com> wrote:
>Hi all,
>Why I still stuck on same error ??!! I cut data & be sure from its
>validity,
>Then I run c:/irstlm/bin/add-start-end.sh <unH.en > unHse.en
>before creating LM (pos & surface) using srilm. ... I can't apply
>add-start-end to corpus file , as moses show error on training "<s> has
>no second factor"
>Please help I should resolve this fast
>Tahanks
>
>
>
>Marwa N. Refaie
>
>
>
>> Subject: Re: [Moses-support] Factored LM / <s></s>
>> From: bojar@ufal.mff.cuni.cz
>> Date: Thu, 9 Oct 2014 07:40:28 +0200
>> To: basmallah@hotmail.com; moses-support@mit.edu
>>
>> Dear Marwa,
>>
>> Try cutting the bad data in half and then in half again, etc. to get
>a very small input that still suffers from the error. Then you'll
>probably realize what is the problem or you can at least send it to the
>mailing list.
>>
>> Cheers, O.
>>
>>
>> On October 9, 2014 2:10:12 AM CEST, Marwa Refaie
><basmallah@hotmail.com> wrote:
>> >How I should fix this error ?? Tokenizing didn't differ !! how to
>> >normalize data or set sentence boundaries ???
>> >
>> >Start loading text SCFG phrase table. Moses format : [1.000]
>> >secondsReading
>>
>>/cygdrive/c/mosesdecoder-master/try/ai/sep/fsmt/work/model/phrase-table.
>>
>>0,1-0,1.gz----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80
>> >---85---90---95--100Either your data contains <s> in a position
>other
>> >than the first word or your la
>
>> >nguage model is missing <s>. Did you build your ARPA using IRSTLM
>and
>> >forget to
>
>> > run add-start-end.sh?
>> >
>> >Marwa N. Refaie
>> >
>> >
>> >
>>
>>------------------------------------------------------------------------
>> >
>> >_______________________________________________
>> >Moses-support mailing list
>> >Moses-support@mit.edu
>> >http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> --
>> Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
>> http://www.cuni.cz/~obo
>>
>

--
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz)
http://www.cuni.cz/~obo

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 97, Issue 10
*********************************************

Moses-support Digest, Vol 97, Issue 10

0 Response to "Moses-support Digest, Vol 97, Issue 10"

Post a Comment