Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: phrase table (John D Burger)
2. Re: phrase table (Matthias Huck)
3. Re: Sparse features and overfitting (Matthias Huck)
4. Re: Sparse features and overfitting (Matthias Huck)
5. Re: Sparse features and overfitting (HOANG Cong Duy Vu)
6. Re: nplm (Marwa Refaie)
----------------------------------------------------------------------
Message: 1
Date: Thu, 15 Jan 2015 13:57:23 -0500
From: John D Burger <john@mitre.org>
Subject: Re: [Moses-support] phrase table
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <6788866D-392A-4290-8B77-0BC7EB440449@mitre.org>
Content-Type: text/plain; charset=iso-8859-1
I've observed this as well. It seems to me there are several competing pressures affecting the number of ngram types in a corpus. On the one hand, as the size of the corpus increases, so does the vocabulary. This obviously increases the number of unigram types (which is the same as the vocabulary size), but also increases all of the other ngram sizes as well. But the other effect is that language is hugely constrained by context, and the longer the context (i.e. the longer the ngram) the less freedom there is for what can reasonably say next. If I say "the big", there are lots of reasonable choices for the third word, but if I say "I was frightened by the barking of the big", there are very few sensible completions.
You could quantify this by computing perplexity at various ngram sizes, but that's just another way of measuring the same effect you see with your ngram counts.
Of course this could be complete nonsense - I'm eager to hear what other people think.
- John Burger
MITRE
On Jan 15, 2015, at 11:39 , Read, James C <jcread@essex.ac.uk> wrote:
> Hi,
>
> I just ran a count of different sized n-grams in the source side of my phrase table and this is what I got.
>
> unigrams 85,233
> bigrams 991,701
> trigrams 2,697,341
> 4-grams 3,876,180
> 5-grams 4,209,094
> 6-grams 3,702,813
> 7-grams 2,560,251
> 8-grams 0
>
> So, up until the 5-grams the results are what I expected the number is increasing. But then it drops for the 6-grams and drops again for the 7-grams.
>
> Does anybody know why?
>
> James
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
------------------------------
Message: 2
Date: Thu, 15 Jan 2015 21:56:52 +0000
From: Matthias Huck <mhuck@inf.ed.ac.uk>
Subject: Re: [Moses-support] phrase table
To: "Read, James C" <jcread@essex.ac.uk>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <1421359012.2192.78.camel@portedgar>
Content-Type: text/plain; charset="UTF-8"
Hi,
The data is sentence-segmented.
Assume you train your model with a training corpus which contains a
single parallel sentence pair. Your training sentence has length L on
both source and target side, and it's aligned along the diagonal.
If n > L, you cannot extract any phrase of length n from this training
corpus. If n <= L, you can extract L - n + 1 phrases of length n.
Example: for L = 5 you can extract five phrases of length n = 1, four of
length n = 2, ... , one of length n = 5, and none of length n > 5.
Also, bilingual blocks are valid (=extractable) phrases only if they are
consistent wrt. the word alignment. Larger blocks are possibly more
frequently inconsistent.
Of course you should consider some more aspects, e.g.:
- training settings
(there won't be any 8-grams if you set the max. phrase length to 7;
long phrases will be affected more by a count cutoff because of sparsity)
- vocabulary sizes limit the amount of possible combinations
- n-gram entropy of the language
[http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf]
Analyzing such things in detail is surely a fun pastime. You can start
with vocabulary sizes, number of running words of your corpus,
histograms of source-side training sentence lengths, number of distinct
n-grams that appear in the source side of the corpus vs. number of
distinct n-grams that are source sides of valid phrases, number of
distinct n-grams that appear in the source side of the corpus if you
undo the sentence segmentation (replace all line breaks by spaces), etc.
Cheers,
Matthias
On Thu, 2015-01-15 at 16:39 +0000, Read, James C wrote:
> Hi,
>
>
>
> I just ran a count of different sized n-grams in the source side of my
> phrase table and this is what I got.
>
>
>
> unigrams 85,233
>
>
> bigrams 991,701
>
>
> trigrams 2,697,341
>
>
> 4-grams 3,876,180
>
>
> 5-grams 4,209,094
>
>
> 6-grams 3,702,813
>
>
> 7-grams 2,560,251
>
>
> 8-grams 0
>
>
>
> So, up until the 5-grams the results are what I expected the number is
> increasing. But then it drops for the 6-grams and drops again for the
> 7-grams.
>
>
>
> Does anybody know why?
>
>
>
> James
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
------------------------------
Message: 3
Date: Thu, 15 Jan 2015 22:17:22 +0000
From: Matthias Huck <mhuck@inf.ed.ac.uk>
Subject: Re: [Moses-support] Sparse features and overfitting
To: duyvuleo@gmail.com
Cc: moses-support <moses-support@mit.edu>
Message-ID: <1421360242.2192.86.camel@portedgar>
Content-Type: text/plain; charset="UTF-8"
We typically try to increase the tuning set in order to obtain more
reliable sparse feature weights. But in your case it's rather the test
set that seems a bit small for trusting the BLEU scores.
Do the sparse features give you any large improvement on the tuning set?
On Thu, 2015-01-15 at 13:54 +0800, HOANG Cong Duy Vu wrote:
> I used sparse features such as: TargetWordInsertionFeature,
> SourceWordDeletionFeature, WordTranslationFeature,
> PhraseLengthFeature.
> Sparse features are used only for top source and target words (100,
> 150, 200, 250, ....).
>
>
> My parallel data include: train(201K); tune(6214); test(641).
>
> Is there any way to prevent over-fitting when applying the sparse
> features? Or in this case, sparse features will not generalize well
> over "unseen" data?
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
------------------------------
Message: 4
Date: Thu, 15 Jan 2015 22:31:39 +0000
From: Matthias Huck <mhuck@inf.ed.ac.uk>
Subject: Re: [Moses-support] Sparse features and overfitting
To: duyvuleo@gmail.com
Cc: moses-support <moses-support@mit.edu>
Message-ID: <1421361099.2192.95.camel@portedgar>
Content-Type: text/plain; charset="UTF-8"
On Thu, 2015-01-15 at 13:54 +0800, HOANG Cong Duy Vu wrote:
> - tune & test
> (based on source)
> size of overlap set = 624
> (based on target)
> size of overlap set = 386
>
> (tune & test have high overlapping parts based on source sentences,
> but half of them have different target sentences)
Does this mean that there are hundreds of sentences in your original
tuning and test sets that are equal on the source side but have
different references? That sounds a bit odd. Maybe it indicates that
something about your data is generally problematic.
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
------------------------------
Message: 5
Date: Fri, 16 Jan 2015 07:46:36 +0800
From: HOANG Cong Duy Vu <duyvuleo@gmail.com>
Subject: Re: [Moses-support] Sparse features and overfitting
To: Matthias Huck <mhuck@inf.ed.ac.uk>, prashant@fbk.eu
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAPRaJX11SQ0CMPyvraYq0QW_P4SVu_a_uHuL=s5Ugnn+bj33eA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Thanks for your replies!
Hi Prashant,
there is definitely an option for sparse l1/l2 regularization with mira. I
> don't know how to call it through command line though.
Yes. For MIRA, we can set the *C* parameter to control its regularization.
I tried different C values (0.01, 0.001) but it didn't work in my case.
Hi Matthias,
Do the sparse features give you any large improvement on the tuning set?
Yes. The improvement is around ~2-3 BLEU scores on the tuning set.
Does this mean that there are hundreds of sentences in your original
> tuning and test sets that are equal on the source side but have
> different references? That sounds a bit odd. Maybe it indicates that
> something about your data is generally problematic.
Yes. It's quite odd, I think so. But this data (Chinese-to-English) is
extracted from an official competition.
Probably, I will have to remove overlapping before moving on with other
kinds of features.
--
Cheers,
Vu
On Fri, Jan 16, 2015 at 6:31 AM, Matthias Huck <mhuck@inf.ed.ac.uk> wrote:
> On Thu, 2015-01-15 at 13:54 +0800, HOANG Cong Duy Vu wrote:
>
>
> > - tune & test
> > (based on source)
> > size of overlap set = 624
> > (based on target)
> > size of overlap set = 386
>
> >
> > (tune & test have high overlapping parts based on source sentences,
> > but half of them have different target sentences)
>
>
>
> Does this mean that there are hundreds of sentences in your original
> tuning and test sets that are equal on the source side but have
> different references? That sounds a bit odd. Maybe it indicates that
> something about your data is generally problematic.
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150116/8db32698/attachment-0001.htm
------------------------------
Message: 6
Date: Fri, 16 Jan 2015 10:08:27 +0200
From: Marwa Refaie <basmallah@hotmail.com>
Subject: Re: [Moses-support] nplm
To: <feng.x.q.2006@gmail.com>, <nheart@gmail.com>, <avaswani@isi.edu>
Cc: moses-support@mit.edu
Message-ID: <DUB406-EAS403ABD80C94F0BE6F12199EBA4F0@phx.gbl>
Content-Type: text/plain; charset="utf-8"
Hi
Thanks for all your replies. It's working now.
I can send a step by step how it installed and linked to moses & how it worked on Arabic .
For the 00000 train model, it solved using dos2unix.exe on the file before working with nplm.
I designing my experiments now.
Thanks all
--- Original Message ---
From: "Xiaoqiang Feng" <feng.x.q.2006@gmail.com>
Sent: 14 January 2015 04:21
To: "Marwa Refaie" <basmallah@hotmail.com>
Subject: Re: nplm
Hi,
You got train.ngrams with all 000 000 000 ...... etc, which means the
preprocessing of training data is wrong.
I think you should first to handle the preprocessing: convering the
training data to numberized n-grams.
xiaoqiang
2015-01-14 5:37 GMT+08:00 Marwa Refaie <basmallah@hotmail.com>:
> Hi,
>
>
> I follow all instruction as listed to compile & build the nplm , it's ok
> when I get language model for ENGLISH LANGUAGE, but when I proceed to build
> ARABIC LANGUAGE MODEL I got train.ngrams with all 000 000 000 .......etc .
>
> Even the prepare & train step finished very fast !!
>
> As logic when I used with moses decoder I have no translation !!!!!!! just
> the same English words instead of translated Arabic word.
>
> Any help please ....
>
>
>
>
> *Marwa N. Refaie*
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150116/30bdf307/attachment.htm
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 99, Issue 31
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 99, Issue 31"
Post a Comment