Moses-support Digest, Vol 100, Issue 31

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Recasing and truecasing (Tom Hoar)
2. training tree2string models (joerg)
3. moses/LM/IRST.cpp:25:24: fatal error: dictionary.h: No such
file or directory #include "dictionary.h" (James Johnson)
4. Re: optimizing lattice InputFeature weight (Hieu Hoang)

----------------------------------------------------------------------

Message: 1
Date: Mon, 09 Feb 2015 14:24:51 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] Recasing and truecasing
To: moses-support@mit.edu
Message-ID: <54D860C3.40608@precisiontranslationtools.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

Ken,

We have abandoned the recaser/truecaser + detokenize.perl combination
altogether. Instead, we developed a proprietary tokenization +
statistical model approach that restores both tokenization and casing to
the expected natural state (never tokenized/never lower-cased). Best of
all, it's language independent. So there's no need for language-specific
detokenize scripts.

If you're interested, I'm happy to take the discussion off-list.

Tom

On 02/07/2015 04:34 AM, Kenneth Heafield wrote:
> Dear Moses,
>
> What are the experiences with truecasing v the recaser? It seems the
> recaser's default does:
>
> 1) Train a truecaser
> 2) Truecase the monolingual data
> 3) Train an LM on the truecased data
>
> There's an option to just directly go to LM training. Any thoughts on
> which is better?
>
> It just feels weird to use the truecaser, which applies a unigram
> popularity model in some cases, to filter the training data for an
> n-gram model (so it won't be able to make n-gram decisions about thsoe
> words).
>
> Kenneth
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

------------------------------

Message: 2
Date: Mon, 9 Feb 2015 10:37:10 +0100
From: joerg <tiedeman@gmail.com>
Subject: [Moses-support] training tree2string models
To: moses-support <moses-support@mit.edu>
Message-ID: <63129434-BA45-474D-B0DA-EC06D47860F6@gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

I try to train a model with source-syntax (with factored tokens) but I have the problem that the train-model script removes all XML when reducing factors. Nothing is extracted in the end. Maybe it's not possible to have token-factors in source-syntax models? I guess that this is the same for target-syntax as well. Would it be possible to support such models?

Best,
J?rg

**********************************************************************************
J?rg Tiedemann http://stp.lingfil.uu.se/~joerg/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150209/6b553a1a/attachment-0001.htm

------------------------------

Message: 3
Date: Mon, 9 Feb 2015 16:01:52 +0530
From: James Johnson <jamesjohnson1097@gmail.com>
Subject: [Moses-support] moses/LM/IRST.cpp:25:24: fatal error:
dictionary.h: No such file or directory #include "dictionary.h"
To: moses-support@mit.edu
Message-ID:
<CAO3pzyjRsUpkJ04nbu58nHH64mCVvdXLXxp+668hjvsNXoSTnQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Need help
cannot able to solve this issue
...failed gcc.compile.c++
moses/LM/bin/gcc-4.8/release/debug-symbols-on/link-static/threading-multi/IRST.o...

--
@jamesjohnson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150209/23fc921c/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: build.log.gz
Type: application/x-gzip
Size: 2267 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150209/23fc921c/attachment-0001.bin

------------------------------

Message: 4
Date: Mon, 09 Feb 2015 11:49:04 +0000
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] optimizing lattice InputFeature weight
To: Jorg Tiedemann <tiedeman@gmail.com>, Hieu Hoang
<Hieu.Hoang@ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID: <54D89EB0.3060106@gmail.com>
Content-Type: text/plain; charset="windows-1252"

i think i've resolved it and committed the change
https://github.com/moses-smt/mosesdecoder/commit/ce80e53b30f766ab85cb58c4a2d06742b4a4f38b
the scores for the input path wasn't being set. At the moment, there's
different code paths, depending on what type of phrase-table and input
type so it's a little confusing. This would hopefully be resolved when
we get rid of the binary phrase-table

On 07/02/15 20:35, Jorg Tiedemann wrote:
>
> inputtype is set to 2. I really don't know why this doesn't work.
>
> Here an example:
>
> lattice input:
>
> ((('muutoksia',0.5,1),('Muutoksia',0.9,1),),(('yritystukik?yt?nt?ihin',0.9,3),('yritys',0.5,1),),(('tuki',0.5,1),),(('k?yt?nt?ihin',0.5,1),),(('-',0.9,1),),(('Maakunta',0.9,2),('maa',0.5,1),),(('kunta',0.5,1),),(('-',0.9,1),),(('Alueet',0.9,1),('alueet',0.5,1),),(('-',0.9,1),),(('Uutiset',0.9,1),('uutiset',0.5,1),),(('-',0.9,1),),(('Karjalainen',0.9,1),('karjalainen',0.5,1),),)
>
> nbest-list:
> 0 ||| changes in the business practices - - - news - karjalainen |||
> Distortion0= -6 LM0= -74.8382 InputFeature0= 0 WordPenalty0= -11
> PhrasePenalty0= 11 TranslationModel0= -45.7094 -47.7159 -16.804
> -11.8221 ||| -50.4294
> 0 ||| changes in the business practices - - - news - karjalainen |||
> Distortion0= -6 LM0= -74.8382 InputFeature0= 0 WordPenalty0= -11
> PhrasePenalty0= 10 TranslationModel0= -45.7472 -47.7159 -16.5762
> -11.8221 ||| -50.5914
> ...
>
> The InputFeature0 is always 0
>
> Decoding output includes:
> 0 -- (muutoksia , , -0.6931) (Muutoksia , , -0.1051)
> 1 -- (yritystukik?yt?nt?ihin , , -0.1053) (yritys , , -0.6931)
> 2 -- (tuki , , -0.6931)
> 3 -- (k?yt?nt?ihin , , -0.6931)
> 4 -- (- , , -0.1051)
> 5 -- (Maakunta , , -0.1052) (maa , , -0.6931)
> 6 -- (kunta , , -0.6931)
> 7 -- (- , , -0.1051)
> 8 -- (Alueet , , -0.1051) (alueet , , -0.6931)
> 9 -- (- , , -0.1051)
> 10 -- (Uutiset , , -0.1051) (uutiset , , -0.6931)
> 11 -- (- , , -0.1051)
> 12 -- (Karjalainen , , -0.1051) (karjalainen , , -0.6931)
>
> What I don't understand is why there is an empty field in the output
> above.
>
> And the config file sets the weight for InputFeature0 to 1:
>
> [feature]
> InputFeature num-features=1 num-input-features=1 real-word-count=0
> ....
> # dense weights for feature functions
> [weight]
> InputFeature0= 1
>
> Strange ...
>
> J?rg
>
>
> J?rg Tiedemann
> tiedeman@gmail.com <mailto:tiedeman@gmail.com>
>
>
>
>
> On Feb 7, 2015, at 9:21 PM, Hieu Hoang wrote:
>
>> there's no reason why it shouldn't work.
>>
>> the only thing i can think of is that the input type hasn't been set
>> to lattice. In the moses.ini, there should be something like
>> [inputtype]
>> 2
>> or on the command line
>> moses -inputtype 2
>>
>>
>> Hieu Hoang
>> Research Associate (until March 2015)
>> ** searching for interesting commercial MT position **
>> University of Edinburgh
>> http://www.hoang.co.uk/hieu
>>
>>
>> On 7 February 2015 at 20:05, Jorg Tiedemann <tiedeman@gmail.com
>> <mailto:tiedeman@gmail.com>> wrote:
>>
>>
>> I have a problem with lattice decoding and optimizing
>> input-feature weights. I have edge weights in my lattice input
>> (one per edge) and I defined one input feature that I'd like to
>> optimize using MERT. However, my the input feature value is
>> always 0 in the nbest lists even though none of the input edges
>> has value 1 (or 0). What do I do wrong?
>>
>> My initial config file includes:
>>
>> [feature]
>> InputFeature num-features=1 num-input-features=1 real-word-count=0
>> ...
>> # dense weights for feature functions
>> [weight]
>> InputFeature0= 1
>>
>> The lattice input is valid and looks like this:
>>
>> ((('word',0.8,1),('word',0.6,1), ....
>>
>> MERT tuning fails in the end especially because the input feature
>> cannot be set.
>> Any help is very much appreciated.
>> Thanks,
>> J?rg
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>

--
Hieu Hoang
Research Associate (until March 2015)
** searching for interesting commercial MT position **
University of Edinburgh
http://www.hoang.co.uk/hieu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150209/96c6e96f/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 100, Issue 31
**********************************************

Moses-support Digest, Vol 100, Issue 31

0 Response to "Moses-support Digest, Vol 100, Issue 31"

Post a Comment