Moses-support Digest, Vol 86, Issue 45

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Question: How to parse tokens with slashes in PTB style?
(Philip Williams)
2. Interpolating LMs (Prashant Mathur)


----------------------------------------------------------------------

Message: 1
Date: Tue, 17 Dec 2013 19:42:59 +0000
From: Philip Williams <philip.williams@mac.com>
Subject: Re: [Moses-support] Question: How to parse tokens with
slashes in PTB style?
To: Martin Velez <marvelez@ucdavis.edu>
Cc: moses-support@mit.edu
Message-ID: <DF703016-6CE6-4D78-B2AF-0C9E791B4198@mac.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi Martin,

I don't think that will break the pipeline, but for word alignment and grammar extraction, separating out the slash character is probably a good idea for reducing data sparsity.

If the reason you want PTB-style tokenization is to prepare the text for parsing then I'd recommend taking a look at the script parse-de-berkeley.perl in scripts/training/wrappers. It's a wrapper script for the Berkeley parser (it works for English as well as German), which takes the tokenized input and, if you give it the -split-slash option, it joins the tokens back together prior to parsing. After parsing it -- via a call to berkeleyparsed2mosesxml.perl -- splits them again, adapting the parse tree structure in the process. If you're using a different parser then it should be reasonably simple to write a wrapper along the same lines.

Phil

On 17 Dec 2013, at 01:45, Martin Velez <marvelez@ucdavis.edu> wrote:

> I would like to tokenize tokens with forward slashes in the same way PTB does it.
>
> For example:
> Input: "Resolution 55/100"
> Output: "Resolution 55 / 100" (using default options)
> Output: "Resolution 55 %/% 100" (using "-penn" options)
> Desired Output: "Resolution 55/100"
>
> I skimmed through the code. I found the relevant commented code at line 400 of the tokenizer.perl script. If I commented it out, will I achieve my goal? Or will I break something?
>
> Saludos!
> Martin Velez
> UC Davis
> marvelez@ucdavis.edu
> http://csiflabs.cs.ucdavis.edu/~marvelez/
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20131217/886052d7/attachment-0001.htm

------------------------------

Message: 2
Date: Wed, 18 Dec 2013 14:38:05 +0100
From: Prashant Mathur <prashant@fbk.eu>
Subject: [Moses-support] Interpolating LMs
To: Moses <moses-support@mit.edu>
Message-ID: <52B1A53D.5070806@fbk.eu>
Content-Type: text/plain; charset="iso-8859-1"

Hi,

I am using the script interpolate-lm.perl for interpolating 11 LMs at
once. ngram cannot interpolate more than 10 at once (atleast in the old
versions), so the script builds two groups and compute mixing
coefficients using PPL per word (seems like it).
In my case one of the LM is much bigger than rest of the LMs.

When I run interpolate-lm like this, it throws the error (1.error.log)
perl ~/mosesdecoder/scripts/ems/support/interpolate-lm.perl --tuning
TED/tuning/reference.lc.1 --name TED-interpolate --lm
commoncrawl/lm/commoncrawl.lm.1,ECB/lm/ECB.lm.1,EMEA/lm/EMEA.lm.1,EUconst/lm/EUconst.lm.1,europarl.v7/lm/europarl.v7.lm.1,Ford/lm/Ford.lm.1,KDE4/lm/KDE4.lm.1,news-commentary/lm/news-commentary.lm.1,OOffice/lm/OOffice.lm.1,OpenSubs-2011/lm/OpenSubs-2011.lm.1,PHP/lm/PHP.lm.1,UN/lm/UN.lm.1
--srilm $SRILM/bin/i686 --tempdir $TEMPDIR


When I change the position of "commoncrawl-lm" in the command,
interpolation works. If I would like to guess I would say its because
commoncrawl LM was grouped in the second group (1.log)
perl ~/mosesdecoder/scripts/ems/support/interpolate-lm.perl --tuning
TED/tuning/reference.lc.1 --name TED-interpolate --lm
ECB/lm/ECB.lm.1,EMEA/lm/EMEA.lm.1,EUconst/lm/EUconst.lm.1,europarl.v7/lm/europarl.v7.lm.1,Ford/lm/Ford.lm.1,KDE4/lm/KDE4.lm.1,news-commentary/lm/news-commentary.lm.1,OOffice/lm/OOffice.lm.1,OpenSubs-2011/lm/OpenSubs-2011.lm.1,PHP/lm/PHP.lm.1,UN/lm/UN.lm.1,commoncrawl/lm/commoncrawl.lm.1
--srilm /hltsrv1/software/srilm/srilm-1.5.10/bin/i686 --tempdir $TEMPDIR

Any idea why this is happening, apart from my baseless conclusion?
I don't have any problems as long as it works.. just wanted to put it
out there.

Thanks,
--
Prashant


#SRILM #scripts
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1.error.log
Type: text/x-log
Size: 6377 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20131218/cd08bebe/attachment.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1.log
Type: text/x-log
Size: 10356 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20131218/cd08bebe/attachment-0001.bin

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 86, Issue 45
*********************************************

0 Response to "Moses-support Digest, Vol 86, Issue 45"

Post a Comment