Moses-support Digest, Vol 103, Issue 43

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. high-order-grams in kenlm and srilm (koormoosh)
2. Re: high-order-grams in kenlm and srilm (Kenneth Heafield)
3. Re: Stripping carriage returns in FilePiece? (Kenneth Heafield)
4. Re: Stripping carriage returns in FilePiece? (Jeroen Vermeulen)
5. Re: Unexpected behaviour of placeables (Hieu Hoang)
6. Re: Unexpected behaviour of placeables (Carla Parra)

----------------------------------------------------------------------

Message: 1
Date: Tue, 19 May 2015 11:48:39 +1000
From: koormoosh <koormoosh@gmail.com>
Subject: [Moses-support] high-order-grams in kenlm and srilm
To: Kenneth Heafield <moses@kheafield.com>, moses-support@mit.edu
Message-ID:
<CAN3_CDjQUCCpAGxrEzbD9kVhaQ1GeYYyx1mP3qPCYEZA3R+Wyw@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hello,

I wonder why it takes lot of time to do language modelling with kenlm and
srilm when n goes beyond 6 (even on a relatively small dataset: 500 MB),
and is there a way to actually do high-order (6,7,8-gram) language
modelling with srilm and kenlm on a laptop (12GB RAM)? I assume there is a
flag somewhere that I need to set when creating the arpa or binary file, or
during the test (computing the perplexity etc...).

Thanks,
-K
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150519/d80c3214/attachment-0001.htm

------------------------------

Message: 2
Date: Mon, 18 May 2015 22:13:16 -0400
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] high-order-grams in kenlm and srilm
To: koormoosh <koormoosh@gmail.com>, moses-support@mit.edu
Message-ID: <555A9C3C.4010205@kheafield.com>
Content-Type: text/plain; charset=utf-8

Hi,

There are more n-grams. I'm guessing you're running low on RAM. Are
you referring to estimating or querying?

To estimate such a model from data, you simply need to use the -o
option to lmplz as you already do. lmplz already lets you specify the
memory usage.

For models above 7, to query you will need to recompile with e.g.
--max-kenlm=order=8 .

Regarding compression, take a look at
http://kheafield.com/code/kenlm/structures/

This all said, I doubt you'll get much useful out of a 500 MB data set
with higher orders.

Kenneth

On 05/18/2015 09:48 PM, koormoosh wrote:
> Hello,
>
> I wonder why it takes lot of time to do language modelling with kenlm
> and srilm when n goes beyond 6 (even on a relatively small dataset: 500
> MB), and is there a way to actually do high-order (6,7,8-gram) language
> modelling with srilm and kenlm on a laptop (12GB RAM)? I assume there is
> a flag somewhere that I need to set when creating the arpa or binary
> file, or during the test (computing the perplexity etc...).
>
> Thanks,
> -K

------------------------------

Message: 3
Date: Mon, 18 May 2015 22:22:43 -0400
From: Kenneth Heafield <moses@kheafield.com>
Subject: Re: [Moses-support] Stripping carriage returns in FilePiece?
To: moses-support@mit.edu
Message-ID: <555A9E73.5050408@kheafield.com>
Content-Type: text/plain; charset=utf-8

There are non-traditional uses like ReadLine('\0') to read
null-delimited tokens.

But I'd support Jeroen here: the default ReadLine() with no argument
should swallow \r.

In any case if you're going to change code there, can you do it upstream
in github.com/kpu/kenlm ? I just gave you commit access.

Also, how would you feel if I changed it to be FakeIFStream with
operator>> extraction, at least for integer/float types?

Kenneth

On 05/18/2015 03:41 AM, Jeroen Vermeulen wrote:
> On 18/05/15 14:02, Hieu Hoang wrote:
>> i prefer FilePiece outputs a failthful representation of the file. If
>> you need to clean your data, I think it should go into the cleaning or
>> normalization scripts
>
> That could go into a lot more places and end up being more brittle though.
>
> Would it help if I made the default "do not strip carriage returns", and
> made lexical-reordering-score request the conversion explicitly?
>
> Bear in mind here that every time we fopen() a file without the "b" mode
> flag, we're really saying we want the same conversion if the runtime
> feels the need ? as it would on Windows. When we call ReadLine(), at
> least it knows we really want the file interpreted as text.
>
>
> Jeroen
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

------------------------------

Message: 4
Date: Tue, 19 May 2015 11:45:24 +0700
From: Jeroen Vermeulen <jtv@precisiontranslationtools.com>
Subject: Re: [Moses-support] Stripping carriage returns in FilePiece?
To: moses-support@mit.edu
Message-ID: <555ABFE4.30201@precisiontranslationtools.com>
Content-Type: text/plain; charset=utf-8

On 19/05/15 09:22, Kenneth Heafield wrote:
> There are non-traditional uses like ReadLine('\0') to read
> null-delimited tokens.

While exploring this change I didn't find a single use of that parameter
in the Moses source tree!

But there may be uses outside the project of course. That's one of the
dangers of duplicating code. Is there an overview somewhere of what
code in Moses was copied in but is actually maintained elsewhere?

> But I'd support Jeroen here: the default ReadLine() with no argument
> should swallow \r.

To be clear though: with my change, FilePiece remains an exact binary
representation of the file. It's just that ReadLine() returns a
slightly shorter piece of it, just like it already swallows \n.

(Side note: I've had a quick look at StringPiece now and it looks like a
really useful abstraction for performance. And apparently that same
abstraction is going to be in the C++17 standard library as string_view.)

> In any case if you're going to change code there, can you do it upstream
> in github.com/kpu/kenlm ? I just gave you commit access.

Will do, thanks. What's the procedure for "downstreaming" that into Moses?

> Also, how would you feel if I changed it to be FakeIFStream with
> operator>> extraction, at least for integer/float types?

Sorry, I haven't looked into FakeIFStream at all yet, and I may not
fully understand the question.

Jeroen

------------------------------

Message: 5
Date: Tue, 19 May 2015 11:13:44 +0400
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] Unexpected behaviour of placeables
To: carla.parra@hermestrans.com
Cc: Moses Support <moses-support@mit.edu>
Message-ID:
<CAEKMkbjQ0eJuX-nLSh7LAPgU3=S7nOxLj6wsXcDDLC=dvn=EAA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

what is the exact command you used to decode? Can you please provide the
moses.ini file and a few lines of your input data for us to look at.

Hieu Hoang
Researcher
New York University, Abu Dhabi
http://www.hoang.co.uk/hieu

On 18 May 2015 at 15:35, Carla Parra <carla.parra@hermestrans.com> wrote:

> Dear all,
>
> we just finished some experiments using placeables, and we have observed
> several issues that may be worth sharing. I don't know if someone has
> experienced the same, or you were already aware of this, but just in
> case:
>
> (1) Special characters must be scaped in the "entity" value field.
> Otherwise, the cause XML parsing errors at tuning (not at training,
> though!), and wrong values are retrieved from the tags (e.g. we had text
> with additional quotation marks, and this caused that the translation
> stopped at the first quotation mark, not yielding the complete "entity"
> value we had encoded).
>
> (2) <ne> tags are added to sentences as if they were computed as tokens
> during training. (i.e. not ignored, as they just contain the
> placeables).
> As an example, the English sentence "Allow simple password", is
> translated as "Permitir simple contrase?a <ne translation="@tag@"
> entity="</1>">@tag@</ne> ."
>
> While the first issue is our fault, we do not know what causes the
> second one. We have followed the instructions at the MOSES advanced
> features site and thus specified "extract-settings = "--Placeholder
> @tag@"" in training and "-placeholder-factor 1 -xml-input exclusive" in
> the decoder and evaluation. Has anyone experienced the same thing and/or
> know how to solve this issue?
>
> Thank you very much. Best regards,
>
> Carla
>
> --
> Carla Parra Escart?n
> Marie Curie Experienced Researcher - EXPERT ITN
> http://expert-itn.eu/
> Hermes Traducciones
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150519/e03b53a8/attachment-0001.htm

------------------------------

Message: 6
Date: Tue, 19 May 2015 09:53:57 +0200
From: Carla Parra <carla.parra@hermestrans.com>
Subject: Re: [Moses-support] Unexpected behaviour of placeables
To: Hieu Hoang <hieuhoang@gmail.com>
Cc: Moses Support <moses-support@mit.edu>
Message-ID: <b4d057fd7d33803babd6e28bf2036506@hermestrans.com>
Content-Type: text/plain; charset="utf-8"

Dear Hieu,

thanks for your reply. I attach the config file, my moses.ini (I think
this is the one you want to get), and a few lines of our input file,
already preprocessed. If you want the RAW lines I can also send them to
you.

I don't know if this will be a similar issue, but I tried the same
strategy using the forced translations (<np
translation="German">Deutsch</np>), and this morning I have observed the
same, some tags are suddenly appearing in the translation.

Thank you very much for your support!

Carla

El 19.05.2015 09:13, Hieu Hoang escribi?:
> what is the exact command you used to decode? Can you please provide
> the moses.ini file and a few lines of your input data for us to look
> at.
>
> Hieu Hoang
> Researcher
>
> New York University, Abu Dhabi
>
> http://www.hoang.co.uk/hieu [3]
>
> On 18 May 2015 at 15:35, Carla Parra <carla.parra@hermestrans.com>
> wrote:
>
>> Dear all,
>>
>> we just finished some experiments using placeables, and we have
>> observed
>> several issues that may be worth sharing. I don't know if someone
>> has
>> experienced the same, or you were already aware of this, but just
>> in
>> case:
>>
>> (1) Special characters must be scaped in the "entity" value field.
>> Otherwise, the cause XML parsing errors at tuning (not at training,
>> though!), and wrong values are retrieved from the tags (e.g. we had
>> text
>> with additional quotation marks, and this caused that the
>> translation
>> stopped at the first quotation mark, not yielding the complete
>> "entity"
>> value we had encoded).
>>
>> (2) <ne> tags are added to sentences as if they were computed as
>> tokens
>> during training. (i.e. not ignored, as they just contain the
>> placeables).
>> As an example, the English sentence "Allow simple password", is
>> translated as "Permitir simple contrase?a <ne translation="@tag@"
>> entity="</1>">@tag@</ne> ."
>>
>> While the first issue is our fault, we do not know what causes the
>> second one. We have followed the instructions at the MOSES advanced
>> features site and thus specified "extract-settings = "--Placeholder
>> @tag@"" in training and "-placeholder-factor 1 -xml-input
>> exclusive" in
>> the decoder and evaluation. Has anyone experienced the same thing
>> and/or
>> know how to solve this issue?
>>
>> Thank you very much. Best regards,
>>
>> Carla
>>
>> --
>> Carla Parra Escart?n
>> Marie Curie Experienced Researcher - EXPERT ITN
>> http://expert-itn.eu/ [1]
>> Hermes Traducciones
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support [2]
>
>
>
> Links:
> ------
> [1] http://expert-itn.eu/
> [2] http://mailman.mit.edu/mailman/listinfo/moses-support
> [3] http://www.hoang.co.uk/hieu

--
Carla Parra Escart?n
Marie Curie Experienced Researcher - EXPERT ITN
http://expert-itn.eu/
Hermes Traducciones
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.placeables
Type: text/x-c
Size: 19991 bytes
Desc: not available
Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150519/ccc77c13/attachment.bin
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: moses.ini.1
Url: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150519/ccc77c13/attachment.bat
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Sample.input.preprocessed
Url: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150519/ccc77c13/attachment-0001.bat

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 103, Issue 43
**********************************************

Moses-support Digest, Vol 103, Issue 43

0 Response to "Moses-support Digest, Vol 103, Issue 43"

Post a Comment