Moses-support Digest, Vol 92, Issue 42

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Sentence alignment for Comparable Corpora (alireza tabebordbar)
2. Re: Large parallel corpora (Hieu Hoang)
3. Re: Recasing is very slow compared to the actual translation
(Hieu Hoang)

----------------------------------------------------------------------

Message: 1
Date: Mon, 23 Jun 2014 16:29:39 +0430
From: alireza tabebordbar <ar.tabebordbar@gmail.com>
Subject: [Moses-support] Sentence alignment for Comparable Corpora
To: Moses-support@mit.edu
Message-ID:
<CAAECi_G5xB8MR-5zykgPE6fvfA+dy1KZ9GMN3AVku=Kh2E0riQ@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi All Im Master Degree Student and I have problem in sentence aligning.
I extracted some data (Persian-English) which is not parallel (comparable).
each Persian sentence is aligned with the top 50 similar sentences in
English side. Now I want to find out the log probability of alignment and
the number of aligned/unaligned words of each English sentence with Persian
sentence.
Example:
Persian Sentence A English Sentence 1
Persian Sentence A English Sentence 2
Persian Sentence A English Sentence 3
.... .....
Persian Sentence A English Sentence 50
I Know Giza++ is suitable for alignment ,However I used Giza for Parallel
Corpus..How Can I get log Probability alignment and number of
aligned/unaligned words.
Best Regards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140623/eb38eaca/attachment-0001.htm

------------------------------

Message: 2
Date: Mon, 23 Jun 2014 10:20:20 -0400
From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Subject: Re: [Moses-support] Large parallel corpora
To: Tom Hoar <tahoar@precisiontranslationtools.com>
Cc: Moses-Support <moses-support@mit.edu>
Message-ID:
<CAEKMkbhUhZ-=sDQX_-aB04Kfx6CkwMadR4zA8-APf9t5RTmiNA@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

i've never done it. There's a limit on the maximum number of shards when
parallelizing extract and scoring during training.

I've upped the limit (from 99,999) to 9,999,999

https://github.com/moses-smt/mosesdecoder/commit/f95a1bb75b2add5b7dcd1e3e5c76777f2f141e21

Other than that, I can't think of any other issues.

To minimize disk space usage (and probably increase speed too), compress
the intermediate training files. Also, optimize the sorting. This is my
arguments to train-model.perl to do these things
..../train-model.perl -sort-buffer-size 1G -sort-batch-size 253
-sort-compress gzip -cores 8

On 20 June 2014 09:50, Tom Hoar <tahoar@precisiontranslationtools.com>
wrote:

> Does anyone have experience (words-of-wisdom) training the translation
> model from a parallel corpus with 2.25 trillion phrase pairs and over 45
> trillion tokens?
>
> Thanks,
> Tom
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140623/e374c661/attachment-0001.htm

------------------------------

Message: 3
Date: Mon, 23 Jun 2014 11:34:35 -0400
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] Recasing is very slow compared to the
actual translation
To: Stanislav Ku??k <standa.kurik@gmail.com>, moses-support@MIT.EDU
Message-ID: <53A8490B.5030204@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed

the recaser should be very fast. You should set the distortion limit to 0.

you should binarize your phrase table and language model

On 20/06/14 11:03, Stanislav Ku??k wrote:
> Hello,
>
> I wonder if it's normal for the recaser to take 7 seconds to process a
> single sentence when the actual translation of that sentence took 3 seconds.
>
> Below is the recaser's config file. As for the referred files, phrase-
> table.gz is about 3.5 MB while cased.srilm.gz is 33.7 MB.
>
> I tried setting the 'distortion-limit' to 0 but it did not make any
> difference.
>
> Thank you.
>
> #########################
> ### MOSES CONFIG FILE ###
> #########################
>
> # input factors
> [input-factors]
> 0
>
> # mapping steps
> [mapping]
> 0 T 0
>
> [ttable-file]
> 0 0 0 5 /storage/moses/trained-models/EN_SV-SE_2014-04-
> 22T04_00_17_105475/recaser/phrase-table.gz
>
> # no generation models, no generation-file section
>
> # language models: type(srilm/irstlm), factors, order, file
> [lmodel-file]
> 0 0 3 /storage/moses/trained-models/EN_SV-SE_2014-04-
> 22T04_00_17_105475/recaser/cased.srilm.gz
>
>
> # limit on how many phrase translations e for each phrase f are loaded
> # 0 = all elements loaded
> [ttable-limit]
> 20
>
> # distortion (reordering) weight
> [weight-d]
> 0.6
>
> # language model weights
> [weight-l]
> 0.5000
>
>
> # translation model weights
> [weight-t]
> 0.20
> 0.20
> 0.20
> 0.20
> 0.20
>
> # no generation models, no weight-generation section
>
> # word penalty
> [weight-w]
> -1
>
> [distortion-limit]
> 6
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 92, Issue 42
*********************************************

Moses-support Digest, Vol 92, Issue 42

0 Response to "Moses-support Digest, Vol 92, Issue 42"

Post a Comment