Moses-support Digest, Vol 108, Issue 17

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Faster decoding with multiple moses instances
(Michael Denkowski)

----------------------------------------------------------------------

Message: 1
Date: Tue, 6 Oct 2015 16:39:20 -0400
From: Michael Denkowski <michael.j.denkowski@gmail.com>
Subject: Re: [Moses-support] Faster decoding with multiple moses
instances
To: Hieu Hoang <hieuhoang@gmail.com>
Cc: Moses Support <moses-support@mit.edu>, Philipp Koehn <phi@jhu.edu>
Message-ID:
<CA+-GegKLyoz60+r_0FEkhqEemn5RX8i5psVM+CHyttJnLJDvJg@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Hieu and all,

I just checked in a bug fix for the multi_moses.py script. I forgot to
override the number of threads for each moses command, so if [threads] were
specified in the moses.ini, the multi-moses runs were cheating by running a
bunch of multi-threaded instances. If threads were only being specified on
the command line, the script was correctly stripping the flag so everything
should be good. I finished a benchmark on my system with an unpruned
compact PT (with the fixed script) and got the following:

16 threads 5.38 sent/sec
16 procs 13.51 sent/sec

This definitely used a lot more memory though. Based on some very rough
estimates looking at free system memory, the memory mapped suffix array PT
went from 2G to 6G with 16 processes while the compact PT went from 3G to
37G. For cases where everything fits into memory, I've seen significant
speedup from multi-process decoding.

For cases where things don't fit into memory, the multi-moses script could
be extended to start as many multi-threaded instances as will fit into ram
and farm out sentences in a way that keeps all of the CPUs busy. I know
Marcin has mentioned using GNU parallel.

Best,
Michael

On Tue, Oct 6, 2015 at 4:16 PM, Hieu Hoang <hieuhoang@gmail.com> wrote:

> I've just run some comparison between multithreaded decoder and the
> multi_moses.py script. It's good stuff.
>
> It make me seriously wonder whether we should use abandon multi-threading
> and go all out for the multi-process approach.
>
> There's some advantage to multi-thread - eg. where model files are loaded
> into memory rather than memory map. But there's disadvantages too - it more
> difficult to maintain and there's about a 10% overhead.
>
> What do people think?
>
> Phrase-based:
>
> 1 5 10 15 20 25 30 32 real 4m37.000s real 1m15.391s real
> 0m51.217s real 0m48.287s real 0m50.719s real 0m52.027s real
> 0m53.045s Baseline (Compact pt) user 4m21.544s user 5m28.597s user
> 6m38.227s user 8m0.975s user 8m21.122s user 8m3.195s user
> 8m4.663s
> sys 0m15.451s sys 0m34.669s sys 0m53.867s sys 1m10.515s
> sys 1m20.746s sys 1m24.368s sys 1m23.677s
>
>
>
>
>
>
>
> 34 4m49.474s real 1m17.867s real 0m43.096s real 0m31.999s
> 0m26.497s 0m26.296s killed (32) + multi_moses 4m33.580s user 4m40.486s
> user 4m56.749s user 5m6.692s 5m43.845s 7m34.617s
>
> 0m15.957s sys 0m32.347s sys 0m51.016s sys 1m11.106s 1m44.115s
> 2m21.263s
>
>
>
>
>
>
>
>
> 38 real 4m46.254s real 1m16.637s real 0m49.711s real 0m48.389s
> real 0m49.144s real 0m51.676s real 0m52.472s Baseline (Probing
> pt) user 4m30.596s user 5m32.500s user 6m23.706s user
> 7m40.791s user 7m51.946s user 7m52.892s user 7m53.569s
> sys 0m15.624s sys 0m36.169s sys 0m49.433s sys 1m6.812s sys
> 1m9.614s sys 1m13.108s sys 1m12.644s
>
>
>
>
>
>
>
> 39 real 4m43.882s real 1m17.849s real 0m34.245s real 0m31.318s
> real 0m28.054s real 0m24.120s real 0m22.520s (38) + multi moses
> user 4m29.212s user 4m47.693s user 5m5.750s user 5m33.573s
> user 6m18.847s user 7m19.642s user 8m38.013s
> sys 0m15.835s sys 0m25.398s sys 0m36.716s sys 0m41.349s
> sys 0m48.494s sys 1m0.843s sys 1m13.215s
> Hiero:
> 3 real 5m33.011s real 1m28.935s real 0m59.470s real 1m0.315s
> real 0m55.619s real 0m57.347s real 0m59.191s 1m2.786s 6/10
> baseline user 4m53.187s user 6m23.521s user 8m17.170s user
> 12m48.303s user 14m45.954s user 17m58.109s user 20m22.891s
> 21m13.605s
> sys 0m39.696s sys 0m51.519s sys 1m3.788s sys 1m22.125s sys
> 1m58.718s sys 2m51.249s sys 4m4.807s 4m37.691s
>
>
>
>
>
>
>
>
> 4
> real 1m27.215s real 0m40.495s real 0m36.206s real 0m28.623s
> real 0m26.631s real 0m25.817s 0m25.401s (3) + multi_moses
> user 5m4.819s user 5m42.070s user 5m35.132s user 6m46.001s
> user 7m38.151s user 9m6.500s 10m32.739s
>
> sys 0m38.039s sys 0m45.753s sys 0m44.117s sys 0m52.285s
> sys 0m56.655s sys 1m6.749s 1m16.935s
>
> On 05/10/2015 16:05, Michael Denkowski wrote:
>
> Hi Philipp,
>
> Unfortunately I don't have a precise measurement. If anyone knows of a
> good way to benchmark a process tree with lots of memory mapping the same
> files, I would be glad to run it.
>
> --Michael
>
> On Mon, Oct 5, 2015 at 10:26 AM, Philipp Koehn <phi@jhu.edu> wrote:
>
>> Hi,
>>
>> great - that will be very useful.
>>
>> Since you just ran the comparison - do you have any numbers on "still
>> allowed everything to fit into memory", i.e., how much more memory is used
>> by running parallel instances?
>>
>> -phi
>>
>> On Mon, Oct 5, 2015 at 10:15 AM, Michael Denkowski <
>> <michael.j.denkowski@gmail.com>michael.j.denkowski@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Like some other Moses users, I noticed diminishing returns from running
>>> Moses with several threads. To work around this, I added a script to run
>>> multiple single-threaded instances of moses instead of one multi-threaded
>>> instance. In practice, this sped things up by about 2.5x for 16 cpus and
>>> using memory mapped models still allowed everything to fit into memory.
>>>
>>> If anyone else is interested in using this, you can prefix a moses
>>> command with scripts/generic/multi_moses.py. To use multiple instances in
>>> mert-moses.pl, specify --multi-moses and control the number of parallel
>>> instances with --decoder-flags='-threads N'.
>>>
>>> Below is a benchmark on WMT fr-en data (2M training sentences, 400M
>>> words mono, suffix array PT, compact reordering, 5-gram KenLM) testing
>>> default stack decoding vs cube pruning without and with the parallelization
>>> script (+multi):
>>>
>>> ---
>>> 1cpu sent/sec
>>> stack 1.04
>>> cube 2.10
>>> ---
>>> 16cpu sent/sec
>>> stack 7.63
>>> +multi 12.20
>>> cube 7.63
>>> +multi 18.18
>>> ---
>>>
>>> --Michael
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>
>
> _______________________________________________
> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> --
> Hieu Hoanghttp://www.hoang.co.uk/hieu
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151006/d2ad6a43/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 108, Issue 17
**********************************************

Moses-support Digest, Vol 108, Issue 17

0 Response to "Moses-support Digest, Vol 108, Issue 17"

Post a Comment