Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Re: Output unknown words in moses server (Hieu Hoang)
2. Re: Reusing alignments for 10**9 fr-en and using fast_align
(rohit dholakia)
3. Re: Reusing alignments for 10**9 fr-en and using fast_align
(Philipp Koehn)
4. Re: Reusing alignments for 10**9 fr-en and using fast_align
(Holger Schwenk)
----------------------------------------------------------------------
Message: 1
Date: Tue, 7 Jan 2014 17:24:20 +0000
From: Hieu Hoang <Hieu.Hoang@ed.ac.uk>
Subject: Re: [Moses-support] Output unknown words in moses server
To: Roee Aharoni <roee.aharoni@gmail.com>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAEKMkbi4=okgJRGW9nGLqHLtY8zMEkjdETyBH5=ccFY62uXjnw@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
I don't think that's implemented for the server, and i'm not sure how
useful it would be even if it is.
There is also another option
-mark-unknown
which marks the unknown word in the output with a prefix 'UNK'. The client
application can do what it wants with this information.
I'm not sure if this option is implemented in the server, but it's probably
easy to implement and easier to use. If you want to implement it, I can
show you how
On 7 January 2014 16:08, Roee Aharoni <roee.aharoni@gmail.com> wrote:
> Hello,
> I use the option -output-unknown in moses decoder, which outputs the
> unknown (untranslated) words to a file. Is there an equivalent output from
> the mosesserver process?
>
> Thanks!
>
>
> Roee
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140107/59318876/attachment-0001.htm
------------------------------
Message: 2
Date: Tue, 7 Jan 2014 13:26:18 -0800
From: rohit dholakia <rdholaki@sfu.ca>
Subject: Re: [Moses-support] Reusing alignments for 10**9 fr-en and
using fast_align
To: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAA==Lgsc1j6yhErEeq0j=29aHN+YGZPaWtdzvTdHZm0MNJaJyg@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Hi,
Thanks for the reply. It looks like giving 24gb or 48gb of memory is not
enough for the 10**9 corpus using fast_align. I have never dealt with a
corpus this size or used fast_align before, so, am not sure how much memory
to ask for in the cluster.
Will it be possible to divide the file into, say, 8 parts, run
fast_align on each, symmetrize them as you pointed above and then finally
join the symmetrized 8 parts ?
Thanks again.
On Mon, Jan 6, 2014 at 5:21 PM, Philipp Koehn <pkoehn@inf.ed.ac.uk> wrote:
> Hi,
>
> using the billion word French-English corpus, mgiza runs for a week even
> with 8 cores.
> So, fast_align is a good alternative.
>
> Make sure that you create output in a way that the subsequent processing
> steps
> can deal with. The easiest way to do that is to use experimemt.perl (see
> the
> example config files), otherwise run the following command to create
> symmetrized word alignments:
>
> /path/to/moses/scripts/ems/support/symmetrize-fast-align.perl
> fast-align-output fast-align-inverse-output corpus.f corpus.e aligned
> grow-diag-final-and /path/to/moses/bin/symal
>
> -phi
>
>
>
> On Mon, Jan 6, 2014 at 9:08 PM, rohit dholakia <rdholaki@sfu.ca> wrote:
>
>> Hi,
>>
>> I have been trying to get a fr-en phrase table by using the 10**9 fr-en
>> Europarl corpora. Unfortunately, 47 hours of mgiza cluster time was not
>> enough to get the alignments. If I restart, will Moses reuse the alignments
>> it has ? I used --parallel and --parts 8, so Moses has something in both
>> directions.
>>
>> Also, I have been trying to use fast_align but it got a bad_alloc, must
>> be less memory. Having said that, can I use fast_align in the following
>> manner :
>>
>> 0. Run moses --last-step 1
>>
>> 1. use fast_align with -d -o -v
>>
>> 2. Resume Moses as --first-step 3 --last-step 6
>>
>> Thanks !
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140107/f3076138/attachment-0001.htm
------------------------------
Message: 3
Date: Wed, 8 Jan 2014 00:37:19 +0000
From: Philipp Koehn <pkoehn@inf.ed.ac.uk>
Subject: Re: [Moses-support] Reusing alignments for 10**9 fr-en and
using fast_align
To: rohit dholakia <rdholaki@sfu.ca>
Cc: moses-support <moses-support@mit.edu>
Message-ID:
<CAAFADDAOGnCjLnV4qbXGFoT56M6cQqtsB7c8YJx939Sn_fH5ig@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Hi,
yes, running word alignment on parts is reasonable.
There is a concerns that this impacts alignment accuracy, but that may not
be a significant issue with such a large corpus.
-phi
On Tue, Jan 7, 2014 at 9:26 PM, rohit dholakia <rdholaki@sfu.ca> wrote:
> Hi,
>
> Thanks for the reply. It looks like giving 24gb or 48gb of memory is not
> enough for the 10**9 corpus using fast_align. I have never dealt with a
> corpus this size or used fast_align before, so, am not sure how much memory
> to ask for in the cluster.
>
> Will it be possible to divide the file into, say, 8 parts, run
> fast_align on each, symmetrize them as you pointed above and then finally
> join the symmetrized 8 parts ?
>
>
> Thanks again.
>
>
>
>
>
> On Mon, Jan 6, 2014 at 5:21 PM, Philipp Koehn <pkoehn@inf.ed.ac.uk> wrote:
>
>> Hi,
>>
>> using the billion word French-English corpus, mgiza runs for a week even
>> with 8 cores.
>> So, fast_align is a good alternative.
>>
>> Make sure that you create output in a way that the subsequent processing
>> steps
>> can deal with. The easiest way to do that is to use experimemt.perl (see
>> the
>> example config files), otherwise run the following command to create
>> symmetrized word alignments:
>>
>> /path/to/moses/scripts/ems/support/symmetrize-fast-align.perl
>> fast-align-output fast-align-inverse-output corpus.f corpus.e aligned
>> grow-diag-final-and /path/to/moses/bin/symal
>>
>> -phi
>>
>>
>>
>> On Mon, Jan 6, 2014 at 9:08 PM, rohit dholakia <rdholaki@sfu.ca> wrote:
>>
>>> Hi,
>>>
>>> I have been trying to get a fr-en phrase table by using the 10**9 fr-en
>>> Europarl corpora. Unfortunately, 47 hours of mgiza cluster time was not
>>> enough to get the alignments. If I restart, will Moses reuse the alignments
>>> it has ? I used --parallel and --parts 8, so Moses has something in both
>>> directions.
>>>
>>> Also, I have been trying to use fast_align but it got a bad_alloc, must
>>> be less memory. Having said that, can I use fast_align in the following
>>> manner :
>>>
>>> 0. Run moses --last-step 1
>>>
>>> 1. use fast_align with -d -o -v
>>>
>>> 2. Resume Moses as --first-step 3 --last-step 6
>>>
>>> Thanks !
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140108/f1186105/attachment-0001.htm
------------------------------
Message: 4
Date: Wed, 08 Jan 2014 02:57:49 +0100
From: Holger Schwenk <holger.schwenk@lium.univ-lemans.fr>
Subject: Re: [Moses-support] Reusing alignments for 10**9 fr-en and
using fast_align
To: moses-support@mit.edu
Message-ID: <52CCB09D.50405@lium.univ-lemans.fr>
Content-Type: text/plain; charset="iso-8859-1"
Hi,
you may also consider data selection techniques to extract a relevant
subset of all that training data.
This will allow you to train faster, fit into your memory and you
usually get better results since you discard out-of-domain data which
may have a negative impact on the estimation of the translation
probabilities.
Popular techniques are
- Moore and Lewis, Intelligent Selection of Language Model Training
Data, ACL 2010
(using ether the source or target language only)
- Axelrod, He and Gao, Domain Adaptation via Pseudo In-Domain Data
Selection, EMNLP 2011
(using target and source simultaneously)
both are implemented in the open-source tool XenC
(https://github.com/rousseau-lium/XenC)
- Holger
On 01/08/2014 01:37 AM, Philipp Koehn wrote:
> Hi,
>
> yes, running word alignment on parts is reasonable.
>
> There is a concerns that this impacts alignment accuracy, but that may not
> be a significant issue with such a large corpus.
>
> -phi
>
>
>
> On Tue, Jan 7, 2014 at 9:26 PM, rohit dholakia <rdholaki@sfu.ca
> <mailto:rdholaki@sfu.ca>> wrote:
>
> Hi,
>
> Thanks for the reply. It looks like giving 24gb or 48gb of memory
> is not enough for the 10**9 corpus using fast_align. I have never
> dealt with a corpus this size or used fast_align before, so, am
> not sure how much memory to ask for in the cluster.
>
> Will it be possible to divide the file into, say, 8 parts, run
> fast_align on each, symmetrize them as you pointed above and then
> finally join the symmetrized 8 parts ?
>
>
> Thanks again.
>
>
>
>
> On Mon, Jan 6, 2014 at 5:21 PM, Philipp Koehn <pkoehn@inf.ed.ac.uk
> <mailto:pkoehn@inf.ed.ac.uk>> wrote:
>
> Hi,
>
> using the billion word French-English corpus, mgiza runs for a
> week even with 8 cores.
> So, fast_align is a good alternative.
>
> Make sure that you create output in a way that the subsequent
> processing steps
> can deal with. The easiest way to do that is to use
> experimemt.perl (see the
> example config files), otherwise run the following command to
> create
> symmetrized word alignments:
>
> /path/to/moses/scripts/ems/support/symmetrize-fast-align.perl
> fast-align-output fast-align-inverse-output corpus.f corpus.e
> aligned grow-diag-final-and /path/to/moses/bin/symal
>
> -phi
>
>
>
> On Mon, Jan 6, 2014 at 9:08 PM, rohit dholakia
> <rdholaki@sfu.ca <mailto:rdholaki@sfu.ca>> wrote:
>
> Hi,
>
> I have been trying to get a fr-en phrase table by using
> the 10**9 fr-en Europarl corpora. Unfortunately, 47 hours
> of mgiza cluster time was not enough to get the
> alignments. If I restart, will Moses reuse the alignments
> it has ? I used --parallel and --parts 8, so Moses has
> something in both directions.
>
> Also, I have been trying to use fast_align but it got a
> bad_alloc, must be less memory. Having said that, can I
> use fast_align in the following manner :
>
> 0. Run moses --last-step 1
> 1. use fast_align with -d -o -v
>
> 2. Resume Moses as --first-step 3 --last-step 6
>
> Thanks !
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
--
Holger Schwenk
membre IUF s?nior
professeur en Informatique
LIUM - Universit? du Maine
email : schwenk@lium.univ-lemans.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140108/d4eb39dd/attachment.htm
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 87, Issue 16
*********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 87, Issue 16"
Post a Comment