Moses-support Digest, Vol 87, Issue 53

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Language modelling (Hieu Hoang)
2. Re: word alignment-words' indexes and sentences' length
(amir haghighi)
3. Re: word alignment-words' indexes and sentences' length (Tom Hoar)

----------------------------------------------------------------------

Message: 1
Date: Thu, 23 Jan 2014 22:50:49 +0000
From: Hieu Hoang <hieuhoang@gmail.com>
Subject: Re: [Moses-support] Language modelling
To: moses-support@mit.edu
Message-ID: <52E19CC9.8060307@gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

There's no assumption that English is the target language.

In some Moses programs, -f referred to French, -e referred to English.
However, this is historical.

It should be
-f = source
-e = target
but no-one has had the time to change the variable names.

On 22/01/2014 03:15, Arththika Paramanathan wrote:
> In moses, it assume English as a target language & other language is
> source language (foreign). So that we can translate a foreign language
> to English (In my case, Tamil-English). I want to translate
> English-Tamil. So, what I want to change,
> (in train-model.perl file/ )
>
>
> On Wed, Jan 22, 2014 at 8:37 AM, Arththika Paramanathan
> <arthiparamanathan@gmail.com <mailto:arthiparamanathan@gmail.com>> wrote:
>
> Hi Nicola,
> Thank you for your response.
>
> I think in LM with IRSTLM, there are 4 or 5 steps.
> In step 1, it will split the corpus as 1-gram with it's frequency
> count (there is no sorting here)
> In step 2, split this dictionary into 3 dictionaries (balanced
> n-gram lists). Here, the threshold is approximately the total
> words divided by 3. Is it correct?
> In step 3, Collect n-gram for each dictionary. ie) for each words
> in each spitted dictionary, it search for 3-gram & put them in a
> separate file.
> Then I don't understand the next step (ARPA file).
> How to calculate this?
> -3.72202 <s> -0.598275
> -3.17795 illegal -0.60206
> -2.42099 folder -0.500602
> -2.53169 name -0.723104
>
> Can you please explain me that how to calculate this?
>
>
>
>
>
>
>
> On Tue, Jan 21, 2014 at 10:46 PM, Nicola Bertoldi <bertoldi@fbk.eu
> <mailto:bertoldi@fbk.eu>> wrote:
>
> Hi Arththika,
>
>
> (1) In language modelling,
> how IRSTLM split the dictionary which is extracted from
> corpus into 3 dictionaries?
> how to calculate n-gram counts?
>
>
>
> I would like to answer your first question
> as a responsible of the IRSLTM tookit
>
> If not clear, please reply privately to me only.
>
>
> I suppose you are using the build-lm.sh script from IRSTLM
>
> The script split the dictionary, sorted according the 1-grams
> frequency,
> in such a way that the global frequency of each part is balanced.
>
> In this way the corresponding partitions of the n-grams are
> balanced as well.
> the n-gram partition is built by taking into consideration the
> first token,
>
> Not sure what do you mean with the second part of the question.
>
> best regards,
> Nicola
>
>
>
>
> On Jan 20, 2014, at 7:34 PM, Arththika Paramanathan wrote:
>
> Hi,
>
> (2) And, If English is the foreign language, what I want to
> change, (in train-model.perl file)
>
> (3) can anyone tell me that how to use a perl module? I want
> to use this module named Locale-Maketext-Lexicon-0.97 to
> extract translatable strings from po files.
>
>
>
> --
> regards,
> P.Arththika
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> <mailto:Moses-support@mit.edu><mailto:Moses-support@mit.edu
> <mailto:Moses-support@mit.edu>>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> regards,
> P.Arththika
>
>
>
>
> --
> regards,
> P.Arththika
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140123/b561a96c/attachment-0001.htm

------------------------------

Message: 2
Date: Fri, 24 Jan 2014 09:06:53 +0330
From: amir haghighi <amir.haghighi.64@gmail.com>
Subject: Re: [Moses-support] word alignment-words' indexes and
sentences' length
To: moses-support@mit.edu
Message-ID:
<CA+UVbEhQ5GmPv_1sVe3WM2Z6ZbH8tnrPWMwT0pNCpr63AhN-pA@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

I removed all of the double spaces from the corpus but there are some
double spaces in the tokenised file yet.
My source language is Persian and I have half-spaces in my corpus. I
noticed that after the tokenisation step,these half-spaces are converted to
double-spaces. this conversion disturb the sentence's length and the
alignment.
How can I prevent from this conversion?

Thank you again
Amir

On Wed, Jan 22, 2014 at 2:10 PM, Hieu Hoang <Hieu.Hoang@ed.ac.uk> wrote:

> yes, remove the double space. Sometimes, the double space is ignored,
> sometimes it's counted as a 'word' with no characters, depending on exactly
> how the program tokenizes the line.
>
>
>
>
> On 22 January 2014 10:09, amir haghighi <amir.haghighi.64@gmail.com>wrote:
>
>> Thank you Hieu,
>>
>> The corpus is utf8, but there is a double space in this line. are double
>> spaces regarded as a word?
>> should I remove double spaces from the lines manually to get the correct
>> sentence's length?
>>
>>
>>
>> On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang <hieuhoang@gmail.com> wrote:
>>
>>>
>>> On 20/01/2014 13:45, amir haghighi wrote:
>>>
>>> Hello
>>>
>>> I've some questions about the giza word alignment.
>>>
>>> 1-where is the final alignment file?Is it the aligned.1.grow.... in
>>> the model folder?
>>>
>>> yes.
>>>
>>>
>>> 2-do indexes of the words of both target and source sentences start
>>> from 0?
>>>
>>> yes
>>>
>>>
>>> 3- how does giza calculate the length of a sentence?
>>>
>>> the number of words
>>>
>>> I have a sentence with 11 tokens that are separated with space, but in
>>> the alignment file it length is 13.
>>>
>>> strange. Are you sure your corpus file is encoded as UTF8? Are there
>>> double spaces in the line?
>>>
>>>
>>> Regards
>>> Amir
>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140124/1f11fa68/attachment-0001.htm

------------------------------

Message: 3
Date: Fri, 24 Jan 2014 13:03:54 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] word alignment-words' indexes and
sentences' length
To: moses-support@mit.edu
Message-ID: <52E2024A.5040901@precisiontranslationtools.com>
Content-Type: text/plain; charset="utf-8"

What tokenizer are you using? You can either edit/configure the
tokenizer to treat them as non-whitespace, or escape them before passing
them to the tokenizer.

On 01/24/2014 12:36 PM, amir haghighi wrote:
>
> I removed all of the double spaces from the corpus but there are some
> double spaces in the tokenised file yet.
> My source language is Persian and I have half-spaces in my corpus. I
> noticed that after the tokenisation step,these half-spaces are
> converted to double-spaces. this conversion disturb the sentence's
> length and the alignment.
> How can I prevent from this conversion?
>
> Thank you again
> Amir
>
>
> On Wed, Jan 22, 2014 at 2:10 PM, Hieu Hoang <Hieu.Hoang@ed.ac.uk
> <mailto:Hieu.Hoang@ed.ac.uk>> wrote:
>
> yes, remove the double space. Sometimes, the double space is
> ignored, sometimes it's counted as a 'word' with no characters,
> depending on exactly how the program tokenizes the line.
>
>
>
>
> On 22 January 2014 10:09, amir haghighi
> <amir.haghighi.64@gmail.com <mailto:amir.haghighi.64@gmail.com>>
> wrote:
>
> Thank you Hieu,
>
> The corpus is utf8, but there is a double space in this line.
> are double spaces regarded as a word?
> should I remove double spaces from the lines manually to get
> the correct sentence's length?
>
>
>
> On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang
> <hieuhoang@gmail.com <mailto:hieuhoang@gmail.com>> wrote:
>
>
> On 20/01/2014 13:45, amir haghighi wrote:
>> Hello
>>
>> I've some questions about the giza word alignment.
>>
>> 1-where is the final alignment file?Is it the
>> aligned.1.grow.... in the model folder?
> yes.
>
>>
>> 2-do indexes of the words of both target and source
>> sentences start from 0?
> yes
>
>>
>> 3- how does giza calculate the length of a sentence?
> the number of words
>
>> I have a sentence with 11 tokens that are separated with
>> space, but in the alignment file it length is 13.
> strange. Are you sure your corpus file is encoded as UTF8?
> Are there double spaces in the line?
>>
>> Regards
>> Amir
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140124/72c69c39/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 87, Issue 53
*********************************************

Moses-support Digest, Vol 87, Issue 53

0 Response to "Moses-support Digest, Vol 87, Issue 53"

Post a Comment