Send Moses-support mailing list submissions to
moses-support@mit.edu
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu
You can reach the person managing the list at
moses-support-owner@mit.edu
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."
Today's Topics:
1. Keeping track of the Moses version in your experiment logs
(Ulrich Germann)
2. Re: decoder question (Tom Hoar)
3. FinalState in Moses decoder (Vu Thuong Huyen)
4. Re: decoder question (Vincent Nguyen)
----------------------------------------------------------------------
Message: 1
Date: Fri, 4 Dec 2015 23:29:48 +0000
From: Ulrich Germann <ulrich.germann@gmail.com>
Subject: [Moses-support] Keeping track of the Moses version in your
experiment logs
To: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID:
<CAHQSRUpiK-A4_sNY_RQPMywyfYkYi+9=aV-6Eo4E06z-iqRDNQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi all,
This is a service announcement for all maintainers of experimental
frameworks for Moses (EMS, Eman, ...) as well as as a request for
information from contributors of plug-in libraries that are considered
3rd-Party dependencies from the Moses point of view (e.g., IRSTLM, OxLM,
etc.).
Maintainers of experimental frameworks:
Moses now has a switch --version that prints the git commit hash or tag of
the code base that Moses was compiled from (as long as you compile with
bjam; if you don't, I'd be very curious how you do it) along with
information of 3rd-party libraries used. You may find this useful for
logging purposes, for example to improve replicability of experiments. The
value reported for Moses is the
output of "git describe --dirty".
Contributors of external libraries:
Please let me know how to determine the version of your library at compile
time, either through macros defined in some header file, or by running a
command (e.g, xmlrpc-c-config --version) from within bjam prior to actual
compilation.
Best regards - Uli
--
Ulrich Germann
Senior Researcher
School of Informatics
University of Edinburgh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151204/f81d7d07/attachment-0001.html
------------------------------
Message: 2
Date: Sat, 5 Dec 2015 09:39:51 +0700
From: Tom Hoar <tahoar@precisiontranslationtools.com>
Subject: Re: [Moses-support] decoder question
To: moses-support@mit.edu
Message-ID: <56624E77.7040602@precisiontranslationtools.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Here's another perspective. The concept of a what should be translated
as a "sentence" during production depends on the training data and
tuning set that created the model. I like Ulrich's input. The period
(question mark, exclamation mark, etc) are just tokens. The newline
marker tells moses, "start translating a new job with all the tokens
before me."
Let's say you train your translation model with a parallel corpus broken
down into paired part-of-speed phrases (noun phrases, verb phrases,
object phrases, etc.). Then build your language model using the target
half of the part-of-speech corpus. Finally, tune your SMT model using
this TM/LM pair and a tuning set with part-of-speech pairs. Your
translation production input should also be broken into those same
part-of-speech phrases to achieve optimal results. With such a model,
you will get degraded results if you translate a complete sentence or a
paragraph (multi-sentences).
Here's an modified approach. Train a translation model with the same
part-of-speech parallel corpus. Then, use a different version of the
target language corpus with complete sentences (i.e. broken by sentence
breaks like full-stops, question marks, etc.). Next, tune your SMT model
with a tuning set of paired complete sentences that match the LM's
breaks. The tuning process optimizes performance for that type of input.
Therefore, your optimized translation results will mirror the LM corpus
and matched tuning set. You will get degraded results if you translate
part-of-speech phrases or multi-sentences or complete paragraphs.
We call these "things" models because they're supposed to be a miniature
representation of a larger universe. So, you'll always get the best
results when your production input matches the input side of your tuning
set.
Re newline markers, I think Ulrich's "Mac: CR" is for the legacy Mac OS.
The current OS X uses Posix/Linux LF. We have not tested our
cross-platform updates with the older Mac CR and I suspect it will not
work. So I suggest using either CRLF or LF, which we have extensively
using across Windows and Posix systems.
Tom
On 12/5/2015 6:13 AM, moses-support-request@mit.edu wrote:
> Date: Fri, 4 Dec 2015 23:13:10 +0000
> From: Ulrich Germann<ulrich.germann@gmail.com>
> Subject: Re: [Moses-support] decoder question
> To: Vincent Nguyen<vnguyen@neuf.fr>
> Cc: moses-support<moses-support@mit.edu>
>
> Hi Vincent,
>
> as far as Moses is concerned, the end of a sentence is marked by whatever
> the end-of-line marker is on the respective OS (Win: CRLF, Linux: LF, Mac:
> CR, apparently). A period is treated as a plain old token. The purpose of
> the sentence splitter that Kenneth mentioned is to tell Moses what the
> "sentence" boundaries are.
>
> The language model has a concept of sentences beginning and ending and
> usually doesn't like periods anywhere except at the end of a sentence, so
> it'll down-vote translation hypotheses containing isolated periods.
>
> - Uli
------------------------------
Message: 3
Date: Sat, 5 Dec 2015 11:32:47 +0700
From: "Vu Thuong Huyen" <huyenvt2211@gmail.com>
Subject: [Moses-support] FinalState in Moses decoder
To: <moses-support@mit.edu>
Message-ID: <004901d12f16$00484fa0$00d8eee0$@gmail.com>
Content-Type: text/plain; charset="us-ascii"
Hi all,
I'm integrating my LM in Mosesdecoder. I followed the SRI.cpp and
NeuralLMWrapper.cpp. I don't know how to assign value for *finalState
variable. Could you explain for me how does it affect when decoding? I
logged out the context when the system call "GetValue" function in my LM and
in SRILM LM, the number of contexts were different.
LMResult LanguageModelSRI::GetValue(VocabIndex wordId, VocabIndex *context)
const
{
LMResult ret;
ret.score = FloorScore(TransformLMScore(m_srilmModel->wordProb( wordId,
context)));
ret.unknown = (wordId == m_unknownId);
return ret;
}
Which is value of ret.score? Log Probability or Probability?
Best Regards,
Huyen.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151204/d696d5de/attachment-0001.html
------------------------------
Message: 4
Date: Sat, 5 Dec 2015 13:40:45 +0100
From: Vincent Nguyen <vnguyen@neuf.fr>
Subject: Re: [Moses-support] decoder question
To: moses-support@mit.edu
Message-ID: <5662DB4D.90400@neuf.fr>
Content-Type: text/plain; charset=windows-1252; format=flowed
Tom, Uli thank you guys.
Very clear.
Now how important is it to have a "end of sentence" delimiter in the
language model (not talking of the LF stuff) ?
Should each line in the LM end with a "." or equivalent (excla-mark,
question-mark, ...) ?
I saw some LM (especially for ASR) where the training text ends each
line with a specific delimiter </s>
Le 05/12/2015 03:39, Tom Hoar a ?crit :
> Here's another perspective. The concept of a what should be translated
> as a "sentence" during production depends on the training data and
> tuning set that created the model. I like Ulrich's input. The period
> (question mark, exclamation mark, etc) are just tokens. The newline
> marker tells moses, "start translating a new job with all the tokens
> before me."
>
> Let's say you train your translation model with a parallel corpus broken
> down into paired part-of-speed phrases (noun phrases, verb phrases,
> object phrases, etc.). Then build your language model using the target
> half of the part-of-speech corpus. Finally, tune your SMT model using
> this TM/LM pair and a tuning set with part-of-speech pairs. Your
> translation production input should also be broken into those same
> part-of-speech phrases to achieve optimal results. With such a model,
> you will get degraded results if you translate a complete sentence or a
> paragraph (multi-sentences).
>
> Here's an modified approach. Train a translation model with the same
> part-of-speech parallel corpus. Then, use a different version of the
> target language corpus with complete sentences (i.e. broken by sentence
> breaks like full-stops, question marks, etc.). Next, tune your SMT model
> with a tuning set of paired complete sentences that match the LM's
> breaks. The tuning process optimizes performance for that type of input.
> Therefore, your optimized translation results will mirror the LM corpus
> and matched tuning set. You will get degraded results if you translate
> part-of-speech phrases or multi-sentences or complete paragraphs.
>
> We call these "things" models because they're supposed to be a miniature
> representation of a larger universe. So, you'll always get the best
> results when your production input matches the input side of your tuning
> set.
>
> Re newline markers, I think Ulrich's "Mac: CR" is for the legacy Mac OS.
> The current OS X uses Posix/Linux LF. We have not tested our
> cross-platform updates with the older Mac CR and I suspect it will not
> work. So I suggest using either CRLF or LF, which we have extensively
> using across Windows and Posix systems.
>
> Tom
>
>
> On 12/5/2015 6:13 AM, moses-support-request@mit.edu wrote:
>> Date: Fri, 4 Dec 2015 23:13:10 +0000
>> From: Ulrich Germann<ulrich.germann@gmail.com>
>> Subject: Re: [Moses-support] decoder question
>> To: Vincent Nguyen<vnguyen@neuf.fr>
>> Cc: moses-support<moses-support@mit.edu>
>>
>> Hi Vincent,
>>
>> as far as Moses is concerned, the end of a sentence is marked by whatever
>> the end-of-line marker is on the respective OS (Win: CRLF, Linux: LF, Mac:
>> CR, apparently). A period is treated as a plain old token. The purpose of
>> the sentence splitter that Kenneth mentioned is to tell Moses what the
>> "sentence" boundaries are.
>>
>> The language model has a concept of sentences beginning and ending and
>> usually doesn't like periods anywhere except at the end of a sentence, so
>> it'll down-vote translation hypotheses containing isolated periods.
>>
>> - Uli
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
------------------------------
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
End of Moses-support Digest, Vol 110, Issue 14
**********************************************
Subscribe to:
Post Comments (Atom)
0 Response to "Moses-support Digest, Vol 110, Issue 14"
Post a Comment