Moses-support Digest, Vol 108, Issue 66

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."

Today's Topics:

1. Re: Issue with alignment (Per Tunedal)
2. Re: factored tuning time (Tomasz Gawryl)

----------------------------------------------------------------------

Message: 1
Date: Thu, 22 Oct 2015 13:21:21 +0200
From: Per Tunedal <per.tunedal@operamail.com>
Subject: Re: [Moses-support] Issue with alignment
To: gang tang <gangtang2014@126.com>
Cc: moses-support@mit.edu
Message-ID:
<1445512881.2024501.417254033.7DFA8E04@webmail.messagingengine.com>
Content-Type: text/plain; charset="utf-8"

Dear Gang, To clarify: I used my own scripts for my experiments. I tried
improving already aligned data (e.g. samples from the Europarl corpus)
by eliminating bad alignments. I was inspired by Hunalign to try using
dictionaries. An approach that proved useful to prevent unduly removal
of alignments by other tests.

BTW There was an interesting discussion about using dictionaries in
Moses some 6 months ago. Different approaches where discussed, like
training on dictionaries or using them as a back-off. That might be of
interest to you.

Good luck with your explorations!

Yours, Per Tunedal

On Thu, Oct 22, 2015, at 09:56, gang tang wrote:
>
> Dear Per,
>
> Thanks for your kind suggestions. I am digging into my data and the
> source code of giza++ to find out what happened to my precious pair of
> "sandalo vernice" and "vernice sandal". I will certainly look into how
> to utilize Hunalign to advance my cause later on.
>
>
>
> Thanks again, and best regards,
>
> Gang At 2015-10-21 22:03:36, "Per Tunedal"
> <per.tunedal@operamail.com> wrote:
>> Dear Gang, I don't know any tool for word alignment using a
>> dictionary. Anyhow, Hunalign does sentence alignment with the help of
>> dictionaries. I have done some promising experiments using
>> dictionaries to clean sentence aligned corpora. I found that:
>> - dictionaries with domain specific vocabulary are very beneficial
>> - bad dictionaries e.g. created with GIZA++ are somewhat beneficial
>> - dictionaries are best used to prevent suspicious sentence pairs to
>> be unduly removed. The other way around may remove a lot of good
>> pairs with uncommon words.
>>
>> Yours, Per Tunedal
>>
>>
>> On Fri, Oct 9, 2015, at 13:02, gang tang wrote:
>>> Dear All,
>>>
>>> Since there are no answers to my questions, I assume that there are
>>> no easy fixes to the alignment problem. However, just out of
>>> curiosity, shouldn't there be alignment tools that take lexical
>>> considerations into account while aligning parallel corpus? I mean,
>>> alignment tools that look up translations for specific words in a
>>> domain-specifc dictionary during alignment? Could there be any
>>> reason that it is not an interesting area to explore?
>>>
>>> Best Regards, Gang
>>>
>>>
>>>
>>> ? 2015-09-25 19:34:13?"gang tang" <gangtang2014@126.com> ???
>>>> Dear all,
>>>>
>>>> I have a problem with alignment. I'd greatly appreciate if anyone
>>>> can help solve my issue.
>>>>
>>>> I have the following corpus:
>>>>
>>>> ?sandalo camufluge" -> "camufluge sandal" "sandalo daino" -> "daino
>>>> sandal" "sandalo madras" -> "madras sandal" "sandalo vernice" ->
>>>> "vernice sandal"
>>>>
>>>> The alignment software I used was GIZA++, and the alignment result
>>>> was always 0-0 1-1, which meant that "sandalo" wasn't aligned with
>>>> "sandal". And after training phrase.translation.table always had
>>>> entries such as
"sandalo" -> "camufluge", "sandalo" -> "daino", "sandalo"->"madras",
and "sandalo"->"vernice", and no "sandalo"->"sandal". Is there any way
this problem could be solved? Could I add more data to align "sandalo"
with "sandal" and translate "sandalo" to "sandal"? How should I tune
the system?
>>>>
>>>> Thanks for your attention,
>>>>
>>>> Gang
>>>>
>>>>
>>>>
>>>>
>>>> ????iPhone6s???5288???????[1]

>>>>
>>>
>>>
>>>
>>>

>>>
>>> _________________________________________________
>>> Moses-support mailing list Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>

>

Links:

1. http://rd.da.netease.com/redirect?t=ORBmhG&p=y7fo42&proId=1024&target=http%3A%2F%2Fwww.kaola.com%2Factivity%2Fdetail%2F4650.html%3Ftag%3Dea467f1dcce6ada85b1ae151610748b5
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151022/db2ea559/attachment-0001.html

------------------------------

Message: 2
Date: Thu, 22 Oct 2015 13:25:57 +0200
From: "Tomasz Gawryl" <tomasz.gawryl@skrivanek.pl>
Subject: Re: [Moses-support] factored tuning time
To: "'Hieu Hoang'" <hieuhoang@gmail.com>
Cc: 'moses-support' <moses-support@mit.edu>
Message-ID: <005b01d10cbc$6c5f1290$451d37b0$@gawryl@skrivanek.pl>
Content-Type: text/plain; charset="utf-8"

Hi Hieu,

So I?ll wait to the end and then reduce tuning file to recommended size.

Thank you for your help! J

Best Regards,

Tomek

From: Hieu Hoang [mailto:hieuhoang@gmail.com]
Sent: Thursday, October 22, 2015 1:13 PM
To: Tomasz Gawryl
Cc: moses-support
Subject: Re: [Moses-support] factored tuning time

your moses.ini file is fine. Your tuning set is crazily large. You only need about 2000 sentences in most cases

Hieu Hoang
http://www.hoang.co.uk/hieu

On 22 October 2015 at 12:09, Tomasz Gawryl <tomasz.gawryl@skrivanek.pl> wrote:

Sure. Below is my moses.ini file.

Current run: run9

Tuning set contains 21552 sentences

Test sets are very small, the biggest has 118 sentences.

Regards,

TG

moses@SKR-moses:~/working/experiments/FACTORED/model$ more moses.ini.1

#########################

### MOSES CONFIG FILE ###

#########################

# input factors

[input-factors]

0

# mapping steps

[mapping]

0 T 0

[distortion-limit]

6

# feature functions

[feature]

UnknownWordPenalty

WordPenalty

PhrasePenalty

PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/home/moses/working/experiments/FACTORED/model/phrase-table.1.0-0,1 input-factor=0 output-factor=0,1

LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home/moses/working/experiments/FACTORED/model/reordering-table.1.0-0.wbe-msd-bidirectional-fe.gz

Distortion

SRILM name=LM0 factor=1 path=/home/moses/working/experiments/FACTORED/lm/ACROSS=pos.lm.1 order=7

# dense weights for feature functions

[weight]

# The default weights are NOT optimized for translation quality. You MUST tune the weights.

# Documentation for tuning is here: http://www.statmt.org/moses/?n=FactoredTraining.Tuning

UnknownWordPenalty0= 1

WordPenalty0= -1

PhrasePenalty0= 0.2

TranslationModel0= 0.2 0.2 0.2 0.2

LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3

Distortion0= 0.3

LM0= 0.5

From: Hieu Hoang [mailto:hieuhoang@gmail.com]
Sent: Thursday, October 22, 2015 12:55 PM
To: Tomasz Gawryl
Cc: moses-support
Subject: Re: [Moses-support] factored tuning time

can you show me the moses.ini file it creates. It's usually

model/moses.ini.?

How big is your tuning set? ANd your test set? What iteration is the tuning on at the moment? You can find that by looking in

tuning/tmp.?

Hieu Hoang
http://www.hoang.co.uk/hieu

On 22 October 2015 at 11:50, Tomasz Gawryl <tomasz.gawryl@skrivanek.pl> wrote:

Hi Hieu,

Here is my factored config.

Regards,

TG

################################################

### CONFIGURATION FILE FOR AN SMT EXPERIMENT ###

################################################

[GENERAL]

home-dir = /home/moses

data-dir = $home-dir/corpus

train-dir = $data-dir/training

dev-dir = $data-dir/dev

### directory in which experiment is run

#

working-dir = $home-dir/working/experiments/FACTORED

# specification of the language pair

input-extension = en

output-extension = pl

pair-extension = en-pl

### directories that contain tools and data

#

# moses

moses-src-dir = $home-dir/src/mosesdecoder

#

# moses binaries

moses-bin-dir = $moses-src-dir/bin

#

# moses scripts

moses-script-dir = $moses-src-dir/scripts

#

# directory where GIZA++/MGIZA programs resides

external-bin-dir = $moses-src-dir/tools

#

# srilm

srilm-dir = $home-dir/src/srilm-1.7.0-lite/bin/i686-m64

#

# irstlm

irstlm-dir = $home-dir/src/irstlm-5.80.08/trunk/bin

#

# randlm

#randlm-dir = $moses-src-dir/randlm/bin

#

# data

#wmt12-data = $working-dir/data

### basic tools

#

# moses decoder

decoder = $moses-bin-dir/moses

# conversion of rule table into binary on-disk format

ttable-binarizer = "$moses-bin-dir/CreateOnDiskPt 1 2 4 100 2"

# tokenizers - comment out if all your data is already tokenized

input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -threads 4 -a -l $input-extension"

output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -threads 4 -a -l $output-extension"

# truecasers - comment out if you do not use the truecaser

input-truecaser = $moses-script-dir/recaser/truecase.perl

output-truecaser = $moses-script-dir/recaser/truecase.perl

detruecaser = $moses-script-dir/recaser/detruecase.perl

# lowercaser - comment out if you use truecasing

#input-lowercaser = $moses-script-dir/tokenizer/lowercase.perl

#output-lowercaser = $moses-script-dir/tokenizer/lowercase.perl

### generic parallelizer for cluster and multi-core machines

# you may specify a script that allows the parallel execution

# parallizable steps (see meta file). you also need specify

# the number of jobs (cluster) or cores (multicore)

#

#generic-parallelizer = $moses-script-dir/ems/support/generic-parallelizer.perl

generic-parallelizer = $moses-script-dir/ems/support/generic-multicore-parallelizer.perl

### cluster settings (if run on a cluster machine)

# number of jobs to be submitted in parallel

#

#jobs = 10

# arguments to qsub when scheduling a job

#qsub-settings = ""

# project for priviledges and usage accounting

#qsub-project = iccs_smt

# memory and time

#qsub-memory = 4

#qsub-hours = 48

### multi-core settings

# when the generic parallelizer is used, the number of cores

# specified here

cores = 8

#################################################################

# PARALLEL CORPUS PREPARATION:

# create a tokenized, sentence-aligned corpus, ready for training

[CORPUS]

### long sentences are filtered out, since they slow down GIZA++

# and are a less reliable source of data. set here the maximum

# length of a sentence

#

max-sentence-length = 100

[CORPUS:europarl] IGNORE

### command to run to get raw corpus files

#

# get-corpus-script =

### raw corpus files (untokenized, but sentence aligned)

#

#raw-stem = $train-dir/corpus.skropus.nmk.$pair-extension

### tokenized corpus files (may contain long sentences)

#

#tokenized-stem =

### if sentence filtering should be skipped,

# point to the clean training data

#

#clean-stem =

### if corpus preparation should be skipped,

# point to the prepared training data

#

#lowercased-stem =

[CORPUS:ACROSS]

raw-stem = $train-dir/across.$pair-extension

[CORPUS:un] IGNORE

raw-stem = $wmt12-data/training/undoc.2000.$pair-extension

#################################################################

# LANGUAGE MODEL TRAINING

[LM]

### tool to be used for language model training

# kenlm training

lm-training = "$moses-script-dir/ems/support/lmplz-wrapper.perl -bin $moses-bin-dir/lmplz"

settings = "--prune '0 0 1' -T $working-dir/lm -S 20%"

# srilm

#lm-training = $srilm-dir/ngram-count

#settings = "-'interpolate -kndiscount -unk"

# irstlm training

# msb = modified kneser ney; p=0 no singleton pruning

#lm-training = "$moses-script-dir/generic/trainlm-irst2.perl -cores $cores -irst-dir $irstlm-dir -temp-dir $working-dir/tmp"

#settings = "-s msb -p 0"

# order of the language model

order = 5

### tool to be used for training randomized language model from scratch

# (more commonly, a SRILM is trained)

#

#rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8"

### script to use for binary table format for irstlm or kenlm

# (default: no binarization)

# irstlm

#lm-binarizer = $irstlm-dir/compile-lm

# kenlm, also set type to 8

#lm-binarizer = $moses-bin-dir/build_binary

#type = 8

### script to create quantized language model format (irstlm)

# (default: no quantization)

#

#lm-quantizer = $irstlm-dir/quantize-lm

### script to use for converting into randomized table format

# (default: no randomization)

#

#lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"

### each language model to be used has its own section here

[LM:europarl] IGNORE

### command to run to get raw corpus files

#

#get-corpus-script = ""

### raw corpus (untokenized)

#

#raw-corpus = $wmt12-data/training/europarl-v7.$output-extension

### tokenized corpus files (may contain long sentences)

#

#tokenized-corpus =

### if corpus preparation should be skipped,

# point to the prepared language model

#

#lm =

[LM:ACROSS] IGNORE

raw-corpus = $train-dir/across.$pair-extension.$output-extension

[LM:un] IGNORE

raw-corpus = $wmt12-data/training/undoc.2000.$pair-extension.$output-extension

[LM:news] IGNORE

raw-corpus = $wmt12-data/training/news.$output-extension.shuffled

[LM:ACROSS=pos]

factors = "pos"

order = 7

#settings = "-interpolate -unk"

settings = "--discount_fallback"

raw-corpus = $train-dir/across.$pair-extension.$output-extension

#################################################################

# INTERPOLATING LANGUAGE MODELS

[INTERPOLATED-LM] IGNORE

# if multiple language models are used, these may be combined

# by optimizing perplexity on a tuning set

# see, for instance [Koehn and Schwenk, IJCNLP 2008]

### script to interpolate language models

# if commented out, no interpolation is performed

#

script = $moses-script-dir/ems/support/interpolate-lm.perl

### tuning set

# you may use the same set that is used for mert tuning (reference set)

#

tuning-sgm = $wmt12-data/dev/newstest2010-ref.$output-extension.sgm

#raw-tuning =

#tokenized-tuning =

#factored-tuning =

#lowercased-tuning =

#split-tuning =

### group language models for hierarchical interpolation

# (flat interpolation is limited to 10 language models)

#group = "first,second fourth,fifth"

### script to use for binary table format for irstlm or kenlm

# (default: no binarization)

# irstlm

#lm-binarizer = $irstlm-dir/compile-lm

# kenlm, also set type to 8

#lm-binarizer = $moses-bin-dir/build_binary

#type = 8

### script to create quantized language model format (irstlm)

# (default: no quantization)

#

#lm-quantizer = $irstlm-dir/quantize-lm

### script to use for converting into randomized table format

# (default: no randomization)

#

#lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"

#################################################################

# FACTOR DEFINITION

[INPUT-FACTOR]

# also used for output factors

temp-dir = $working-dir/training/factor

[OUTPUT-FACTOR:pos]

### script that generates this factor

#

mxpost = $home-dir/src/mxpost

factor-script = "$moses-script-dir/training/wrappers/make-factor-en-pos.mxpost.perl -mxpost $mxpost"

#################################################################

# MODIFIED MOORE LEWIS FILTERING

[MML] IGNORE

### specifications for language models to be trained

#

#lm-training = $srilm-dir/ngram-count

#lm-settings = "-interpolate -kndiscount -unk"

#lm-binarizer = $moses-src-dir/bin/build_binary

#lm-query = $moses-src-dir/bin/query

#order = 5

### in-/out-of-domain source/target corpora to train the 4 language model

#

# in-domain: point either to a parallel corpus

#outdomain-stem = [CORPUS:toy:clean-split-stem]

# ... or to two separate monolingual corpora

#indomain-target = [LM:toy:lowercased-corpus]

#raw-indomain-source = $toy-data/nc-5k.$input-extension

# point to out-of-domain parallel corpus

#outdomain-stem = [CORPUS:giga:clean-split-stem]

# settings: number of lines sampled from the corpora to train each language model on

# (if used at all, should be small as a percentage of corpus)

#settings = "--line-count 100000"

#################################################################

# TRANSLATION MODEL TRAINING

[TRAINING]

### training script to be used: either a legacy script or

# current moses training script (default)

#

script = $moses-script-dir/training/train-model.perl

### general options

# these are options that are passed on to train-model.perl, for instance

# * "-mgiza -mgiza-cpus 8" to use mgiza instead of giza

# * "-sort-buffer-size 8G -sort-compress gzip" to reduce on-disk sorting

# * "-sort-parallel 8 -cores 8" to speed up phrase table building

# * "-parallel" for parallel execution of mkcls and giza

#

#training-options = ""

training-options = "-mgiza -mgiza-cpus 8 -cores $cores -parallel -sort-buffer-size 10G -sort-batch-size 253 -sort-compress gzip -sort-parallel 10"

### factored training: specify here which factors used

# if none specified, single factor training is assumed

# (one translation step, surface to surface)

#

input-factors = word

output-factors = word pos

alignment-factors = "word -> word"

translation-factors = "word -> word+pos"

reordering-factors = "word -> word"

#generation-factors =

decoding-steps = "t0"

### parallelization of data preparation step

# the two directions of the data preparation can be run in parallel

# comment out if not needed

#

parallel = yes

### pre-computation for giza++

# giza++ has a more efficient data structure that needs to be

# initialized with snt2cooc. if run in parallel, this may reduces

# memory requirements. set here the number of parts

#

run-giza-in-parts = 5

### symmetrization method to obtain word alignments from giza output

# (commonly used: grow-diag-final-and)

#

alignment-symmetrization-method = grow-diag-final-and

### use of Chris Dyer's fast align for word alignment

#

#fast-align-settings = "-d -o -v"

### use of berkeley aligner for word alignment

#

#use-berkeley = true

#alignment-symmetrization-method = berkeley

#berkeley-train = $moses-script-dir/ems/support/berkeley-train.sh

#berkeley-process = $moses-script-dir/ems/support/berkeley-process.sh

#berkeley-jar = /your/path/to/berkeleyaligner-1.1/berkeleyaligner.jar

#berkeley-java-options = "-server -mx30000m -ea"

#berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"

#berkeley-process-options = "-EMWordAligner.numThreads 8"

#berkeley-posterior = 0.5

### use of baseline alignment model (incremental training)

#

#baseline = 68

#baseline-alignment-model = "$working-dir/training/prepared.$baseline/$input-extension.vcb \

# $working-dir/training/prepared.$baseline/$output-extension.vcb \

# $working-dir/training/giza.$baseline/${output-extension}-$input-extension.cooc \

# $working-dir/training/giza-inverse.$baseline/${input-extension}-$output-extension.cooc \

# $working-dir/training/giza.$baseline/${output-extension}-$input-extension.thmm.5 \

# $working-dir/training/giza.$baseline/${output-extension}-$input-extension.hhmm.5 \

# $working-dir/training/giza-inverse.$baseline/${input-extension}-$output-extension.thmm.5 \

# $working-dir/training/giza-inverse.$baseline/${input-extension}-$output-extension.hhmm.5"

### if word alignment should be skipped,

# point to word alignment files

#

#word-alignment = $working-dir/model/aligned.1

### filtering some corpora with modified Moore-Lewis

# specify corpora to be filtered and ratio to be kept, either before or after word alignment

#mml-filter-corpora = toy

#mml-before-wa = "-proportion 0.9"

#mml-after-wa = "-proportion 0.9"

### build memory mapped suffix array phrase table

# (binarizing the reordering table is a good idea, since filtering makes little sense)

#mmsapt = "num-features=9 pfwd=g+ pbwd=g+ smooth=0 sample=1000 workers=1"

binarize-all = $moses-script-dir/training/binarize-model.perl

### create a bilingual concordancer for the model

#

#biconcor = $moses-bin-dir/biconcor

## Operation Sequence Model (OSM)

# Durrani, Schmid and Fraser. (2011):

# "A Joint Sequence Translation Model with Integrated Reordering"

# compile Moses with --max-kenlm-order=9 if higher order is required

#

#operation-sequence-model = "yes"

#operation-sequence-model-order = 5

#operation-sequence-model-settings = "-lmplz '$moses-src-dir/bin/lmplz -S 40% -T $working-dir/model/tmp'"

#

# if OSM training should be skipped, point to OSM Model

#osm-model =

### unsupervised transliteration module

# Durrani, Sajjad, Hoang and Koehn (EACL, 2014).

# "Integrating an Unsupervised Transliteration Model

# into Statistical Machine Translation."

#

#transliteration-module = "yes"

#post-decoding-transliteration = "yes"

### lexicalized reordering: specify orientation type

# (default: only distance-based reordering model)

#

lexicalized-reordering = msd-bidirectional-fe

### hierarchical rule set

#

#hierarchical-rule-set = true

### settings for rule extraction

#

#extract-settings = ""

max-phrase-length = 5

### add extracted phrases from baseline model

#

#baseline-extract = $working-dir/model/extract.$baseline

#

# requires aligned parallel corpus for re-estimating lexical translation probabilities

#baseline-corpus = $working-dir/training/corpus.$baseline

#baseline-alignment = $working-dir/model/aligned.$baseline.$alignment-symmetrization-method

### unknown word labels (target syntax only)

# enables use of unknown word labels during decoding

# label file is generated during rule extraction

#

#use-unknown-word-labels = true

### if phrase extraction should be skipped,

# point to stem for extract files

#

# extracted-phrases =

### settings for rule scoring

#

score-settings = "--GoodTuring --MinScore 2:0.0001"

### include word alignment in phrase table

#

#include-word-alignment-in-rules = yes

### sparse lexical features

#

#sparse-features = "target-word-insertion top 50, source-word-deletion top 50, word-translation top 50 50, phrase-length"

### domain adaptation settings

# options: sparse, any of: indicator, subset, ratio

#domain-features = "subset"

### if phrase table training should be skipped,

# point to phrase translation table

#

# phrase-translation-table =

### if reordering table training should be skipped,

# point to reordering table

#

# reordering-table =

### filtering the phrase table based on significance tests

# Johnson, Martin, Foster and Kuhn. (2007): "Improving Translation Quality by Discarding Most of the Phrasetable"

# options: -n number of translations; -l 'a+e', 'a-e', or a positive real value -log prob threshold

#salm-index = /path/to/project/salm/Bin/Linux/Index/IndexSA.O64

#sigtest-filter = "-l a+e -n 50"

### if training should be skipped,

# point to a configuration file that contains

# pointers to all relevant model files

#

#config-with-reused-weights =

#####################################################

### TUNING: finding good weights for model components

[TUNING]

### instead of tuning with this setting, old weights may be recycled

# specify here an old configuration file with matching weights

#

#weight-config = $working-dir/tuning/moses.weight-reused.ini.1

### tuning script to be used

#

tuning-script = $moses-script-dir/training/mert-moses.pl

tuning-settings = "-mertdir $moses-bin-dir -threads=$cores"

### specify the corpus used for tuning

# it should contain 1000s of sentences

#

#input-sgm = $wmt12-data/dev/newstest2010-src.$input-extension.sgm

raw-input = $dev-dir/Tatoeba.$input-extension

#tokenized-input =

#factorized-input =

#input =

#reference-sgm = $wmt12-data/dev/newstest2010-ref.$output-extension.sgm

raw-reference = $dev-dir/Tatoeba.$output-extension

#tokenized-reference =

#factorized-reference =

#reference =

### size of n-best list used (typically 100)

#

nbest = 100

### ranges for weights for random initialization

# if not specified, the tuning script will use generic ranges

# it is not clear, if this matters

#

# lambda =

### additional flags for the filter script

#

filter-settings = ""

### additional flags for the decoder

#

decoder-settings = "-threads $cores"

### if tuning should be skipped, specify this here

# and also point to a configuration file that contains

# pointers to all relevant model files

#

#config-with-reused-weights =

#########################################################

## RECASER: restore case, this part only trains the model

[RECASING] IGNORE

### training data

# raw input needs to be still tokenized,

# also also tokenized input may be specified

#

#tokenized = [LM:europarl:tokenized-corpus]

### additinal settings

#

recasing-settings = ""

#lm-training = $srilm-dir/ngram-count

decoder = $moses-bin-dir/moses

# already a trained recaser? point to config file

#recase-config =

#######################################################

## TRUECASER: train model to truecase corpora and input

[TRUECASER]

### script to train truecaser models

#

trainer = $moses-script-dir/recaser/train-truecaser.perl

### training data

# data on which truecaser is trained

# if no training data is specified, parallel corpus is used

#

# raw-stem =

# tokenized-stem =

### trained model

#

# truecase-model =

######################################################################

## EVALUATION: translating a test set using the tuned system and score it

[EVALUATION]

### number of jobs (if parallel execution on cluster)

#

#jobs = 10

### additional flags for the filter script

#

#filter-settings = ""

### additional decoder settings

# switches for the Moses decoder

# common choices:

# "-threads N" for multi-threading

# "-mbr" for MBR decoding

# "-drop-unknown" for dropping unknown source words

# "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000" for cube pruning

#

decoder-settings = "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000 -threads $cores"

### specify size of n-best list, if produced

#

#nbest = 100

### multiple reference translations

#

#multiref = yes

### prepare system output for scoring

# this may include detokenization and wrapping output in sgm

# (needed for nist-bleu, ter, meteor)

#

detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l $output-extension"

#recaser = $moses-script-dir/recaser/recase.perl

wrapping-script = "$moses-script-dir/ems/support/wrap-xml.perl $output-extension"

#output-sgm =

### BLEU

#

nist-bleu = $moses-script-dir/generic/mteval-v13a.pl

nist-bleu-c = "$moses-script-dir/generic/mteval-v13a.pl -c"

#multi-bleu = "$moses-script-dir/generic/multi-bleu.perl -lc"

#multi-bleu-c = $moses-script-dir/generic/multi-bleu.perl

#ibm-bleu =

### TER: translation error rate (BBN metric) based on edit distance

# not yet integrated

#

# ter =

### METEOR: gives credit to stem / worknet synonym matches

# not yet integrated

#

# meteor =

### Analysis: carry out various forms of analysis on the output

#

analysis = $moses-script-dir/ems/support/analysis.perl

#

# also report on input coverage

analyze-coverage = yes

#

# also report on phrase mappings used

report-segmentation = yes

#

# report precision of translations for each input word, broken down by

# count of input word in corpus and model

#report-precision-by-coverage = yes

#

# further precision breakdown by factor

#precision-by-coverage-factor = pos

#

# visualization of the search graph in tree-based models

#analyze-search-graph = yes

[EVALUATION:newstest2011] IGNORE

### input data

#

input-sgm = $wmt12-data/dev/newstest2011-src.$input-extension.sgm

# raw-input =

# tokenized-input =

# factorized-input =

# input =

### reference data

#

reference-sgm = $wmt12-data/dev/newstest2011-ref.$output-extension.sgm

# raw-reference =

# tokenized-reference =

# reference =

### analysis settings

# may contain any of the general evaluation analysis settings

# specific setting: base coverage statistics on earlier run

#

#precision-by-coverage-base = $working-dir/evaluation/test.analysis.5

### wrapping frame

# for nist-bleu and other scoring scripts, the output needs to be wrapped

# in sgm markup (typically like the input sgm)

#

wrapping-frame = $input-sgm

[EVALUATION:carlsberg]

raw-input = $dev-dir/Carlsberg.en

raw-reference = $dev-dir/Carlsberg.pl

[EVALUATION:TNS_TEKST]

raw-input = $dev-dir/TLUMACZE_NA_START2015/tekst.nmk.$input-extension

raw-reference = $dev-dir/TLUMACZE_NA_START2015/tekst.nmk.$output-extension

[EVALUATION:TNS_ETAP1]

raw-input = $dev-dir/TLUMACZE_NA_START2015/etap1.nmk.$input-extension

raw-reference = $dev-dir/TLUMACZE_NA_START2015/etap1.nmk.$output-extension

##########################################

### REPORTING: summarize evaluation scores

[REPORTING]

### currently no parameters for reporting section

From: Hieu Hoang [mailto:hieuhoang@gmail.com]
Sent: Thursday, October 22, 2015 12:22 PM
To: Tomasz Gawryl
Cc: moses-support
Subject: Re: [Moses-support] factored tuning time

Can I have a look at your moses.ini file?

You should be careful trying to use complicated factored models as they can take a long time to run.

Also, you can use multithreading to make it run faster

Hieu Hoang
http://www.hoang.co.uk/hieu

On 22 October 2015 at 07:28, Tomasz Gawryl <tomasz.gawryl@skrivanek.pl> wrote:

Hi,

I?ve one question to you about time of factored tuning. How many times longer it takes compared to phrase based tuning?

I?m asking because it?s 7?th day and it?s still tuning (3,3 mln corpus sentences). Phrase based tuning took around 3h for the same corpus.

Top shows me that moses uses near 100% CPU. So the speed is the same.

Regards,

Tomek Gawryl

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20151022/d3af9151/attachment.html

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

End of Moses-support Digest, Vol 108, Issue 66
**********************************************

Moses-support Digest, Vol 108, Issue 66

0 Response to "Moses-support Digest, Vol 108, Issue 66"

Post a Comment