Moses-support Digest, Vol 88, Issue 57

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-request@mit.edu

You can reach the person managing the list at
moses-support-owner@mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Moses-support digest..."


Today's Topics:

1. Re: Moses training performance (Marcin Junczys-Dowmunt)
2. Weight normalization in mert-moses.pl (Marcin Junczys-Dowmunt)
3. Re: Weight normalization in mert-moses.pl (Marcin Junczys-Dowmunt)
4. Re: --activate-features in mert-moses.perl not working?
(Marcin Junczys-Dowmunt)
5. CFP Seven SIGIR?14 Workshops on emerging areas in IR (Richi Nayak)


----------------------------------------------------------------------

Message: 1
Date: Tue, 25 Feb 2014 22:06:23 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Moses training performance
To: moses-support@mit.edu
Message-ID: <530D05CF.2090001@amu.edu.pl>
Content-Type: text/plain; charset=UTF-8; format=flowed

I guess the mkcls time is a good hint here. Could it be that the Xeon
system is much slower on a per-core basis, like low CPU frequency
compared to your Mac? Mkcls is single process, so this is not a
multi-threading issue. Maybe there is heavy load on that Xeon from other
sources?

W dniu 25.02.2014 21:37, Andrzej Zydron pisze:
> Many thanks Hieu,
>
> I did specify "-mgiza-cpus 4" for the Mac and "-mgiza-cpus 12" for the
> Xeon server. Interestingly "-mgiza-cpus 10" gave slightly better
> performance (5 mins). Looking at the io stats mgiza did not appear to
> be io bound.
>
> Email signature standard
>
> Best Regards,
>
>
> Andrzej Zydro?
>
> ---------------------------------------
>
> CTO
>
> *XTM International Ltd.*
>
> PO Box 2167, Gerrards Cross, SL9 8XF, UK
>
> email: azydron@xtm-intl.com <mailto:azydron@xtm-intl.com>
>
> Tel: +44 (0) 1753 480 479
>
> Mob: +44 (0) 7966 477 181
>
> skype: Zydron
>
> www.xtm-intl.com <http://www.xtm-intl.com/>
>
>
> On 25/02/2014 18:19, Hieu Hoang wrote:
>> Strange and interesting.
>>
>> I can think of 2 issues:
>> 1. The number of cores isn't relevant unless you explicitly ask mgiza
>> & the various extraction steps to use multiple cores.
>> 2. It looks like mgiza is the issue
>> 3. I'm not sure how io-bound mgiza is. However, in my test with
>> virtual machines, io-bound processes are slow
>> http://www.hanselman.com/blog/VMPerformanceChecklistBeforeYouComplainThatYourVirtualMachineIsSlow.aspx
>>
>> This may be the case with ram-disk
>>
>>
>> On 25 February 2014 18:01, Andrzej Zydron <azydron@xtm-intl.com
>> <mailto:azydron@xtm-intl.com>> wrote:
>>
>> Dear Support,
>>
>> I realize that there may not be a simple answer, but I would like to
>> understand why running training on a 9300 segment corpus takes nearly
>> three times as long on a 12 core Xeon E5-1650v2 128GB RAM Running
>> CentOS
>> 6.5, than on my MacBook Pro 4 core i7 3720QM 8GB RAM running
>> Mavericks.
>> I am at a loss to explain. On the Xeon server I used a 28GB
>> RAMDISK to
>> simulate an SSD to make things more equal. I have used mgiza
>> throughout.
>> I have used the same data nad identical settings throughout on both
>> machines and I have used the official Moses 2.1 Git distribution and
>> compiled and linked on the machine.
>>
>> These are the timings in minutes for the MacBook Pro 4 core i7 3720QM
>> 8Gb RAM SSD:
>>
>> Start End Time taken
>> mkls 10:18:50 10:19:23 00:00:33
>> snt2cooc 10:19:23 10:19:25 00:00:02
>> mgiza 10:19:25 10:31:58 00:12:33
>> extract 10:31:58 10:32:04 00:00:06
>> score 10:32:04 10:32:14 00:00:10
>> reordering 10:32:14 10:32:17 00:00:03
>>
>> Total 00:13:27
>>
>> and these for the 12 core Xeon E5-1650v2 128GB RAM using 28GB
>> RAMDISKfor
>> all the data:
>>
>> Start End Time taken
>> mkls 09:44:24 09:49:00 00:04:36
>> snt2cooc 09:49:00 09:49:23 00:00:23
>> mgiza 09:49:23 10:23:32 00:34:09
>> extract 10:23:32 10:24:20 00:00:48
>> score 10:24:20 10:26:08 00:01:48
>> reordering 10:26:08 10:26:20 00:00:12
>>
>> Total 00:41:56
>>
>> I know that the Mac is a superb machine (the best I have ever put my
>> hands on), but I find it difficult to understand why it should be so
>> much faster than a state of the art Xeon server for Moses training.
>>
>> Email signature standard
>>
>> Best Regards,
>>
>>
>> Andrzej Zydro?
>>
>> ---------------------------------------
>>
>> CTO
>>
>> *XTM International Ltd.*
>>
>> PO Box 2167, Gerrards Cross, SL9 8XF, UK
>>
>> email: azydron@xtm-intl.com <mailto:azydron@xtm-intl.com>
>> <mailto:azydron@xtm-intl.com <mailto:azydron@xtm-intl.com>>
>>
>> Tel: +44 (0) 1753 480 479 <tel:%2B44%20%280%29%201753%20480%20479>
>>
>> Mob: +44 (0) 7966 477 181 <tel:%2B44%20%280%29%207966%20477%20181>
>>
>> skype: Zydron
>>
>> www.xtm-intl.com <http://www.xtm-intl.com> <http://www.xtm-intl.com/>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> --
>> Hieu Hoang
>> Research Associate
>> University of Edinburgh
>> http://www.hoang.co.uk/hieu
>>
>
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support



------------------------------

Message: 2
Date: Tue, 25 Feb 2014 22:15:42 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: [Moses-support] Weight normalization in mert-moses.pl
To: moses-support <moses-support@MIT.EDU>
Message-ID: <530D07FE.7050404@amu.edu.pl>
Content-Type: text/plain; charset=UTF-8; format=flowed

Hi,
I am wondering about this piece of code in the get_weights_from_mert
function of mert-moses.pl:

my $sum = 0.0;
while (<$fh>) {
if (/^F(\d+) ([\-\.\de]+)/) { # regular features
$WEIGHT[$1] = $2;
$sum += abs($2);
} elsif (/^M(\d+_\d+) ([\-\.\de]+)/) { # mix weights
push @$mix_weights,$2;
} elsif (/^(.+_.+) ([\-\.\de]+)/) { # sparse features
$$sparse_weights{$1} = $2;
}
}
close $fh;
die "It seems feature values are invalid or unable to read
$outfile." if $sum < 1e-09;

$devbleu = "unknown";
foreach (@WEIGHT) { $_ /= $sum; }
foreach (keys %{$sparse_weights}) { $$sparse_weights{$_} /= $sum; }

I understand that the division by "$sum" is meant as a normalization,
but I notice that sparse features are not being summed, nevertheless
they are being normalized by the sum of the dense features. Does this
actually make sense? Also kbmira often produces several sets of weights
during one run of which only the last set is kept (the rest is being
overwritten), but the sum is collected over all sets. Looks kinda fishy :)


------------------------------

Message: 3
Date: Tue, 25 Feb 2014 22:29:43 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] Weight normalization in mert-moses.pl
To: moses-support <moses-support@mit.edu>
Message-ID: <530D0B47.8010602@amu.edu.pl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

OK, second part of question is resolved. The multiple weight sets only
occur if kbmira outputs to stdout. If a file name is given it overwrites
the file and thus the previous weight set. I am however still wondering
why the sparse weights are not being summed into the normalization factor.

W dniu 25.02.2014 22:15, Marcin Junczys-Dowmunt pisze:
> Hi,
> I am wondering about this piece of code in the get_weights_from_mert
> function of mert-moses.pl:
>
> my $sum = 0.0;
> while (<$fh>) {
> if (/^F(\d+) ([\-\.\de]+)/) { # regular features
> $WEIGHT[$1] = $2;
> $sum += abs($2);
> } elsif (/^M(\d+_\d+) ([\-\.\de]+)/) { # mix weights
> push @$mix_weights,$2;
> } elsif (/^(.+_.+) ([\-\.\de]+)/) { # sparse features
> $$sparse_weights{$1} = $2;
> }
> }
> close $fh;
> die "It seems feature values are invalid or unable to read
> $outfile." if $sum < 1e-09;
>
> $devbleu = "unknown";
> foreach (@WEIGHT) { $_ /= $sum; }
> foreach (keys %{$sparse_weights}) { $$sparse_weights{$_} /= $sum; }
>
> I understand that the division by "$sum" is meant as a normalization,
> but I notice that sparse features are not being summed, nevertheless
> they are being normalized by the sum of the dense features. Does this
> actually make sense? Also kbmira often produces several sets of weights
> during one run of which only the last set is kept (the rest is being
> overwritten), but the sum is collected over all sets. Looks kinda fishy :)
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support



------------------------------

Message: 4
Date: Tue, 25 Feb 2014 23:14:04 +0100
From: Marcin Junczys-Dowmunt <junczys@amu.edu.pl>
Subject: Re: [Moses-support] --activate-features in mert-moses.perl
not working?
To: Hieu Hoang <hieuhoang@gmail.com>
Cc: "moses-support@mit.edu" <moses-support@mit.edu>
Message-ID: <530D15AC.7020807@amu.edu.pl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Hieu, Rico,
this does not seem to be an issue with the ini-file. It actually works
as well with stand-alone moses. The issue seems to be the mert-moses.pl
script which switches off features that are not returned by the decoder
because they are set to tuneable=false.

In the function "run_decoder" in mert-moses.perl there is this line:

$decoder_config = "-weight-overwrite '" . join(" ", values
%model_weights) ."'" unless $___USE_CONFIG_WEIGHTS_FIRST && $run==1;

And I suspect -weight-overwrite is with a truncated list of
model_weights is causing the issue.
Best,
Marcin

W dniu 24.02.2014 11:56, Hieu Hoang pisze:
> Can you please send me your ini file where you set tuneable = false.
>
> This param has to work, the unknown word penalty depends on it
>
> Sent while bumping into things
>
>> On 23 Feb 2014, at 12:13 am, Marcin Junczys-Dowmunt <junczys@amu.edu.pl> wrote:
>>
>> And with "tuneable=false" it seems the features are being ignored during
>> decoding, I understand this should not be happening. I get much worse
>> translation results with an ini-file that has "tuneable=false" for all
>> features than with the same ini without the option. The translation is
>> also much faster with the options specified, so something is clearly not
>> being evaluated.
>>
>> W dniu 23.02.2014 00:30, Marcin Junczys-Dowmunt pisze:
>>> BTW. "tuneable=false" seems to be ignored by Kenlm, works with other
>>> features though.
>>>
>>> W dniu 10.02.2014 21:15, Rico Sennrich pisze:
>>>> Marcin Junczys-Dowmunt <junczys@...> writes:
>>>>
>>>>> Hi,
>>>>> it seems --activate-features=STRING is not working in mert-moses.perl.
>>>>> The script prints a message that the ignored features are not being
>>>>> used, but then optimizes them anyway. I can see that the "enabled"
>>>>> information in the feature data structure is not being used anywhere in
>>>>> the script once it has been set (apart from printing the message).
>>>> I don't know too much about the --activate-features option myself, but in
>>>> recent Moses versions, you can add the option 'tuneable=false' to a feature
>>>> function in the config. The effect is that the feature score(s) won't be
>>>> reported to the n-best list, and MERT/MIRA/PRO won't even know that the
>>>> feature exists. The weight from the original config will be used for all
>>>> tuning iterations, and copied to the final config. You can now also specify
>>>> the weight of sparse features in the config, and this will override the
>>>> weight set in the weights file.
>>>>
>>>> best wishes,
>>>> Rico
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>



------------------------------

Message: 5
Date: Wed, 26 Feb 2014 05:00:45 +0000
From: Richi Nayak <r.nayak@qut.edu.au>
Subject: [Moses-support] CFP Seven SIGIR?14 Workshops on emerging
areas in IR
To: Richi Nayak <r.nayak@qut.edu.au>
Message-ID:
<032189CFEE6B094A8CB9DB0838786ADF052740@ex10mb1.qut.edu.au>
Content-Type: text/plain; charset="windows-1252"

[Apologies if you receive this more than once]


The workshop program of the



SIGIR?14: 37th Annual ACM SIGIR Conference,

Gold Coast, Australia, 6-11 July, 2014

<http://sigir.org/sigir2014/>



will host seven attractive workshops covering novel ideas and emerging areas in IR:





* ERD?14: Entity Recognition and Disambiguation Challenge



http://web-ngram.research.microsoft.com/ERD2014/



The Entity Recognition and Disambiguation Workshop will be organized as

a challenge, where participants submit working systems that identify the

entities mentioned in text. The challenge will have two tracks, focusing

on long and short texts. All submissions will be evaluated on shared

datasets; part of the data will be withheld, to be used for the final

evaluation of all submitted systems to determine the winners. Each

participating team will be offered a spot at the workshop to present

their system.



David Carmel, Yahoo! Research

Ming-Wei Chang, Microsoft Research

Evgeniy Gabrilovich, Google

Bo-June (Paul) Hsu, Microsoft Research

Kuansan Wang, Microsoft Research





* GEAR?14: Gathering Efficient Assessments of Relevance Workshop



https://sites.google.com/site/sigirgear/



Evaluation is a fundamental part of Information Retrieval, and in the

conventional Cranfield evaluation paradigm, sets of relevance

assessments are a fundamental part of test collections. In this

workshop, we wish to revisit how relevance assessments can be

efficiently created. Potential themes include methods for generating

assessments, the process of assessment, effort involved in assessing

different materials, exploration of the concept of relevance etc. A

discussion and exploration of this issue will be facilitated through the

presentation of results based papers and position papers on the topic,

as well as a group design activity.



Martin Halvey, Glasgow Caledonian University

Robert Villa, University of Sheffield

Paul Clough, University of Sheffield





* MedIR?14: Medical Information Retrieval Workshop



http://medir.dcu.ie/



Medical information is accessible from diverse sources including the

general web, social media, journal articles, and hospital records; users

include patients and their families, researchers, practitioners and

clinicians. Challenges in medical information retrieval include:

diversity of users and user ability; variations in the format,

reliability, and quality of biomedical and medical information; the

multimedia nature of data; and the need for accuracy and reliability.

The aim of the workshop is to bring together researchers interested in

medical information search with the goal of identifying specific

challenges that need to be addressed to advance the state-of-the-art.



Eiji Aramaki, Kyoto University, Japan

Lorraine Goeuriot, Dublin City University, Ireland

Gareth JF Jones, Dublin City University, Ireland

Liadh Kelly, Dublin City University, Ireland

Henning M?ller, University of Applied Sciences Western Switzerland

Justin Zobel, University of Melbourne, Australia





* PIR?14: Privacy-Preserving IR Workshop ? When Information Retrieval

Meets Privacy and Security



http://www.cs.georgetown.edu/~huiyang/sigir2014-pir-workshop/



Information retrieval and information privacy/security are two

fast-growing computer science disciplines. There are many synergies and

connections between these two disciplines. However, there have been very

limited efforts to connect the two. On the other hand, due to lack of

mature techniques in privacy-preserving IR, concerns about privacy and

security have become serious obstacles that prevent valuable user data

to be used in IR research such as studies about query logs, social

media, tweets, sessions, and medical record retrieval. This

privacy-preserving IR workshop aims to spurring research brings together

the research fields of IR and privacy/security, and mitigate privacy

threats in information retrieval by exploring novel algorithms and tools.



Luo Si (Purdue University, USA)

Grace Hui Yang (Georgetown University, USA)





* SMIR?14: Semantic Matching in Information Retrieval



http://smir2014.noahlab.com.hk



Recently, significant progress has been made in research on what we call

semantic matching (SM), in Web search, question answering, online

advertisement, cross language information retrieval, multimedia

retrieval, and other tasks. Let us take Web search as example of the

problem. When comparing the textual content of query and documents, the

simple term-based approaches can fail when searcher and author use

different terms. A more realistic approach beyond bag-of-words, referred

to as semantic matching (SM), is to conduct deeper query and document

analysis to encode text with richer representations and then perform

query-document matching with such representations. The main purpose of

the workshop is to bring together IR and NLP researchers working on or

interested in semantic matching, to share latest research results,

express opinions on the related issues, and discuss future directions.



Julio Gonzalo, UNED, Spain

Hang Li, Noah's Ark Lab, Huawei, Hong Kong

Alessandro Moschitti, Qatar Computing Research Institute, Qatar

Jun Xu, Noah's Ark Lab, Huawei, Hong Kong





* SoMeRA?14: Social Media Retrieval and Analysis Workshop



http://www.cp.jku.at/conferences/SoMeRA2014/



The SoMeRA 2014 workshop will present and discuss cutting edge research

on all topics of retrieval, recommendation, and browsing in social

media, as well as on the analysis of user's multifaceted traces in

social media. In particular, novel methods and ideas that address

challenges such as large quantity and noisiness of user-generated

multimedia data, user biases, cold-start problem, or integrating

contextual aspects into retrieval and recommendation techniques are

highly welcome. The workshop will further foster the exchange of ideas

between different communities, in particular it aims at better

connecting the multimedia and recommender systems communities with the

information retrieval community. The workshop will feature both oral

presentations (full papers) and poster/demo presentations (short papers).



Markus Schedl, Johannes Kepler University, Austria

Peter Knees, Johannes Kepler University, Austria

Jialie Shen, Singapore Management University, Singapore





* TAIA?14: Temporal, social and spatially Aware Information Access Workshop



http://research.microsoft.com/en-us/people/milads/taia2014.aspx



Users provide an unprecedented volume of detailed, and continuously

updated information about where they are, what they are doing, who they

are with, and what they are thinking and feeling about their activities.

The provision of this stream creates an informal contract between the

user and the information access application in which the user will

provide the information, but the application must provide results that

are contextually relevant. In this workshop we explore spatial and

temporal context in dynamic geotagged collections, such as Wikipedia,

and traditional news sources, as well as social media sites such as

Twitter, Foursquare, Facebook and Flickr. To ground the workshop, and

provide a locus for discussion of the two aspects of user context, we

focus on event detection and recommendation. Events are a natural theme

around which to center discussions of spatial and temporal context

because events are defined by their time and place.



Fernando Diaz, Microsoft Research

Claudia Hauff, Delft University of Technology

Vanessa Murdock, Microsoft

Maarten de Rijke, University of Amsterdam

Milad Shokouhi, Microsoft





Please look at the individual websites for the calls, and deadlines ?

and participate in the discussion on the SIGIR?14 workshop day, on

Friday 11 July 2014, in the beautiful scenery of Gold Coast, Queensland,

Australia.




Dr Richi Nayak, Associate Professor
Higher Degree Research Director, School of Electrical Engineering and Computer Science
Science and Engineering Faculty| Queensland University of Technology |Brisbane, QLD 4001
Office: S1206 | Ph: 313 81976 | Fax: 313 89390 | Email: r.nayak@qut.edu.au<mailto:resources.scitech@qut.edu.au>
Webpage: http://applieddatamining.info/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140226/1a21ba74/attachment.htm

------------------------------

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


End of Moses-support Digest, Vol 88, Issue 57
*********************************************

0 Response to "Moses-support Digest, Vol 88, Issue 57"

Post a Comment