Posts (page 2)
In response to a query that popped up on Friend Feed:
I taught myself years ago with O'Reilly's Learning Python. The web
book 'Dive into Python' is essential, and the blog posts on
respectively 'How to Code like a Pythonista' and 'How to think like a
Pythonista' will get you a long way towards familiarizing yourself
with common idoms in the language.
I've been interested for a while now in entity disambiguation,
particularly where the e
ntities are the names of authors in academic journals.
Jian Huang, Seyda Ertekin and C Lee Giles have a paper from 2006 in
which they describe
the method that they used to diambiguate the CiteSeer data set of
over 700,000 article
s. The paper is "Efficient Name Disambiguation for Large-Scale Databases".
They were able to use this algorithim to disambiguate this data set
over three days int
o just under half a million unique authors (though I didn't see a
mention of the hardware nor of whether they used a linear or parallel
computing approach).
Their approach seems to be to create an online SVM to bootstap a
distance funtion. This
distance funtion can be trained using a number of types of
information, names, meta da
ta such as emails, and terms extracted from the associated papers.
They then block author names into groups based on name similarity, use
the distance fun
vtion found with the SVM and find groups of names associated with the
same person by sc
anning over the data using DBSCAN, which is a clustering algorithm
that creates cluster
s based on a minimal distance and minimal number of members. By
slicing up the paramate
r space based on minimal distance, rather than on an a-priori number
of clusters, the a
lgorithim is insensitive to a change in the number of points in the
parameter space. Th
is means you can use this method in an iterative way and it can be
adopted to new data
as it arrives. I'm remined of some papers in astrophysics that did
clustering based on
voroni volumes, but only in so far as the voroni method is vaguley
related to a density
method.
All in all this looks like a nice approach to the problem, and the
authors got 90% accu
racy with their method, which is probably enough to bootstap a
solution into existence.
The following paper "Soft peer review: Social software and distributed
scientific evaluation" was passed along to me by alf today. I think
another copy has been haunting my file system for a few days, but this
seemed like a good reason to sit down again with it.
It's by Dario
Taraborelli and the abstract is as follows:
Abstract: The debate on the prospects of peer-review in the Internet age and the
increasing criticism leveled against the dominant role of impact factor
indicators are calling for new measurable criteria to assess
scientific quality.
Usage-based metrics offer a new avenue to scientific quality assessment but
face the same risks as first generation search engines that used unreliable
metrics (such as raw traffic data) to estimate content quality. In
this article I
analyze the contribution that social bookmarking systems can provide to the
problem of usage-based metrics for scientific evaluation. I suggest that
collaboratively aggregated metadata may help fill the gap between traditional
citation-based criteria and raw usage factors. I submit that bottom-up,
distributed evaluation models such as those afforded by social bookmarking
will challenge more traditional quality assessment models in terms of coverage,
efficiency and scalability. Services aggregating user-related quality indicators
for online scientific content will come to occupy a key function in
the scholarly
communication system
and I get a mention in the acknowledgments, which is cool.
It is a very nice essay on the potential of social bookmarking as a tool for ran
king academic articles, in addition to adding metadata to scientific articles. D
ario discusses the issue of ranking the expertese of people who are bookmarking
and proposes a really nice method to get over the scaling problem that is inherr
ent when we try to intoduce manual methods to rank people. He suggestes that a u
sers notes and annotations could be made available about a bookmark on an anonom
ous basis. Others would have the option to copy these annotations, or rate them.
This would be a form of soft peer review on the annotations, which would in tur
n effect the standing of the person creating these annotations.
There would be ways to cheat this system, but with enough signal, one hopes that
such noise could be drowned out.
The paper also pointed out http://www.naboj.com/ which I'd not
seen before and which is
pretty amazing.
I really like this paper. Thanks Dario!
I've been thinking about the streamosphere (as coined by Euan Adie),
and how I am starting to get dragged down in it again. I'm getting
daily notifications from pounce, twitter, friendfeed, linked in (I've
been ignoring facebook for ages, facebook, you bite my ass). What I
really need is a way to manage my notifications and communications in
the way that I now manage my rss feeds, through a bespoke piece of
aggregation kit.
Ubigraph (http://ubietylab.net/ubigraph/) Ubigraph (sorry for the
fucked up formatting, I think my blog host vox is still being shitty
at recognising simple html formatting in input, mixed with line
breaks, it's just a frickin url link for god's sake) is a nicely
implemented engine for making 3-D graphs and plots. I installed it and
got the default python interface up and running in about 5 minutes. My
initial reaction was "this is cool". It runs on a server that you
communicate with using an XML-RPC interface. Then I had a moment and
realised that I couldn't think of anything, other than trying to plot
graph relationships in citations, to do with it. I want to look at the
graphing of these relationships, but need to find the time to mine
some data first, so I'll just have to put this on the shelf for a
moment. Before I do, I wanted to spin off a quick blog post to remind
myself that this really was very easy to work with. You just create a
graph object G. one could use this in conjunction with NetworkX and subclass the
ubigraph object from the networkx object and as you wrangled you
networkx object you would see it appear in ubigraph, that would be
cool.
I was in, of all places, Godalming, at the weekend and ended up
browsing in a book store for a few moments. I saw what looked like a
very interesting book,
href="http://www.amazon.co.uk/Predictably-Irrational-Hidden-Forces-Decisions/dp/0007256523/ref=sr_1_1?ie=UTF8&s=books&qid=1211186399&sr=8-1">Predictably
Irrational. The book looks at the processes behind bad or
irrational decisions. The author is an economist and it seems the aim
of the book is to help us to see the emotional effects that influence
our decisions, and that lead us to poor decision making. This is a
theme that is dear to my heart, as it is closely related to the way
that science policy gets determined, insofar as the facts about
specific scientific domains are often swamped by emotional reactions
to what people think the science is about.
I didn't buy the book yesterday, as I have a very large current
reading list, but not wanting to feel that I was costing myself some
opportunity I actually remembered the name of the book. I was
delighted this morning when I found that the author has a great
href="http://www.predictablyirrational.com/">site about the book
which includes a
href="http://www.predictablyirrational.com/?page_id=17">blog. The
author Dan Ariely also has a
href="http://en.wikipedia.org/wiki/Dan_Ariely">wikipedia page.
From the site and the blog there are links to some papers that seem to
form the basis of the book, so now I can get some small chunks to read
to satisfy my curiosity about the subject.
I just got the following Yay!
"Hi,
Somebody has sent you a Brightkite invite. Brightkite is a
location-based social network, currently in private beta.
Personal message:
The wait is over! Here's your Brightkite beta invite. Enjoy!"
Then on sign up discovered that it is only supported by a number of US
carriers, boo.
Having just gotten engaged and begun to look at the relative costs
involved in organising a wedding I woke up this morning with the
unavoidable realisation that the need to save for the wedding is going
to make purchasing an iPhone, while I have a perfectly normal
n95 that works rather well (especially since the broken key unstuck
and started working again) untenable. My cunning plan had been
to wait until the end of my current contract, at which point the new
iPhone would be available, but the need to actually pay for the device
in addition with my desire to invite rather a lot of people to my
wedding means that I ought rather probably wait until after the
wedding before
I think about getting a new flashy gadget.
So the question of the day is if Hillary Clinton becomes
Vice-President, would she be more or less evil than Dick Cheney?
Now Dick Cheney is evil Inc. but the guy has never really pretended to
be anything that he isn't, whereas Hillary would obviously
lie and make policy in any way possible to win votes, which means that
though she might not be as intrinsically as evil as Dick Cheney, she
might make as bad a VP as him.
I think that she probably wouldn't be as actively evil as Dick Cheney,
but sadly she is beginning to point in that direction.
This morning as I was leaving Shoreditch park a guy in a black masarti
with the top down passed my on my bike.
I continued along the canal as usual and as I came past sainsbury's in
islington saw the same guy in the same car, as I passed him!
Bike 1, Masarati + London Traffic 0.