BarCamp Cambridge - James Smith talking about Ensemble, head of the internet team for Ensemble
Ensemble came out of the human genome project about 8 years ago to prevent
commercialization of genomic data.
the idea was to have an open source human genome
companies would have to do some work before they could make money off of
sequences.
the ensemble projects takes the raw data from the genes and adds other data
to this, such as reference data from other experiments
there is enemble code
and there is the data
there are 41 genomes,
the code is also used elsewhere from this project
everything is OS
there are probably about 100 instaled copies world wide
it is 1.5 milion lines of perl code
major pharma companie use it and layer their hose data on top if the public
data
there is a public mysql interface
ww.ensembl.org (no e on the end)
there is also an archive system to see old data
everything is in CVS
there are about 40 people involved directly from the gene builders through
to the comparative groups
there is a funtional annotation of the genome
there is the web team, an outreach team a helpdesk team.
a warehouse team.
and others ..
there is support from the core web team,
scale
35 species in ensemble, human mouse rat zebra fish
then there are random mammalls
hedgehogs, many mammals from madagascar
the platapus has a poisned claw
they are runing half a million search index queries on one machine, this
makes them about the 5th
largest search index in the world
about 2 million page impression a week
100 gb's of data traffic
they have 20 4 core machines, about 80 cores to run the site
BLAAST SSAHA servers
using 40 TB's of data at the moment
you expect hardware failure every week, and they don't let you know
at this point about hardware failure every day
currently on 3rd set of web code
2000 human
2001 mouse
2001 fly
2003 Vegas site
2004 archive site started
2005 web code v3
2006 users and groups
in about a month ensembe 50 will be released
also have a number of other sites
they have a two month cycle for releasing data, and code.
the day after each release they start building genes again
many data sets take longer than this, for data, the new mouse sequence was
released by ncbi 6 months ago,
but it has taken this long for sanger to do the annotation and comparative
work.
there is a pre-site for data that didn't quite finish within the two month
cycle
VectorBase - ensembl for desiese vectors
Gramene - esembl for plants
Cosmic - uses the drawing code
they are moving over to AJAX because people don't realize that items in the
interface are buttons or forms.
a lot of the interaction is human interaction
they hope they can make ajax that does not break the screen readers, hope
that ajax will offer a web services
platform. this leads to issues of display vs data markup.
webcode is extensible by plug-ins.
can add code which resides outside the main ensemble CVS tree - but
accessible from within.
and that's it
Questions:
Q: how does MySQL cope?
it copes really well, they have 150 GB, about 5GB is in RW DB the rest is in
Read only DB
the issue is not the size of the data, but the number of tables.
one of the DB's has 3000 tables, so they have very careful balancing of data
on the servers
some problems come from MySQL not being able to have key
talbes larger than 4GB,
and when you have 60GB of memory then you run into this problem.
the bottlenecks tend to be in the code layer, not in the DB
this is one of the largest MySQL DB's in the world
currently using 4.something, keep planning to move to 5, but keep finding
other things that are more important.
there are a lot of left joins in some queries.
sometimes it is easier to do these joins in perl rather than in
MySQL,
millions of times faster than in MySQL
connected to the net via a 1gb net to Janet.