[Biojava-l] Hackathon Wrap-up

Ewan Birney birney@ebi.ac.uk
Tue, 5 Mar 2002 09:31:20 +0000 (GMT)


  BioHackathon


Over the last 6 weeks we have held the first "hackathon" where developers
of the open-bio projects met. The hackathon was split over two sessions,
the first one being at the O'Reilly Bioinformatics Technology Conference
at Arizona and the second in Cape Town South Africa organised by Electric
Genetics. As well as this key support from O'Reilly and Electric Genetics,
the hackathon was additionally sponsored by Astra Zeneca and Dalke
Scientific. All the code generated was immediately committed to the
publically accessible cvs system on open-bio (instructions at
http://cvs.open-bio.org/).


The hackathon drew together 20 developers across a number of different
open source projects. Our aim was to develop an infrastructure for
accessing sequence databases transparently that scales from a small
single computer in a molecular biology lab to a large scale pipeline
project. This infrastructure can be transparently shared between the
different language projects - eg, building a sequence database in
BioPerl but accessing it from BioJava. The hope is that we can both
reduce the time it takes to build and test applications in different
languages and, at the same time, reduce the overhead in managing and
deploying sequence databases in bioinformatics installations. Aware of
the need for snazzy acronyms for standards to allow people to dazzle
their managers/sales force/bosses we have named this the "Open
Bioinformatics Database Access" scheme (OBDA for short).


We settled on a standard set of 6 implementations to retrieve
sequences, differing in their complexity, network requirements and
throughput.  In all cases we were taking an existing system from an
open source project and wherever possible we followed existing
standards. Having discussed the specifications of these methods we
then implemented the system in 5 languages - Perl, Java, Python, Ruby
and C (not all languages got all implementations due to limitations in
programming time, but Perl, Java and Python had a full suite). The
implementations where then tested between different languages to ensure
programmatic and data transfer capibilities. Finally the different
methods were performance tested and a number of performance bottlenecks
removed. There are more technical details at the end of this mail and
a list of what each participant achieved.


At the same time a number of other projects were advanced. A framework
for Bibliographic objects was discussed and Perl and Java code provided.
The Genquire Perl GUI  was adapted to work on top of aspects of the 
OBDA system. Bio::Graphics, a GIF drawing system for Perl was integrated
into BioPerl. The OmniGene project became more plug-and-play with BioJava.



One important corollary of our work was strengthening the common
conceptual view of our data. For the last five years all the projects
have by and large been sticking to a common core of EMBL/GenBank
format information in their data model. It was unclear how to extend
this model into other areas without losing cross-project
interoperability. The requirement of all projects to read and
write to a relational database (BioSQL) forced us to re-examine our
common data model away from the perspective of a data format.  The
result was in fact closer cooperation and a clearer understanding of
how to extend our data models in cross project compatible manner. In
particular we have decided to make ontology integration an explicit
option for our information, allowing more flexibility and richness in
describing the additional data attached to sequences.



Finally, we had fun. Some of that fun was deliberately scheduled such
as the trip to the fast-food mexican "chuys" joint in Tuscon where we
aquired a stuffed toy (which became our mascot). South Africa was a real
eye opener for us, with incredible scenery, lovely people and real
attention to detail from our hosts, Electric Genetics. But we are also
hackers, and all of us got a kick out of simply being able to work
together with few distractions and an open 802.11b network. Having a
turn around time of minutes in a Q/A session, rather than potential
days when people are working via email in different time zones 
was sensational.


All the projects and open-bio in general was strengthened immeasurably by
the hackathon. We'd like to thank our organisors (Electric Genetics and
O'Reilly) and sponsors (Astra Zeneca and Dalke Scientific) and in general
the support from open-bio over this time.



Person by Person report.
------------------------


Michele Clamp:  Perl flatfile indexing works quite fast - into Ensembl
production in next month


Heikki Levashilo:  BioFetch has been taking more work than expected.  
Server side has one outstanding bug in error reports; BioPerl
implementation keeps changing (more generic) to use with wide variety of
dbs RefSeq, SP, EMBL are in


James Cuff:  testing arm; overview of all different language projects.  
Scaleability and information transfer testing.


Steve Searle:  C Berkeley db & flatfile connecters

Lincoln Stein:  Berkeley DB/Perl implementation; very fast but not as fast
as C. Bio::Graphics into Bioperl. File Caching in Bioperl.


Martin Senger: Tie ins from some languages to do biblio as web service;
feew more implementation and testing; BioPerl is most complete; Java is
complete but needs commenting; Python has also made good progress. 


Chris Mungal:  BioSQL core is pretty much done; BioPerl DB both
Postgres and MySQL. Also will be working on
ontology module for both BioSQL and BioPerl. 


Brian Gilman: BioJava hooks and BioSQL backended.  Will put up DAS server at
home wthat serves anything that is in BioSQL.  Will make ER diagram from
BioSQL DDL and put on web so people can see it.  


Elia Stupka: Registry in Bioperl. Promised world-accessible BioSQL
server of EMBL will be put up when he gets home. 


Ewan Birney:  Bioperl CORBA, Memory caching in Bioperl. Performance
enhancement of parsers.

Katayama Toshiaki: BioRuby BioSQL, Biofetch client and server and
Registry


Jason Staijch: Bioperl CORBA and prepping for 1.0 release of BioPERL.  
Make decisions on how to release.  Will include registry, index, Medline,
parser, etc.  Hopefully people will hammer on it and will get feedback
from people who aren't such good programmers.


Thomas Down:  BioJava - all this stuff was cut off
from 1.2.  BioJava should releas 1.3 in ~2 months or less to include this
stuff.  Testing on BioSQL, working nicely.  Tidied up BioJava registry
code added access point for normal users.  BioFetch client/server and
CORBA interoperability. 


Mark Wilkinson:  GenQuire on Windows bugs have been squashed
and put up for download. BioSQL server.  GenQuire works with schema if
only one contig is specified. All genes displayed in + strand.  May not
all be in GenQuire, might be in BioSQL adapter. 


Chris Dadigidan:  Some BioPERL; working BioSQL in Boston with GenBank.  
Love to start playing with client/server stuff cross-language.  
Documenting for system admin


Andrew Dalke: Flatfile indexing with Flastfile and
BerkeleyDB; regression testing.


Matthew Pocock:  Java flatfile indexer; debugging and performance
enhancing.


Brad Chapman:  Registry in Python, BioSQL indexing, BioCORBA and regular
http is all hooked in to get from one interface.  BioSQL is all set with
new schema. 



Technical Details
-----------------
  

The OBDA specifications are available via anonymous cvs from cvs.bioperl.org,
/home/repository/obf-common cvs module obda-specs. We will have a web page
off open-bio.org soon and we are hoping to publish a paper on OBDA this year.


In brief, the 6 implementations are:


 (1) Flat file, raw index. This implementation requires no additional
technology than reading files. It works off a fixed-length sorted record with
byte offsets into a flat file dump of sequence.

 (2) Flat file, Berkeley DB. This implementation is the same data model as the
flat file index, but using Berkeley DB as the back-end store having byte offsets
into a flat file dump of sequences.

 (3) Biofetch. This is a simple EMBL/GenBank/Fasta format over http: protocol,
where clients have to provide a suitably formatted query string and the server
responds with the entry as a ascii stream over http.

 (4) XEMBL. This is a SOAP protocol with the data format being one of Agave
XML or BSML XML. We are debating how much stress we should put on this as Biofetch
seems to work cleaner for us currently.

 (5) BioCorba. We use the BSANE/BioCorba 0.5 spec and did cross-platform testing.

 (6) BioSQL. A relational schema which we tested on both MySQL and Postgres.
This was perhaps the project which stretched our conceptual understanding
of the area the most, with gratifying results as we round-tripped
information between the different projects.



Finally there is a simple discovery system (called the "Registry") which
associates database namespaces (eg, EMBL) with implementations (eg, BioSQL
at this location). The Registry is found by searching the path

  $HOME/.bioinformatics/seqdatabase.ini
  /etc/bioinformatics/seqdatabase.ini
  http://www.open-bio.org/registry/seqdatabase.ini


The aim here is to have a path of personal, local and internet-wide
specifications for where databases. The internet accessible registry will
mean that just by installing bioperl users will get transparent (if
potentially a little slow) access to databases. We expect the "local"
configuration mode to be the most widely used across bioinformatics
installations.


(the web accessible registry is currently just a testing version. Once we
have built up the correct services worldwide we will replace it with a set
of internet accessible services).




We will be setting up soon (Chris - is it up already?) a mailing list
explicitly for cross-project projects and in particular to allow the
development of the common data/concept model to be put in place.







-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------