[BioPython] Biopython object serialization

Wed, 13 Nov 2002 19:02:28 -0800

On Tue, Nov 12, 2002 at 12:29:33PM +0200, Estienne Swart wrote:
> I've been wondering about a decent way of storing biopython objects for 
> some time now. It looks like there has been some progress (CVS) on 
> interfacing Biopython with a relational DB system, but is this the best 
> approach? For instance, say you'd actually like to store sequences 
> within the database (which require one of the large text field types), 
> you then find yourself having to deal with relatively long data 
> retrieval times (if memory serves me right, it takes on the order of a 
> couple seconds to retrieve a single sequence from a database containing 
> a few thousand entries, with sequences stored in the medium text field).

Aahhh, you're referring to the BioSQL project, I think.  Is this a
theoretical concern, or are you really having problems with the
performance with relational databases?

> Have any of the biopython developers attempted/considered using an 
> object database, such as ZODB, or at least assessed the relative merits 
> of some different approaches to data storage/object persistence?

No, although perhaps it is warranted.

The BioSQL project was originally started by Ewan Birney with bioperl.
Since then, interfaces to it have been written in Python and Java.
Thus, one of the requirements is that the data stored must be
accessible from those other languages as well.  This ruled out ZODB
and many other systems that were not as well supported across
different languages.

That's not to say that Biopython wouldn't benefit from supporting
other types of data storage systems that might have different
performance characteristics as relational databases.  However, unless
you're volunteering, I'm not optimistic that that's going to happen,
since the relational DB approach works well enough for most people.
;)

> I recently came across an article about object persistence 
> (http://www-106.ibm.com/developerworks/linux/library/l-pypers.html), by 
> Patrick O'Brien (the name should ring a bell to those of you that read 
> his O'Reilly article on Bioinformatics). He advocates the use of his own 
> solution to persistence,
> PyPerSyst <http://sourceforge.net/projects/pypersyst/>, which is 
> supposedly faster than ZODB, and simpler to implement too.
> 
> Do you think that some benchmarking would be in order (not that I'm 
> volunteering)?

Dang, not volunteering.

> What course will Biopython be persuing in the near future (as far as 
> object serialization is concerned)? Is there room for alternatives 
> besides those that use relational databases, i.e. will they be 
> competitive as far as performance is concerned.

For the forseeable future, we'll be working on improving BioSQL
support.  It seems to be working reasonably well enough that there's
no imminent plans to change.

Jeff