[Biopython] SQL Alchemy based BioSQL

Wed Aug 26 16:40:00 UTC 2009

> I'm not sure about that - all the other "lookup" functions already in
> BioSeqDatabase return DBSeqRecord objects don't they? See
> below for an alternative...

Although the example I provided didn't illustrate it, the reason I did
it this way was to provide a function that could look up features
without having to find their DBSeqRecords first.  In my particular
case, I've loaded all of the Genbank files from
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.gbk.tar.gz and I want
to be able to find proteins by their protein_id.

lookupFeature( protein_id="NP_808921.1" )

> Have you tested this on composite features (e.g. a join)?
> Without looking into the details of your code this isn't clear.

That isn't supported in the current syntax.  But when using SQLAlchemy
it's pretty easy to generate new queries by treating the tables and
selection criteria like lists.  Right now I just shove rules into an
'and' list, but when could also create an 'or' list.

> I wonder how well this would scale with a big BioSQL database
> with hundreds of bioentry rows, and millions of seqfeature
> and location rows? You'd have to search all the location rows,
> filtering on the seqfeature_id linked to the bioentry_id you
> wanted. The performance would depend on the database
> server, the database software, how big the database is, and
> any indexing etc.

Like I said, my test database is all of the published bacterial
genebank files from the NCBI ftp.  It's about 1784 bioentry rows.
About 27,838,905 seqfeature_qualifier_value rows.  The location based
searches where pretty much instant.
The only way that I've augmented the database is by adding the line

Mysql:
CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value(10));
Sqlite:
CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value);

So that I could look up features by their value quickly (like getting
a protein by it's protein_id).  Note the '(10)' for MySQL, because for
some reason it will only index text blobs if you define a prefix area
for it to index...

> On the other hand, if all the record's features have already been
> loaded into memory, there would just be thousands of locations
> to look at - it might be quicker.

My experience so far it that pulling all the features for a single
large chromosome can take a while.
One of the other things that I did to speed things up (it's not
actually faster, it just spreads the work out), is to build a
DBSeqFeature with a lazy loader.  It just stores it's seqfeature_id
and type until __getattr__ is hit, and only then does it bother to
load the data in from the database.  So if you bring in 2000
seqfeatures, you get the list back and read the first entry without
having the first load the other 1999 entries.

> This brings me to another idea for how this interface might work,
> via the SeqRecord - how about adding a method like this:
>
> def filtered_features(self, start=None, end=None, type=None):
>
> Note I think it would also be nice to filter on the feature type (e.g.
> CDS or gene). This method would return a sublist of the full
> feature list (i.e. a list of those SeqFeature objects within the
> range given, and of the appropriate type). This could initially
> be implemented with a simple loop, but there would be scope
> for building an index or something more clever.

It may be worth while to just use the sqlite memory database.  We
store the schema in the module, and have a simple wrapper module that
builds the sqlite RAM database and loads in the sequence file to the
database.
Something like:

from Bio import SeqIODB
handle = open("ls_orchid.gbk")
for seq_record in SeqIODB.parse(handle, "genbank") :
   print seq_record.id

But in the background SeqIODB would be creating a Sqlite memory
database and loading ls_orchid into it, it's __iter__ function would
simple spit out DBSeqRecords for easy of the bioentries...

Kyle