[Biopython] SQL Alchemy based BioSQL

Wed Aug 26 20:24:34 UTC 2009

On Wed, Aug 26, 2009 at 5:40 PM, Kyle Ellrott<kellrott at gmail.com> wrote:
>> I'm not sure about that - all the other "lookup" functions already in
>> BioSeqDatabase return DBSeqRecord objects don't they? See
>> below for an alternative...
>
> Although the example I provided didn't illustrate it, the reason I did
> it this way was to provide a function that could look up features
> without having to find their DBSeqRecords first.  In my particular
> case, I've loaded all of the Genbank files from
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.gbk.tar.gz and I want
> to be able to find proteins by their protein_id.
>
> lookupFeature( protein_id="NP_808921.1" )

That makes sense.

>> Have you tested this on composite features (e.g. a join)?
>> Without looking into the details of your code this isn't clear.
>
> That isn't supported in the current syntax.  But when using
> SQLAlchemy it's pretty easy to generate new queries by
> treating the tables and selection criteria like lists.  Right
> now I just shove rules into an 'and' list, but when could
> also create an 'or' list.

Even the bacterial GenBank files have joins ;)

>> I wonder how well this would scale with a big BioSQL database
>> with hundreds of bioentry rows, and millions of seqfeature
>> and location rows? You'd have to search all the location rows,
>> filtering on the seqfeature_id linked to the bioentry_id you
>> wanted. The performance would depend on the database
>> server, the database software, how big the database is, and
>> any indexing etc.
>
> Like I said, my test database is all of the published bacterial
> genebank files from the NCBI ftp.  It's about 1784 bioentry rows.
> About 27,838,905 seqfeature_qualifier_value rows.  The location based
> searches where pretty much instant.

Cool. That sounds like more or less the same amount of data
we have in our BioSQL database.

> The only way that I've augmented the database is by adding the line
>
> Mysql:
> CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value(10));
> Sqlite:
> CREATE INDEX seqfeaturequal_value ON seqfeature_qualifier_value(value);
>
> So that I could look up features by their value quickly (like getting
> a protein by it's protein_id).  Note the '(10)' for MySQL, because for
> some reason it will only index text blobs if you define a prefix area
> for it to index...

We've done something similar here too - I'd have to check
exactly what we used for the index though.

>> On the other hand, if all the record's features have already been
>> loaded into memory, there would just be thousands of locations
>> to look at - it might be quicker.
>
> My experience so far it that pulling all the features for a single
> large chromosome can take a while.
> One of the other things that I did to speed things up (it's not
> actually faster, it just spreads the work out), is to build a
> DBSeqFeature with a lazy loader.  It just stores it's seqfeature_id
> and type until __getattr__ is hit, and only then does it bother to
> load the data in from the database.  So if you bring in 2000
> seqfeatures, you get the list back and read the first entry without
> having the first load the other 1999 entries.

Yes - Leighton Pritchard has also done a DBSeqFeature object
(with lazy loading of the qualifiers too). I guess your code will
be similar. This is something I think could well be worth merging
into BioSQL (and doesn't depend on SQLAlchemy at all).

>> This brings me to another idea for how this interface might work,
>> via the SeqRecord - how about adding a method like this:
>>
>> def filtered_features(self, start=None, end=None, type=None):
>>
>> Note I think it would also be nice to filter on the feature type (e.g.
>> CDS or gene). This method would return a sublist of the full
>> feature list (i.e. a list of those SeqFeature objects within the
>> range given, and of the appropriate type). This could initially
>> be implemented with a simple loop, but there would be scope
>> for building an index or something more clever.
>
> It may be worth while to just use the sqlite memory database.  We
> store the schema in the module, and have a simple wrapper module that
> builds the sqlite RAM database and loads in the sequence file to the
> database.
> Something like:
>
> from Bio import SeqIODB
> handle = open("ls_orchid.gbk")
> for seq_record in SeqIODB.parse(handle, "genbank") :
>    print seq_record.id
>
> But in the background SeqIODB would be creating a Sqlite memory
> database and loading ls_orchid into it, it's __iter__ function would
> simple spit out DBSeqRecords for easy of the bioentries...

Is this idea just a shortcut for explicitly loading the GenBank
file into a BioSQL database (which hopefully will include an
SQLite backend option soon), and then iterating over its
records? e.g.

from Bio import SeqIO
from BioSQL import BioSeqDatabase
server = BioSeqDatabase.open_database(...)
db = server["Orchids"]
db.load(SeqIO.parse(open("ls_orchid.gbk"), "genbank"))
server.commit()
#Now retrieve the records one by one...

It would make sense to define __iter__ on the BioSQL
database object (i.e. the BioSeqDatabase class) to
allow iteration on ALL the records in the database
(as DBSeqRecord objects). That should be a nice
simple enhancement, allowing:

for seq_record in db :
   print seq_record.id

[And again, this has no SQLAlchemy dependence]

Peter