[Bioperl-l] bioperl based database infrastucture for directed graphs

Wed Jan 9 18:12:55 UTC 2008

cc'ing the gbrowse list in case Lincoln hasn't seen this.

I believe the primary intent for Bio::DB::SeqFeature::Store was as a  
more GFF3-compatible replacement for Bio::DB::GFF (unlimited feature  
nesting, uses any SeqFeatureI, etc) and was streamlined for faster  
lookups by GBrowse.  I don't think adding tables would affect  
performance dramatically, though maybe Lincoln would have a better idea.

chris

On Jan 9, 2008, at 7:20 AM, Robson Francisco de Souza wrote:

> Hello All!
>
> Greetings for everybody and happy new year for those following an
> western calendary!
>
> I'm starting a new project to store and analyze distinct sets of
> sequence annotation data which are related in a way suitable for
> representation in a directed (e.g. transcript splicing) or undirected
> (e.g. gene product interaction) graph. Analysis will require frequent
> queries based on interval overlaps, feature neighbourhood, annotation
> and, most importantly, feature relationships and stored paths.
>
> At first, I thought of build an entire new database structure to store
> project specific data (e.g. alternative splicing or protein  
> interaction),
> but as I have some experience with Lincon's
> Bio::DB::SeqFeature::Store, I'm now considering extending it for the
> purpose of storing graphs describing relationships among features.
>
> I'm aware that some other bioperl related databases, specifically
> BioSQL and Chado, do have  components which might be suitable for
> storing all or some of these data but, since Lincon's feature storage
> and interval binning implementations in
> Bio::DB::SeqFeature::Store::mysql are both clean, simple and very  
> fast,
> perhaps extending it in a seemingly modular way is desirable. A good
> extension to Lincon's database could include tables like
> feature_relationship and feature_path, for edges and transitive
> closures (just like in BioSQL) and feature_stored_path, for exclusion
> of biologically irrelevant paths in DAGs, like certain splicing
> isoforms. These tables could be used  to store sequence assemblies or
> EST alignments efficiently, including scaffolds inferred by connecting
> contigs.
>
> Before starting, I would like to know if the BioSQL and Chado schemata
> do have accelerators for quering intervals among billions of features
> and feature relatioships (some examples using these databases would
> also help, if they that these databases are efficient for such tasks).
> If these or other databases are not as suitable as Bio::DB::SeqFeature
> for feature retrieval based on interval overlap and attributes,  then
> again I might consider extending Bio::DB::seqFeature
> and contributing such extensions back to bioperl...
>
> Any thoughts?
>
> Best regards,
> Robson
>
> PS: sorry if anyone gets two copies of this post, but took me some
> time to realize my new e-mail wasn't subscribed to bioperl-l...
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign