[Bioperl-l] truncating a sequence and remapping annotations

Thu Aug 27 20:19:56 UTC 2009

On Aug 27, 2009, at 1:41 PM, Jason Stajich wrote:

> Yeah one thought that we batted around at a hackathon many moons ago  
> had been to use Bio::DB::SeqFeature in a lightweight way under the  
> hood to represent sequences in layers more rather than the arbitrary  
> data model that is setup by focusing on handling GenBank records.  A  
> lot of the architecture development (that is like 10-15 years old  
> now!) was initially just focused on round-tripping the sequence  
> files. We more recently felt like a new model was more appropriate.   
> With the fast SQLite implementation that Lincoln has put in for  
> DB::SeqFeature we could in theory map every sequence into a SQLite  
> DB and then have the power of the interface.
>
> Some more bells and whistles might be needed but the basic API is  
> respected AFAIK and it prevents needing to store whole sequences in  
> memory.  The SeqIO->DB::SeqFeature loading would need some finessing  
> so that as parsed the sequence object could be updated efficiently.

Exactly my thought.  Probably worth pushing the FeatureHolderI  
interface into something like a SeqFeature::Collection.  What about  
annotation?  Maybe add that to the 'source' feature?

Also makes me think Seq needs to be RangeI (or potentially locatable  
to another sequence).  Bio::DB::SF::Segment is.

I'm thinking the old way of doing it (parsing a file) is still  
possible, but underneath would be an Bio::Index or similar, and the  
returned Bio::Seq would have a backend Bio::Index/ 
Bio::SeqFeature::Collection database (the latter maybe being lazily  
implemented).

> Actually this might also help reduce the number of objects needed to  
> be created by basically efficiently serializing sequences into the  
> DB on parsing (and with some simple caching this could make for  
> pretty fast system).  Since disk is basically not a limitation now  
> could be an interesting experiment?

Yes.

> Maybe it is too out there, but if not it could be something major  
> enough that it has to go in a bioperl-2/bioperl-ng.   It sort of  
> assumes the data model of Bio::DB::SeqFeature is adequate for all  
> the messiness of sequence data formats and one problem for some  
> people has been the seq file format => GFF in order to load it into  
> a SeqFeature DB for Gbrowse... So I don't know what are the boundary  
> cases here.  Certainly for FASTA it should be straightforward.
>
> -jason

Well, one could possibly test something like this on a branch, or with  
their own Bio::Seq, or in Biome ;>

Just sayin'....

chris