[Bioperl-l] truncating a sequence and remapping annotations
Chris Fields
cjfields at illinois.edu
Thu Aug 27 20:19:56 UTC 2009
On Aug 27, 2009, at 1:41 PM, Jason Stajich wrote:
> Yeah one thought that we batted around at a hackathon many moons ago
> had been to use Bio::DB::SeqFeature in a lightweight way under the
> hood to represent sequences in layers more rather than the arbitrary
> data model that is setup by focusing on handling GenBank records. A
> lot of the architecture development (that is like 10-15 years old
> now!) was initially just focused on round-tripping the sequence
> files. We more recently felt like a new model was more appropriate.
> With the fast SQLite implementation that Lincoln has put in for
> DB::SeqFeature we could in theory map every sequence into a SQLite
> DB and then have the power of the interface.
>
> Some more bells and whistles might be needed but the basic API is
> respected AFAIK and it prevents needing to store whole sequences in
> memory. The SeqIO->DB::SeqFeature loading would need some finessing
> so that as parsed the sequence object could be updated efficiently.
Exactly my thought. Probably worth pushing the FeatureHolderI
interface into something like a SeqFeature::Collection. What about
annotation? Maybe add that to the 'source' feature?
Also makes me think Seq needs to be RangeI (or potentially locatable
to another sequence). Bio::DB::SF::Segment is.
I'm thinking the old way of doing it (parsing a file) is still
possible, but underneath would be an Bio::Index or similar, and the
returned Bio::Seq would have a backend Bio::Index/
Bio::SeqFeature::Collection database (the latter maybe being lazily
implemented).
> Actually this might also help reduce the number of objects needed to
> be created by basically efficiently serializing sequences into the
> DB on parsing (and with some simple caching this could make for
> pretty fast system). Since disk is basically not a limitation now
> could be an interesting experiment?
Yes.
> Maybe it is too out there, but if not it could be something major
> enough that it has to go in a bioperl-2/bioperl-ng. It sort of
> assumes the data model of Bio::DB::SeqFeature is adequate for all
> the messiness of sequence data formats and one problem for some
> people has been the seq file format => GFF in order to load it into
> a SeqFeature DB for Gbrowse... So I don't know what are the boundary
> cases here. Certainly for FASTA it should be straightforward.
>
> -jason
Well, one could possibly test something like this on a branch, or with
their own Bio::Seq, or in Biome ;>
Just sayin'....
chris
More information about the Bioperl-l
mailing list