[Bioperl-l] truncating a sequence and remapping annotations

Thu Aug 27 18:41:28 UTC 2009

Yeah one thought that we batted around at a hackathon many moons ago  
had been to use Bio::DB::SeqFeature in a lightweight way under the  
hood to represent sequences in layers more rather than the arbitrary  
data model that is setup by focusing on handling GenBank records.  A  
lot of the architecture development (that is like 10-15 years old  
now!) was initially just focused on round-tripping the sequence files.  
We more recently felt like a new model was more appropriate.  With the  
fast SQLite implementation that Lincoln has put in for DB::SeqFeature  
we could in theory map every sequence into a SQLite DB and then have  
the power of the interface.

Some more bells and whistles might be needed but the basic API is  
respected AFAIK and it prevents needing to store whole sequences in  
memory.  The SeqIO->DB::SeqFeature loading would need some finessing  
so that as parsed the sequence object could be updated efficiently.

Actually this might also help reduce the number of objects needed to  
be created by basically efficiently serializing sequences into the DB  
on parsing (and with some simple caching this could make for pretty  
fast system).  Since disk is basically not a limitation now could be  
an interesting experiment?  Maybe it is too out there, but if not it  
could be something major enough that it has to go in a bioperl-2/ 
bioperl-ng.   It sort of assumes the data model of Bio::DB::SeqFeature  
is adequate for all the messiness of sequence data formats and one  
problem for some people has been the seq file format => GFF in order  
to load it into a SeqFeature DB for Gbrowse... So I don't know what  
are the boundary cases here.  Certainly for FASTA it should be  
straightforward.

-jason
On Aug 27, 2009, at 11:20 AM, Chris Fields wrote:

> It's not implemented completely.  As Jason mentioned in the bug  
> report, it was meant to be part of an overall system to truncate  
> sequences with remapped features, but the implementation in place is  
> substandard.  It's open for implementation if anyone wants to take  
> it up.
>
> I should point out, though, in my opinion Bio::DB::GFF/SeqFeature  
> deal with this in a more elegant and lightweight way, and is  
> probably the direction I would take.  YMMV.
>
> chris
>
> On Aug 27, 2009, at 12:40 PM, Robert Buels wrote:
>
>> Looks like bug 1572 is related to this: http://bugzilla.open-bio.org/show_bug.cgi?id=1572
>>
>> Rob
>>
>> Robert Buels wrote:
>>> Hi all,
>>> Recently a user came into #bioperl looking to truncate an  
>>> annotated sequence (leaving the region between e.g. 150 to 250  
>>> nt), and have the annotations from the original sequence be  
>>> remapped onto the new truncated sequence.
>>> Poking through code, I came across an undocumented function  
>>> trunc() that from the comments looks like it was written by Jason  
>>> as part of a master plan to implement this very functionality.
>>> Just wondering, what's the status of that?
>>> Rob
>>
>>
>> -- 
>> Robert Buels
>> Bioinformatics Analyst, Sol Genomics Network
>> Boyce Thompson Institute for Plant Research
>> Tower Rd
>> Ithaca, NY  14853
>> Tel: 503-889-8539
>> rmb32 at cornell.edu
>> http://www.sgn.cornell.edu
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

--
Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org