[Bioperl-l] truncating a sequence and remapping annotations

Thu Aug 27 19:23:05 UTC 2009

I should weigh in here since I am the above-mentioned 'user' who posed the
question in #bioperl.

To clarify, to train one particular gene finder I need to take a full
genbank file with annotation for a whole genome and create separate gbk
records, one for each gene.  Each record will then contain the gene, exon
coordinates for the CDS and sequence for the gene.

I can iterate through the features of the full record and do the math myself
for each spliced coordinate, making/writing individual records as I go, but
thought I would see if BioPerl had any mechanism to extract a region of an
annotated record and treat the starting base of that extraction as position
1, recoordinating all the other features that were present.  Then I could
just iterate through the features of the whole entry, extracting regions for
each gene as I see them.

Hopefully this makes sense.

Joshua

On Thu, Aug 27, 2009 at 2:41 PM, Jason Stajich <jason at bioperl.org> wrote:

>
> Yeah one thought that we batted around at a hackathon many moons ago had
> been to use Bio::DB::SeqFeature in a lightweight way under the hood to
> represent sequences in layers more rather than the arbitrary data model that
> is setup by focusing on handling GenBank records.  A lot of the architecture
> development (that is like 10-15 years old now!) was initially just focused
> on round-tripping the sequence files. We more recently felt like a new model
> was more appropriate.  With the fast SQLite implementation that Lincoln has
> put in for DB::SeqFeature we could in theory map every sequence into a
> SQLite DB and then have the power of the interface.
>
> Some more bells and whistles might be needed but the basic API is respected
> AFAIK and it prevents needing to store whole sequences in memory.  The
> SeqIO->DB::SeqFeature loading would need some finessing so that as parsed
> the sequence object could be updated efficiently.
>
> Actually this might also help reduce the number of objects needed to be
> created by basically efficiently serializing sequences into the DB on
> parsing (and with some simple caching this could make for pretty fast
> system).  Since disk is basically not a limitation now could be an
> interesting experiment?  Maybe it is too out there, but if not it could be
> something major enough that it has to go in a bioperl-2/bioperl-ng.   It
> sort of assumes the data model of Bio::DB::SeqFeature is adequate for all
> the messiness of sequence data formats and one problem for some people has
> been the seq file format => GFF in order to load it into a SeqFeature DB for
> Gbrowse... So I don't know what are the boundary cases here.  Certainly for
> FASTA it should be straightforward.
>
> -jason
>
> On Aug 27, 2009, at 11:20 AM, Chris Fields wrote:
>
>  It's not implemented completely.  As Jason mentioned in the bug report, it
>> was meant to be part of an overall system to truncate sequences with
>> remapped features, but the implementation in place is substandard.  It's
>> open for implementation if anyone wants to take it up.
>>
>> I should point out, though, in my opinion Bio::DB::GFF/SeqFeature deal
>> with this in a more elegant and lightweight way, and is probably the
>> direction I would take.  YMMV.
>>
>> chris
>>
>> On Aug 27, 2009, at 12:40 PM, Robert Buels wrote:
>>
>>  Looks like bug 1572 is related to this:
>>> http://bugzilla.open-bio.org/show_bug.cgi?id=1572
>>>
>>> Rob
>>>
>>> Robert Buels wrote:
>>>
>>>> Hi all,
>>>> Recently a user came into #bioperl looking to truncate an annotated
>>>> sequence (leaving the region between e.g. 150 to 250 nt), and have the
>>>> annotations from the original sequence be remapped onto the new truncated
>>>> sequence.
>>>> Poking through code, I came across an undocumented function trunc() that
>>>> from the comments looks like it was written by Jason as part of a master
>>>> plan to implement this very functionality.
>>>> Just wondering, what's the status of that?
>>>> Rob
>>>>
>>>
>>>
>>> --
>>> Robert Buels
>>> Bioinformatics Analyst, Sol Genomics Network
>>> Boyce Thompson Institute for Plant Research
>>> Tower Rd
>>> Ithaca, NY  14853
>>> Tel: 503-889-8539
>>> rmb32 at cornell.edu
>>> http://www.sgn.cornell.edu
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> --
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>