[Bioperl-l] truncating a sequence and remapping annotations

Thu Aug 27 20:00:24 UTC 2009

So when I did this for the retraining of AUGUSTUS I loaded all my gene  
models in Bio::DB::GFF as GFF3 and then just extracted each locus I  
needed +/- some surrounding sequence context  and wrote it out as  
genbank file.  There might have been one or two problems collapsing  
the features back into Genbank's concept of a CDS as a single-feature  
rather than individual, but I just make a split-location and added the  
sub-pieces to it.  It was only a few lines of code to do it right -  
the flatten/unflatten being one of the most annoying parts maybe we  
could work out to streamline.

-jason
On Aug 27, 2009, at 12:23 PM, Joshua Orvis wrote:

> I should weigh in here since I am the above-mentioned 'user' who  
> posed the
> question in #bioperl.
>
> To clarify, to train one particular gene finder I need to take a full
> genbank file with annotation for a whole genome and create separate  
> gbk
> records, one for each gene.  Each record will then contain the gene,  
> exon
> coordinates for the CDS and sequence for the gene.
>
> I can iterate through the features of the full record and do the  
> math myself
> for each spliced coordinate, making/writing individual records as I  
> go, but
> thought I would see if BioPerl had any mechanism to extract a region  
> of an
> annotated record and treat the starting base of that extraction as  
> position
> 1, recoordinating all the other features that were present.  Then I  
> could
> just iterate through the features of the whole entry, extracting  
> regions for
> each gene as I see them.
>
> Hopefully this makes sense.
>
> Joshua
>
> On Thu, Aug 27, 2009 at 2:41 PM, Jason Stajich <jason at bioperl.org>  
> wrote:
>
>>
>> Yeah one thought that we batted around at a hackathon many moons  
>> ago had
>> been to use Bio::DB::SeqFeature in a lightweight way under the hood  
>> to
>> represent sequences in layers more rather than the arbitrary data  
>> model that
>> is setup by focusing on handling GenBank records.  A lot of the  
>> architecture
>> development (that is like 10-15 years old now!) was initially just  
>> focused
>> on round-tripping the sequence files. We more recently felt like a  
>> new model
>> was more appropriate.  With the fast SQLite implementation that  
>> Lincoln has
>> put in for DB::SeqFeature we could in theory map every sequence  
>> into a
>> SQLite DB and then have the power of the interface.
>>
>> Some more bells and whistles might be needed but the basic API is  
>> respected
>> AFAIK and it prevents needing to store whole sequences in memory.   
>> The
>> SeqIO->DB::SeqFeature loading would need some finessing so that as  
>> parsed
>> the sequence object could be updated efficiently.
>>
>> Actually this might also help reduce the number of objects needed  
>> to be
>> created by basically efficiently serializing sequences into the DB on
>> parsing (and with some simple caching this could make for pretty fast
>> system).  Since disk is basically not a limitation now could be an
>> interesting experiment?  Maybe it is too out there, but if not it  
>> could be
>> something major enough that it has to go in a bioperl-2/bioperl- 
>> ng.   It
>> sort of assumes the data model of Bio::DB::SeqFeature is adequate  
>> for all
>> the messiness of sequence data formats and one problem for some  
>> people has
>> been the seq file format => GFF in order to load it into a  
>> SeqFeature DB for
>> Gbrowse... So I don't know what are the boundary cases here.   
>> Certainly for
>> FASTA it should be straightforward.
>>
>> -jason
>>
>> On Aug 27, 2009, at 11:20 AM, Chris Fields wrote:
>>
>> It's not implemented completely.  As Jason mentioned in the bug  
>> report, it
>>> was meant to be part of an overall system to truncate sequences with
>>> remapped features, but the implementation in place is  
>>> substandard.  It's
>>> open for implementation if anyone wants to take it up.
>>>
>>> I should point out, though, in my opinion Bio::DB::GFF/SeqFeature  
>>> deal
>>> with this in a more elegant and lightweight way, and is probably the
>>> direction I would take.  YMMV.
>>>
>>> chris
>>>
>>> On Aug 27, 2009, at 12:40 PM, Robert Buels wrote:
>>>
>>> Looks like bug 1572 is related to this:
>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=1572
>>>>
>>>> Rob
>>>>
>>>> Robert Buels wrote:
>>>>
>>>>> Hi all,
>>>>> Recently a user came into #bioperl looking to truncate an  
>>>>> annotated
>>>>> sequence (leaving the region between e.g. 150 to 250 nt), and  
>>>>> have the
>>>>> annotations from the original sequence be remapped onto the new  
>>>>> truncated
>>>>> sequence.
>>>>> Poking through code, I came across an undocumented function  
>>>>> trunc() that
>>>>> from the comments looks like it was written by Jason as part of  
>>>>> a master
>>>>> plan to implement this very functionality.
>>>>> Just wondering, what's the status of that?
>>>>> Rob
>>>>>
>>>>
>>>>
>>>> --
>>>> Robert Buels
>>>> Bioinformatics Analyst, Sol Genomics Network
>>>> Boyce Thompson Institute for Plant Research
>>>> Tower Rd
>>>> Ithaca, NY  14853
>>>> Tel: 503-889-8539
>>>> rmb32 at cornell.edu
>>>> http://www.sgn.cornell.edu
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> --
>> Jason Stajich
>> jason.stajich at gmail.com
>> jason at bioperl.org
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>

--
Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org