[Bioperl-l] truncating a sequence and remapping annotations
Jason Stajich
jason at bioperl.org
Thu Aug 27 20:00:24 UTC 2009
So when I did this for the retraining of AUGUSTUS I loaded all my gene
models in Bio::DB::GFF as GFF3 and then just extracted each locus I
needed +/- some surrounding sequence context and wrote it out as
genbank file. There might have been one or two problems collapsing
the features back into Genbank's concept of a CDS as a single-feature
rather than individual, but I just make a split-location and added the
sub-pieces to it. It was only a few lines of code to do it right -
the flatten/unflatten being one of the most annoying parts maybe we
could work out to streamline.
-jason
On Aug 27, 2009, at 12:23 PM, Joshua Orvis wrote:
> I should weigh in here since I am the above-mentioned 'user' who
> posed the
> question in #bioperl.
>
> To clarify, to train one particular gene finder I need to take a full
> genbank file with annotation for a whole genome and create separate
> gbk
> records, one for each gene. Each record will then contain the gene,
> exon
> coordinates for the CDS and sequence for the gene.
>
> I can iterate through the features of the full record and do the
> math myself
> for each spliced coordinate, making/writing individual records as I
> go, but
> thought I would see if BioPerl had any mechanism to extract a region
> of an
> annotated record and treat the starting base of that extraction as
> position
> 1, recoordinating all the other features that were present. Then I
> could
> just iterate through the features of the whole entry, extracting
> regions for
> each gene as I see them.
>
> Hopefully this makes sense.
>
> Joshua
>
> On Thu, Aug 27, 2009 at 2:41 PM, Jason Stajich <jason at bioperl.org>
> wrote:
>
>>
>> Yeah one thought that we batted around at a hackathon many moons
>> ago had
>> been to use Bio::DB::SeqFeature in a lightweight way under the hood
>> to
>> represent sequences in layers more rather than the arbitrary data
>> model that
>> is setup by focusing on handling GenBank records. A lot of the
>> architecture
>> development (that is like 10-15 years old now!) was initially just
>> focused
>> on round-tripping the sequence files. We more recently felt like a
>> new model
>> was more appropriate. With the fast SQLite implementation that
>> Lincoln has
>> put in for DB::SeqFeature we could in theory map every sequence
>> into a
>> SQLite DB and then have the power of the interface.
>>
>> Some more bells and whistles might be needed but the basic API is
>> respected
>> AFAIK and it prevents needing to store whole sequences in memory.
>> The
>> SeqIO->DB::SeqFeature loading would need some finessing so that as
>> parsed
>> the sequence object could be updated efficiently.
>>
>> Actually this might also help reduce the number of objects needed
>> to be
>> created by basically efficiently serializing sequences into the DB on
>> parsing (and with some simple caching this could make for pretty fast
>> system). Since disk is basically not a limitation now could be an
>> interesting experiment? Maybe it is too out there, but if not it
>> could be
>> something major enough that it has to go in a bioperl-2/bioperl-
>> ng. It
>> sort of assumes the data model of Bio::DB::SeqFeature is adequate
>> for all
>> the messiness of sequence data formats and one problem for some
>> people has
>> been the seq file format => GFF in order to load it into a
>> SeqFeature DB for
>> Gbrowse... So I don't know what are the boundary cases here.
>> Certainly for
>> FASTA it should be straightforward.
>>
>> -jason
>>
>> On Aug 27, 2009, at 11:20 AM, Chris Fields wrote:
>>
>> It's not implemented completely. As Jason mentioned in the bug
>> report, it
>>> was meant to be part of an overall system to truncate sequences with
>>> remapped features, but the implementation in place is
>>> substandard. It's
>>> open for implementation if anyone wants to take it up.
>>>
>>> I should point out, though, in my opinion Bio::DB::GFF/SeqFeature
>>> deal
>>> with this in a more elegant and lightweight way, and is probably the
>>> direction I would take. YMMV.
>>>
>>> chris
>>>
>>> On Aug 27, 2009, at 12:40 PM, Robert Buels wrote:
>>>
>>> Looks like bug 1572 is related to this:
>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=1572
>>>>
>>>> Rob
>>>>
>>>> Robert Buels wrote:
>>>>
>>>>> Hi all,
>>>>> Recently a user came into #bioperl looking to truncate an
>>>>> annotated
>>>>> sequence (leaving the region between e.g. 150 to 250 nt), and
>>>>> have the
>>>>> annotations from the original sequence be remapped onto the new
>>>>> truncated
>>>>> sequence.
>>>>> Poking through code, I came across an undocumented function
>>>>> trunc() that
>>>>> from the comments looks like it was written by Jason as part of
>>>>> a master
>>>>> plan to implement this very functionality.
>>>>> Just wondering, what's the status of that?
>>>>> Rob
>>>>>
>>>>
>>>>
>>>> --
>>>> Robert Buels
>>>> Bioinformatics Analyst, Sol Genomics Network
>>>> Boyce Thompson Institute for Plant Research
>>>> Tower Rd
>>>> Ithaca, NY 14853
>>>> Tel: 503-889-8539
>>>> rmb32 at cornell.edu
>>>> http://www.sgn.cornell.edu
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> --
>> Jason Stajich
>> jason.stajich at gmail.com
>> jason at bioperl.org
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
--
Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org
More information about the Bioperl-l
mailing list