[Biopython] "raw" genbank locations?

Peter Cock p.j.a.cock at googlemail.com
Thu Mar 10 08:57:20 UTC 2011


On Thu, Mar 10, 2011 at 2:05 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Cedar;
> Glad to hear Biopython has been helping out with your work.
>
>> I'm trying to get the "raw" genbank locations for a downstream
>> application after parsing a genbank file. Is there any way to get at
>> this (or reproduce it)? As it is, the SeqRecord feature has start and
>> stop information for the whole feature, and a list of sub-features
>> each with it's own start and stops. I'm looking for one concise text
>> string the describes the entire feature location, much like the
>> original raw genbank locations do.
>
> You can do this with the GenBank RecordParser, which doesn't parse
> the location strings:
>
>>>> from Bio.GenBank import RecordParser
>>>> parser = RecordParser()
>>>> handle = open("NT_019265.gb")
>>>> rec = parser.parse(handle)
>>>> for f in rec.features:
> ...     print f.location
> ...
> <cut>
>
> If you have SeqRecord objects from SeqIO you can do this in a ugly
> way by reaching into the internals of the GenBank writer:
>
>>>> from Bio import SeqIO
>>>> from Bio.SeqIO import InsdcIO
>>>> handle = open("NT_019265.gb")
>>>> for rec in SeqIO.parse(handle, "genbank"):
> ...     for f in rec.features:
> ...         print InsdcIO._insdc_feature_location_string(f, len(rec.seq))
> ...
> <cut>
>
> That might work for a quick hack but is not necessarily future proof
> is the internal change. Peter, do you think this would be useful to
> expose as a function of a SeqFeature directly, so you could do
> feature.insdc_string() or something similar?

A couple of people have asked for this, and since adding SeqIO
output in GenBank/EMBL format (the code you refer to in InsdcIO)
this would be very possible... the issue holding me back is the
annoying special case(s) requiring to know the parent sequence's
length. The problem is that currently the SeqFeature doesn't
have this information - it doesn't have any link back to a parent
SeqRecord (and indeed it doesn't even have to be created in
the context of a SeqRecord).

Perhaps we can handle the case of between features N^1 on
circular sequences of length N differently, maybe with a dedicated
SeqFeature location class which would tell us it was at the origin?
Then we'd be able to avoid the need to know the parent length.

Once that is resolved, an orphan SeqFeature could generate its
own INSDC (GenBank/EMBL) location string without needing any
extra information, and exposing this as an object method would
be fine.

Peter

P.S. If we ever add a CircularSeq object - see other thread- then
SeqFeature locations spanning the origin might need reworking
too.




More information about the Biopython mailing list