[Biopython] SeqIO feature.location.start and end for genes spanning origin
Peter Cock
p.j.a.cock at googlemail.com
Thu May 8 18:25:48 UTC 2014
On Thu, May 8, 2014 at 6:47 PM, Richard Llewellyn wrote:
>> > What numbers are you hoping to get out of this location?
>
> Great question. I can see that having 0,end is useful as a flag for origin
> spanning. However, it is also the least informative, as neither 0 or the
> end are actual locations of the gene starting/ending.
Your view "least informative" is subjective. They are end points of
the "fake" exons used to describe the feature, and more importantly
represent the min/max of the region spanned (very useful for drawing
features or considering intersections etc).
> My code would have
> expected the start and end to be the sequence locations (so start >> end),
> and it would have marked this as a special case of origin spanning. But it
> does require special handling. I currently use negative numbers for the
> start in this situation, though this has its own problems.
You mean the biological start and end? You didn't answer my
question but I am guessing you wanted start 879 (adjusted for
Python counting?) and end 490883 given this location string:
complement(join(490883..490885,1..879))
That would be possible for any stranded feature, although it
is less well defined for strand-less features (which in GenBank
and EMBL are by default forward strand).
Anyway, using NC_005213 as the example:
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
Sample code:
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005213.gbk", "genbank")
>>> feature = record.features[4] # the CDS record spanning origin
>>> print(feature.location)
join{[0:879](-), [490882:490885](-)}
>>> print(feature.location.start)
0
>>> print(feature.location.end)
490885
Perhaps there is a case for "biological" start and end properties
(calculated from the current data structure on demand)? i.e.
something like this:
>>> feature.location.parts[0].end if
feature.location.parts[0].strand == -1 else
feature.location.parts[0].start
ExactPosition(879)
>>> feature.location.parts[-1].start if
feature.location.parts[-1].strand == -1 else
feature.location.parts[-1].end
ExactPosition(490882)
Note as a convenience, even basic non-compound locations have
a parts property - returning a list of one entry, themselves. So that
code should work in general :)
The potential enhancement would be to define these are new
properties, feature.location.bio_start and feature.location.bio_end
or similar naming? Would that be useful? What would name them?
> In the code I'm currently debugging I treat DNA molecules as partial order
> graphs in order to deal with overlapping or mutually-exclusive genes [or
> gene predictions ;-) ]. The graphs are cyclical for circular molecules.
> The graphs are responsible for generating distances between genes, so each
> node in the graph (a gene) contains the entire molecule length if circular.
> The definition of 'distance' here is also an issue -- typically I want the
> shortest intergenic distance.
Have you found any mixed-strand trans-spliced genes in your
dataset yet? They are really special cases which cause pain!
Peter
More information about the Biopython
mailing list