[Biopython] SeqIO feature.location.start and end for genes spanning origin

Peter Cock p.j.a.cock at googlemail.com
Thu May 8 18:25:48 UTC 2014


On Thu, May 8, 2014 at 6:47 PM, Richard Llewellyn wrote:
>> > What numbers are you hoping to get out of this location?
>
> Great question.  I can see that having 0,end is useful as a flag for origin
> spanning.  However, it is also the least informative, as neither 0 or the
> end are actual locations of the gene starting/ending.

Your view "least informative" is subjective. They are end points of
the "fake" exons used to describe the feature, and more importantly
represent the min/max of the region spanned (very useful for drawing
features or considering intersections etc).

> My code would have
> expected the start and end to be the sequence locations (so start >> end),
> and it would have marked this as a special case of origin spanning.  But it
> does require special handling.  I currently use negative numbers for the
> start in this situation, though this has its own problems.

You mean the biological start and end? You didn't answer my
question but I am guessing you wanted start 879 (adjusted for
Python counting?) and end 490883 given this location string:

complement(join(490883..490885,1..879))

That would be possible for any stranded feature, although it
is less well defined for strand-less features (which in GenBank
and EMBL are by default forward strand).

Anyway, using NC_005213 as the example:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk

Sample code:

  >>> from Bio import SeqIO
  >>> record = SeqIO.read("NC_005213.gbk", "genbank")
  >>> feature = record.features[4] # the CDS record spanning origin
  >>> print(feature.location)
  join{[0:879](-), [490882:490885](-)}
  >>> print(feature.location.start)
  0
  >>> print(feature.location.end)
  490885

Perhaps there is a case for "biological" start and end properties
(calculated from the current data structure on demand)? i.e.
something like this:

  >>> feature.location.parts[0].end if
feature.location.parts[0].strand == -1 else
feature.location.parts[0].start
  ExactPosition(879)

  >>> feature.location.parts[-1].start if
feature.location.parts[-1].strand == -1 else
feature.location.parts[-1].end
  ExactPosition(490882)

Note as a convenience, even basic non-compound locations have
a parts property - returning a list of one entry, themselves. So that
code should work in general :)

The potential enhancement would be to define these are new
properties, feature.location.bio_start and feature.location.bio_end
or similar naming? Would that be useful? What would name them?

> In the code I'm currently debugging I treat DNA molecules as partial order
> graphs in order to deal with overlapping or mutually-exclusive genes [or
> gene predictions ;-)  ].  The graphs are cyclical for circular molecules.
> The graphs are responsible for generating distances between genes, so each
> node in the graph (a gene) contains the entire molecule length if circular.
> The definition of 'distance' here is also an issue -- typically I want the
> shortest intergenic distance.

Have you found any mixed-strand trans-spliced genes in your
dataset yet? They are really special cases which cause pain!

Peter



More information about the Biopython mailing list