[Biopython] SeqIO feature.location.start and end for genes spanning origin

Richard Llewellyn llewelr at gmail.com
Thu May 8 17:47:50 UTC 2014


>
> > What numbers are you hoping to get out of this location?


Great question.  I can see that having 0,end is useful as a flag for origin
spanning.  However, it is also the least informative, as neither 0 or the
end are actual locations of the gene starting/ending.  My code would have
expected the start and end to be the sequence locations (so start >> end),
and it would have marked this as a special case of origin spanning.  But it
does require special handling.  I currently use negative numbers for the
start in this situation, though this has its own problems.

In the code I'm currently debugging I treat DNA molecules as partial order
graphs in order to deal with overlapping or mutually-exclusive genes [or
gene predictions ;-)  ].  The graphs are cyclical for circular molecules.
 The graphs are responsible for generating distances between genes, so each
node in the graph (a gene) contains the entire molecule length if circular.
 The definition of 'distance' here is also an issue -- typically I want the
shortest intergenic distance.






On Thu, May 8, 2014 at 11:19 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Thu, May 8, 2014 at 5:02 PM, Richard Llewellyn <llewelr at gmail.com>
> wrote:
> > I was surprised to find that the start and end of a SeqIO
> > record.feature.location for a gene spanning the origin was
> > given as 0 and the end the length of the circular chromosome
> > (see below).
> >
> > I know it is difficult to deal with features spanning the origin, and
> > imagine that there are issues if the start location is given as greater
> > than the end.
>
> Yes. The Biopython model has start <= end, regardless of
> strand - like GFF etc.
>
> > I wonder if you have a suggested work around.
> >
> > Off the top of my head, I could test whether the feature.location is of
> > type CompoundLocation, and if so, determine whether it spans the origin
> > (for instance, test if the end of one location is chromosome length,
> start
> > of another is 0), and then take the minimum of the former and the max of
> > the latter).  Since I am currently working with prokaryotic sequence this
> > would just add the type test to each parse, a relatively small overhead.
>
> You could (in theory) have some (trans) splicing going on, but in
> most origin wrapping yes, you have a 0/length join point.
>
> It depends what the goal of your code is - if just to get the
> sequence described, the extract method does all the hard
> work. But generally you are going to have to special case
> features wrapping the origin - however the parser/object
> model handled it.
>
> What numbers are you hoping to get out of this location?
>
> > Thanks for the great work.
> >
> > #####################################
> >
> >
> > I ran into this problem with Nanoarchaeum equitans Kin4-M,
> > http://www.ncbi.nlm.nih.gov/nuccore/38349555,
> >
> > where parsing the first CDS, location.start is 0 and location.end is
> > 490885.
> >
> > ...
>
> This is one of my favourite test cases for features wrapping
> the origin :)
>
> Peter
>



More information about the Biopython mailing list