[Biopython-dev] Replacing SeqFeature sub_features with compound feature locations

Wed Sep 5 20:34:19 EDT 2012

On Tue, Jul 24, 2012 at 10:38 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Jul 24, 2012 at 10:08 PM, Lenna Peterson <arklenna at gmail.com> wrote:
>> I agree that an "upgraded" FeatureLocation could be more
>> elegant.
>
> It could turn out to be simpler having just one location object...
> certainly worth trying out before committing this branch as is.

Such a new  "upgraded" FeatureLocation would need to hold
a list/tuple of its parts (rather like the proposed CompoundLocation),
and those could be simply as tuples of start, end, strand, db_ref
etc (essentially everything currently held in a FeatureLocation).

I'm not sure that that is any better than the new class
CompoundLocation holding a list of existing FeatureLocation
objects.

On the bright side, the branch still works nicely with the
extra BioSQL tests I added.

One of the issues worth a bit more discussion is the start
and end values of the CompoundLocation - which I am
considering making act as the left/minimum and right/
maximum boundary of the region spanned by the parts.

For normal forward strand features this does give the
biological start and end, likewise for reverse strand
features but inverted (location's start gives the biological
end). i.e. for *most* features this means no change to
the current behaviour.

My proposal would mean that for a feature spanning
the origin on a circular genome of length N, the start
would be 0 and the end N.

Similarly for weird cases from trans-splicing, the
start/end coordinates would give the total region
spanned. As shown below, sometimes that happens
to match the current behaviour, but in other cases
the current behaviour isn't useful anyway.

Adopting start/end as the spanned region makes a
lot of sense for things like drawing features in a
region of interest, or other more abstract tasks
doing feature/region intersection. Here knowing the
min/max boundaries of the region spanned is more
useful than any attempt to capture the biological
start/end of the feature.

Note that already for the simple FeatureLocation for
reverse strand features we have start < end, i.e. the
start coordinate property does NOT represent the
biological starting point.

Under the proposed CompoundLocation behaviour,
the desirable property of the FeatureLocation that
start < end would also hold for compound locations.

Pathological examples at the end,

Regards,

Peter

P.S.  One of the advantages of the CompoundLocation
is when constructing the location you don't give the
overall start/end - there are inferred from the list of parts
automatically. Currently the GenBank/EMBL parser
is having to do this.

P.P.S. I've also confirmed Lenna's testing that sum
of feature locations works if we define integer
addition with locations (so that sum can include
zero and several locations), see:

https://github.com/peterjc/biopython/commit/dc6bc658141cc42e7e6802bbe8baf6c87a6874c0

-----------------------------------------------------------------
Trans-splicing: Mixed Strands

An example where the range/span idea is simpler is
mixed strand features like this trans-spliced example
from NC_000932 (in our unit tests),

join(complement(69611..69724),139856..140650)

What would you expect as the start/end here? The
biological start is base 69724 (one based) and the
last base is 140650. Currently:

>>> from Bio import SeqIO
>>> f = SeqIO.read("NC_000932.gb", "gb").features[135]
>>> print f.location
[69610:140650]
>>> f.location.start
ExactPosition(69610)
>>> f.location.end
ExactPosition(140650)
>>> for sub in f.sub_features: print sub.location
...
[69610:69724](-)
[139855:140650](+)

Here the end value does match the last base in the
feature following the biological order - the start value
is actually a base in the middle of the combined
sequence. In fact, for this example the start/end
are already acting like the range/span idea.

-----------------------------------------------------------------
Trans-splicing: Reverse strand

The example above is a real corner case, and so is this
single strand trans-splcing example, also in NC_000932,
which is a bit like an circular genome origin spanning
annotation:

complement(join(97999..98793,69611..69724))

With the current master branch:

>>> from Bio import SeqIO
>>> f = SeqIO.read("NC_000932.gb", "genbank").features[1]
>>> print f.location
[97998:69724](-)
>>> f.location.start
ExactPosition(97998)
>>> f.location.end
ExactPosition(69724)
>>> for sub in f.sub_features: print sub.location
...
[97998:98793](-)
[69610:69724](-)

Notice that we do not have start < end as you might
expect. However the start and end DO capture the
biological end and start (order inverted - this is on
the reverse strand). To verify this I find it helps to
transform the GenBank style location:

complement(join(97999..98793,69611..69724))

into the old EMBL equivalent:

join(complement(69611..69724),complement(97999..98793))

i.e. The first base is 69724 (one based counting), and
the last base is 97999 (one based counting). So if
you wanted to look at the upstream or downstream
(assuming that makes sense for a trans-spliced
gene), the current start/end values are useful (but
you have to choose start vs end dependent on the
strand).

On the other hand, the range of co-ordindate values
is 69611 to 98793 (one based, inclusive). Therefore
one might expect start 69610 and end 98793 (Python
counting), giving the spanned region.