[Biopython-dev] Replacing SeqFeature sub_features with compound feature locations

Peter Cock p.j.a.cock at googlemail.com
Tue Jul 24 17:19:31 UTC 2012


On Tue, Jul 24, 2012 at 5:57 PM, Lenna Peterson <arklenna at gmail.com> wrote:
>> This is what I have tried to do on this branch:
>> https://github.com/peterjc/biopython/tree/f_loc4
>>
>> As part of this, adding two FeatureLocations will give a
>> CompoundLocation - similarly you can add a simple
>> FeatureLocation and a CompoundLocation or two
>> CompoundLocation objects. I think this makes creating
>> a SeqFeature describing a Eukaryotic gene model
>> MUCH simpler than with the existing approach.
>>
>> (A potential refinement not implemented yet would be
>> to merge abutting exact locations automatically, so that
>> adding 123..456 and 457..999 would give 123..999
>> instead of join(123..456,457..999), but that might be
>> too much magic?)
>
> Hi Peter,
>
> I have been testing the new CompoundLocation w.r.t. coordinate mapping
> and for the most part, I find it simplifies things.

That's encouraging.

> The documentation suggests using + to combine FeatureLocations, which
> invites the use of sum. However, sum doesn't work properly. I explain
> why in my StackOverflow question:
> http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior

Huh, I hadn't anticipated that - but I agree trying to use sum seems
natural.

> I have considered a number of workarounds:
>
> 1. Implementing __radd__ on FeatureLocation to return self if other ==
> 0 allows sum() to work in place, but I am uncomfortable with
> hard-coding such a condition.

Another idea is to define FeatureLocation or CompoundFeature
addition with an integer to expose the current private method _shift.
i.e. Apply an offset to the co-ordinates. Something I'd been pondering
as a (previously unrelated) enhancement. In this interpretation, adding
zero would have no effect on the co-ordinates and thus as a side
effect should also make sum(locations) work. We'd need to test this
to see if that actually works.

> 2. Changing the location to subclass set and use xrange for generation
> would easily allow a number of things: an empty location
> (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the
> 'magic' of merging abutting locations that you mention. However, using
> + and sum() on sets is dubious from a mathematically pure standpoint,
> and this would only work for ExactPositions. Note that I haven't
> attempted this yet and it may have disadvantages even for
> ExactPositions that I've failed to imagine.
>
> Let me know your thoughts.

I wouldn't think of FeatureLocation(0,0) aka [0:0] as an empty
location, but rather as a between location - in this case between
the last and first base on a circular genome. In Genbank notation
for a circular genome of length 1234, this would be 1234^1
(already an annoying special case we have to handle in the
parser and the writer - although I'd have to check the code
to see if we store this as [0:0] or [1234:1234] since both make
sense).

On the other hand, a CompoundLocation with zero parts might
make sense. There is something to be said for simply have
a single (upgraded) FeatureLocation object with a parts list,
which in the typical case would be length one, and proxy
methods for start/end as currently defined in CompoundLocation.
Maybe I should try that on another branch... it might be more
elegant overall.

Peter



More information about the Biopython-dev mailing list