[Biopython-dev] Replacing SeqFeature sub_features with compound feature locations

Peter Cock p.j.a.cock at googlemail.com
Tue Jul 24 21:38:59 UTC 2012


On Tue, Jul 24, 2012 at 10:08 PM, Lenna Peterson <arklenna at gmail.com> wrote:
>>> The documentation suggests using + to combine FeatureLocations, which
>>> invites the use of sum. However, sum doesn't work properly. I explain
>>> why in my StackOverflow question:
>>> http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior
>>
>> Huh, I hadn't anticipated that - but I agree trying to use sum seems
>> natural.
>>
>>> I have considered a number of workarounds:
>>>
>>> 1. Implementing __radd__ on FeatureLocation to return self if other ==
>>> 0 allows sum() to work in place, but I am uncomfortable with
>>> hard-coding such a condition.
>>
>> Another idea is to define FeatureLocation or CompoundFeature
>> addition with an integer to expose the current private method _shift.
>> i.e. Apply an offset to the co-ordinates. Something I'd been pondering
>> as a (previously unrelated) enhancement. In this interpretation, adding
>> zero would have no effect on the co-ordinates and thus as a side
>> effect should also make sum(locations) work. We'd need to test this
>> to see if that actually works.
>
> Yes, this works fine:
>
> Modifying FeatureLocation.__add__ with the condition:
>
>     if isinstance(other, int):
>         return self._shift(other)
>
> and adding FeatureLocation.__radd__:
>
>     def __radd__(self, other):
>         return self.__add__(other)
>
> After these changes, FeatureLocation(3,6) + 3 yields [6:9] and
> sum([FeatureLocation(3,6), FeatureLocation(10,13)]) yields join{[3:6],
> [10:13]}. (+ of FeatureLocations also still works, as does summing
> lists with length > 2)

OK - good. That might be worthwhile then.

>>> 2. Changing the location to subclass set and use xrange for generation
>>> would easily allow a number of things: an empty location
>>> (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the
>>> 'magic' of merging abutting locations that you mention. However, using
>>> + and sum() on sets is dubious from a mathematically pure standpoint,
>>> and this would only work for ExactPositions. Note that I haven't
>>> attempted this yet and it may have disadvantages even for
>>> ExactPositions that I've failed to imagine.
>>>
>>> Let me know your thoughts.
>>
>> I wouldn't think of FeatureLocation(0,0) aka [0:0] as an empty
>> location, but rather as a between location - in this case between
>> the last and first base on a circular genome. In Genbank notation
>> for a circular genome of length 1234, this would be 1234^1
>> (already an annoying special case we have to handle in the
>> parser and the writer - although I'd have to check the code
>> to see if we store this as [0:0] or [1234:1234] since both make
>> sense).
>
> If the length is 1234, [1234] would be an index error.  I don't think
> [1233:1233] would make sense either; for space-counted genomic
> coordinates (http://alternateallele.blogspot.co.uk/2012/03/genome-coordinate-conventions.html),
> the index refers to the space to the left of the base pair. By that
> convention, [0:0] would refer to the gap between the last base and the
> first base.

The point is that with a circular sequence of length n, base 0
is also base n, so [0:0] is sort of the same as [n:n], or [n:0]. Of
these I guess [0,0] is the most sensible representation for
following Python norms.

But we digress - this certainly isn't an 'empty location', something
which doesn't really make sense (other than in the sense of None
meaning missing data).

>>
>> On the other hand, a CompoundLocation with zero parts might
>> make sense. There is something to be said for simply have
>> a single (upgraded) FeatureLocation object with a parts list,
>> which in the typical case would be length one, and proxy
>> methods for start/end as currently defined in CompoundLocation.
>> Maybe I should try that on another branch... it might be more
>> elegant overall.
>>
>
> I haven't tested sum() on CompoundLocations but I would guess they
> would need similar treatment to FeatureLocation. Should
> CompoundLocation + int also shift each part?

If we make those changes to the FeatureLocation, then yes,
the CompoundLocation should get them too.

> I agree that an "upgraded" FeatureLocation could be more
> elegant.

It could turn out to be simpler having just one location object...
certainly worth trying out before committing this branch as is.

Peter



More information about the Biopython-dev mailing list