[Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF)
Peter Cock
p.j.a.cock at googlemail.com
Tue Apr 21 09:51:26 EDT 2009
On Tue, Apr 21, 2009 at 1:44 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Unless you are thinking of having an object representation as being too
> heavy, the non-light part of SeqFeature is all the FeatureLocation
> fuzziness.
I've just had a quick go at what should be a 100% backwards compatible
modification to the FeatureLocation class to store ExactPosition start
or end positions as integers. The idea should be more memory
efficient, using the complex position objects only when required.
The new __init__ method would look like this:
def __init__(self, start, end):
"""Specify the start and end of a sequence feature."""
#Keeps exact locations as plain integers
#Calculates the non-fuzzy versions now so make accessing
#them simpler and faster (expected to be used more often)
if isinstance(start, int) or isinstance(start, long):
self._start = None
self._start_int_nofuzzy = start
elif isinstance(start, ExactPosition) :
#Don't need to keep the full object
self._start = None
self._start_int_nofuzzy = start.position
else :
assert isinstance(start, AbstractPosition), repr(start)
self._start = start
self._start_int_nofuzzy = min(start.position,
start.position + start.extension)
if isinstance(end, int) or isinstance(end, long) :
self._end = None
self._end_int_nofuzzy = end
elif isinstance(end, ExactPosition) :
#Don't need to keep the full object
self._end = None
self._end_int_nofuzzy = end.position
else :
assert isinstance(end, AbstractPosition), repr(end)
self._end = end
self._end_int_nofuzzy = max(end.position,
end.position + end.extension)
The associated methods are then updated accordingly. When a position
object is requested, self._start or self._end is used (if it is not
None, when an ExactPosition is generated on the fly from the integer
self.self._start_int_nofuzzy or self._end_int_nofuzzy). When the
non-fuzzy integer approximation is wanted (the typical use case), we
have those cached as the integers.
The unit tests all pass (except test_BioSQL_SeqIO.py), but we'd need
to have some sort of benchmark to demonstrate any memory gains in
order to justify this kind of change. Maybe try it with Brad's GFF
parser on a very large file? I could stick the full patch on Bugzilla
(or perhaps github) is this sounds worth pursuing...
An alternative implementation would use a single private variable to
store either the integer position or the position object, and check
the type when the public properties are accessed. This should be an
even bigger memory saving, but may be slower.
Peter
More information about the Biopython-dev
mailing list