[Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF)

Peter Cock p.j.a.cock at googlemail.com
Tue Apr 21 09:51:26 EDT 2009


On Tue, Apr 21, 2009 at 1:44 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Unless you are thinking of having an object representation as being too
> heavy, the non-light part of SeqFeature is all the FeatureLocation
> fuzziness.

I've just had a quick go at what should be a 100% backwards compatible
modification to the FeatureLocation class to store ExactPosition start
or end positions as integers.  The idea should be more memory
efficient, using the complex position objects only when required.

The new __init__ method would look like this:

    def __init__(self, start, end):
        """Specify the start and end of a sequence feature."""
        #Keeps exact locations as plain integers
        #Calculates the non-fuzzy versions now so make accessing
        #them simpler and faster (expected to be used more often)
        if isinstance(start, int) or isinstance(start, long):
            self._start = None
            self._start_int_nofuzzy = start
        elif isinstance(start, ExactPosition) :
            #Don't need to keep the full object
            self._start = None
            self._start_int_nofuzzy = start.position
        else :
            assert isinstance(start, AbstractPosition), repr(start)
            self._start = start
            self._start_int_nofuzzy = min(start.position,
                                          start.position + start.extension)
        if isinstance(end, int) or isinstance(end, long) :
            self._end = None
            self._end_int_nofuzzy = end
        elif isinstance(end, ExactPosition) :
            #Don't need to keep the full object
            self._end = None
            self._end_int_nofuzzy = end.position
        else :
            assert isinstance(end, AbstractPosition), repr(end)
            self._end = end
            self._end_int_nofuzzy = max(end.position,
                                        end.position + end.extension)

The associated methods are then updated accordingly.  When a position
object is requested, self._start or self._end is used (if it is not
None, when an ExactPosition is generated on the fly from the integer
self.self._start_int_nofuzzy or self._end_int_nofuzzy).  When the
non-fuzzy integer approximation is wanted (the typical use case), we
have those cached as the integers.

The unit tests all pass (except test_BioSQL_SeqIO.py), but we'd need
to have some sort of benchmark to demonstrate any memory gains in
order to justify this kind of change.  Maybe try it with Brad's GFF
parser on a very large file? I could stick the full patch on Bugzilla
(or perhaps github) is this sounds worth pursuing...

An alternative implementation would use a single private variable to
store either the integer position or the position object, and check
the type when the public properties are accessed.  This should be an
even bigger memory saving, but may be slower.

Peter


More information about the Biopython-dev mailing list