[Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF)

Peter Cock p.j.a.cock at googlemail.com
Fri Apr 24 10:14:10 EDT 2009


On Fri, Apr 24, 2009 at 1:45 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> With the unix 'time' command; those are the values reported by %M,
> which is the maximum memory used during the process.
>

You said 70k features, but how big was the file on disk?

>>
>> And was your 15% comparison between the current "heavy" SeqFeature +
>> FeatureLocation system as in CVS, and my lightweight alternative
>> described earlier?
>>
>
> This was with an even lighter version. I just added start/end as
> attributes to the SeqFeatures. So there was no FeatureLocation or
> individual position objects. This was a hack to look at the best case
> scenario to save memory. The baseline was the default SeqFeatures
> before we started thinking about changing them.

Right - so even if the FeatureLocation is a bit "heavy", getting rid of it
wouldn't make that much difference based on your simple profiling.

>> How does this version look? It should save more memory that the
>> version I sent you three days ago, and again aims for 100% backwards
>> compatibility - all the unit tests pass.
>
> That is nice. Do we still want to keep a FeatureLocation, or
> condense this all onto the SeqFeature itself?

For the moment I was exploring ways to avoid wasting memory in the
FeatureLocation object while retaining 100% compatibility.  If your
simple profiling numbers are telling the whole story, then there isn't
a great deal of point in adding any internal complexity for a small
memory saving.

If we do want to preserve the current SeqFeature and FeatureLocation
API, then the proposal on Bug 2818 is a worthwhile incremental
improvement.

However, we can probably come up with something even nicer if we
change the SeqFeature and FeatureLocation in a non-backwards
compatible way. If we did change the API, I would want to stop using
the sub_features list to hold join information as child SeqFeatures.
I was thinking the FeatureLocation object should hold this, but
merging the SeqFeature and FeatureLocation could make sense.  Are
there any other non-join location operators we really have to deal
with?

Internally the FeatureLocation (or SeqFeature) could have a list of
child locations held as a private list holding two entry tuples (start
and end positions).  Typically for a non-join feature this will be
just _loc_list=[(start,end)], while more generally it would be
_loc_list=[(start1,end1),...,(startN,endN)].  The FeatureLocation (or
SeqFeature) would have (fuzzy/non-fuzzy) start and end properties
which would access _loc_list[0][0] for the start, and loc_list[-1][1]
for the end.  I would still use the existing position objects to store
fuzzy positions.

Peter


More information about the Biopython-dev mailing list