[Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF)

Brad Chapman chapmanb at 50mail.com
Thu Apr 23 12:36:35 UTC 2009


Hi all;

> > Unless you are thinking of having an object representation as being too
> > heavy, the non-light part of SeqFeature is all the FeatureLocation
> > fuzziness.
> 
> I've just had a quick go at what should be a 100% backwards compatible
> modification to the FeatureLocation class to store ExactPosition start
> or end positions as integers.  The idea should be more memory
> efficient, using the complex position objects only when required.

I like the idea here but I would go a step further and get rid of
FeatureLocation, collapsing the start and end location onto the
SeqFeature itself. FeatureLocation is basically just a holder for a
start and end coordinates. In this version, you would store the
positions plus extensions and fuzzy type on the Feature, and then
instantiate fuzzy objects on demand.

I took a look at the resource usage of these objects versus
a lightweight implementation. For a GFF file with 70k features, the
maximum memory usage is 128M versus 111M for the lightweight
version. So the improvement is rather modest, ~15%.

> I forgot to mention the second major use case I'm concerned about,
> which is recovering the GenBank/EMBL style location string.  I have
> looked at this in the past, by adding methods to the FeatureLocation
> and all the Position objects, but it is complicated by the fact the
> Position objects don't know if they are at the start or end (and for
> the start locations we need to add one to convert from Python
> counting).  This is the main block on having Bio.SeqIO support writing
> GenBank (or EMBL) files with their features included.

I admittedly haven't looked at this in a while, but this was
designed to be round tripped. The GenBank Record class can be
written out back in GenBank format, and test_GenBank explicitly
checks that the start and end records are the same.

Brad



More information about the Biopython-dev mailing list