[Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF)
Brad Chapman
chapmanb at 50mail.com
Thu Apr 23 08:36:35 EDT 2009
Hi all;
> > Unless you are thinking of having an object representation as being too
> > heavy, the non-light part of SeqFeature is all the FeatureLocation
> > fuzziness.
>
> I've just had a quick go at what should be a 100% backwards compatible
> modification to the FeatureLocation class to store ExactPosition start
> or end positions as integers. The idea should be more memory
> efficient, using the complex position objects only when required.
I like the idea here but I would go a step further and get rid of
FeatureLocation, collapsing the start and end location onto the
SeqFeature itself. FeatureLocation is basically just a holder for a
start and end coordinates. In this version, you would store the
positions plus extensions and fuzzy type on the Feature, and then
instantiate fuzzy objects on demand.
I took a look at the resource usage of these objects versus
a lightweight implementation. For a GFF file with 70k features, the
maximum memory usage is 128M versus 111M for the lightweight
version. So the improvement is rather modest, ~15%.
> I forgot to mention the second major use case I'm concerned about,
> which is recovering the GenBank/EMBL style location string. I have
> looked at this in the past, by adding methods to the FeatureLocation
> and all the Position objects, but it is complicated by the fact the
> Position objects don't know if they are at the start or end (and for
> the start locations we need to add one to convert from Python
> counting). This is the main block on having Bio.SeqIO support writing
> GenBank (or EMBL) files with their features included.
I admittedly haven't looked at this in a while, but this was
designed to be round tripped. The GenBank Record class can be
written out back in GenBank format, and test_GenBank explicitly
checks that the start and end records are the same.
Brad
More information about the Biopython-dev
mailing list