[Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF)
Peter Cock
p.j.a.cock at googlemail.com
Thu Apr 23 14:06:14 UTC 2009
On Thu, Apr 23, 2009 at 1:36 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi all;
>
>> > Unless you are thinking of having an object representation as being too
>> > heavy, the non-light part of SeqFeature is all the FeatureLocation
>> > fuzziness.
>>
>> I've just had a quick go at what should be a 100% backwards compatible
>> modification to the FeatureLocation class to store ExactPosition start
>> or end positions as integers. The idea should be more memory
>> efficient, using the complex position objects only when required.
>
> I like the idea here but I would go a step further and get rid of
> FeatureLocation, collapsing the start and end location onto the
> SeqFeature itself. FeatureLocation is basically just a holder for a
> start and end coordinates. In this version, you would store the
> positions plus extensions and fuzzy type on the Feature, and then
> instantiate fuzzy objects on demand.
>
> I took a look at the resource usage of these objects versus
> a lightweight implementation. For a GFF file with 70k features, the
> maximum memory usage is 128M versus 111M for the lightweight
> version. So the improvement is rather modest, ~15%.
Thanks for that. Perhaps the variant idea using a using a single
reference for each location would save more (currently is uses two
references, one for the object and one for the integer - so in general
we are wasting memory on a pointer to None).
Certainly merging the SeqFeature and FeatureLocation should save even
more memory. We could do this with full backward compatibility by
generating the FeatureLocation object on request (using a property
method for the SeqFeature's location), and this can also trigger a
deprecation warning. We'd have to think about what to do with the
SeqFeature's __init__ method more carefully.
>> I forgot to mention the second major use case I'm concerned about,
>> which is recovering the GenBank/EMBL style location string. I have
>> looked at this in the past, by adding methods to the FeatureLocation
>> and all the Position objects, but it is complicated by the fact the
>> Position objects don't know if they are at the start or end (and for
>> the start locations we need to add one to convert from Python
>> counting). This is the main block on having Bio.SeqIO support writing
>> GenBank (or EMBL) files with their features included.
>
> I admittedly haven't looked at this in a while, but this was
> designed to be round tripped. The GenBank Record class can be
> written out back in GenBank format, and test_GenBank explicitly
> checks that the start and end records are the same.
Yes - The Bio.GenBank.Record class should round-trip, from memory it
stores feature locations as string.
I'm interested in writing a SeqRecord out as a GenBank file (which
already do, but without the features). This would let you do things
like load an EMBL or GFF3 file as a SeqRecord, and output it as a
GenBank file.
Peter
More information about the Biopython-dev
mailing list