[Biopython-dev] Replacing SeqFeature sub_features with compound feature locations

Tue Jul 24 16:57:34 UTC 2012

On Sun, Jul 22, 2012 at 10:19 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Dear all,
>
> One of the 'warts' in the current SeqRecord/SeqFeature object
> model is how non-trivial features are stored - in particular joins
> (in the terminology of GenBank/EMBL).
>
> Previous discussions include:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html
> ...
> http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html
> http://lists.open-bio.org/pipermail/biopython-dev/2011-October/009221.html
>
> Consider a single gene like this from NC_000932 in our test
> suite: complement(join(97999..98793,69611..69724))
>
> Currently that becomes three SeqFeature objects, a parent
> object present in the SeqRecord's feature list, and two child
> objects (one for each exon) within that parent feature's
> sub_features list. The parent feature gets a location which
> summarises the span, so start 97999-1 (Pythonic counting),
> end 69724, and strand -1.
>
> This usage of the sub_features property in this way has
> been present in Biopython for a very long time, and prevents
> us using it for nesting features based on the parent/child
> relationship models used in GFF (e.g. gene and CDS, or
> gene, mRNA, CDS, and exon).
>
> As Brad and I had discussed, a new separate mechanism
> might be added for explicit parent/child relationships
> between SeqFeature objects useful for GFF3, since the
> current name sub_features has this historical baggage.
>
> Suggestion
> ==========
>
> What I had proposed was we get rid of sub_features
> (deprecate it, so for the next couple of releases our parser
> and BioSQL access will populate it, but our writers and
> the BioSQL loader will ignore it) and replace it with a new
> subclass of the FeatureLocation object specifically for these
> compound locations.
>
> This will then map much more closely to the tables used
> in BioSQL, and therefore I suspect the BioPerl object
> model too.
>
> Once the sub_feature support is dropped, the objects for
> the complement(join(97999..98793,69611..69724)) example
> becomes just one SeqFeature object, whose location is a new
> CompoundLocation containing two parts (the two exons).
>
> Note that in order to handle mixed strand features and
> to make iteration etc simpler, the parts are stored in the
> biological order (5' to 3'). To put this another way, for this
> example I find it helps to think about example this as the
> old EMBL variant form of the location string:
>
> join(complement(69611..69724),complement(97999..98793))
>
> i.e. The first part of this gene (the 5' end of the gene)
> is complement(69611..69724), and the last part (with
> the 3' end of the gene) is complement(97999..98793).
>
> For iteration over the bases of this CompoundLocation
> you'd get 69723, 69723, ..., 69610 (the first exon), then
> 98792, ..., 97998 (the second exon) which is exactly what
> happens now when iterating over the parent SeqFeature.
>
> This is what I have tried to do on this branch:
> https://github.com/peterjc/biopython/tree/f_loc4
>
> As part of this, adding two FeatureLocations will give a
> CompoundLocation - similarly you can add a simple
> FeatureLocation and a CompoundLocation or two
> CompoundLocation objects. I think this makes creating
> a SeqFeature describing a Eukaryotic gene model
> MUCH simpler than with the existing approach.
>
> (A potential refinement not implemented yet would be
> to merge abutting exact locations automatically, so that
> adding 123..456 and 457..999 would give 123..999
> instead of join(123..456,457..999), but that might be
> too much magic?)
>
> Impact
> ======
>
> What does this mean for Biopython users? It will only
> really affect people using annotated nucleotide files,
> (i.e. GenBank or EMBL files), and only those doing
> anything clever with 'join' type features.
>
> The deprecation process will allow scripts just reading
> files to continue to be used unmodified in the short
> term.
>
> However, as the branch currently stands, scripts
> building SeqFeature objects using sub_features
> would have to be updated immediately. I believe
> this is only going to affect a handful of people
> though, and will (once done) simplify their code.
>
> Thoughts? I've tried to balance backwards compatibility
> with providing something more intuitive - and fixing this
> should help with merging the GFF support.
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

Hi Peter,

I have been testing the new CompoundLocation w.r.t. coordinate mapping
and for the most part, I find it simplifies things.

The documentation suggests using + to combine FeatureLocations, which
invites the use of sum. However, sum doesn't work properly. I explain
why in my StackOverflow question:
http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior

I have considered a number of workarounds:

1. Implementing __radd__ on FeatureLocation to return self if other ==
0 allows sum() to work in place, but I am uncomfortable with
hard-coding such a condition.

2. Changing the location to subclass set and use xrange for generation
would easily allow a number of things: an empty location
(FeatureLocation(0,0) prints as [0:0]), union for iteration, and the
'magic' of merging abutting locations that you mention. However, using
+ and sum() on sets is dubious from a mathematically pure standpoint,
and this would only work for ExactPositions. Note that I haven't
attempted this yet and it may have disadvantages even for
ExactPositions that I've failed to imagine.

Let me know your thoughts.

Lenna