[Biopython-dev] Replacing SeqFeature sub_features with compound feature locations

Peter Cock p.j.a.cock at googlemail.com
Sun Jul 22 14:19:31 UTC 2012

Dear all,

One of the 'warts' in the current SeqRecord/SeqFeature object
model is how non-trivial features are stored - in particular joins
(in the terminology of GenBank/EMBL).

Previous discussions include:

Consider a single gene like this from NC_000932 in our test
suite: complement(join(97999..98793,69611..69724))

Currently that becomes three SeqFeature objects, a parent
object present in the SeqRecord's feature list, and two child
objects (one for each exon) within that parent feature's
sub_features list. The parent feature gets a location which
summarises the span, so start 97999-1 (Pythonic counting),
end 69724, and strand -1.

This usage of the sub_features property in this way has
been present in Biopython for a very long time, and prevents
us using it for nesting features based on the parent/child
relationship models used in GFF (e.g. gene and CDS, or
gene, mRNA, CDS, and exon).

As Brad and I had discussed, a new separate mechanism
might be added for explicit parent/child relationships
between SeqFeature objects useful for GFF3, since the
current name sub_features has this historical baggage.


What I had proposed was we get rid of sub_features
(deprecate it, so for the next couple of releases our parser
and BioSQL access will populate it, but our writers and
the BioSQL loader will ignore it) and replace it with a new
subclass of the FeatureLocation object specifically for these
compound locations.

This will then map much more closely to the tables used
in BioSQL, and therefore I suspect the BioPerl object
model too.

Once the sub_feature support is dropped, the objects for
the complement(join(97999..98793,69611..69724)) example
becomes just one SeqFeature object, whose location is a new
CompoundLocation containing two parts (the two exons).

Note that in order to handle mixed strand features and
to make iteration etc simpler, the parts are stored in the
biological order (5' to 3'). To put this another way, for this
example I find it helps to think about example this as the
old EMBL variant form of the location string:


i.e. The first part of this gene (the 5' end of the gene)
is complement(69611..69724), and the last part (with
the 3' end of the gene) is complement(97999..98793).

For iteration over the bases of this CompoundLocation
you'd get 69723, 69723, ..., 69610 (the first exon), then
98792, ..., 97998 (the second exon) which is exactly what
happens now when iterating over the parent SeqFeature.

This is what I have tried to do on this branch:

As part of this, adding two FeatureLocations will give a
CompoundLocation - similarly you can add a simple
FeatureLocation and a CompoundLocation or two
CompoundLocation objects. I think this makes creating
a SeqFeature describing a Eukaryotic gene model
MUCH simpler than with the existing approach.

(A potential refinement not implemented yet would be
to merge abutting exact locations automatically, so that
adding 123..456 and 457..999 would give 123..999
instead of join(123..456,457..999), but that might be
too much magic?)


What does this mean for Biopython users? It will only
really affect people using annotated nucleotide files,
(i.e. GenBank or EMBL files), and only those doing
anything clever with 'join' type features.

The deprecation process will allow scripts just reading
files to continue to be used unmodified in the short

However, as the branch currently stands, scripts
building SeqFeature objects using sub_features
would have to be updated immediately. I believe
this is only going to affect a handful of people
though, and will (once done) simplify their code.

Thoughts? I've tried to balance backwards compatibility
with providing something more intuitive - and fixing this
should help with merging the GFF support.


More information about the Biopython-dev mailing list