[Biopython-dev] Replacing SeqFeature sub_features with compound feature locations

Mon Jul 23 13:05:34 UTC 2012

Peter;
Thanks for working through the sub_feature issue and coming up with this
proposal. I'm 100% on board with converting over to something more
general and this looks like a great approach.

A couple of quick thoughts:

- Would it be possible to have a back-compatible 'sub_features' that
  reconstituted features based on the compound location? This could help
  us avoid breaking scripts that use sub_features, even if we no longer
  fill those in going forward.

- How do you envision storing GFF feature hierarchies? The location
  object is more lightweight with only position and strand information.
  Nested child GFF features would have key/value pairs associated with
  them as well. Would we want to use sub_features (or some new nested
  structure) for these?

Brad

> Dear all,
>
> One of the 'warts' in the current SeqRecord/SeqFeature object
> model is how non-trivial features are stored - in particular joins
> (in the terminology of GenBank/EMBL).
>
> Previous discussions include:
> http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005830.html
> ...
> http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009183.html
> http://lists.open-bio.org/pipermail/biopython-dev/2011-October/009221.html
>
> Consider a single gene like this from NC_000932 in our test
> suite: complement(join(97999..98793,69611..69724))
>
> Currently that becomes three SeqFeature objects, a parent
> object present in the SeqRecord's feature list, and two child
> objects (one for each exon) within that parent feature's
> sub_features list. The parent feature gets a location which
> summarises the span, so start 97999-1 (Pythonic counting),
> end 69724, and strand -1.
>
> This usage of the sub_features property in this way has
> been present in Biopython for a very long time, and prevents
> us using it for nesting features based on the parent/child
> relationship models used in GFF (e.g. gene and CDS, or
> gene, mRNA, CDS, and exon).
>
> As Brad and I had discussed, a new separate mechanism
> might be added for explicit parent/child relationships
> between SeqFeature objects useful for GFF3, since the
> current name sub_features has this historical baggage.
>
> Suggestion
> ==========
>
> What I had proposed was we get rid of sub_features
> (deprecate it, so for the next couple of releases our parser
> and BioSQL access will populate it, but our writers and
> the BioSQL loader will ignore it) and replace it with a new
> subclass of the FeatureLocation object specifically for these
> compound locations.
>
> This will then map much more closely to the tables used
> in BioSQL, and therefore I suspect the BioPerl object
> model too.
>
> Once the sub_feature support is dropped, the objects for
> the complement(join(97999..98793,69611..69724)) example
> becomes just one SeqFeature object, whose location is a new
> CompoundLocation containing two parts (the two exons).
>
> Note that in order to handle mixed strand features and
> to make iteration etc simpler, the parts are stored in the
> biological order (5' to 3'). To put this another way, for this
> example I find it helps to think about example this as the
> old EMBL variant form of the location string:
>
> join(complement(69611..69724),complement(97999..98793))
>
> i.e. The first part of this gene (the 5' end of the gene)
> is complement(69611..69724), and the last part (with
> the 3' end of the gene) is complement(97999..98793).
>
> For iteration over the bases of this CompoundLocation
> you'd get 69723, 69723, ..., 69610 (the first exon), then
> 98792, ..., 97998 (the second exon) which is exactly what
> happens now when iterating over the parent SeqFeature.
>
> This is what I have tried to do on this branch:
> https://github.com/peterjc/biopython/tree/f_loc4
>
> As part of this, adding two FeatureLocations will give a
> CompoundLocation - similarly you can add a simple
> FeatureLocation and a CompoundLocation or two
> CompoundLocation objects. I think this makes creating
> a SeqFeature describing a Eukaryotic gene model
> MUCH simpler than with the existing approach.
>
> (A potential refinement not implemented yet would be
> to merge abutting exact locations automatically, so that
> adding 123..456 and 457..999 would give 123..999
> instead of join(123..456,457..999), but that might be
> too much magic?)
>
> Impact
> ======
>
> What does this mean for Biopython users? It will only
> really affect people using annotated nucleotide files,
> (i.e. GenBank or EMBL files), and only those doing
> anything clever with 'join' type features.
>
> The deprecation process will allow scripts just reading
> files to continue to be used unmodified in the short
> term.
>
> However, as the branch currently stands, scripts
> building SeqFeature objects using sub_features
> would have to be updated immediately. I believe
> this is only going to affect a handful of people
> though, and will (once done) simplify their code.
>
> Thoughts? I've tried to balance backwards compatibility
> with providing something more intuitive - and fixing this
> should help with merging the GFF support.
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev