[BioPython] Extracting SeqFeature locations from sequences

Mon Sep 3 16:47:06 UTC 2007

I was prompted to actually write this email based on Sebastian Bassi's 
recent email where he was having trouble getting to grips with this topic.

I had been thinking that Biopython really should have code built in to 
take a SeqFeature's location and extract this from the full record 
sequence. This would particularly apply to SeqRecord objects read from 
GenBank or EMBL files (using Bio.SeqIO or using Bio.GenBank directly).

As far as I am aware, right now it is up to the user to take the 
information stored in a SeqFeature and apply this "by hand" to the 
parent record's sequence.  Adding some more detailed examples to the 
tutorial is probably a good idea - for example based on 
http://www.warwick.ac.uk/go/peter_cock/python/genbank/

In addition to improving the documentation, we could add a new method to 
the Seq and/or SeqRecord object which would return the sub-sequence 
defined by a SeqFeature.

We could even do this via the __getitem__ method, normally used for 
accessing elements of a sequence (as strings) or splicing to get a 
sub-sequence. e.g.
print seq[index]
print seq[start:end]
print seq[feature]
or,
print record[feature]

I think this is quite elegant, but a separate explicitly named method 
might be clearer and more discoverable.

To do this properly covering all cases is actually non-trivial - a good 
reason to have it built into Biopython (with a good test suite) rather 
than having end users reimplement it themselves.

Messy details to take care of include being aware of both joins and 
complements (stored as sub-features and the strand property 
respectively), and fuzzy locations.  Most situations should be resolved 
relatively easily - but in the worst case we could throw a ValueError if 
there really is no sensible solution.

Peter