[BioPython] Extracting SeqFeature locations from sequences
Peter
biopython at maubp.freeserve.co.uk
Mon Sep 3 16:47:06 UTC 2007
I was prompted to actually write this email based on Sebastian Bassi's
recent email where he was having trouble getting to grips with this topic.
I had been thinking that Biopython really should have code built in to
take a SeqFeature's location and extract this from the full record
sequence. This would particularly apply to SeqRecord objects read from
GenBank or EMBL files (using Bio.SeqIO or using Bio.GenBank directly).
As far as I am aware, right now it is up to the user to take the
information stored in a SeqFeature and apply this "by hand" to the
parent record's sequence. Adding some more detailed examples to the
tutorial is probably a good idea - for example based on
http://www.warwick.ac.uk/go/peter_cock/python/genbank/
In addition to improving the documentation, we could add a new method to
the Seq and/or SeqRecord object which would return the sub-sequence
defined by a SeqFeature.
We could even do this via the __getitem__ method, normally used for
accessing elements of a sequence (as strings) or splicing to get a
sub-sequence. e.g.
print seq[index]
print seq[start:end]
print seq[feature]
or,
print record[feature]
I think this is quite elegant, but a separate explicitly named method
might be clearer and more discoverable.
To do this properly covering all cases is actually non-trivial - a good
reason to have it built into Biopython (with a good test suite) rather
than having end users reimplement it themselves.
Messy details to take care of include being aware of both joins and
complements (stored as sub-features and the strand property
respectively), and fuzzy locations. Most situations should be resolved
relatively easily - but in the worst case we could throw a ValueError if
there really is no sensible solution.
Peter
More information about the Biopython
mailing list