[Biopython-dev] Getting nucleotide sequence for GenBank features

Wed Oct 28 08:07:57 EDT 2009

Hi all,

I've been following a thread on the BioPerl mailing list about how to
get the mature peptide amino acid sequences for mat_peptide
features in a GenBank file (given in general these features do not
include the translation, nor a GI number of Protein ID which can be
looked up online).

Chris summarised a working approach here:
http://lists.open-bio.org/pipermail/bioperl-l/2009-October/031493.html

Step one of this process is to be able to take a GenBank feature
(here a mat_peptide) and use the location information to extract
the relevant part of the parent nucleotide sequence (at the foot
of the file). For example,

http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank
Consider mat_peptide nsp12, whose location is a little complex,
join(12332..12358,12358..15117) - in Python terms, we need
seq[12331:12358] + seq[12357:15117], although in general there
are other concerns like the strand.

Step two (in Chris' workflow) is to translate this into amino acids,
and as a precaution, verify this is a subsequence of the precursor
protein given in the previous CDS entry (protein ID ABI14446.1
in this case). This is quite straightforward.

The first operation is tricky, but is actually a very general problem,
and has come up before on the Biopython mailing lists, e.g.
http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005991.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005997.html

As noted in the linked threads, I have some (apparently) working code
as function get_feature_nuc in the unit test file test_SeqIO_features.py

I think this should be part of Biopython proper (with unit tests etc), and
would like to discuss where to put it. My ideas include:

(1) Method of the SeqFeature object taking the parent sequence (as a
string, Seq, ...?) as a required argument. Would return an object of the
same type as the parent sequence passed in.

(2) Separate function, perhaps in Bio.SeqUtils taking the parent
sequence (as a string, Seq, ...?) and a SeqFeature object. Would
return an object of the same type as the parent sequence passed in.

(3) Method of the Seq object taking a SeqFeature, returning a Seq.
[A drawback is Bio.Seq currently does not depend on Bio.SeqFeature]

(4) Method of the SeqRecord object taking a SeqFeature. Could
return a SeqRecord using annotation from the SeqFeature. Complex.

Any other ideas?

We could even offer more than one of these approaches, but ideally
there should be one obvious way for the end user to do this. My
question is, which is most intuitive? I quite like idea (1).

In terms of code complexity, I expect (1), (2) and (3) to be about the
same. Building a SeqRecord in (4) is trickier.

Options (1) and (2) are not tied to the sequence object, and could
in theory support any Seq like object, plain strings - or in future
even a SeqRecord: I have a git branch where the SeqRecord
object supports addition and the reverse_complement method,
which would work nicely here. See:
http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html

Peter