[Biopython] Getting the sequence for a SeqFeature

Peter biopython at maubp.freeserve.co.uk
Fri Nov 6 12:47:42 UTC 2009


On Fri, Nov 6, 2009 at 12:22 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I am planing to add a new method to the SeqFeature object, but
> would like a little feedback first. This email is really just the
> background - I'll write up a few examples later to try and make
> this a bit clearer...

OK, here is a non-trivial example - the first CDS feature in the
GenBank file NC_000932.gb (included as a Biopython unit test),
which is a three part join on the reverse strand. In this case, the
GenBank file includes the protein translation for the CDS features
so we can use it to check our results.

We can parse this GenBank file into a SeqRecord with:

from Bio import SeqIO
record = SeqIO.read(open("../biopython/Tests/GenBank/NC_000932.gb"), "gb")

Let's have a look at the first CDS feature (index 2):

f = record.features[2]
print f.type, f.location, f.strand, f.location_operator
for sub_f in f.sub_features :
    print " - ", sub_f.location, sub_f.strand
table = f.qualifiers.get("transl_table",[1])[0] # List of one int
print "Table", table

Giving:

CDS [97998:69724] -1 join
 -  [97998:98024] -1
 -  [98561:98793] -1
 -  [69610:69724] -1
Table 11

Looking at the raw GenBank file, this feature has location string:

complement(join(97999..98024,98562..98793,69611..69724))

i.e. To get the sequence you need to do this (note zero based
Python counting as in the output above):

print (record.seq[97998:98024] + record.seq[98561:98793] +
record.seq[69610:69724]).reverse_complement()

And then translate it using NCBI genetic code table 11,

print "Manual translation:"
print (record.seq[97998:98024] + record.seq[98561:98793] +
record.seq[69610:69724]).reverse_complement().translate(table=11,
cds=True)
print "Given translation:"
print f.qualifiers["translation"][0] # List of one string
print "Biopython translation (with proposed code):"
print f.extract(record.seq).translate(table, cds=True)

And the output, together with the provided translation in the
feature annotation, and the shortcut with the new code I am
proposing to include in Biopython:

Manual translation:
MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK
Given translation:
MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK
Biopython translation:
MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK

The point of all this was with the proposed new extract method,
you just need:

feature_seq = f.extract(record.seq)

instead of:

feature_seq = (record.seq[97998:98024] + record.seq[98561:98793] +
record.seq[69610:69724]).reverse_complement()

which is in itself a slight simplification since you'd need to get the
those coordinates from the sub features, worry about strands, etc.

Peter



More information about the Biopython mailing list