[Biopython] Getting the sequence for a SeqFeature

Michael Thon mike.thon at gmail.com
Wed Dec 2 07:31:45 EST 2009


I was wondering if you have implemented this method yet and if so, is it in a repository somewhere where I can try it?  I was about to post a message on this and I searched the archives first (!) and found this thread.  I have genbank genomic sequences and I need to get the transcript sequences for the CDS features.  

thanks
Mike

On Nov 6, 2009, at 1:47 PM, Peter wrote:

> On Fri, Nov 6, 2009 at 12:22 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> Hi all,
>> 
>> I am planing to add a new method to the SeqFeature object, but
>> would like a little feedback first. This email is really just the
>> background - I'll write up a few examples later to try and make
>> this a bit clearer...
> 
> OK, here is a non-trivial example - the first CDS feature in the
> GenBank file NC_000932.gb (included as a Biopython unit test),
> which is a three part join on the reverse strand. In this case, the
> GenBank file includes the protein translation for the CDS features
> so we can use it to check our results.
> 
> We can parse this GenBank file into a SeqRecord with:
> 
> from Bio import SeqIO
> record = SeqIO.read(open("../biopython/Tests/GenBank/NC_000932.gb"), "gb")
> 
> Let's have a look at the first CDS feature (index 2):
> 
> f = record.features[2]
> print f.type, f.location, f.strand, f.location_operator
> for sub_f in f.sub_features :
>    print " - ", sub_f.location, sub_f.strand
> table = f.qualifiers.get("transl_table",[1])[0] # List of one int
> print "Table", table
> 
> Giving:
> 
> CDS [97998:69724] -1 join
> -  [97998:98024] -1
> -  [98561:98793] -1
> -  [69610:69724] -1
> Table 11
> 
> Looking at the raw GenBank file, this feature has location string:
> 
> complement(join(97999..98024,98562..98793,69611..69724))
> 
> i.e. To get the sequence you need to do this (note zero based
> Python counting as in the output above):
> 
> print (record.seq[97998:98024] + record.seq[98561:98793] +
> record.seq[69610:69724]).reverse_complement()
> 
> And then translate it using NCBI genetic code table 11,
> 
> print "Manual translation:"
> print (record.seq[97998:98024] + record.seq[98561:98793] +
> record.seq[69610:69724]).reverse_complement().translate(table=11,
> cds=True)
> print "Given translation:"
> print f.qualifiers["translation"][0] # List of one string
> print "Biopython translation (with proposed code):"
> print f.extract(record.seq).translate(table, cds=True)
> 
> And the output, together with the provided translation in the
> feature annotation, and the shortcut with the new code I am
> proposing to include in Biopython:
> 
> Manual translation:
> MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK
> Given translation:
> MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK
> Biopython translation:
> MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK
> 
> The point of all this was with the proposed new extract method,
> you just need:
> 
> feature_seq = f.extract(record.seq)
> 
> instead of:
> 
> feature_seq = (record.seq[97998:98024] + record.seq[98561:98793] +
> record.seq[69610:69724]).reverse_complement()
> 
> which is in itself a slight simplification since you'd need to get the
> those coordinates from the sub features, worry about strands, etc.
> 
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython




More information about the Biopython mailing list