[BioPython] More string methods for the Seq object

Peter peter at maubp.freeserve.co.uk
Fri Sep 26 12:52:13 EDT 2008


> Support you had translated a nucleotide sequence into for example,
> "SADKCNKADND*AKDNCDNADK*AK*NCAKNSHJ" (as a Seq object with a protein
> alphabet).  You might want to split the sequence at terminators, to
> get the open reading frames (and then filter them on length).  Right
> now the Seq object doesn't have a split method so you would have to
> switch to using python strings (and then go back to a Biopython Seq
> object later if need be).

Using pure python strings:

str_seq = "SADKCNKADND*AKDNCDNADK*AK*NCAKNSHJ"
orf_str_list = str_seq.split("*")

Using Biopython Seq objects:

from Bio.Seq import Seq
from Bio.Alphabet import generic_protein
seq = Seq("SADKCNKADND*AKDNCDNADK*AK*NCAKNSHJ", generic_protein)
#I want to be able to do this:
orf_seq_list = seq.split("*")
#Right now I have to do something like this:
orf_seq_list = [Seq(x, generic_protein) for x in seq.tostring().split("*")]

Another example of using a Seq object .split() method would be for
restriction enzymes (although the Bio.Restriction package should be
more general).

> Suppose you have some sequences which you have aligned in ClustalW,
> and most have leading or trailing gaps characters.  e.g.  Given
> "---SAD-KCNKADND---" (as a Seq object with a gapped protein alphabet)
> you might want to strip off the leading and trailing gaps to have just
> "SAD-KCNKADND"  (as a Seq object with the same alphabet).  Right now
> the Seq object doesn't have a strip method, so you would have to
> switch to a string and back again.

Using pure python strings:

long_seq_str = "---SAD-KCNKADND---"
trimmed_seq_str = long_seq_str.strip("-")

Using Biopython Seq objects:

from Bio.Seq import Seq
from Bio.Alphabet import generic_protein
long_seq = Seq("---SAD-KCNKADND---", generic_protein)
#I want to be able to do this:
trimmed_seq = long_seq.strip("-")
#Right now, I have to do something like this:
trimmed_seq = Seq(long_seq.tostring().strip("-"), generic_protein)

Another possible example is if you have some EST sequences and you
want to strip the poly A tail on the trailing end (right side), e.g
"ACACTGCAGCATCAGCAAAAAAA".rstrip("A")

Peter


More information about the BioPython mailing list