[BioPython] More string methods for the Seq object
Peter
peter at maubp.freeserve.co.uk
Fri Sep 26 12:52:13 EDT 2008
> Support you had translated a nucleotide sequence into for example,
> "SADKCNKADND*AKDNCDNADK*AK*NCAKNSHJ" (as a Seq object with a protein
> alphabet). You might want to split the sequence at terminators, to
> get the open reading frames (and then filter them on length). Right
> now the Seq object doesn't have a split method so you would have to
> switch to using python strings (and then go back to a Biopython Seq
> object later if need be).
Using pure python strings:
str_seq = "SADKCNKADND*AKDNCDNADK*AK*NCAKNSHJ"
orf_str_list = str_seq.split("*")
Using Biopython Seq objects:
from Bio.Seq import Seq
from Bio.Alphabet import generic_protein
seq = Seq("SADKCNKADND*AKDNCDNADK*AK*NCAKNSHJ", generic_protein)
#I want to be able to do this:
orf_seq_list = seq.split("*")
#Right now I have to do something like this:
orf_seq_list = [Seq(x, generic_protein) for x in seq.tostring().split("*")]
Another example of using a Seq object .split() method would be for
restriction enzymes (although the Bio.Restriction package should be
more general).
> Suppose you have some sequences which you have aligned in ClustalW,
> and most have leading or trailing gaps characters. e.g. Given
> "---SAD-KCNKADND---" (as a Seq object with a gapped protein alphabet)
> you might want to strip off the leading and trailing gaps to have just
> "SAD-KCNKADND" (as a Seq object with the same alphabet). Right now
> the Seq object doesn't have a strip method, so you would have to
> switch to a string and back again.
Using pure python strings:
long_seq_str = "---SAD-KCNKADND---"
trimmed_seq_str = long_seq_str.strip("-")
Using Biopython Seq objects:
from Bio.Seq import Seq
from Bio.Alphabet import generic_protein
long_seq = Seq("---SAD-KCNKADND---", generic_protein)
#I want to be able to do this:
trimmed_seq = long_seq.strip("-")
#Right now, I have to do something like this:
trimmed_seq = Seq(long_seq.tostring().strip("-"), generic_protein)
Another possible example is if you have some EST sequences and you
want to strip the poly A tail on the trailing end (right side), e.g
"ACACTGCAGCATCAGCAAAAAAA".rstrip("A")
Peter
More information about the BioPython
mailing list