[BioPython] More string methods for the Seq object

Fri Sep 26 18:45:58 UTC 2008

Peter wrote:
>> Support you had translated a nucleotide sequence into for example,
>> "SADKCNKADND*AKDNCDNADK*AK*NCAKNSHJ" (as a Seq object with a protein
>> alphabet).  You might want to split the sequence at terminators, to
>> get the open reading frames (and then filter them on length).  Right
>> now the Seq object doesn't have a split method so you would have to
>> switch to using python strings (and then go back to a Biopython Seq
>> object later if need be).
>>     
>
> Using pure python strings:
>
> str_seq = "SADKCNKADND*AKDNCDNADK*AK*NCAKNSHJ"
> orf_str_list = str_seq.split("*")
>
> Using Biopython Seq objects:
>
> from Bio.Seq import Seq
> from Bio.Alphabet import generic_protein
> seq = Seq("SADKCNKADND*AKDNCDNADK*AK*NCAKNSHJ", generic_protein)
> #I want to be able to do this:
> orf_seq_list = seq.split("*")
> #Right now I have to do something like this:
> orf_seq_list = [Seq(x, generic_protein) for x in seq.tostring().split("*")]
>
> Another example of using a Seq object .split() method would be for
> restriction enzymes (although the Bio.Restriction package should be
> more general).
>
>   
>> Suppose you have some sequences which you have aligned in ClustalW,
>> and most have leading or trailing gaps characters.  e.g.  Given
>> "---SAD-KCNKADND---" (as a Seq object with a gapped protein alphabet)
>> you might want to strip off the leading and trailing gaps to have just
>> "SAD-KCNKADND"  (as a Seq object with the same alphabet).  Right now
>> the Seq object doesn't have a strip method, so you would have to
>> switch to a string and back again.
>>     
>
> Using pure python strings:
>
> long_seq_str = "---SAD-KCNKADND---"
> trimmed_seq_str = long_seq_str.strip("-")
>
> Using Biopython Seq objects:
>
> from Bio.Seq import Seq
> from Bio.Alphabet import generic_protein
> long_seq = Seq("---SAD-KCNKADND---", generic_protein)
> #I want to be able to do this:
> trimmed_seq = long_seq.strip("-")
> #Right now, I have to do something like this:
> trimmed_seq = Seq(long_seq.tostring().strip("-"), generic_protein)
>
> Another possible example is if you have some EST sequences and you
> want to strip the poly A tail on the trailing end (right side), e.g
> "ACACTGCAGCATCAGCAAAAAAA".rstrip("A")
>
> Peter
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   
Hi,
While I do like the idea, strip(), as defined here, is inconsistent with 
the Python string version.
Python documentation: strip([chars]): "Return a copy of the string with 
the leading and trailing characters removed."

Rather you should use an alternative word like compress to remove the 
said character from within a sequence.

Bruce