[BioPython] More string methods for the Seq object

Bruce Southey bsouthey at gmail.com
Fri Sep 26 21:53:35 UTC 2008


Peter wrote:
>>>> Suppose you have some sequences which you have aligned in ClustalW,
>>>> and most have leading or trailing gaps characters.  e.g.  Given
>>>> "---SAD-KCNKADND---" (as a Seq object with a gapped protein alphabet)
>>>> you might want to strip off the leading and trailing gaps to have just
>>>> "SAD-KCNKADND"  (as a Seq object with the same alphabet).  Right now
>>>> the Seq object doesn't have a strip method, so you would have to
>>>> switch to a string and back again.
>>>>         
>>> Using pure python strings:
>>>
>>> long_seq_str = "---SAD-KCNKADND---"
>>> trimmed_seq_str = long_seq_str.strip("-")
>>>       
>
> This gives "SAD-KCNKADND", it does NOT remove the internal "-" character.
>
>   
>>> Using Biopython Seq objects:
>>>
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_protein
>>> long_seq = Seq("---SAD-KCNKADND---", generic_protein)
>>> #I want to be able to do this:
>>> trimmed_seq = long_seq.strip("-")
>>> #Right now, I have to do something like this:
>>> trimmed_seq = Seq(long_seq.tostring().strip("-"), generic_protein)
>>>       
>
> This gives Seq("SAD-KCNKADND", ProteinAlphabet()), i.e. it would NOT
> remove the internal "-" character.
>
>   
>> While I do like the idea, strip(), as defined here, is inconsistent with the
>> Python string version.  Python documentation: strip([chars]): "Return a
>> copy of the string with the leading and trailing characters removed."
>>     
>
> My intended Seq strip method is intended EXACTLY like the python
> string apart from the default strip characters (except I would suggest
> defaulting to the gap character rather than white space).  My proposed
> implementation even calls the python string strip method internally.
> Have another look at the suggested code:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2596
>
>   
>> Rather you should use an alternative word like compress to remove the said
>> character from within a sequence.
>>     
>
> I suspect you have misunderstood my intension.  My Seq object .strip()
> method would NOT remove the given characters from the interior of the
> sequence - only from the ends.
>
> However, there is certainly a case for wanting an .ungap() method for
> the Seq class (or a more general method to remove all of a particular
> character), but I hadn't intended to raise this issue yet.
>
> Peter
>
>   
Yes, sorry about that. I misunderstood because I confused myself with 
the first part that uses the split.

Bruce



More information about the Biopython mailing list