[BioPython] More string methods for the Seq object

Peter biopython at maubp.freeserve.co.uk
Fri Sep 26 21:22:48 UTC 2008


>>> Suppose you have some sequences which you have aligned in ClustalW,
>>> and most have leading or trailing gaps characters.  e.g.  Given
>>> "---SAD-KCNKADND---" (as a Seq object with a gapped protein alphabet)
>>> you might want to strip off the leading and trailing gaps to have just
>>> "SAD-KCNKADND"  (as a Seq object with the same alphabet).  Right now
>>> the Seq object doesn't have a strip method, so you would have to
>>> switch to a string and back again.
>>
>> Using pure python strings:
>>
>> long_seq_str = "---SAD-KCNKADND---"
>> trimmed_seq_str = long_seq_str.strip("-")

This gives "SAD-KCNKADND", it does NOT remove the internal "-" character.

>> Using Biopython Seq objects:
>>
>> from Bio.Seq import Seq
>> from Bio.Alphabet import generic_protein
>> long_seq = Seq("---SAD-KCNKADND---", generic_protein)
>> #I want to be able to do this:
>> trimmed_seq = long_seq.strip("-")
>> #Right now, I have to do something like this:
>> trimmed_seq = Seq(long_seq.tostring().strip("-"), generic_protein)

This gives Seq("SAD-KCNKADND", ProteinAlphabet()), i.e. it would NOT
remove the internal "-" character.

> While I do like the idea, strip(), as defined here, is inconsistent with the
> Python string version.  Python documentation: strip([chars]): "Return a
> copy of the string with the leading and trailing characters removed."

My intended Seq strip method is intended EXACTLY like the python
string apart from the default strip characters (except I would suggest
defaulting to the gap character rather than white space).  My proposed
implementation even calls the python string strip method internally.
Have another look at the suggested code:
http://bugzilla.open-bio.org/show_bug.cgi?id=2596

> Rather you should use an alternative word like compress to remove the said
> character from within a sequence.

I suspect you have misunderstood my intension.  My Seq object .strip()
method would NOT remove the given characters from the interior of the
sequence - only from the ends.

However, there is certainly a case for wanting an .ungap() method for
the Seq class (or a more general method to remove all of a particular
character), but I hadn't intended to raise this issue yet.

Peter



More information about the Biopython mailing list