[BioPython] More string methods for the Seq object

Bruce Southey bsouthey at gmail.com
Sat Sep 27 21:06:54 EDT 2008


On Sat, Sep 27, 2008 at 7:57 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>>
>>> Anyway - do think adding the split and strip methods to the Seq object
>>> is worthwhile?
>>
>> Yes - in fact probably essential now many users are likely to have to
>> and want to parse genome sequences.
>>
>> I really would like to see many of the sequence methods 'work' in the
>> same manner Python string methods. The string methods that I use a lot
>> for sequences are:
>> strip
>> split
>> join
>> find
>>
>> (I don't the 'l' and 'r' versions very much.)
>> So you would address the first two.
>
> I was planning deal with strip and split first, and then move on to
> discuss the remaining string methods.
>
> No one has objected to adding strip and split (plus lstrip and rstrip)
> so if we take that as a consensus, the only point we should still
> debate is their default arguments.  Other alternatives to what I have
> already put forward include following the python string and defaulting
> to white space (which would never normally be present in a sequence),
> or making the arguments non optional.

I do agree especially in terms of attempting to keep the standard
Python defaults and behavior.

>
> Adding the join method is more complicated as regards the alphabet of
> the sequence and the list of sequences given (which could be strings
> or Seq objects) - but in principle I think we should support it.  I'd
> prefer to leave this one till last!

Well I have the view that if this is easier to do now then it should
be done now.

>
> Adding support for find should be straight forward.

This would be great to have.

>
>> I do something like your ungap() idea with strings using join:
>>>>> ''.join(sequence.split('-'))
>
> That would work but to get a sensible alphabet forces a much longer
> version with Seq objects - something like this:
> Seq("", generic_protein).join(my_seq.split("-"))
>
> Having my_seq.ungap() or my_seq.ungap("-") would in my opinion be much
> clearer for the reader, plus the ungap method would also be able to
> amend the alphabet appropriately.

I do agree and the terminology is appropriate.


>
>> Python 2.5 introduced 'partition(sep): Split the string at the first
>> occurrence of sep, and return a 3-tuple containing the part before the
>> separator, the separator itself, and the part after the separator'.
>> While I don't use it (because I usually split multiple times) it has
>> advantages if you are looking for the first occurrence of a patten:
>>>>> a='GTATGCGTAATG'
>>>>> a.partition('ATG')
>> ('GT', 'ATG', 'CGTAATG')
>
> Thanks for pointing that out.  I hadn't noticed the addition of the
> partition method to python - until recently my main machine ran python
> 2.4 (and even now I still use python 2.3 on some occasions).  However,
> we could still add a partition method to the Seq object, but wouldn't
> be able to take advantage of the string implementation on the older
> versions of python.
>

The real question is would this functionality be sufficiently useful
to justify it?

I can see that it is useful for very special cases like open reading
frames but I do not think that this is sufficient.

Regards
Bruce


More information about the BioPython mailing list