[BioPython] More string methods for the Seq object

Peter biopython at maubp.freeserve.co.uk
Sat Sep 27 12:57:41 UTC 2008


>
>> Anyway - do think adding the split and strip methods to the Seq object
>> is worthwhile?
>
> Yes - in fact probably essential now many users are likely to have to
> and want to parse genome sequences.
>
> I really would like to see many of the sequence methods 'work' in the
> same manner Python string methods. The string methods that I use a lot
> for sequences are:
> strip
> split
> join
> find
>
> (I don't the 'l' and 'r' versions very much.)
> So you would address the first two.

I was planning deal with strip and split first, and then move on to
discuss the remaining string methods.

No one has objected to adding strip and split (plus lstrip and rstrip)
so if we take that as a consensus, the only point we should still
debate is their default arguments.  Other alternatives to what I have
already put forward include following the python string and defaulting
to white space (which would never normally be present in a sequence),
or making the arguments non optional.

Adding the join method is more complicated as regards the alphabet of
the sequence and the list of sequences given (which could be strings
or Seq objects) - but in principle I think we should support it.  I'd
prefer to leave this one till last!

Adding support for find should be straight forward.

> I do something like your ungap() idea with strings using join:
>>>> ''.join(sequence.split('-'))

That would work but to get a sensible alphabet forces a much longer
version with Seq objects - something like this:
Seq("", generic_protein).join(my_seq.split("-"))

Having my_seq.ungap() or my_seq.ungap("-") would in my opinion be much
clearer for the reader, plus the ungap method would also be able to
amend the alphabet appropriately.

> Python 2.5 introduced 'partition(sep): Split the string at the first
> occurrence of sep, and return a 3-tuple containing the part before the
> separator, the separator itself, and the part after the separator'.
> While I don't use it (because I usually split multiple times) it has
> advantages if you are looking for the first occurrence of a patten:
>>>> a='GTATGCGTAATG'
>>>> a.partition('ATG')
> ('GT', 'ATG', 'CGTAATG')

Thanks for pointing that out.  I hadn't noticed the addition of the
partition method to python - until recently my main machine ran python
2.4 (and even now I still use python 2.3 on some occasions).  However,
we could still add a partition method to the Seq object, but wouldn't
be able to take advantage of the string implementation on the older
versions of python.

Peter



More information about the Biopython mailing list