[BioPython] Making the Seq object act more like a string

Wed Aug 22 15:05:31 UTC 2007

Dear Biopython users,

A couple of times (on bugs or the developers mailing list), Michiel de 
Hoon has previously suggested we could make the Seq class (Bio.Seq.Seq) 
a subclass of python string. I agree with him - the Seq object should 
act more like a string.

I'm posting on the main discussion list to get some feedback on this, as 
any sudden changes could affect lots of people.

As a simple example, although there are functions in Biopython to 
calculate GC percentages, for a beginner playing with Seq objects, it 
would be nice to be able to do things like this (where s is a Seq object):

print float(s.count("G") + s.count("C")) / len(s)

rather for example this:

print float(s.tostring().count("G") + s.tostring().count("C")) / len(s)

I also don't think the current behaviour of str(seq) is helpful.  To 
recap, here is a simple example using a simple string:

 >>> ss = 'ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA'
 >>> print repr(ss)
'ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA'
 >>> print str(ss)
ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA
 >>> print ss
ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA

And the equivalent using a Seq object as of Biopython 1.43:

 >>> s = Seq("ACAGGTACGATCGATCCTGTTGTACGTGCTC"\
... +"GTCGACTGCTAGCTCGTCGTGGTCCGATGA")
 >>> print repr(s)
Seq('ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA', 
Alphabet())
 >>> print s.tostring()
ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA
 >>> print str(s)
Seq('ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATG ...', 
Alphabet())
 >>> print s
Seq('ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATG ...', 
Alphabet())

Note that currently doing str() on a Seq object gives a truncated 
version of what repr() gives. The only nice things about this is when 
working at the python command line, doing "print s" will only take one 
line even when working with a genome. And it makes it really clear you 
don't just have a string object.

That was the motivation/background part. Feel free to chime in :)

-----------------------------------------------------------------------

This next bit of the email gets a bit more technical... and might be 
better off on the developers mailing list. We'll see where any 
discussion goes.

If we can agree to that making Seq inherit from the basic string is a 
good idea, then I would advocate a gradual transition... my thoughts are 
that for the next release of Biopython we:

(1) Modify Seq .__str__() method to act like the existing .tostring(), 
i.e. return self.data

I don't think changing the __str__ method will break any serious code, 
because as shown above, currently its like a truncated version of 
__repr__ so all its useful for at the moment is getting a truncated 
sequence for display.

(2) Consider adding alphabet aware versions selected string methods to 
the Seq object (e.g. count, find)

Adding new methods to the Seq class should have no effect on existing usage.

Then, for the release afterwards:
(3) actually do the class inheritance with all the horrors entailed.

As part of this, we'll need to address how the __eq__ method of the Seq 
object should act: Looking at the sequence only, or considering the 
alphabet too? Currently this method is not implemented at all.

This is part of a larger question - how to cope with multiple Seq/string 
operations where there is more than one alphabet.  e.g. 
comparing/adding/joining a nucleotide Seq to a protein Seq object. I 
would opt for the simple solution that the alphabets must match or some 
sort of ValueError is raised. Alternatively, as the alphabets have a 
class hierarchy, we could choose the parent alphabet (e.g. the generic 
single letter alphabet when dealing with a DNA and Protein; or the 
generic single nucleotide alphabet when dealing with RNA and DNA).

Any thoughts?

Peter

[In fact, Michiel has also suggested making the SeqRecord class a 
subclass of the Seq class, which raises even more questions]