[BioPython] Making the Seq object act more like a string
Peter
biopython at maubp.freeserve.co.uk
Wed Aug 22 11:05:31 EDT 2007
Dear Biopython users,
A couple of times (on bugs or the developers mailing list), Michiel de
Hoon has previously suggested we could make the Seq class (Bio.Seq.Seq)
a subclass of python string. I agree with him - the Seq object should
act more like a string.
I'm posting on the main discussion list to get some feedback on this, as
any sudden changes could affect lots of people.
As a simple example, although there are functions in Biopython to
calculate GC percentages, for a beginner playing with Seq objects, it
would be nice to be able to do things like this (where s is a Seq object):
print float(s.count("G") + s.count("C")) / len(s)
rather for example this:
print float(s.tostring().count("G") + s.tostring().count("C")) / len(s)
I also don't think the current behaviour of str(seq) is helpful. To
recap, here is a simple example using a simple string:
>>> ss = 'ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA'
>>> print repr(ss)
'ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA'
>>> print str(ss)
ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA
>>> print ss
ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA
And the equivalent using a Seq object as of Biopython 1.43:
>>> s = Seq("ACAGGTACGATCGATCCTGTTGTACGTGCTC"\
... +"GTCGACTGCTAGCTCGTCGTGGTCCGATGA")
>>> print repr(s)
Seq('ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA',
Alphabet())
>>> print s.tostring()
ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATGA
>>> print str(s)
Seq('ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATG ...',
Alphabet())
>>> print s
Seq('ACAGGTACGATCGATCCTGTTGTACGTGCTCGTCGACTGCTAGCTCGTCGTGGTCCGATG ...',
Alphabet())
Note that currently doing str() on a Seq object gives a truncated
version of what repr() gives. The only nice things about this is when
working at the python command line, doing "print s" will only take one
line even when working with a genome. And it makes it really clear you
don't just have a string object.
That was the motivation/background part. Feel free to chime in :)
-----------------------------------------------------------------------
This next bit of the email gets a bit more technical... and might be
better off on the developers mailing list. We'll see where any
discussion goes.
If we can agree to that making Seq inherit from the basic string is a
good idea, then I would advocate a gradual transition... my thoughts are
that for the next release of Biopython we:
(1) Modify Seq .__str__() method to act like the existing .tostring(),
i.e. return self.data
I don't think changing the __str__ method will break any serious code,
because as shown above, currently its like a truncated version of
__repr__ so all its useful for at the moment is getting a truncated
sequence for display.
(2) Consider adding alphabet aware versions selected string methods to
the Seq object (e.g. count, find)
Adding new methods to the Seq class should have no effect on existing usage.
Then, for the release afterwards:
(3) actually do the class inheritance with all the horrors entailed.
As part of this, we'll need to address how the __eq__ method of the Seq
object should act: Looking at the sequence only, or considering the
alphabet too? Currently this method is not implemented at all.
This is part of a larger question - how to cope with multiple Seq/string
operations where there is more than one alphabet. e.g.
comparing/adding/joining a nucleotide Seq to a protein Seq object. I
would opt for the simple solution that the alphabets must match or some
sort of ValueError is raised. Alternatively, as the alphabets have a
class hierarchy, we could choose the parent alphabet (e.g. the generic
single letter alphabet when dealing with a DNA and Protein; or the
generic single nucleotide alphabet when dealing with RNA and DNA).
Any thoughts?
Peter
[In fact, Michiel has also suggested making the SeqRecord class a
subclass of the Seq class, which raises even more questions]
More information about the BioPython
mailing list