[Biopython-dev] Rethinking Seq objects

Thu Apr 28 04:56:51 EDT 2005

On Thu, 28 Apr 2005, Michiel Jan Laurens de Hoon wrote:

> Michael Hoffman wrote:
>
>>  On Wed, 27 Apr 2005, Michiel Jan Laurens de Hoon wrote:
>> 
>> >  1) Make Seq objects mutable, and get rid of MutableSeq.
>> 
>>  I imagine it will be a lot slower to replace built-in strings with
>>  character arrays. Right now, I only use Seq when I absolutely have to.

> Well I wouldn't replace them with character arrays, the idea would be to 
> reimplement the Seq class in C. So it would not be slower than built-in 
> strings, maybe even a bit faster. The Seq object would look like a string 
> object, but be mutable.

If you can make a sequence class that is faster than the current
built-in string, I would suggest you submit a patch to the Python
tracker to make it a replacement for the current built-in string. :P

> OK, then how about this:
> - By default, don't assume a particular alphabet. Same as how it works now:
> > > >  from Bio.Seq import *
> > > >  Seq('ATCG')
> Seq('ATCG', Alphabet())

+1

>> > > > >   my_seq = MutableSeq('GATCGATGGGCCTATTAGGATCGAAAATCGC', my_alpha)
>> > > > >   my_seq[:10] = "weirdstuff"
>> > > > >   my_seq
>> >  MutableSeq(array('c', 'weirdstuffCCTATTAGGATCGAAAATCGC'), 
>> >  IUPACUnambiguousDNA())
>> 
>>  "Doctor, it hurts when I do this."
>>  "Don't do that."
>
> Well you would be right if this were Biofortran. For a higher-level language, 
> I would expect better checking to make sure an object is self-consistent. 
> Python itself is full of checks and assertions.

How often have you actually put "weirdstuff" into the middle of a
MutableSeq? Have you ever done this, or are you just imagining that it
might happen?

You Aren't Gonna Need It. The number of checks and assertions you can
make is limitless and you have to know where to draw the line. To me,
the line should be drawn at user input, but not at every internal
change to a sequence made within a program. Maybe optional alphabet
checking would help with this.

> Another option would be to get rid of alphabets altogether. What good are 
> they otherwise?

They're useful for transcription/translation/reverse complement
operations. And as far as I'm concerned, that's a good place to do
error checking, should it be necessary.

>>  But having a CircularSeq subclass would make it easier to avoid
>>  this extra functionality from impacting on the primary use case.
>
> My feeling is that having a subclass is a bit of an overkill. The idea is to 
> have an optional topology argument, which defaults to "linear". So the 
> primary use case would not be affected.

If you're doing this in C, then my performance assumptions are perhaps
incorrect. I wouldn't want every slice of my linear sequence to have
to go through "is this circular?" logic in Python.

> If the alphabet defaults to Alphabet() when creating a Seq object,
> then I'd think the transcribe and translate methods should work even
> if a user doesn't specify the sequence to be DNA or RNA. My current
> gripe with the Seq object is that there are too many steps to
> translate a DNA sequence.

Good point. Perhaps a warning when it has to guess?
-- 
Michael Hoffman <hoffman at ebi.ac.uk>
European Bioinformatics Institute