[BioPython] Sequence numbering. Moving on...

Iddo Friedberg idoerg@cc.huji.ac.il
Tue, 5 Oct 1999 18:08:44 +0200 (GMT+0200)


On Mon, 4 Oct 1999, Andrew Dalke wrote:

: Iddo Friedberg <idoerg@cc.huji.ac.il> said:

: 
: > 2) A method which returns a sequence, will return a sequence object,
: > not a string.
: 
: I'm not sure I follow you here.In your code you have:
: >      def seq(self, min, max):
: >              # with a base of 1 and including the end
: >              # Negative slice notation not allowed
: >            assert max >= min and min >0
: >              return self[min-1:max]


: 
: This returns a sequence, but not a sequence object.In fact, as 
: your example shows, it returns a list of character strings.
: > >>> seq(1,3)
: > ['A', 'T', 'G']

The proper use would be...

>>> MySeq = DNASeq("ATGCGGGTTTGGGC")
>>> MySlice = MySeq(5,9)
>>> print MySlice
GGGTT
# Reverse a part of the new sequence
>>> MySlice(1,3).complement()

...without direct access to the .seq .omg or all such attributes. After
all, __call__ is supposed to wrap those, with a slicing method of our
choice.

Oops, that last one creates an error in the version you got. I just fixed
it. Anyhow, MySlice is now a fresh DNASeq instantiation, carrying all the
methods on such a type.

: 
: I would rather deal with strings than arrays, for a few reasons.
: 
: 1) Strings are immutable, and I prefer dealing with immutable data types
: in general (you can make better assertions about them).It is possible
: to edit sequences, but I would rather not deal with that for now.

I'm not sure I understand what you mean by "assertions", and how this
applies here. An example, perhaps?


: 
: 2) I would rather have a Numeric array of characters, which is much less
: of a memory hog, than a list of strings, but then this requires my
: hoped-for addition of the Numeric array types to standard Python in
: the 1.6 release :)
: 

I agree with the memory inefficiency bit. If and when we are graced with a
numeric array type, it may well be a good alternative.


: 3) Now you have three data types for sequences: a string (used in
: the constructor), the Seq class, and the list of strings.It is
: better to reduce the number of fundamental data types.
:

The list of strings is for implementation by the methods, and is supposed
to be user-transparent... hang on... so maybe it would be a good idea to
try a string instead of a list of strings. I'll try to drum up a .seq and
a method implementation. I still hold, however, that when dealing with a
sequence, that sequence should be an object, with predefined methods
applyting to that object. And as a default, values returned by that
object's methods (slicing, complementing etc.) which are sequences (unlike
molecular weight, charge, or composition) should be sequence objects by
themselves. 

Example of use:

>>> MySlice = MySeq(200,300).complement()
>>> MySlice(7,30).seqReverse().translate()

Can be a common enough operation, and should return a sequence object (in
this case, a protein sequence). Returning sequence objects should be the
default rule. The exception would be "str" ing the object. This comes from
a rather rigorous OOP school of thought, I know.
 
: I proposed having sequences act like strings so that there is,
: in essense, only 1.5 data types-- most algorithms will work on strings
: and the 0.5 is for those algorithms which might want to get, say,
: information about the alphabet used in a given data type.
: 
: Besides, if you stay with strings you can always create a new
: subseqence of the same data type with:
: 
: sub_s = seq.__class__(seq[51:92])
: 
: If you take my example from Unambiguous_IUPAC_ProteinSeqType you'll
: even preserve the correct seqtypes field.



: 
: > 4) I'm not very good at streamlining Python code. If anyone likes
: > this and has a proposition on how to make this faster, I'd like to
: > know.
: 

[certain good suggestions follow]

Thanks Andrew, that was very helpful.

All the best,

Iddo

--

/* --- */main(c){float t,x,y,b=-2,a=b;for(;b-=a>2?.1/(a=-2):0,b<2;
/*  |  */putchar(30+c),a+=.0503) for(x=y=c=0;++c<90&x*x+y*y<4;y=2*
/*  |  */x*y+b,x=t)t=x*x-y*y+a;}
/* --- ddo Friedberg */