[BioPython] Sequence numbering. Moving on...

Iddo Friedberg idoerg@cc.huji.ac.il
Wed, 6 Oct 1999 13:06:41 +0200 (GMT+0200)


On Wed, 6 Oct 1999, Andrew Dalke wrote:

: Iddo Friedberg <idoerg@cc.huji.ac.il> said:
: > : I would rather deal with strings than arrays, for a few reasons.
: > :
: > : 1) Strings are immutable, and I prefer dealing with immutable data types
: > : in general (you can make better assertions about them).It is possible
: > : to edit sequences, but I would rather not deal with that for now.
: >
: > I'm not sure I understand what you mean by "assertions", and how this
: > applies here. An example, perhaps?
: 
: In this case I wasn't speaking about Python "assert" assertions.
: I was meaning that you could do things like:
:  * cache results, like residue frequencies
:  * hash on them (using the string as a hash)
:  * have subsequence implementations which don't need to make copies
: (eg, keep a reference to the original string and the start/stop
: info, and do the extra dereference.This could be useful for memory
: constrained cases.)
:  * not worry as much about multithreaded code.If the sequence
: is mutable, then the sequence could change in one thread while being
: read in another.If the sequence are immutable, then you can assert
: that that will not be a problem.
: 
: > And as a default, values returned by that object's methods (slicing,
: > complementing etc.) which are sequences (unlike molecular weight,
: > charge, or composition) should be sequence objects by themselves.
: 
: Ohhh, I can tell this is going to be confusing.Let me define
: "Seq objects" as the biological sequence objects, and "sequence
: objects" as "something which meets the Python sequence protocol."
: In this definition, a string is a sequence object but not a Seq
: object.

Yes, hehe. OK, using this terminology then, I think that a Seq object
should be returned for any operation that generates a biological sequence.
A Seq object's data attribute (the attribute which holds the sequence per
se) should, IMHO, be implemented as a mutable list of single strings. Much
easier on coding and implementing the various slice-and-dice operations
which molecular biologists need. But I need to think a bit more about the
implications of using immutable types. The disadvantages of using a list
of strings have been pointed out.

If, as you showed, we would like to use accompanying data (say,
residue frequencies), it should be the application programmer's
responsibility to ascertain that that data is up to date, after various
operations performed.


: 
: I do see your point, and I need to consider it some more.The use
: case you have,
: 
: > MySlice(7,30).seqReverse().translate()
: 
: is slightly problematical, since Python sequences return None from
: a call to reverse().Perhaps a better sample case is:

See my answer to that in the letter posted in return to Thomas's post.

: 
:  subseq1 = MySlice(7,30)
:  subseq2 = subseq1.revcomp()
:  protein = subseq2.translate()
: 
: or, in the move to have these be functions be stand-alone,
: 
:  subseq2 = revcomp(subseq1)
:  protein = translate(subseq2, GeneticCode.Bacterial)
: 
: So there are times when the subsequence should be an Seq object,
: as compared to a raw string.And actually, if the ".seq" (or
: whichever) always returns a string representation, then I can get
: the substring by doing either
: 
: protein[7:30].seq
: or
: protein.seq[7:30]
: 
: where the first is better for huge (genome) sized data and the
: latter is faster if the internal representation just happens to
: be a string stored as the attribute "seq".

Yep. Going that way between Seq objects (for actual operations on
biosequences and accompanying data), and strings (for, say, hashing on
them) might enable us to enjoy the best of both worlds.

Iddo

--

/* --- */main(c){float t,x,y,b=-2,a=b;for(;b-=a>2?.1/(a=-2):0,b<2;
/*  |  */putchar(30+c),a+=.0503) for(x=y=c=0;++c<90&x*x+y*y<4;y=2*
/*  |  */x*y+b,x=t)t=x*x-y*y+a;}
/* --- ddo Friedberg */