[BioPython] Sequence numbering. Moving on...

Andrew Dalke dalke@bioreason.com
Wed, 06 Oct 1999 01:22:11 -0600


Iddo Friedberg <idoerg@cc.huji.ac.il> said:
> : I would rather deal with strings than arrays, for a few reasons.
> :
> : 1) Strings are immutable, and I prefer dealing with immutable data types
> : in general (you can make better assertions about them).It is possible
> : to edit sequences, but I would rather not deal with that for now.
> 
> I'm not sure I understand what you mean by "assertions", and how this
> applies here. An example, perhaps?

In this case I wasn't speaking about Python "assert" assertions.
I was meaning that you could do things like:
 * cache results, like residue frequencies
 * hash on them (using the string as a hash)
 * have subsequence implementations which don't need to make copies
(eg, keep a reference to the original string and the start/stop
info, and do the extra dereference.  This could be useful for memory
constrained cases.)
 * not worry as much about multithreaded code.  If the sequence
is mutable, then the sequence could change in one thread while being
read in another.  If the sequence are immutable, then you can assert
that that will not be a problem.

> And as a default, values returned by that object's methods (slicing,
> complementing etc.) which are sequences (unlike molecular weight,
> charge, or composition) should be sequence objects by themselves.

Ohhh, I can tell this is going to be confusing.  Let me define
"Seq objects" as the biological sequence objects, and "sequence
objects" as "something which meets the Python sequence protocol."
In this definition, a string is a sequence object but not a Seq
object.

I do see your point, and I need to consider it some more.  The use
case you have,

> MySlice(7,30).seqReverse().translate()

is slightly problematical, since Python sequences return None from
a call to reverse().  Perhaps a better sample case is:

 subseq1 = MySlice(7,30)
 subseq2 = subseq1.revcomp()
 protein = subseq2.translate()

or, in the move to have these be functions be stand-alone,

 subseq2 = revcomp(subseq1)
 protein = translate(subseq2, GeneticCode.Bacterial)

So there are times when the subsequence should be an Seq object,
as compared to a raw string.  And actually, if the ".seq" (or
whichever) always returns a string representation, then I can get
the substring by doing either

  protein[7:30].seq
or
  protein.seq[7:30]

where the first is better for huge (genome) sized data and the
latter is faster if the internal representation just happens to
be a string stored as the attribute "seq".

I think you've pursuaded me :)

						Andrew
						dalke@acm.org