[BioPython] strings or arrays?

Sat, 1 Apr 2000 03:29:04 -0700

I've been trying to figure out if the main character
storage data type for biopython should be a string or
an array.  I was leaning towards using an array.array
of characters, because I thought that would be faster.

It isn't.  Guido used his time machine to optimize
single character strings in Python.  I can't find a
good alternative to using anything else over strings.

Here's what I mean.  Consider a simple translate
function using strings:

def translate(dna, table = Translation.standard_table):
  protein = []
  for i in range(0, len(dna), 3):
    protein.append(table[dna[i:i+3]])
  return string.join(x, "")

This uses an intermediate list of strings before conversion
back to a string, and I don't like the extra storage required.
However, CPython optimizes single character strings to
reference the same object (there's a lookup table of
single character strings), so the overhead is only the
32 bit pointer, rather than the 8 bit character, so a factor
of 4.

That's still a lot of memory if everything is stored as a
list of characters, but not as much if only used for temporary
store.

There still might be a performance problem, as compared to
an array of characters.  An implementation of the latter would
look like:

def translate(dna, table = Translation.standard_table):
  dna = dna.tostring()
  protein = array.array( ("c", "") )
  for i in range(0, len(dna), 3):
    protein.append(table[dna[i:i+3]])
  return protein

(the tostring() is used as otherwise the table lookup would be
  dna[i] + dna[i+1] + dna[i+2], or
  (dna[i], dna[i+1], dna[i+2]), or
  dna[i:i+3].tostring(), or
  the table lookup needs to understand array objects.
All of which have a higher performance hit. )

To time the differences, I used something like:

for i in range(2, 22):
  s = "a" * (2**i)
  t1 = time.time()
  x = []
  for c in s:
    x.append(c)
  t = string.join(x, "")
  t2 = time.time()
  y = array.array("c", "")
  for c in s:
    y.append(c)
  t3 = time.time()

Interestingly, building a list and converting to a string
is faster than using an array.array, by about 20%.  The
array code converts the .append object to a character
twice - first it checks if the object can be converted
to a character by calling c_setitem with a fake index,
then it resizes the array, then it converts the object
to a character, using a real index.

Getting rid of the double call, by optimistically assuming
the conversion will work, then backing the size down if
it doesn't, makes the append performance on par with string
conversion.  Adding an extra variable to preallocate an
extra 64 bytes when the array is resized makes the array
code about 20% faster.  But if I need the string, adding
"t = string.join(y, '')" before setting t3 brings the overall
times back on par.

The big problem with array.array, which I didn't realize
until a couple of hours ago, is the internal data is not
exposed via a C API.  The prototypes are in arraymodule.c.
Thus, it isn't accessible to, say, a C based sequence
alignment library.

It might be possible to use a Numeric.array instead, but
it isn't available on every machine and the size cannot
be changed (it's a restriction because slices refer to the
original data).  Sometimes the correct size can be
preallocated, as in the translation above.  However, if
the translation was allowed to stop at a stop codon, then
the size cannot be determined a priori.

So I'm going to suggest that the primary character data
storage for biopython be a string.

That leaves the problem of mutability, as with editing
a sequence.  There is no good mutable solution in Python.
Either it has to be emulated with strings (eg, seq[2] = "A"
causes seq.data = seq.data[:2] + "A" + seq.data[3:]), or
implemented with some other data type, and converted to
a string when doing analysis.

I think the latter of these is fine, using array.array as
the data type, but converting to a string based sequence
before doing analysis.

                    Andrew Dalke
                    dalke@acm.org