[BioPython] seq objects etc...

Andrew Dalke dalke@bioreason.com
Sun, 19 Sep 1999 23:00:16 -0600


I said:
>   if seq1 == seq2:
>      # could, for example, compare the two string
> representations

Brad replied:
> This is fine for if ('atg') = ('atg')
> 
> but what about:  if (chromosome_A) == (chromosome_B)
> Isn't that a pretty expensive comparison?

That's why I said "for example."  All you really need to
have is that some appropriate __cmp__ exists for the two
object.

Here's an example of how to compare chromosome sized record
(note that there's little security in this system and if
you really do this you'll want to cache file handles so you
won't exceed your machine's buffer limit)

######################

# "cmp" module compares two files using the cmp() function.
# Work around an ugliness in Python since cmp() is also a builtin!
old_cmp = cmp
from cmp import cmp
file_compare = cmp
cmp = old_cmp

import os

class Chromosome:
  def __init__(self, id):
    self.id = id
    self._filename = "chromosome_" + str(id) + ".seq"
    self._file = open(self._filename)
  def __getslice__(self, i, j):
    assert i>=0 and j>=0 and j>i  # doesn't support full slice semantics!
    self._file.seek(i)
    return self._file.read(j-i)
  def __getitem__(self, i):
    return self.__getslice__(i, i+1)
  def __cmp__(self, other):
    if isinstance(other, Chromosome):
      print "compare chromosome"
      # special case working with other Chromosomes
      return file_compare(self._filename, other._filename) == 0
    # else just check the sequence
    print "compare non-chromosome"
    size = len(other)
    return cmp(self[:size], other.seq)
  def __rcmp__(self, other):
    # used in reverse comparison
    print "reverse",
    return -self.__cmp__(other)
  def __len__(self):
    return os.stat(self._filename)[6] # return file size
  def __getattr__(self, key):
    if key == "seq":
       # return the whole file!
       self._file.seek(0)
       return self._file.read()
    raise KeyError, key

# the original class
class Seq:
  def __init__(self, seq):
    self.seq = seq
  def __getitem__(self, i):
    return self.seq[i]
  def __getslice__(self, i, j):
    return self.seq[i:j]
  def __cmp__(self, other):
    print "Standard compare"
    return cmp(self.seq, other.seq)
  def __len__(self):
    return len(self.seq)

x = "ANDREW"
open("chromosome_1.seq", "w").write(x)
open("chromosome_2.seq", "w").write("DALKE")
s1 = Seq(x)
s2 = Chromosome(1)
s3 = Chromosome(2)
print "s1 is", s1.seq
print " size= ", len(s1)
print "s2 is", s2.seq
print " size= ", len(s2)
print "s3 is", s3.seq
print " size= ", len(s3)
print "Compare 1 with 2", s1 == s2
print "Compare 2 with 1", s2 == s1
print "Compare 2 with 3", s2 == s3
print "Compare 3 with 2", s3 == s2
print "fifth characters are", s1[4], s2[4], s3[4]
print "first two characters are", s1[:2], s2[:2], s3[:3]

#########

You can see the interface for Chromosome is exactly the same as
the one for Seq, but the implementation is very different, and
the Chromosome version can be arbitrarily long -- as much space
as you have disk.  The only time you'll have big memory usage
is if you ask for it with "[:]" (the __getslice__ notation, or
Ewan's subseq method) or if you compare a sequence to a chromosome.

The last is a problem because the Chromosome implements "seq" for
backwards compatibility, and the Seq.__cmp__ uses .seq to compare
terms, so if you did

  seq == chromosome

you would have problems, but if you did

  chromosome == seq

then you wouldn't, because the Chromosome __cmp__ is smart enough
to only get as much of itself as it needs.

Hmm, a possible fix for Seq.__cmp__ would be

  def __cmp__(self, other):
    return cmp(self.seq, other[:len(self.seq)]

which guarantees at most a doubling of memory.

>        Having said that, I can see the utility of having a
> seq_record which has-a sequence.  The record could have ID,
> Accession, type, etc and would be useful for archival, retrieving,
> etc.  The sequence object would contain the functionality.  Is
> this what you're suggesting?

Yes, exactly.

> You're saying two different people ( or the same person at
> different dates ) put in a sequence for the same region, but
> it had differences in sequence?  That's why I think context is
> important.  Those two sequences would have different accession
> #'s, right?

Yes.  Or be in two different databases, and that's the problem
(and hence your `context').  The PDB sequence for 9ZZ9 chain Q in
one database might be the SEQRES fields and in the other extracted
from the ATOM fields, and could be difference.

Ahh, the problem is that the identifer is not fully qualified.
One should be "pdb 9ZZ9Q from SEQRES" and the other "pdb 9ZZ9Q
from ATOM", so there is a solution, and the full qualification
is a function of the database and the database id (eg, SCOP uses
the SEQRES fields, so it is `pdb|9ZZ9Q|SEQRES'. Or something
URNish :)

						Andrew Dalke
						dalke@bioreason.com