[BioPython] sequence proposals (long)

Mon, 27 Mar 2000 04:05:56 -0700

Hello,

I've been thinking about how to define the basic sequence classes for
biopython.  I've got a set of proposals on the topic which I would
love to get feedback about.  Since I have a tendency to write long
emails, I've broken them up into several messages, which I'll be
sending over the next week or so.  This one contains my proposals for
the basic sequence protocol, and an idea of how to handle alphabets
and encodings.  I have a bias towards Python, but I also compare the
Perl, Java and C++ ways of doing things.

Proposal 1)  the basic sequence interface

Conceptually, people think about sequences as a string - a list of
residues.  A residue can be an object (as with biojava's ResidueList)
or a character (as with bioperl's PrimarySeq).  Since I am a fan of
generic programming, I want the basic sequence interface to act the
same as a string, when possible.

Here is a list of the interface I think all sequences, and
sequence-like objects, should implement.

1.1) There is a way to iterate forward through the sequence, element
by element.  When finished, all elements will have been visited in
order.  Forward-only iterators should be rarely implemented.

(This requirement is here because it's the most minimal
"sequence-like" description I can think of.)

In C++ or Java, this means that sequences implement a forward
iterator.  This is sufficient for many algorithms, like computing the
molecular weight or translating from DNA to protein.

Not all languages support different types of iterators as well as C++.
For example, Python only has a random access iterator.  If the
underlying data structure is random access (eg, the residues are held
in an array), then this is not a problem.  If the underlying data is
unidirectional (eg, a linked list), then there are problems.

Resolution of the problem is outside the scope of this proposal.  For
Python, I *suggest* the following non-thread safe solution:

  def __init__(self, ...):
     self._pos = 0
  def __getitem__(self, i):
     assert i == self._pos, "forward iteration only"
     return self.next()
  def next(self):
     self._pos = self._pos + 1
     # do whatever is needed to get and return the next item,
     # or raise IndexError if not available

but better yet, don't use unidirectional data storage.

Perl doesn't really have an iterator over string characters.  Instead,
explicit ones are implemented by conversion to list (via 'split'), or
via substr of an integer position.  Implicit iterators exist in
functions like split, s///, tr//, etc.

It's hard for me to think of cases when only having a forward iterator
(as compared to random access) makes sense, implementation-wise.  It's
needed to simplify algorithms.

1.2) Random access sequences must be integer subscriptable via the
appropriate means for a string for the given language.

C++ and Python let user classes redefine subscripting (via operator[]
or __getitem__).

Perl can also let user classes act like arrays, using TIEARRAY,
according to perltie.  This should mean that element lookups can look
like $seq[5] instead of substr($seq->seq(), 5, 1).  Bioperl does not
implement their class this way, prefering people access the underlying
data object as a string.  This is appropriate since strings are not
subscriptable in perl (I can't do: $a = "Perl"; print $a[1]).

If non-character based sequence classed are ever implemented in Perl
(eg, storing 3-letter codes instead of 1-letter, or used as a data
view to a 3D structure), then I suggest they look into tieing element
lookup.

Java, from what I recall, doesn't have operator overloading, so the
"appropriate means" is to use a method.  In my attempt to understand
how Java works, I see that ResidueList has a way to return the list of
residues as a List.  Java's java.util.List uses "get(int index)" to
return the object at the given position, so I would think that
ResidueList implements a method named "get()" as well.

ResidueList, instead, uses "Residue residueAt(int index)" to return
the given position.  Looking at my src.jar, I see that String has a
"charAt" and Vector has "elementAt", so I guess this determined the
appropriate naming scheme.  Could someone explain to me the different
names, and the reason for the variation in prefixes (that is,
{residue,char,element}At, instead of just elementAt)?

1.3) If the sequence length is known, there must be a way to get
access to it in constant time.  The returned value must be usable to
get access to the last element of the list.

That last part means that with a random access container, I should be
able to use the length (possibly +/- 1) to seek directly to the last
element.  And with a forward iterator, I should be able to count
"length" (again, +/-1) times and be at the end.

I would like to add the requirement that access to the length be
consistent with other objects in the language.  For example, in C++
STL containers, the standard method is "size()", and for
Python,"__len__()".

Looking at the standard Java containers, it looks like they took their
method names from STL, so the length of a container is "size()".
However, biojava uses "length()" .. and so does the Java String class.
He he, and GNU C++ defines length() *and* size() with identical
implementations.  I'm getting the impression that access to a Java
String doesn't act like access to a list of characters.

In Perl, again, $# could be tied to the length of a Seq object, but $#
doesn't work on strings, only lists.  The generic solution is to do
length($seq->seq()), which is actually what PrimarySeqI.pm does.  It
is conceivable that a PrimarySeq implementation may work on genome
size data (eg, someone want to find out how many bases are in
chromosome 11), with the interface talking to a database.  Because
defining a true string-like object in perl seems hard, the length()
method can be used as a work-around to keep from loading the whole
sequence in memory.

Again, I am hard pressed to come up with a real use for an unspecified
length.  The best I can think of is considering a interface to a
sequencing machine.  Before you start you don't know how much data
will come out.  This is also a possible real-life case for having a
forward-only iterator mentioned in 1.1).

1.4) Position, lengths and ranges should be used in language
appropriate form.

This is the standard "do sequences start at 1, like biologists think
about them or 0 like the language (except for Visual Basic and
Fortran)."  I strongly believe if the language is 0 based, then the
sequence implementation is 0 based.  Output meant for a biologist
should be converted to 1 based, but not the internals.

I say "should" in the proposal instead of a "must" because there is
much contention about the subject.  Biojava has sequences starting a 1
even though Java is a 0 based langauge.

The part about "lengths" is because some languages have the length of
a container be the number of elements in it, while others have it be
the offset to the last element.  

The part about ranges is because the string "ABCD" when sliced by
[1:3] can be:
   "BC" - Python and C++
   "BCD" - Perl
   "ABC" - biologist
   "AB" - CORBA/LSR

Also, some languages allow negative subscripts (in Python, seq[-1]
refers to the last element of the sequence) and some define
out-of-bound semantics (in Python, indicies can be out-of-bound, but
not slices.)

In my mind, consistency with the language is more important than
consistency with the domain, if all such matters can be solved purely
as an I/O translation.

1.5) Subsequences can be extracted from sequences given a range, and
the subsequence implements the sequence interface.

Again, this is all language dependent.  Java containers uses
"sublist(int fomIndex, int toIndex)", and so does biojava (though with
a base of 1).

C++ would have a constructor taking iterators for the first and last
positions.  This is nice because the programmer has a chance to define
the proper return type - it's the constructor.

Python has a slice operator, so the standard version could probably be
something like:

    def __getslice__(self, i, j):
        return self.__class__(self.data[i:j])

(should strides be supported?)

Again, Perl can tie to act like an array, but because the
implementation assumes a string, there needs to be some way to get a
substring without accessing the full string, and it needs to preserve
the interface, so they use a method called "subseq".  The name is
analagous to the name "substr."

1.6) Mutability - if the sequence class is mutable, then
  a) elements must be changed by a means equivalent to subscripts
  b) ranges must be changed by a means equivalent to subslicing
At the implementor's discretion, neither or both of the following
are implemented:
  c) assignment to slices may change the length of the list
      (eg, seq[1:5] = "AAAAAAAAAAAAAAAAAAAAAAAAA")
  d) elements may be able to be added and removed
At the implementor's discretion:
  e) modification to slices may affect the original sequence

This part of my proposal is meant mostly as a guideline to show that
mutability is a complex problem.

The phrase "by a means equivalent to" means to use the language
appropriate complement to the subscripting of 1.2) and subslicing of
1.5).  In C++ and Python, it's via [].  In Perl it's substr.  In
biojava, I'm not sure of the method names.

The discretionary parts are present because they can be hard to
implement.  In Python, there are 4 possible sequence-of-character
structures: string, list of characters, array.array, and
Numeric.array, with different trade offs:

  i) strings cannot be editing in place, so even non-size changing
  modifications require building new strings

  ii) "list of characters" uses the most memory and slices copy the
  original sequence (so can be slower and use even more memory)

  iii) array.array of characters also copies from original sequence,

  iv) Numeric.array cannot change size in-place but does e) It isn't
  installed on every machine.

As an example of e), should the last line of following return 'C' or
'x'?

>>> a = Seq("ABCDE")
>>> b = a[2:4]
>>> str(b[0])
'C'
>>> str(a[2])
'C'
>>> b[0] = "x"
>>> b[0]
'x'
>>> str(a[2])

It turns out that an implementation with array.array will return 'C'
while one with Numeric.array will return 'x'.  Which is best?

1.7) If the sequence elements are meant to be viewable, their string
value is found through the normal stringification operation.

In the simplest case, str(seq[i]) should return the single, (or three
letter) code for the given element.  Adding this clause makes it
easier to support aligned displays of different sequence types.

For example, consider the example where I have a sequence, and an
object (a Prosite match object from PS00028) implementing the sequence
protocol.

seq = Seq("KPYECSECRKAFRERSSLINHQRTHTGE")
pattern = Prosite.compile("C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.")
match = pattern.search(seq)
# finds       "CSECRKAFRERSSLINHQRTH"

offset = match.start()
for i in range(len(match_text)):
   print str(pattern[i]), "\t", str(seq[offset+i])
C       C
x       S
x       E
C       C
x       R
x       K
x       A
[LIVMFYWC]      F
x       R

 (you get the idea)

Since the pattern match acts like a sequence object, it can be used
anywhere the concept of "show me each element" can work; with the
provisio that the display may need to be able to show more than one
character.

Note that there can be other ways to get a name for a residue.  You
may prefer the three letter name, or the english name, or the spanish
name.  There is a general need for looking up an property of a
residue, which will be discussed in one of my future proposals.

1.8) If possible, the stringification of a sequence object must return
the string value of the elements such that, when given a sequence of
size N and some x in a valid subrange, then:

   str(seq) == str(seq[:x]) + str(seq[x:])

I'm not sure about this one, which is why I added the "if possible".
The problem comes when there are seperators.  Suppose you use a three
letter code, then you might want the stringification of a sequence to
be: ALA-GLY-PRO-ASP instead of ALAGLYPROASP.  But splitting the
sequence in half and joining them may create

    "ALA-GLY" + "PRO-ASP" -> "ALA-GLYPRO-ASP"

There are a couple of solutions to this.  Track if the end-point is
really a terminal, and add the appropriate seperator:

    "ALA-GLY-" + "PRO-ASP" -> "ALA-GLY-PRO-ASP"

Use lower cases, so the sequence is

    "AlaGly" + "ProAsp" -> "AlaGlyProAsp"

(blech; I can't read that easily)

Require that sequence objects implement a "join" method to join the
strings in the right fashion.  This is also ugly.

This method will only be used when converting the sequence to a string
for use with, eg, a regular expression engine, or for performance
reasons.  (In Python, it's faster to iterate over string elements than
having the __getitem__ overhead for each element.)

This is not a problem in regular strings, since they only have a
single character.  Thus, the "if possible" phrase probably means "if
the underlying encoding is from a single letter alphabet".

 ===============================

Proposal 2)  Encodings

My first proposal topic works fine for sequences which can be
expressed as a set of letters/elements.  In addition to the normal
IUPAC protein/nucleotide usage, here are some of the ways it can be
used (if you know of more, please let me know):

  A) non-standard residues have their own characters.  Some are
    mentioned in the IUPAC definitons, like:
     X = selenocysteine; for proteins
     B = Asx = aspartic acid or asparagine; for protein
     W = wyosine; for DNA
  B) non-IUPAC usage
    - perhaps the most common is using upper/lower case as an
      indication of certainty.
    - Aaron J Mackey (ajm6q@virginia.edu) on bioperl-guts points
      out that SEG can convert low-information residues to lower case
      so they act like X in FASTA, but you can still see the original
      characters in the output
  C) non-residue information
    (I especially want more information about these.)
    - gaps, as in alignments (often with "-", "." or " ")
    - stop codons; "*" is often used to show where a protein residue
       would be, if the corresponding codon wasn't a stop codon.
  D) extra-residue information, still with a single character
    - show secondary structure prediction using "H"elix, "S"trand,
       "C"oil, "T"urn
    - digit to show % properties, like % conservation
  E) non-single character definitions, still per-residue
    - a prosite pattern match ("[FILAPVM]"), esp. as when aligned
        to the matching protein,
    - alignment of 3-character nucleotides to 1-character amino acids

All of the above scenerios can be modeled with objects implementing
the sequence protocol.  That is, they are subscriptable, sliceable,
stringifiable, etc.  I know this is true for A)-D) because I've seen
FASTA files used to store all of them, and the records of a FASTA file
are easily mapped to the sequence protocol (trivially so, since the
data is natually stored as a string of characters).  And E) works
because I've restricted it to have a one-to-one mapping to a residue.

When read from a generic FASTA file, the sequence isn't really
anything other than a glorified string.  In a formal sense, you can't
calculate the molecular weight because you don't know if
"CCCCCHHHHHHHTCCC" is a protein or nucleotide sequence, or maybe a
secondary structure prediction!  (BTW, I consider the autoguessing
requirement of bioperl to be a flaw because other encodings exist.)

You can't even get the biological sequence length, because it can
contain gap characters.  Should the length of "A--T" be 2 or 4?  I
think 2, but that violates my "1.3)", above.

The problem is two-fold; what's the per-residue alphabet, and what's
the encoding for non-residue information?  (The last part is C) in my
list.)

I got to thinking about this problem a lot.  Staring from basics, if
space and performance were of no concern, a sequence like "APGA..."
could be represented as

  from IUPAC.Protein import ALA, PRO, GLY
  sequence = [ALA(), PRO(), GLY(), ALA(), ...]

where PRO, GLY and ALA are constructors for a residue object.  This
would even solve the typing problem because each residue would be
typed, and all you need to assert is that the element types are
homogenous.  Also, in Python, a list (the "[]") implements the
sequence interface.

In the general case, each residue may be different.  For example, some
of the residues may be modified - perhaps methylated, either at
specific sites or in a statistical sense - or you want to store the 3D
coordinates for each one.

If the residues are identical and position independent, there's no
need to create new objects, so this could be turned into:

  sequence = [ALA, PRO, GLY, ALA, ...]

These are sometimes called fly-weight objects because the objects are
reused.  The new "objects" are really just the pointers/references to
the real object.  This, BTW, is the biojava approach.

Space and performance are important, and there are fewer than 256 of
them (okay, fewer than 27) , so this is further reduced to:

  sequence_type = "protein"
  sequence = "APGA"
or
  sequence = Seq("APGA", moltype = "protein")

This is the bioperl approach.

This compression hinges on several predicates:
  a) all residues are identical
  b) there is one-to-one mapping from letters+seqtype to the "real"
       residue objects
  c) residues are of homogenous data types (can't mix protein and
       rna types)

Using a gap character or stop codon symbol violates c) because
non-homogenous types in the container.  A stop codon symbol violates
b) because there is nothing real about it.  It indicates the
non-existence of an object.

So those characters (and likely others in different contexts) do not
belong as elements of a sequence.  They really should be considered
part of the sequence itself.

Flipping things around, how would I define something which holds a
gapped sequence?  I would make a new class which has the sequence and
the location/length of the gaps.  It takes the sequence string and
knows the appropriate way to interpret the gap character.  Very much
like what bioperl's SimpleAlign.pm and UnivAln.pm do, except holding
only one sequence.

There is an interesting question here - should GappedSeq be derived
from Seq, or vice versa?  If GappedSeq is derived from Seq, then by
good class design, a GappedSeq is usable anywhere a Seq is usable, but
this is false - I can't use the same algorithm to calculate the
molecular weight.

Should Seq be derived from GappedSeq?  No, since then Seq would also
need to be derived from any other possible encoding property, like the
"contains stop codons" class.

Should they be the same class?  No, since that's just too heavyweight-
the Seq must implement every property.

So GappedSeq is not a type of Seq, although there is an data view of a
GappedSeq which acts like a Seq.  (That's the one encoded in the FASTA
file.)

Pause for a moment and reconsider the generic FASTA file reader.  What
does it return?  Assuming you know nothing about what's in the file,
it can only be a set of sequence-like objects, with some unknown
encoding scheme.  At some point, I have to specify the encoding.  Only
then can I make the "real" object, whether it be a protein, dna,
secondary structure, or whatever.

2.1) Sequences using a finite alphabet must contain an "alphabet"
member, which is used to map the alphabet back to the appropriate
per-residue type.

Bioperl does not do this - for the most part they assume IUPAC-encoded
sequences.

Biojava does store the alphabet as part of the sequence.  Even after
looking through the code, I can't tell how they know if the sequence
type is protein, dna, or whatever.  Grepping through everything, the 
only hits to "protein" are in these two files:

Annotator.java:
 * domains in proteins, genes in genomes and all sorts of other things.
/Biocorba/Seqcore/SeqType.java:     public static final int _PROTEIN = 0,

Here's how I want to define the alphabet for biopython:

class Alphabet:
  size = None   # unspecified/non-constant size per sequence element

class SingleLetterAlphabet(Alphabet):
  size = 1
  letters = None    # No restriction on the alphabet

single_letter_alphabet = SingleLetterAlphabet()

class Protein(SingleLetterAlphabet):
  pass

class IUPACProtein(Protein):
  letters = "ACDEFGHIKLMNPQRSTVWY"

class DNA(SingleLetterAlphabet):
  pass

class IUPACAmbiguousDNA(DNA):
  letters = "GATCRYWSMKHBVDN"

class IUPACUnambiguousDNA(IUPACAmbiguousDNA):
  pass

class HasStopCodon:
  stop_codon = "*"
  def __init__(self, stop_codon = stop_codon):
    self.stop_codon = codon

class ProteinWithStopCodon(IUPACProtein, HasStopCodon):
  def __init__(self, stop_codon = HasStopCodon.stop_codon):
    HasStopCodon.__init__(self, stop_codon)

class SecondaryStructure(SingleLetterAlphabet):
  letters = "HTSC"

class Percentage(SingleLetterAlphabet):
  letters = "0123456789"

class IgnoreLowercaseProteinAlphabet(ProteinAlphabet):
  pass

This approach heavily uses multiple inheritence, which Java won't
like, and which I've rarely had to use.  Another way is to have a
generic associative array, and do lookups.  The problem is, I haven't
yet implemented this to see how well it will work.

By tagging a sequence with an alphabet, the generic FASTA file parser
might look like:

class EncodedSeq:
    def __init__(self, seq, alphabet):
        self.seq = seq
        self.alphabet = alphabet
class FastaRecord:
    def __init__(self, desc, seq):
        self.desc = desc
        self.seq = seq

def read_fasta_record(infile, alphabet = single_letter_alphabet):
    # get the description line
    # get the sequence lines
    # merge sequence lines into one
    return FastaRecord(desc, seq = EncodedSeq(sequence_string, encoding))

and used like:

  a_fasta_record = read_fasta_record(open("file.aa"))

This input data is untyped, because I didn't specify a type, and it
could be something non-DNA/RNA/protein, like secondary structure
labels.  However, adding guess support would be
  a_fasta_record = guess_enoding(read_fasta_record(open("file.aa")))

where guess_encoding() would modify the "encoding" attribute of the
sequence, as need be.  Better yet, if I knew that the file contained
standard IUPAC defined protein sequences, then I would do

  a_fasta_record = read_fasta_record(open("file.aa"), protein_alphabet)

(One of my future proposals will be having types Seq classes, like
ProteinSeq, DNASeq, and RNASeq.  The generic FASTA parser then
wouldn't take the alphabet type, but rather it would get an
appropriate factory object.)

But even without guessing, there are still things I can do with a
generic alphabet encoding, like display the sequence, or save it back
as a FASTA file.

This same framework also lets me do some normalizations, like force
sequence to uppercase using a transformation done after the read.  I
could also call "verify_encoding" which would check that all
characters are appropriately defined.

For sequence records which specify type, (SWISS-PROT always contains
proteins), then there's no need to specify the encoding type on the
call.

Consider once again the "A--T" question.  I've read in that record
from my generic FASTA reader, and tied it to a "standard DNA sequence
with '-' encoded gaps."  I still have the problem that the length of
that sequence is 2, since nothing has been done to treat it different
than a normal Seq.

Instead, it has to have one more transformation to turn it into the
"GappedSeq" class mentioned earlier.  The way to think of it is as a
GappedSeq which is encoded as a Seq.  The problem with the class
definition earlier was the confusion in that normal sequences have no
special encoding, or rather, a Seq class is encoded at itself.

So if I want to read a gapped sequence from a file, I could do
something like:

   gapped_seq = to_gapped(read_fasta_record(StringIO(">test\nA--T\n")))

This would get the sequence data from the record, find the encoding
for the gap character (if any), and use it to make the GappedSeq
object.  I can then do:

   len(gapped_seq.sequence)

to find the that unencoded length is indeed 2.

Also, for performance reasons there may be specialized parsers which 
implement the whole call chain of

   file -> fasta record + encoding -> to_gapped()

(A biopython project brought up last year was the possibility of
generating some of these specialied parsers automatically.)

The end result of all this is that almost nothing changes from the
current bioperl data structure, except that "moltype" becomes
"encoding" and takes on many more properties.

There will be more in future emails about how the properties work
together, so that you can preserve lower case letters if you want, or
have lower case DNA go to upper case protein, or compute the molecular
weight of a specialized encoding (eg, for selenocysteine).  But I
really need to implement it first to make sure it works

                    Andrew Dalke
                    dalke@acm.org