From katel@worldpath.net Sun, 19 Mar 2000 22:59:43 -0800 Date: Sun, 19 Mar 2000 22:59:43 -0800 From: Cayte katel@worldpath.net Subject: [BioPython] SwissProt This is a multi-part message in MIME format. ------=_NextPart_000_0042_01BF91F6.D0BD1C20 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable So far, my tests have shown these mismatches. I don't understand the = third. Is there a non-printing character in the text? The file is = o59832.sp. I've already posted it to = ftp://bio.perl.org/pub/katel/SwissProt/ C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72 SwissProtTestCase.py test_organelle 111 expected is ['Tight junction', 'Transmembrane.'] actual is = ['Transmembrane.\012' ] C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72 SwissProtTestCase.py test_keywords 135 expected is ['O59832'] actual is [''] C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72 SwissProtTestCase.py test_accessions 89 expected is Homo sapiens (Human). actual is Homo sapiens (Human). Cayte ------=_NextPart_000_0042_01BF91F6.D0BD1C20 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
  So far, my tests have shown these = mismatches.  I=20 don't understand the third.  Is there a non-printing character in = the=20 text?  The file is o59832.sp. I've already posted it to ftp://bio.perl.org/pub/k= atel/SwissProt/
 
C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals=20 72
SwissProtTestCase.py test_organelle 111
expected is ['Tight = junction',=20 'Transmembrane.'] actual is=20 ['Transmembrane.\012'
]
C:\BIOPYT~1\UnitTests\UnitTestCase.py=20 assert_equals 72
SwissProtTestCase.py test_keywords 135
expected = is=20 ['O59832'] actual is ['']
C:\BIOPYT~1\UnitTests\UnitTestCase.py = assert_equals=20 72
SwissProtTestCase.py test_accessions 89
expected is Homo = sapiens=20 (Human). actual is Homo sapiens (Human).
 
 
          &nbs= p;            = ;            =             &= nbsp;  =20 Cayte
------=_NextPart_000_0042_01BF91F6.D0BD1C20-- From jchang@SMI.Stanford.EDU Sun, 19 Mar 2000 22:08:57 -0800 (PST) Date: Sun, 19 Mar 2000 22:08:57 -0800 (PST) From: Jeffrey Chang jchang@SMI.Stanford.EDU Subject: [BioPython] SwissProt On Sun, 19 Mar 2000, Cayte wrote: > So far, my tests have shown these mismatches. I don't understand > the third. Is there a non-printing character in the text? The file > is o59832.sp. I've already posted it to > ftp://bio.perl.org/pub/katel/SwissProt/ > > C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72 > SwissProtTestCase.py test_organelle 111 > expected is ['Tight junction', 'Transmembrane.'] actual is ['Transmembrane.\012' > ] I'm not sure I understand this output. It seems like this goes with test_keywords, and the next error message goes with test_accessions. However, the groupings of the lines seem to suggest otherwise. Am I not understanding things correctly? Yep, this is definitely broken. I've made the fixes and checked them into CVS. > C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72 > SwissProtTestCase.py test_keywords 135 > expected is ['O59832'] actual is [''] Yep, this fix too. > C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72 > SwissProtTestCase.py test_accessions 89 > expected is Homo sapiens (Human). actual is Homo sapiens (Human). Yeah, I'm not sure what's going on here. When reporting values of variables, try using the repr function. This will output strings without any character interpretation. >>> a = 'hello world!\n' >>> print repr(a) 'hello world!\012' >>> Another things I've noticed from 095832 is that it has unusually problematic database cross reference lines! I guess that's why you chose this entry. I made some changes so the parser will handle these better. I've also made some fixes so that the parser now strips the trailing newlines at the end of some strings. So when are you going to get write access so you can integrate your testing code into the CVS tree? :) Jeff From chrisf@fagmed.uit.no Mon, 20 Mar 2000 12:32:30 +0100 Date: Mon, 20 Mar 2000 12:32:30 +0100 From: Chris Fenton chrisf@fagmed.uit.no Subject: [BioPython] Hello and congrats on the marriage of python with biology Firstly, I am new to the list, so hello. Secondly, I am rather new to python, but am convinced it is a better language for large co-op projects than perl (do not misunderstand I like perl). Thirdly, I am no coding guru, but am willing to help. I am unsure as to what has been done already done as far as a code base but I reckon if biopython is to grow it is dependent on people donating code. I am polishing some simple python code that allows users to download sequences from the net (Entrez site): Specify (query, db, max display) to retrieve a list of uid numbers, retrieve records (uid, format) and save to file (file). I do not know if you guys have similar code, or if this could help. If not is there something the needs to be done (testing or porting some perl to python) I will certainly give it a try. At the moment I am looking at 'pythonizing'Aceperl but again unsure if I has not already been done. From jchang@SMI.Stanford.EDU Mon, 20 Mar 2000 08:51:10 -0800 (PST) Date: Mon, 20 Mar 2000 08:51:10 -0800 (PST) From: Jeffrey Chang jchang@SMI.Stanford.EDU Subject: [BioPython] Hello and congrats on the marriage of python with biology Hi Chris, Welcome to biopython! There's certainly a lot of code to be written, so we're accepting contributions. There's already some code in there to query Entrez for medline entries. Perhaps your sequence retrieval code could be merged into that. Do you have access to CVS? If so, please download a copy of the repository (instructions at http://cvs.biopython.org/) and take a browse around. The code is still under heavy development, though, and stability varies across files. There's also a README file in there that outlines things that will need to be done before a 0.1-alpha release. Jeff On Mon, 20 Mar 2000, Chris Fenton wrote: > Firstly, I am new to the list, so hello. > Secondly, I am rather new to python, but am convinced it is a better > language for large co-op projects than perl (do not misunderstand I like > perl). > Thirdly, I am no coding guru, but am willing to help. > > I am unsure as to what has been done already done as far as a code base > but I reckon if biopython is to grow it is dependent on people donating > code. > I am polishing some simple python code that allows users to download > sequences from the net (Entrez site): > > Specify (query, db, max display) to retrieve a list of uid numbers, > retrieve records (uid, format) and save to file (file). > > I do not know if you guys have similar code, or if this could help. > If not is there something the needs to be done (testing or porting some > perl to python) I will certainly give it a try. > At the moment I am looking at 'pythonizing'Aceperl but again unsure if I > has not already been done. > > > > > > > > > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython > From katel@worldpath.net Sun, 26 Mar 2000 18:44:29 -0800 Date: Sun, 26 Mar 2000 18:44:29 -0800 From: Cayte katel@worldpath.net Subject: [BioPython] SwissProt This is a multi-part message in MIME format. ------=_NextPart_000_000A_01BF9753.519D3440 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable under organism classification, my tests gave this result for o95832. expected is ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', = 'Vertebrata', 'Mamm alia', 'Eutheria', 'Primates', 'Catarrhini', 'Hominidae', 'Homo.']=20 actual is ['Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Mammalia', = 'Primates', 'Catarrhi ni', 'Hominidae', 'Homo'] The first item in each line is dropped. Cayte =20 ------=_NextPart_000_000A_01BF9753.519D3440 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
under organism classification, my tests gave this result for = o95832.
 

expected is ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata',=20 'Vertebrata', 'Mamm
alia', 'Eutheria', 'Primates', 'Catarrhini', = 'Hominidae',=20 'Homo.']
actual is ['Metazoa', 'Chordata', 'Craniata', 'Vertebrata', = 'Mammalia',=20 'Primates', 'Catarrhi
ni', 'Hominidae', 'Homo']
 
  The first item in each line is dropped.
 
           &n= bsp;           &nb= sp;           &nbs= p;       =20 Cayte
 
------=_NextPart_000_000A_01BF9753.519D3440-- From jchang@SMI.Stanford.EDU Sun, 26 Mar 2000 17:45:54 -0800 (PST) Date: Sun, 26 Mar 2000 17:45:54 -0800 (PST) From: Jeffrey Chang jchang@SMI.Stanford.EDU Subject: [BioPython] SwissProt It is, indeed. I've fixed this and checked the fixes into the repository. Thanks, Jeff On Sun, 26 Mar 2000, Cayte wrote: > under organism classification, my tests gave this result for o95832. > > > expected is ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Mamm > alia', 'Eutheria', 'Primates', 'Catarrhini', 'Hominidae', 'Homo.'] > actual is ['Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Mammalia', 'Primates', 'Catarrhi > ni', 'Hominidae', 'Homo'] > > The first item in each line is dropped. > > Cayte > > From katel@worldpath.net Sun, 26 Mar 2000 22:39:48 -0800 Date: Sun, 26 Mar 2000 22:39:48 -0800 From: Cayte katel@worldpath.net Subject: [BioPython] PyUnit This is a multi-part message in MIME format. ------=_NextPart_000_0056_01BF9774.311DCBA0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I just fixed a bug in PyUnit and I added code to report success of = failure for each test. The new code can be fount at: ftp://bio.perl.org/pub/katel/biopython/UnitTests/ Cayte ------=_NextPart_000_0056_01BF9774.311DCBA0 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
  I just fixed a bug in PyUnit and I = added code to=20 report success of failure for each test.  The new code can be fount = at:
ftp://bio.perl= .org/pub/katel/biopython/UnitTests/
 
 
          &nbs= p;            = ;            =          =20 Cayte
------=_NextPart_000_0056_01BF9774.311DCBA0-- From eugene.leitl@lrz.uni-muenchen.de Sun, 26 Mar 2000 23:16:20 -0800 (PST) Date: Sun, 26 Mar 2000 23:16:20 -0800 (PST) From: Eugene Leitl eugene.leitl@lrz.uni-muenchen.de Subject: [BioPython] PyUnit Please do not use HTML in mail messages. HTML is insecure (via scripting languages), can be used to trace when you've read your message and otherwise reveal your whereabouts. Also, malicious HTML tags can simply crash your machine. After saving your files, go to http://4.3.78.106/ for a demonstration. Strange how a quite innocent tag can crash your *operating system*. Cayte writes: > I just fixed a bug in PyUnit and I added code to report success of failure for each test. The new code can be fount at: > ftp://bio.perl.org/pub/katel/biopython/UnitTests/ > > > Cayte > > > > > > > >
  I just fixed a bug in PyUnit and I added code to > report success of failure for each test.  The new code can be fount > at:
>
href="ftp://bio.perl.org/pub/katel/biopython/UnitTests/">ftp://bio.perl.org/pub/katel/biopython/UnitTests/
>
 
>
 
>
size=2>                                             > Cayte
From dalke@acm.org Mon, 27 Mar 2000 04:05:56 -0700 Date: Mon, 27 Mar 2000 04:05:56 -0700 From: Andrew Dalke dalke@acm.org Subject: [BioPython] sequence proposals (long) Hello, I've been thinking about how to define the basic sequence classes for biopython. I've got a set of proposals on the topic which I would love to get feedback about. Since I have a tendency to write long emails, I've broken them up into several messages, which I'll be sending over the next week or so. This one contains my proposals for the basic sequence protocol, and an idea of how to handle alphabets and encodings. I have a bias towards Python, but I also compare the Perl, Java and C++ ways of doing things. Proposal 1) the basic sequence interface Conceptually, people think about sequences as a string - a list of residues. A residue can be an object (as with biojava's ResidueList) or a character (as with bioperl's PrimarySeq). Since I am a fan of generic programming, I want the basic sequence interface to act the same as a string, when possible. Here is a list of the interface I think all sequences, and sequence-like objects, should implement. 1.1) There is a way to iterate forward through the sequence, element by element. When finished, all elements will have been visited in order. Forward-only iterators should be rarely implemented. (This requirement is here because it's the most minimal "sequence-like" description I can think of.) In C++ or Java, this means that sequences implement a forward iterator. This is sufficient for many algorithms, like computing the molecular weight or translating from DNA to protein. Not all languages support different types of iterators as well as C++. For example, Python only has a random access iterator. If the underlying data structure is random access (eg, the residues are held in an array), then this is not a problem. If the underlying data is unidirectional (eg, a linked list), then there are problems. Resolution of the problem is outside the scope of this proposal. For Python, I *suggest* the following non-thread safe solution: def __init__(self, ...): self._pos = 0 def __getitem__(self, i): assert i == self._pos, "forward iteration only" return self.next() def next(self): self._pos = self._pos + 1 # do whatever is needed to get and return the next item, # or raise IndexError if not available but better yet, don't use unidirectional data storage. Perl doesn't really have an iterator over string characters. Instead, explicit ones are implemented by conversion to list (via 'split'), or via substr of an integer position. Implicit iterators exist in functions like split, s///, tr//, etc. It's hard for me to think of cases when only having a forward iterator (as compared to random access) makes sense, implementation-wise. It's needed to simplify algorithms. 1.2) Random access sequences must be integer subscriptable via the appropriate means for a string for the given language. C++ and Python let user classes redefine subscripting (via operator[] or __getitem__). Perl can also let user classes act like arrays, using TIEARRAY, according to perltie. This should mean that element lookups can look like $seq[5] instead of substr($seq->seq(), 5, 1). Bioperl does not implement their class this way, prefering people access the underlying data object as a string. This is appropriate since strings are not subscriptable in perl (I can't do: $a = "Perl"; print $a[1]). If non-character based sequence classed are ever implemented in Perl (eg, storing 3-letter codes instead of 1-letter, or used as a data view to a 3D structure), then I suggest they look into tieing element lookup. Java, from what I recall, doesn't have operator overloading, so the "appropriate means" is to use a method. In my attempt to understand how Java works, I see that ResidueList has a way to return the list of residues as a List. Java's java.util.List uses "get(int index)" to return the object at the given position, so I would think that ResidueList implements a method named "get()" as well. ResidueList, instead, uses "Residue residueAt(int index)" to return the given position. Looking at my src.jar, I see that String has a "charAt" and Vector has "elementAt", so I guess this determined the appropriate naming scheme. Could someone explain to me the different names, and the reason for the variation in prefixes (that is, {residue,char,element}At, instead of just elementAt)? 1.3) If the sequence length is known, there must be a way to get access to it in constant time. The returned value must be usable to get access to the last element of the list. That last part means that with a random access container, I should be able to use the length (possibly +/- 1) to seek directly to the last element. And with a forward iterator, I should be able to count "length" (again, +/-1) times and be at the end. I would like to add the requirement that access to the length be consistent with other objects in the language. For example, in C++ STL containers, the standard method is "size()", and for Python,"__len__()". Looking at the standard Java containers, it looks like they took their method names from STL, so the length of a container is "size()". However, biojava uses "length()" .. and so does the Java String class. He he, and GNU C++ defines length() *and* size() with identical implementations. I'm getting the impression that access to a Java String doesn't act like access to a list of characters. In Perl, again, $# could be tied to the length of a Seq object, but $# doesn't work on strings, only lists. The generic solution is to do length($seq->seq()), which is actually what PrimarySeqI.pm does. It is conceivable that a PrimarySeq implementation may work on genome size data (eg, someone want to find out how many bases are in chromosome 11), with the interface talking to a database. Because defining a true string-like object in perl seems hard, the length() method can be used as a work-around to keep from loading the whole sequence in memory. Again, I am hard pressed to come up with a real use for an unspecified length. The best I can think of is considering a interface to a sequencing machine. Before you start you don't know how much data will come out. This is also a possible real-life case for having a forward-only iterator mentioned in 1.1). 1.4) Position, lengths and ranges should be used in language appropriate form. This is the standard "do sequences start at 1, like biologists think about them or 0 like the language (except for Visual Basic and Fortran)." I strongly believe if the language is 0 based, then the sequence implementation is 0 based. Output meant for a biologist should be converted to 1 based, but not the internals. I say "should" in the proposal instead of a "must" because there is much contention about the subject. Biojava has sequences starting a 1 even though Java is a 0 based langauge. The part about "lengths" is because some languages have the length of a container be the number of elements in it, while others have it be the offset to the last element. The part about ranges is because the string "ABCD" when sliced by [1:3] can be: "BC" - Python and C++ "BCD" - Perl "ABC" - biologist "AB" - CORBA/LSR Also, some languages allow negative subscripts (in Python, seq[-1] refers to the last element of the sequence) and some define out-of-bound semantics (in Python, indicies can be out-of-bound, but not slices.) In my mind, consistency with the language is more important than consistency with the domain, if all such matters can be solved purely as an I/O translation. 1.5) Subsequences can be extracted from sequences given a range, and the subsequence implements the sequence interface. Again, this is all language dependent. Java containers uses "sublist(int fomIndex, int toIndex)", and so does biojava (though with a base of 1). C++ would have a constructor taking iterators for the first and last positions. This is nice because the programmer has a chance to define the proper return type - it's the constructor. Python has a slice operator, so the standard version could probably be something like: def __getslice__(self, i, j): return self.__class__(self.data[i:j]) (should strides be supported?) Again, Perl can tie to act like an array, but because the implementation assumes a string, there needs to be some way to get a substring without accessing the full string, and it needs to preserve the interface, so they use a method called "subseq". The name is analagous to the name "substr." 1.6) Mutability - if the sequence class is mutable, then a) elements must be changed by a means equivalent to subscripts b) ranges must be changed by a means equivalent to subslicing At the implementor's discretion, neither or both of the following are implemented: c) assignment to slices may change the length of the list (eg, seq[1:5] = "AAAAAAAAAAAAAAAAAAAAAAAAA") d) elements may be able to be added and removed At the implementor's discretion: e) modification to slices may affect the original sequence This part of my proposal is meant mostly as a guideline to show that mutability is a complex problem. The phrase "by a means equivalent to" means to use the language appropriate complement to the subscripting of 1.2) and subslicing of 1.5). In C++ and Python, it's via []. In Perl it's substr. In biojava, I'm not sure of the method names. The discretionary parts are present because they can be hard to implement. In Python, there are 4 possible sequence-of-character structures: string, list of characters, array.array, and Numeric.array, with different trade offs: i) strings cannot be editing in place, so even non-size changing modifications require building new strings ii) "list of characters" uses the most memory and slices copy the original sequence (so can be slower and use even more memory) iii) array.array of characters also copies from original sequence, iv) Numeric.array cannot change size in-place but does e) It isn't installed on every machine. As an example of e), should the last line of following return 'C' or 'x'? >>> a = Seq("ABCDE") >>> b = a[2:4] >>> str(b[0]) 'C' >>> str(a[2]) 'C' >>> b[0] = "x" >>> b[0] 'x' >>> str(a[2]) It turns out that an implementation with array.array will return 'C' while one with Numeric.array will return 'x'. Which is best? 1.7) If the sequence elements are meant to be viewable, their string value is found through the normal stringification operation. In the simplest case, str(seq[i]) should return the single, (or three letter) code for the given element. Adding this clause makes it easier to support aligned displays of different sequence types. For example, consider the example where I have a sequence, and an object (a Prosite match object from PS00028) implementing the sequence protocol. seq = Seq("KPYECSECRKAFRERSSLINHQRTHTGE") pattern = Prosite.compile("C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.") match = pattern.search(seq) # finds "CSECRKAFRERSSLINHQRTH" offset = match.start() for i in range(len(match_text)): print str(pattern[i]), "\t", str(seq[offset+i]) C C x S x E C C x R x K x A [LIVMFYWC] F x R (you get the idea) Since the pattern match acts like a sequence object, it can be used anywhere the concept of "show me each element" can work; with the provisio that the display may need to be able to show more than one character. Note that there can be other ways to get a name for a residue. You may prefer the three letter name, or the english name, or the spanish name. There is a general need for looking up an property of a residue, which will be discussed in one of my future proposals. 1.8) If possible, the stringification of a sequence object must return the string value of the elements such that, when given a sequence of size N and some x in a valid subrange, then: str(seq) == str(seq[:x]) + str(seq[x:]) I'm not sure about this one, which is why I added the "if possible". The problem comes when there are seperators. Suppose you use a three letter code, then you might want the stringification of a sequence to be: ALA-GLY-PRO-ASP instead of ALAGLYPROASP. But splitting the sequence in half and joining them may create "ALA-GLY" + "PRO-ASP" -> "ALA-GLYPRO-ASP" There are a couple of solutions to this. Track if the end-point is really a terminal, and add the appropriate seperator: "ALA-GLY-" + "PRO-ASP" -> "ALA-GLY-PRO-ASP" Use lower cases, so the sequence is "AlaGly" + "ProAsp" -> "AlaGlyProAsp" (blech; I can't read that easily) Require that sequence objects implement a "join" method to join the strings in the right fashion. This is also ugly. This method will only be used when converting the sequence to a string for use with, eg, a regular expression engine, or for performance reasons. (In Python, it's faster to iterate over string elements than having the __getitem__ overhead for each element.) This is not a problem in regular strings, since they only have a single character. Thus, the "if possible" phrase probably means "if the underlying encoding is from a single letter alphabet". =============================== Proposal 2) Encodings My first proposal topic works fine for sequences which can be expressed as a set of letters/elements. In addition to the normal IUPAC protein/nucleotide usage, here are some of the ways it can be used (if you know of more, please let me know): A) non-standard residues have their own characters. Some are mentioned in the IUPAC definitons, like: X = selenocysteine; for proteins B = Asx = aspartic acid or asparagine; for protein W = wyosine; for DNA B) non-IUPAC usage - perhaps the most common is using upper/lower case as an indication of certainty. - Aaron J Mackey (ajm6q@virginia.edu) on bioperl-guts points out that SEG can convert low-information residues to lower case so they act like X in FASTA, but you can still see the original characters in the output C) non-residue information (I especially want more information about these.) - gaps, as in alignments (often with "-", "." or " ") - stop codons; "*" is often used to show where a protein residue would be, if the corresponding codon wasn't a stop codon. D) extra-residue information, still with a single character - show secondary structure prediction using "H"elix, "S"trand, "C"oil, "T"urn - digit to show % properties, like % conservation E) non-single character definitions, still per-residue - a prosite pattern match ("[FILAPVM]"), esp. as when aligned to the matching protein, - alignment of 3-character nucleotides to 1-character amino acids All of the above scenerios can be modeled with objects implementing the sequence protocol. That is, they are subscriptable, sliceable, stringifiable, etc. I know this is true for A)-D) because I've seen FASTA files used to store all of them, and the records of a FASTA file are easily mapped to the sequence protocol (trivially so, since the data is natually stored as a string of characters). And E) works because I've restricted it to have a one-to-one mapping to a residue. When read from a generic FASTA file, the sequence isn't really anything other than a glorified string. In a formal sense, you can't calculate the molecular weight because you don't know if "CCCCCHHHHHHHTCCC" is a protein or nucleotide sequence, or maybe a secondary structure prediction! (BTW, I consider the autoguessing requirement of bioperl to be a flaw because other encodings exist.) You can't even get the biological sequence length, because it can contain gap characters. Should the length of "A--T" be 2 or 4? I think 2, but that violates my "1.3)", above. The problem is two-fold; what's the per-residue alphabet, and what's the encoding for non-residue information? (The last part is C) in my list.) I got to thinking about this problem a lot. Staring from basics, if space and performance were of no concern, a sequence like "APGA..." could be represented as from IUPAC.Protein import ALA, PRO, GLY sequence = [ALA(), PRO(), GLY(), ALA(), ...] where PRO, GLY and ALA are constructors for a residue object. This would even solve the typing problem because each residue would be typed, and all you need to assert is that the element types are homogenous. Also, in Python, a list (the "[]") implements the sequence interface. In the general case, each residue may be different. For example, some of the residues may be modified - perhaps methylated, either at specific sites or in a statistical sense - or you want to store the 3D coordinates for each one. If the residues are identical and position independent, there's no need to create new objects, so this could be turned into: sequence = [ALA, PRO, GLY, ALA, ...] These are sometimes called fly-weight objects because the objects are reused. The new "objects" are really just the pointers/references to the real object. This, BTW, is the biojava approach. Space and performance are important, and there are fewer than 256 of them (okay, fewer than 27) , so this is further reduced to: sequence_type = "protein" sequence = "APGA" or sequence = Seq("APGA", moltype = "protein") This is the bioperl approach. This compression hinges on several predicates: a) all residues are identical b) there is one-to-one mapping from letters+seqtype to the "real" residue objects c) residues are of homogenous data types (can't mix protein and rna types) Using a gap character or stop codon symbol violates c) because non-homogenous types in the container. A stop codon symbol violates b) because there is nothing real about it. It indicates the non-existence of an object. So those characters (and likely others in different contexts) do not belong as elements of a sequence. They really should be considered part of the sequence itself. Flipping things around, how would I define something which holds a gapped sequence? I would make a new class which has the sequence and the location/length of the gaps. It takes the sequence string and knows the appropriate way to interpret the gap character. Very much like what bioperl's SimpleAlign.pm and UnivAln.pm do, except holding only one sequence. There is an interesting question here - should GappedSeq be derived from Seq, or vice versa? If GappedSeq is derived from Seq, then by good class design, a GappedSeq is usable anywhere a Seq is usable, but this is false - I can't use the same algorithm to calculate the molecular weight. Should Seq be derived from GappedSeq? No, since then Seq would also need to be derived from any other possible encoding property, like the "contains stop codons" class. Should they be the same class? No, since that's just too heavyweight- the Seq must implement every property. So GappedSeq is not a type of Seq, although there is an data view of a GappedSeq which acts like a Seq. (That's the one encoded in the FASTA file.) Pause for a moment and reconsider the generic FASTA file reader. What does it return? Assuming you know nothing about what's in the file, it can only be a set of sequence-like objects, with some unknown encoding scheme. At some point, I have to specify the encoding. Only then can I make the "real" object, whether it be a protein, dna, secondary structure, or whatever. 2.1) Sequences using a finite alphabet must contain an "alphabet" member, which is used to map the alphabet back to the appropriate per-residue type. Bioperl does not do this - for the most part they assume IUPAC-encoded sequences. Biojava does store the alphabet as part of the sequence. Even after looking through the code, I can't tell how they know if the sequence type is protein, dna, or whatever. Grepping through everything, the only hits to "protein" are in these two files: Annotator.java: * domains in proteins, genes in genomes and all sorts of other things. /Biocorba/Seqcore/SeqType.java: public static final int _PROTEIN = 0, Here's how I want to define the alphabet for biopython: class Alphabet: size = None # unspecified/non-constant size per sequence element class SingleLetterAlphabet(Alphabet): size = 1 letters = None # No restriction on the alphabet single_letter_alphabet = SingleLetterAlphabet() class Protein(SingleLetterAlphabet): pass class IUPACProtein(Protein): letters = "ACDEFGHIKLMNPQRSTVWY" class DNA(SingleLetterAlphabet): pass class IUPACAmbiguousDNA(DNA): letters = "GATCRYWSMKHBVDN" class IUPACUnambiguousDNA(IUPACAmbiguousDNA): pass class HasStopCodon: stop_codon = "*" def __init__(self, stop_codon = stop_codon): self.stop_codon = codon class ProteinWithStopCodon(IUPACProtein, HasStopCodon): def __init__(self, stop_codon = HasStopCodon.stop_codon): HasStopCodon.__init__(self, stop_codon) class SecondaryStructure(SingleLetterAlphabet): letters = "HTSC" class Percentage(SingleLetterAlphabet): letters = "0123456789" class IgnoreLowercaseProteinAlphabet(ProteinAlphabet): pass This approach heavily uses multiple inheritence, which Java won't like, and which I've rarely had to use. Another way is to have a generic associative array, and do lookups. The problem is, I haven't yet implemented this to see how well it will work. By tagging a sequence with an alphabet, the generic FASTA file parser might look like: class EncodedSeq: def __init__(self, seq, alphabet): self.seq = seq self.alphabet = alphabet class FastaRecord: def __init__(self, desc, seq): self.desc = desc self.seq = seq def read_fasta_record(infile, alphabet = single_letter_alphabet): # get the description line # get the sequence lines # merge sequence lines into one return FastaRecord(desc, seq = EncodedSeq(sequence_string, encoding)) and used like: a_fasta_record = read_fasta_record(open("file.aa")) This input data is untyped, because I didn't specify a type, and it could be something non-DNA/RNA/protein, like secondary structure labels. However, adding guess support would be a_fasta_record = guess_enoding(read_fasta_record(open("file.aa"))) where guess_encoding() would modify the "encoding" attribute of the sequence, as need be. Better yet, if I knew that the file contained standard IUPAC defined protein sequences, then I would do a_fasta_record = read_fasta_record(open("file.aa"), protein_alphabet) (One of my future proposals will be having types Seq classes, like ProteinSeq, DNASeq, and RNASeq. The generic FASTA parser then wouldn't take the alphabet type, but rather it would get an appropriate factory object.) But even without guessing, there are still things I can do with a generic alphabet encoding, like display the sequence, or save it back as a FASTA file. This same framework also lets me do some normalizations, like force sequence to uppercase using a transformation done after the read. I could also call "verify_encoding" which would check that all characters are appropriately defined. For sequence records which specify type, (SWISS-PROT always contains proteins), then there's no need to specify the encoding type on the call. Consider once again the "A--T" question. I've read in that record from my generic FASTA reader, and tied it to a "standard DNA sequence with '-' encoded gaps." I still have the problem that the length of that sequence is 2, since nothing has been done to treat it different than a normal Seq. Instead, it has to have one more transformation to turn it into the "GappedSeq" class mentioned earlier. The way to think of it is as a GappedSeq which is encoded as a Seq. The problem with the class definition earlier was the confusion in that normal sequences have no special encoding, or rather, a Seq class is encoded at itself. So if I want to read a gapped sequence from a file, I could do something like: gapped_seq = to_gapped(read_fasta_record(StringIO(">test\nA--T\n"))) This would get the sequence data from the record, find the encoding for the gap character (if any), and use it to make the GappedSeq object. I can then do: len(gapped_seq.sequence) to find the that unencoded length is indeed 2. Also, for performance reasons there may be specialized parsers which implement the whole call chain of file -> fasta record + encoding -> to_gapped() (A biopython project brought up last year was the possibility of generating some of these specialied parsers automatically.) The end result of all this is that almost nothing changes from the current bioperl data structure, except that "moltype" becomes "encoding" and takes on many more properties. There will be more in future emails about how the properties work together, so that you can preserve lower case letters if you want, or have lower case DNA go to upper case protein, or compute the molecular weight of a specialized encoding (eg, for selenocysteine). But I really need to implement it first to make sure it works Andrew Dalke dalke@acm.org From katel@worldpath.net Mon, 27 Mar 2000 23:59:23 -0800 Date: Mon, 27 Mar 2000 23:59:23 -0800 From: Cayte katel@worldpath.net Subject: [BioPython] PyUnit ----- Original Message ----- From: Eugene Leitl To: Cayte Cc: Sent: Sunday, March 26, 2000 11:16 PM Subject: [BioPython] PyUnit > > Please do not use HTML in mail messages. HTML is insecure (via > scripting languages), can be used to trace when you've read your > message and otherwise reveal your whereabouts. > Thank you for the info. Is there a safe way to send web sites? Cayte From jchang@SMI.Stanford.EDU Thu, 30 Mar 2000 01:50:07 -0800 (PST) Date: Thu, 30 Mar 2000 01:50:07 -0800 (PST) From: Jeffrey Chang jchang@SMI.Stanford.EDU Subject: [BioPython] sequence proposals (long) On Mon, 27 Mar 2000, Andrew Dalke wrote: > Hello, > > I've been thinking about how to define the basic sequence classes for > biopython. I've got a set of proposals on the topic which I would > love to get feedback about. Since I have a tendency to write long > emails, I've broken them up into several messages, which I'll be > sending over the next week or so. Yikes! And this is the broken-up one! > This one contains my proposals for the basic sequence protocol, and an > idea of how to handle alphabets and encodings. I have a bias towards > Python, but I also compare the Perl, Java and C++ ways of doing > things. Sure. I don't think it would be a failure if biopython were to make sequences classes that were biased (even heavily) toward python's way of doing things. I'd rather have something that works well here, rather than sequences that suck equally on all languages! ;) > Proposal 1) the basic sequence interface > > Conceptually, people think about sequences as a string - a list of > residues. A residue can be an object (as with biojava's ResidueList) > or a character (as with bioperl's PrimarySeq). Since I am a fan of > generic programming, I want the basic sequence interface to act the > same as a string, when possible. Do you mean sequences should support the same slicing semantics? After python 1.6, strings will become objects, with their own methods, and how string objects and biological sequences act will diverge. > Here is a list of the interface I think all sequences, and > sequence-like objects, should implement. > > 1.1) There is a way to iterate forward through the sequence, element > by element. When finished, all elements will have been visited in > order. Forward-only iterators should be rarely implemented. [discussion of forward iterators] > 1.2) Random access sequences must be integer subscriptable via the > appropriate means for a string for the given language. [... I'm liberally cutting things from the email, for length and relevance reasons. I hope that I'm not leaving anything without the proper context. I apologize if I do!] > 1.3) If the sequence length is known, there must be a way to get > access to it in constant time. The returned value must be usable to > get access to the last element of the list. [...] > 1.4) Position, lengths and ranges should be used in language > appropriate form. > > This is the standard "do sequences start at 1, like biologists think > about them or 0 like the language (except for Visual Basic and > Fortran)." I strongly believe if the language is 0 based, then the > sequence implementation is 0 based. Output meant for a biologist > should be converted to 1 based, but not the internals. Yes, this has been covered here before, and IIRC, the consensus. > I say "should" in the proposal instead of a "must" because there is > much contention about the subject. Biojava has sequences starting a 1 > even though Java is a 0 based langauge. In defense of the biojava people, having 1-based sequences seem less offensive in java than it would in python, where the subscripting is overloadable. It feels to me like the semantics of the indexes for a method call is less stringently enforced than that of subscripting, where the syntax is built into the language. > 1.5) Subsequences can be extracted from sequences given a range, and > the subsequence implements the sequence interface. [...] > (should strides be supported?) What's a stride? > 1.6) Mutability - if the sequence class is mutable, then > a) elements must be changed by a means equivalent to subscripts > b) ranges must be changed by a means equivalent to subslicing > At the implementor's discretion, neither or both of the following > are implemented: > c) assignment to slices may change the length of the list > (eg, seq[1:5] = "AAAAAAAAAAAAAAAAAAAAAAAAA") > d) elements may be able to be added and removed > At the implementor's discretion: > e) modification to slices may affect the original sequence > > This part of my proposal is meant mostly as a guideline to show that > mutability is a complex problem. Agreed. (Slight tangent) I'm not sure if you've mentioned it explicitly, but we're going to need both mutable and immutable sequences. Immutable sequences are necessary in order to guarantee the consistency between the sequence information and any annotations that may have been carried with it. Because it would be so hairy otherwise, I propose that any annotated sequences must be immutable. > As an example of e), should the last line of following return 'C' or > 'x'? > > >>> a = Seq("ABCDE") > >>> b = a[2:4] > >>> str(b[0]) > 'C' > >>> str(a[2]) > 'C' > >>> b[0] = "x" > >>> b[0] > 'x' > >>> str(a[2]) > > > It turns out that an implementation with array.array will return 'C' > while one with Numeric.array will return 'x'. Which is best? I don't have a definitive answer for this, but can probably add to the confusion. I'm not really a fan of the Numeric way of doing this, because it breaks the usual python idiom where a[:] creates a copy of the a list. However, the Numeric way does save a lot of memory when accessing just a region of a large matrix (or DNA sequence). > 1.7) If the sequence elements are meant to be viewable, their string > value is found through the normal stringification operation. > > In the simplest case, str(seq[i]) should return the single, (or three > letter) code for the given element. Adding this clause makes it > easier to support aligned displays of different sequence types. Yes, but I'm not sure we need to allow this kind of flexibility. I believe str should just return a human-readable string, and leave specialized formatting to other functions. > Note that there can be other ways to get a name for a residue. You > may prefer the three letter name, or the english name, or the spanish > name. There is a general need for looking up an property of a > residue, which will be discussed in one of my future proposals. OK. > 1.8) If possible, the stringification of a sequence object must return > the string value of the elements such that, when given a sequence of > size N and some x in a valid subrange, then: > > str(seq) == str(seq[:x]) + str(seq[x:]) > > > I'm not sure about this one, which is why I added the "if possible". > The problem comes when there are seperators. Suppose you use a three > letter code, then you might want the stringification of a sequence to > be: ALA-GLY-PRO-ASP instead of ALAGLYPROASP. But splitting the > sequence in half and joining them may create > > "ALA-GLY" + "PRO-ASP" -> "ALA-GLYPRO-ASP" >From what I understand, str is just supposed to return a human-readable string, useful for interactive mode or debugging. I'm uncomfortable imposing other constraints on it. > =============================== > > Proposal 2) Encodings > > My first proposal topic works fine for sequences which can be > expressed as a set of letters/elements. In addition to the normal > IUPAC protein/nucleotide usage, here are some of the ways it can be > used (if you know of more, please let me know): > > A) non-standard residues have their own characters. Some are > mentioned in the IUPAC definitons, like: > X = selenocysteine; for proteins > B = Asx = aspartic acid or asparagine; for protein > W = wyosine; for DNA > B) non-IUPAC usage > - perhaps the most common is using upper/lower case as an > indication of certainty. > - Aaron J Mackey (ajm6q@virginia.edu) on bioperl-guts points > out that SEG can convert low-information residues to lower case > so they act like X in FASTA, but you can still see the original > characters in the output > C) non-residue information > (I especially want more information about these.) > - gaps, as in alignments (often with "-", "." or " ") > - stop codons; "*" is often used to show where a protein residue > would be, if the corresponding codon wasn't a stop codon. > D) extra-residue information, still with a single character > - show secondary structure prediction using "H"elix, "S"trand, > "C"oil, "T"urn > - digit to show % properties, like % conservation > E) non-single character definitions, still per-residue > - a prosite pattern match ("[FILAPVM]"), esp. as when aligned > to the matching protein, > - alignment of 3-character nucleotides to 1-character amino acids > > All of the above scenerios can be modeled with objects implementing > the sequence protocol. That is, they are subscriptable, sliceable, > stringifiable, etc. I know this is true for A)-D) because I've seen > FASTA files used to store all of them, and the records of a FASTA file > are easily mapped to the sequence protocol (trivially so, since the > data is natually stored as a string of characters). And E) works > because I've restricted it to have a one-to-one mapping to a residue. > > When read from a generic FASTA file, the sequence isn't really > anything other than a glorified string. In a formal sense, you can't > calculate the molecular weight because you don't know if > "CCCCCHHHHHHHTCCC" is a protein or nucleotide sequence, or maybe a > secondary structure prediction! (BTW, I consider the autoguessing > requirement of bioperl to be a flaw because other encodings exist.) > > You can't even get the biological sequence length, because it can > contain gap characters. Should the length of "A--T" be 2 or 4? I > think 2, but that violates my "1.3)", above. It depends on what you consider the sequence length. I don't consider "A-T" to be a biological sequence. I think the sequence is "AT", with extra information embedded within. B-D above all describe cases in which information other than the sequence is contained in the string representation of the sequence. This should be stored in some other place, and the sequence protocol preserved for the actual sequence. For example: >>> seq = GappedSequence("AT-G--C") >>> seq[1:3] 'TG' >>> seq.gapped[1:3] 'T-G' Here, the sequence slicing returns the actual biological sequence, while the gapped representation is delegated to another object. The actual storage of the gap information is unspecified. This would solve the problem discussed on the bioperl list, where doing upper or lower was destroying the information. > The problem is two-fold; what's the per-residue alphabet, and what's > the encoding for non-residue information? (The last part is C) in my > list.) > > > I got to thinking about this problem a lot. Staring from basics, if > space and performance were of no concern, a sequence like "APGA..." > could be represented as > > from IUPAC.Protein import ALA, PRO, GLY > sequence = [ALA(), PRO(), GLY(), ALA(), ...] > > > where PRO, GLY and ALA are constructors for a residue object. This > would even solve the typing problem because each residue would be > typed, and all you need to assert is that the element types are > homogenous. Also, in Python, a list (the "[]") implements the > sequence interface. > > In the general case, each residue may be different. For example, some > of the residues may be modified - perhaps methylated, either at > specific sites or in a statistical sense - or you want to store the 3D > coordinates for each one. > > If the residues are identical and position independent, there's no > need to create new objects, so this could be turned into: > > sequence = [ALA, PRO, GLY, ALA, ...] > > These are sometimes called fly-weight objects because the objects are > reused. The new "objects" are really just the pointers/references to > the real object. This, BTW, is the biojava approach. > > Space and performance are important, and there are fewer than 256 of > them (okay, fewer than 27) , so this is further reduced to: > > sequence_type = "protein" > sequence = "APGA" > or > sequence = Seq("APGA", moltype = "protein") > > This is the bioperl approach. > > This compression hinges on several predicates: > a) all residues are identical > b) there is one-to-one mapping from letters+seqtype to the "real" > residue objects > c) residues are of homogenous data types (can't mix protein and > rna types) > > Using a gap character or stop codon symbol violates c) because > non-homogenous types in the container. A stop codon symbol violates > b) because there is nothing real about it. It indicates the > non-existence of an object. > > So those characters (and likely others in different contexts) do not > belong as elements of a sequence. They really should be considered > part of the sequence itself. > > Flipping things around, how would I define something which holds a > gapped sequence? I would make a new class which has the sequence and > the location/length of the gaps. It takes the sequence string and > knows the appropriate way to interpret the gap character. Very much > like what bioperl's SimpleAlign.pm and UnivAln.pm do, except holding > only one sequence. > > > There is an interesting question here - should GappedSeq be derived > from Seq, or vice versa? If GappedSeq is derived from Seq, then by > good class design, a GappedSeq is usable anywhere a Seq is usable, but > this is false - I can't use the same algorithm to calculate the > molecular weight. Yes. I think GappedSeq will need to behave like Seq, with the gap-specific stuff in methods specific to GappedSeq. Unless there's something I'm missing... (It's getting late). > 2.1) Sequences using a finite alphabet must contain an "alphabet" > member, which is used to map the alphabet back to the appropriate > per-residue type. Seems reasonable. I've never had to use an alphabet, but I've not been in a situation where the requirement would've gotten in my way, either. [...] > There will be more in future emails about how the properties work > together, so that you can preserve lower case letters if you want, or > have lower case DNA go to upper case protein, or compute the molecular > weight of a specialized encoding (eg, for selenocysteine). But I > really need to implement it first to make sure it works Looking forward to them! Jeff > > > Andrew Dalke > dalke@acm.org > > > > _______________________________________________ > BioPython mailing list - BioPython@biopython.org > http://biopython.org/mailman/listinfo/biopython > From dalke@acm.org Thu, 30 Mar 2000 14:23:59 -0700 Date: Thu, 30 Mar 2000 14:23:59 -0700 From: Andrew Dalke dalke@acm.org Subject: [BioPython] proposal 3 This is a short one. From what I can tell, a sequence is characterized by a list of residues, which I'm encoding via a list of letters and an alphabet description. Is that all that's needed for a minimal data structure? I can think of one more - are the end points physical end points, or parts of a larger structure? The characters in a string are not exactly one-to-one equivalent to the residues. Consider the carboxyl end of a protein. Because it's a terminal, it contains the extra "O-H", so it's mass and atom count will be higher than any other residue in the middle of the sequence with the same letter. So my proposal is: Proposal 3 - The ends of a sequence may correspond to physical ends of the real sequence. This data is stored in the attribute "endings", which has two elements, "left" and "right". (Left is position 0.) The possible values for the elements are UNKNOWN, TERMINAL, NONTERMINAL. The only results I can see it affecting are the atom count and mass calculations, and only if the functions to calculate the count and mass are accurate. Let me explain that. An accurate mass calculation function might look like: def total_mass(seq, mass_table): mass = 0.0 for c in seq: mass = mass + mass_table[c] return mass + 18.0 # the extra H and O-H for the terminals In this case, mass(seq) != mass(seq[:len(seq)/2]) + mass(seq[len(seq)/2:]) because the 18.0 is added twice on the right hand side. Thus, slicing corresponds to a physical cut of the protein, as compared to being a subsection of the string. It doesn't let you answer "what is the mass contribution of the first half of the sequence?" This information will only be used rarely (as proof, biojava and bioperl don't track this data). Adding it means that every constructor and factory function and generative function must do the right thing for the ending. There are three possible values for each end, which makes things complex. It will be run often, but since the data is almost never used, this work will be wasted. There would also need to be a function other than subslicing used to modify the ending. Eg, seq.cut(10, 40, chop = (TERMINAL, UNKNOWN)) Since I don't like the complexity and performance hits, I'm against the proposal. Andrew dalke@acm.org From dalke@acm.org Thu, 30 Mar 2000 14:24:09 -0700 Date: Thu, 30 Mar 2000 14:24:09 -0700 From: Andrew Dalke dalke@acm.org Subject: [BioPython] sequence proposals (long) > Sure. I don't think it would be a failure if biopython were > to make sequences classes that were biased (even heavily) toward > python's way of doing things. I'd rather have something that > works well here, rather than sequences that suck equally on > all languages! ;) Yes. But if there are three "natural" ways to do something in Python, and one of them is common with the Perl and Java ways, then I would rather chose the common one. > Do you mean sequences should support the same slicing semantics? Yes. > After python 1.6, strings will become objects, with their own > methods, and how string objects and biological sequences act > will diverge. Ohh, good point. Currently a string is little more than a byte array, and I was thinking just of the list-like interfaces. > [... I'm liberally cutting things from the email, for length and > relevance reasons. I hope that I'm not leaving anything without > the proper context. I apologize if I do!] No problem. The extra text was commentary/justification meant to back up my proposals. I was thinking of resending just the proposal part; thanks for doing so. > It feels to me like the semantics of the indexes for a method > call is less stringently enforced than that of subscripting, > where the syntax is built into the language. I hadn't thought of that before. Sounds reasonable. > What's a stride? The step length in a slice. The default is 1, so [1:5] returns 4 characters. [1:5:2] returns 2 characters (as positions 1 and 3). >>> import string, Numeric >>> a = Numeric.array("This is a test.") >>> import string >>> string.join(a[1:12:2], "") 'hsi e' >>> > I'm not sure if you've mentioned it explicitly, but > we're going to need both mutable and immutable sequences. I hadn't mentioned it, but it is true. I've got classes for both types; the mutable one is based off of array.array. > Because it would be so hairy otherwise, I propose that any > annotated sequences must be immutable Agreed. > However, the Numeric way does save a lot of memory when accessing > just a region of a large matrix (or DNA sequence). I was thinking about that. It's also possible to have subsequences return a proxy object, which references back to the main sequence only when needed. There's a higher per-subsequence object cost, but the original object was really big. It becomes rather more difficult to implement and use these. What is a case where the subsequence copies must be nearly genome sized? > Yes, but I'm not sure we need to allow this kind of flexibility. > I believe str should just return a human-readable string, and > leave specialized formatting to other functions. You are right. Using the stringification operator is not the right choice. Looking at the other character array objects (Numeric.array and array.array), the proper method is "tostring()". If Python 1.6 strings have a tostring() method, returning itself, then I would be pleased. I'll ask about/for that on the Python list. > It depends on what you consider the sequence length. > > I don't consider "A-T" to be a biological sequence. Right. I've since changed my alphabet proposal so that gaps are not types of physical alphabets, but are encodings around alphabets. > For example: > >>> seq = GappedSequence("AT-G--C") > >>> seq[1:3] > 'TG' > >>> seq.gapped[1:3] > 'T-G' I would have it the other way around, where the default subscript contains the '-' and the ".ungapped" attribute yields the sequence. This makes it easier to compare relative positions of a sequence with a gapped sequence. Andrew dalke@acm.org From jchang@SMI.Stanford.EDU Thu, 30 Mar 2000 14:00:53 -0800 (PST) Date: Thu, 30 Mar 2000 14:00:53 -0800 (PST) From: Jeffrey Chang jchang@SMI.Stanford.EDU Subject: [BioPython] sequence proposals (long) > [Jeff] > > For example: > > >>> seq = GappedSequence("AT-G--C") > > >>> seq[1:3] > > 'TG' > > >>> seq.gapped[1:3] > > 'T-G' > [Andrew] > I would have it the other way around, where the default subscript > contains the '-' and the ".ungapped" attribute yields the sequence. > This makes it easier to compare relative positions of a sequence > with a gapped sequence. It looks like there's 2 things going on here. In this example, one is getting a display-able representation of the sequence, where you can make inferences on character lengths and such, and the other is accessing the biological sequence. More generally, all sequences need to support some way of getting the biological sequence, and possibly other access methods depending on the requirements of the class. Maybe all sequences will need to support at least biological sequence access, in addition to displayable representation? I'm beginning to worry about sliding down a slippery slope towards large classes, though... The disadvantage of having it the other way around, is that people who want to access the underlying biological sequence (without gap characters) will need to do it a different way for every type of sequence. Jeff