From katel@worldpath.net Sun, 19 Mar 2000 22:59:43 -0800
Date: Sun, 19 Mar 2000 22:59:43 -0800
From: Cayte katel@worldpath.net
Subject: [BioPython] SwissProt
This is a multi-part message in MIME format.
------=_NextPart_000_0042_01BF91F6.D0BD1C20
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
So far, my tests have shown these mismatches. I don't understand the =
third. Is there a non-printing character in the text? The file is =
o59832.sp. I've already posted it to =
ftp://bio.perl.org/pub/katel/SwissProt/
C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72
SwissProtTestCase.py test_organelle 111
expected is ['Tight junction', 'Transmembrane.'] actual is =
['Transmembrane.\012'
]
C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72
SwissProtTestCase.py test_keywords 135
expected is ['O59832'] actual is ['']
C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72
SwissProtTestCase.py test_accessions 89
expected is Homo sapiens (Human). actual is Homo sapiens (Human).
Cayte
------=_NextPart_000_0042_01BF91F6.D0BD1C20
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
So far, my tests have shown these =
mismatches. I=20
don't understand the third. Is there a non-printing character in =
the=20
text? The file is o59832.sp. I've already posted it to ftp://bio.perl.org/pub/k=
atel/SwissProt/
C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals=20
72
SwissProtTestCase.py test_organelle 111
expected is ['Tight =
junction',=20
'Transmembrane.'] actual is=20
['Transmembrane.\012'
]
C:\BIOPYT~1\UnitTests\UnitTestCase.py=20
assert_equals 72
SwissProtTestCase.py test_keywords 135
expected =
is=20
['O59832'] actual is ['']
C:\BIOPYT~1\UnitTests\UnitTestCase.py =
assert_equals=20
72
SwissProtTestCase.py test_accessions 89
expected is Homo =
sapiens=20
(Human). actual is Homo sapiens (Human).
&nbs=
p;  =
; =
&=
nbsp; =20
Cayte
------=_NextPart_000_0042_01BF91F6.D0BD1C20--
From jchang@SMI.Stanford.EDU Sun, 19 Mar 2000 22:08:57 -0800 (PST)
Date: Sun, 19 Mar 2000 22:08:57 -0800 (PST)
From: Jeffrey Chang jchang@SMI.Stanford.EDU
Subject: [BioPython] SwissProt
On Sun, 19 Mar 2000, Cayte wrote:
> So far, my tests have shown these mismatches. I don't understand
> the third. Is there a non-printing character in the text? The file
> is o59832.sp. I've already posted it to
> ftp://bio.perl.org/pub/katel/SwissProt/
>
> C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72
> SwissProtTestCase.py test_organelle 111
> expected is ['Tight junction', 'Transmembrane.'] actual is ['Transmembrane.\012'
> ]
I'm not sure I understand this output. It seems like this goes with
test_keywords, and the next error message goes with test_accessions.
However, the groupings of the lines seem to suggest otherwise. Am I not
understanding things correctly?
Yep, this is definitely broken. I've made the fixes and checked them into
CVS.
> C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72
> SwissProtTestCase.py test_keywords 135
> expected is ['O59832'] actual is ['']
Yep, this fix too.
> C:\BIOPYT~1\UnitTests\UnitTestCase.py assert_equals 72
> SwissProtTestCase.py test_accessions 89
> expected is Homo sapiens (Human). actual is Homo sapiens (Human).
Yeah, I'm not sure what's going on here. When reporting values of
variables, try using the repr function. This will output strings without
any character interpretation.
>>> a = 'hello world!\n'
>>> print repr(a)
'hello world!\012'
>>>
Another things I've noticed from 095832 is that it has unusually
problematic database cross reference lines! I guess that's why you chose
this entry. I made some changes so the parser will handle these better.
I've also made some fixes so that the parser now strips the trailing
newlines at the end of some strings.
So when are you going to get write access so you can integrate your
testing code into the CVS tree? :)
Jeff
From chrisf@fagmed.uit.no Mon, 20 Mar 2000 12:32:30 +0100
Date: Mon, 20 Mar 2000 12:32:30 +0100
From: Chris Fenton chrisf@fagmed.uit.no
Subject: [BioPython] Hello and congrats on the marriage of python with biology
Firstly, I am new to the list, so hello.
Secondly, I am rather new to python, but am convinced it is a better
language for large co-op projects than perl (do not misunderstand I like
perl).
Thirdly, I am no coding guru, but am willing to help.
I am unsure as to what has been done already done as far as a code base
but I reckon if biopython is to grow it is dependent on people donating
code.
I am polishing some simple python code that allows users to download
sequences from the net (Entrez site):
Specify (query, db, max display) to retrieve a list of uid numbers,
retrieve records (uid, format) and save to file (file).
I do not know if you guys have similar code, or if this could help.
If not is there something the needs to be done (testing or porting some
perl to python) I will certainly give it a try.
At the moment I am looking at 'pythonizing'Aceperl but again unsure if I
has not already been done.
From jchang@SMI.Stanford.EDU Mon, 20 Mar 2000 08:51:10 -0800 (PST)
Date: Mon, 20 Mar 2000 08:51:10 -0800 (PST)
From: Jeffrey Chang jchang@SMI.Stanford.EDU
Subject: [BioPython] Hello and congrats on the marriage of python with
biology
Hi Chris,
Welcome to biopython! There's certainly a lot of code to be written, so
we're accepting contributions. There's already some code in there to
query Entrez for medline entries. Perhaps your sequence retrieval code
could be merged into that.
Do you have access to CVS? If so, please download a copy of the
repository (instructions at http://cvs.biopython.org/) and take a browse
around. The code is still under heavy development, though, and stability
varies across files.
There's also a README file in there that outlines things that will need to
be done before a 0.1-alpha release.
Jeff
On Mon, 20 Mar 2000, Chris Fenton wrote:
> Firstly, I am new to the list, so hello.
> Secondly, I am rather new to python, but am convinced it is a better
> language for large co-op projects than perl (do not misunderstand I like
> perl).
> Thirdly, I am no coding guru, but am willing to help.
>
> I am unsure as to what has been done already done as far as a code base
> but I reckon if biopython is to grow it is dependent on people donating
> code.
> I am polishing some simple python code that allows users to download
> sequences from the net (Entrez site):
>
> Specify (query, db, max display) to retrieve a list of uid numbers,
> retrieve records (uid, format) and save to file (file).
>
> I do not know if you guys have similar code, or if this could help.
> If not is there something the needs to be done (testing or porting some
> perl to python) I will certainly give it a try.
> At the moment I am looking at 'pythonizing'Aceperl but again unsure if I
> has not already been done.
>
>
>
>
>
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython@biopython.org
> http://biopython.org/mailman/listinfo/biopython
>
From katel@worldpath.net Sun, 26 Mar 2000 18:44:29 -0800
Date: Sun, 26 Mar 2000 18:44:29 -0800
From: Cayte katel@worldpath.net
Subject: [BioPython] SwissProt
This is a multi-part message in MIME format.
------=_NextPart_000_000A_01BF9753.519D3440
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
under organism classification, my tests gave this result for o95832.
expected is ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', =
'Vertebrata', 'Mamm
alia', 'Eutheria', 'Primates', 'Catarrhini', 'Hominidae', 'Homo.']=20
actual is ['Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Mammalia', =
'Primates', 'Catarrhi
ni', 'Hominidae', 'Homo']
The first item in each line is dropped.
Cayte
=20
------=_NextPart_000_000A_01BF9753.519D3440
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
under organism classification, my tests gave this result for =
o95832.
expected is ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata',=20
'Vertebrata', 'Mamm
alia', 'Eutheria', 'Primates', 'Catarrhini', =
'Hominidae',=20
'Homo.']
actual is ['Metazoa', 'Chordata', 'Craniata', 'Vertebrata', =
'Mammalia',=20
'Primates', 'Catarrhi
ni', 'Hominidae', 'Homo']
The first item in each line is dropped.
&n=
bsp; &nb=
sp; &nbs=
p; =20
Cayte
------=_NextPart_000_000A_01BF9753.519D3440--
From jchang@SMI.Stanford.EDU Sun, 26 Mar 2000 17:45:54 -0800 (PST)
Date: Sun, 26 Mar 2000 17:45:54 -0800 (PST)
From: Jeffrey Chang jchang@SMI.Stanford.EDU
Subject: [BioPython] SwissProt
It is, indeed. I've fixed this and checked the fixes into the repository.
Thanks,
Jeff
On Sun, 26 Mar 2000, Cayte wrote:
> under organism classification, my tests gave this result for o95832.
>
>
> expected is ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Mamm
> alia', 'Eutheria', 'Primates', 'Catarrhini', 'Hominidae', 'Homo.']
> actual is ['Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Mammalia', 'Primates', 'Catarrhi
> ni', 'Hominidae', 'Homo']
>
> The first item in each line is dropped.
>
> Cayte
>
>
From katel@worldpath.net Sun, 26 Mar 2000 22:39:48 -0800
Date: Sun, 26 Mar 2000 22:39:48 -0800
From: Cayte katel@worldpath.net
Subject: [BioPython] PyUnit
This is a multi-part message in MIME format.
------=_NextPart_000_0056_01BF9774.311DCBA0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
I just fixed a bug in PyUnit and I added code to report success of =
failure for each test. The new code can be fount at:
ftp://bio.perl.org/pub/katel/biopython/UnitTests/
Cayte
------=_NextPart_000_0056_01BF9774.311DCBA0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
I just fixed a bug in PyUnit and I =
added code to=20
report success of failure for each test. The new code can be fount =
at:
&nbs=
p;  =
; =
=20
Cayte
------=_NextPart_000_0056_01BF9774.311DCBA0--
From eugene.leitl@lrz.uni-muenchen.de Sun, 26 Mar 2000 23:16:20 -0800 (PST)
Date: Sun, 26 Mar 2000 23:16:20 -0800 (PST)
From: Eugene Leitl eugene.leitl@lrz.uni-muenchen.de
Subject: [BioPython] PyUnit
Please do not use HTML in mail messages. HTML is insecure (via
scripting languages), can be used to trace when you've read your
message and otherwise reveal your whereabouts.
Also, malicious HTML tags can simply crash your machine. After saving
your files, go to http://4.3.78.106/ for a demonstration. Strange how
a quite innocent
tag
can crash your *operating system*.
Cayte writes:
> I just fixed a bug in PyUnit and I added code to report success of failure for each test. The new code can be fount at:
> ftp://bio.perl.org/pub/katel/biopython/UnitTests/
>
>
> Cayte
>
>
>
>
>
>
>
> I just fixed a bug in PyUnit and I added code to
> report success of failure for each test. The new code can be fount
> at:
>
>
>
> size=2>
> Cayte
From dalke@acm.org Mon, 27 Mar 2000 04:05:56 -0700
Date: Mon, 27 Mar 2000 04:05:56 -0700
From: Andrew Dalke dalke@acm.org
Subject: [BioPython] sequence proposals (long)
Hello,
I've been thinking about how to define the basic sequence classes for
biopython. I've got a set of proposals on the topic which I would
love to get feedback about. Since I have a tendency to write long
emails, I've broken them up into several messages, which I'll be
sending over the next week or so. This one contains my proposals for
the basic sequence protocol, and an idea of how to handle alphabets
and encodings. I have a bias towards Python, but I also compare the
Perl, Java and C++ ways of doing things.
Proposal 1) the basic sequence interface
Conceptually, people think about sequences as a string - a list of
residues. A residue can be an object (as with biojava's ResidueList)
or a character (as with bioperl's PrimarySeq). Since I am a fan of
generic programming, I want the basic sequence interface to act the
same as a string, when possible.
Here is a list of the interface I think all sequences, and
sequence-like objects, should implement.
1.1) There is a way to iterate forward through the sequence, element
by element. When finished, all elements will have been visited in
order. Forward-only iterators should be rarely implemented.
(This requirement is here because it's the most minimal
"sequence-like" description I can think of.)
In C++ or Java, this means that sequences implement a forward
iterator. This is sufficient for many algorithms, like computing the
molecular weight or translating from DNA to protein.
Not all languages support different types of iterators as well as C++.
For example, Python only has a random access iterator. If the
underlying data structure is random access (eg, the residues are held
in an array), then this is not a problem. If the underlying data is
unidirectional (eg, a linked list), then there are problems.
Resolution of the problem is outside the scope of this proposal. For
Python, I *suggest* the following non-thread safe solution:
def __init__(self, ...):
self._pos = 0
def __getitem__(self, i):
assert i == self._pos, "forward iteration only"
return self.next()
def next(self):
self._pos = self._pos + 1
# do whatever is needed to get and return the next item,
# or raise IndexError if not available
but better yet, don't use unidirectional data storage.
Perl doesn't really have an iterator over string characters. Instead,
explicit ones are implemented by conversion to list (via 'split'), or
via substr of an integer position. Implicit iterators exist in
functions like split, s///, tr//, etc.
It's hard for me to think of cases when only having a forward iterator
(as compared to random access) makes sense, implementation-wise. It's
needed to simplify algorithms.
1.2) Random access sequences must be integer subscriptable via the
appropriate means for a string for the given language.
C++ and Python let user classes redefine subscripting (via operator[]
or __getitem__).
Perl can also let user classes act like arrays, using TIEARRAY,
according to perltie. This should mean that element lookups can look
like $seq[5] instead of substr($seq->seq(), 5, 1). Bioperl does not
implement their class this way, prefering people access the underlying
data object as a string. This is appropriate since strings are not
subscriptable in perl (I can't do: $a = "Perl"; print $a[1]).
If non-character based sequence classed are ever implemented in Perl
(eg, storing 3-letter codes instead of 1-letter, or used as a data
view to a 3D structure), then I suggest they look into tieing element
lookup.
Java, from what I recall, doesn't have operator overloading, so the
"appropriate means" is to use a method. In my attempt to understand
how Java works, I see that ResidueList has a way to return the list of
residues as a List. Java's java.util.List uses "get(int index)" to
return the object at the given position, so I would think that
ResidueList implements a method named "get()" as well.
ResidueList, instead, uses "Residue residueAt(int index)" to return
the given position. Looking at my src.jar, I see that String has a
"charAt" and Vector has "elementAt", so I guess this determined the
appropriate naming scheme. Could someone explain to me the different
names, and the reason for the variation in prefixes (that is,
{residue,char,element}At, instead of just elementAt)?
1.3) If the sequence length is known, there must be a way to get
access to it in constant time. The returned value must be usable to
get access to the last element of the list.
That last part means that with a random access container, I should be
able to use the length (possibly +/- 1) to seek directly to the last
element. And with a forward iterator, I should be able to count
"length" (again, +/-1) times and be at the end.
I would like to add the requirement that access to the length be
consistent with other objects in the language. For example, in C++
STL containers, the standard method is "size()", and for
Python,"__len__()".
Looking at the standard Java containers, it looks like they took their
method names from STL, so the length of a container is "size()".
However, biojava uses "length()" .. and so does the Java String class.
He he, and GNU C++ defines length() *and* size() with identical
implementations. I'm getting the impression that access to a Java
String doesn't act like access to a list of characters.
In Perl, again, $# could be tied to the length of a Seq object, but $#
doesn't work on strings, only lists. The generic solution is to do
length($seq->seq()), which is actually what PrimarySeqI.pm does. It
is conceivable that a PrimarySeq implementation may work on genome
size data (eg, someone want to find out how many bases are in
chromosome 11), with the interface talking to a database. Because
defining a true string-like object in perl seems hard, the length()
method can be used as a work-around to keep from loading the whole
sequence in memory.
Again, I am hard pressed to come up with a real use for an unspecified
length. The best I can think of is considering a interface to a
sequencing machine. Before you start you don't know how much data
will come out. This is also a possible real-life case for having a
forward-only iterator mentioned in 1.1).
1.4) Position, lengths and ranges should be used in language
appropriate form.
This is the standard "do sequences start at 1, like biologists think
about them or 0 like the language (except for Visual Basic and
Fortran)." I strongly believe if the language is 0 based, then the
sequence implementation is 0 based. Output meant for a biologist
should be converted to 1 based, but not the internals.
I say "should" in the proposal instead of a "must" because there is
much contention about the subject. Biojava has sequences starting a 1
even though Java is a 0 based langauge.
The part about "lengths" is because some languages have the length of
a container be the number of elements in it, while others have it be
the offset to the last element.
The part about ranges is because the string "ABCD" when sliced by
[1:3] can be:
"BC" - Python and C++
"BCD" - Perl
"ABC" - biologist
"AB" - CORBA/LSR
Also, some languages allow negative subscripts (in Python, seq[-1]
refers to the last element of the sequence) and some define
out-of-bound semantics (in Python, indicies can be out-of-bound, but
not slices.)
In my mind, consistency with the language is more important than
consistency with the domain, if all such matters can be solved purely
as an I/O translation.
1.5) Subsequences can be extracted from sequences given a range, and
the subsequence implements the sequence interface.
Again, this is all language dependent. Java containers uses
"sublist(int fomIndex, int toIndex)", and so does biojava (though with
a base of 1).
C++ would have a constructor taking iterators for the first and last
positions. This is nice because the programmer has a chance to define
the proper return type - it's the constructor.
Python has a slice operator, so the standard version could probably be
something like:
def __getslice__(self, i, j):
return self.__class__(self.data[i:j])
(should strides be supported?)
Again, Perl can tie to act like an array, but because the
implementation assumes a string, there needs to be some way to get a
substring without accessing the full string, and it needs to preserve
the interface, so they use a method called "subseq". The name is
analagous to the name "substr."
1.6) Mutability - if the sequence class is mutable, then
a) elements must be changed by a means equivalent to subscripts
b) ranges must be changed by a means equivalent to subslicing
At the implementor's discretion, neither or both of the following
are implemented:
c) assignment to slices may change the length of the list
(eg, seq[1:5] = "AAAAAAAAAAAAAAAAAAAAAAAAA")
d) elements may be able to be added and removed
At the implementor's discretion:
e) modification to slices may affect the original sequence
This part of my proposal is meant mostly as a guideline to show that
mutability is a complex problem.
The phrase "by a means equivalent to" means to use the language
appropriate complement to the subscripting of 1.2) and subslicing of
1.5). In C++ and Python, it's via []. In Perl it's substr. In
biojava, I'm not sure of the method names.
The discretionary parts are present because they can be hard to
implement. In Python, there are 4 possible sequence-of-character
structures: string, list of characters, array.array, and
Numeric.array, with different trade offs:
i) strings cannot be editing in place, so even non-size changing
modifications require building new strings
ii) "list of characters" uses the most memory and slices copy the
original sequence (so can be slower and use even more memory)
iii) array.array of characters also copies from original sequence,
iv) Numeric.array cannot change size in-place but does e) It isn't
installed on every machine.
As an example of e), should the last line of following return 'C' or
'x'?
>>> a = Seq("ABCDE")
>>> b = a[2:4]
>>> str(b[0])
'C'
>>> str(a[2])
'C'
>>> b[0] = "x"
>>> b[0]
'x'
>>> str(a[2])
It turns out that an implementation with array.array will return 'C'
while one with Numeric.array will return 'x'. Which is best?
1.7) If the sequence elements are meant to be viewable, their string
value is found through the normal stringification operation.
In the simplest case, str(seq[i]) should return the single, (or three
letter) code for the given element. Adding this clause makes it
easier to support aligned displays of different sequence types.
For example, consider the example where I have a sequence, and an
object (a Prosite match object from PS00028) implementing the sequence
protocol.
seq = Seq("KPYECSECRKAFRERSSLINHQRTHTGE")
pattern = Prosite.compile("C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.")
match = pattern.search(seq)
# finds "CSECRKAFRERSSLINHQRTH"
offset = match.start()
for i in range(len(match_text)):
print str(pattern[i]), "\t", str(seq[offset+i])
C C
x S
x E
C C
x R
x K
x A
[LIVMFYWC] F
x R
(you get the idea)
Since the pattern match acts like a sequence object, it can be used
anywhere the concept of "show me each element" can work; with the
provisio that the display may need to be able to show more than one
character.
Note that there can be other ways to get a name for a residue. You
may prefer the three letter name, or the english name, or the spanish
name. There is a general need for looking up an property of a
residue, which will be discussed in one of my future proposals.
1.8) If possible, the stringification of a sequence object must return
the string value of the elements such that, when given a sequence of
size N and some x in a valid subrange, then:
str(seq) == str(seq[:x]) + str(seq[x:])
I'm not sure about this one, which is why I added the "if possible".
The problem comes when there are seperators. Suppose you use a three
letter code, then you might want the stringification of a sequence to
be: ALA-GLY-PRO-ASP instead of ALAGLYPROASP. But splitting the
sequence in half and joining them may create
"ALA-GLY" + "PRO-ASP" -> "ALA-GLYPRO-ASP"
There are a couple of solutions to this. Track if the end-point is
really a terminal, and add the appropriate seperator:
"ALA-GLY-" + "PRO-ASP" -> "ALA-GLY-PRO-ASP"
Use lower cases, so the sequence is
"AlaGly" + "ProAsp" -> "AlaGlyProAsp"
(blech; I can't read that easily)
Require that sequence objects implement a "join" method to join the
strings in the right fashion. This is also ugly.
This method will only be used when converting the sequence to a string
for use with, eg, a regular expression engine, or for performance
reasons. (In Python, it's faster to iterate over string elements than
having the __getitem__ overhead for each element.)
This is not a problem in regular strings, since they only have a
single character. Thus, the "if possible" phrase probably means "if
the underlying encoding is from a single letter alphabet".
===============================
Proposal 2) Encodings
My first proposal topic works fine for sequences which can be
expressed as a set of letters/elements. In addition to the normal
IUPAC protein/nucleotide usage, here are some of the ways it can be
used (if you know of more, please let me know):
A) non-standard residues have their own characters. Some are
mentioned in the IUPAC definitons, like:
X = selenocysteine; for proteins
B = Asx = aspartic acid or asparagine; for protein
W = wyosine; for DNA
B) non-IUPAC usage
- perhaps the most common is using upper/lower case as an
indication of certainty.
- Aaron J Mackey (ajm6q@virginia.edu) on bioperl-guts points
out that SEG can convert low-information residues to lower case
so they act like X in FASTA, but you can still see the original
characters in the output
C) non-residue information
(I especially want more information about these.)
- gaps, as in alignments (often with "-", "." or " ")
- stop codons; "*" is often used to show where a protein residue
would be, if the corresponding codon wasn't a stop codon.
D) extra-residue information, still with a single character
- show secondary structure prediction using "H"elix, "S"trand,
"C"oil, "T"urn
- digit to show % properties, like % conservation
E) non-single character definitions, still per-residue
- a prosite pattern match ("[FILAPVM]"), esp. as when aligned
to the matching protein,
- alignment of 3-character nucleotides to 1-character amino acids
All of the above scenerios can be modeled with objects implementing
the sequence protocol. That is, they are subscriptable, sliceable,
stringifiable, etc. I know this is true for A)-D) because I've seen
FASTA files used to store all of them, and the records of a FASTA file
are easily mapped to the sequence protocol (trivially so, since the
data is natually stored as a string of characters). And E) works
because I've restricted it to have a one-to-one mapping to a residue.
When read from a generic FASTA file, the sequence isn't really
anything other than a glorified string. In a formal sense, you can't
calculate the molecular weight because you don't know if
"CCCCCHHHHHHHTCCC" is a protein or nucleotide sequence, or maybe a
secondary structure prediction! (BTW, I consider the autoguessing
requirement of bioperl to be a flaw because other encodings exist.)
You can't even get the biological sequence length, because it can
contain gap characters. Should the length of "A--T" be 2 or 4? I
think 2, but that violates my "1.3)", above.
The problem is two-fold; what's the per-residue alphabet, and what's
the encoding for non-residue information? (The last part is C) in my
list.)
I got to thinking about this problem a lot. Staring from basics, if
space and performance were of no concern, a sequence like "APGA..."
could be represented as
from IUPAC.Protein import ALA, PRO, GLY
sequence = [ALA(), PRO(), GLY(), ALA(), ...]
where PRO, GLY and ALA are constructors for a residue object. This
would even solve the typing problem because each residue would be
typed, and all you need to assert is that the element types are
homogenous. Also, in Python, a list (the "[]") implements the
sequence interface.
In the general case, each residue may be different. For example, some
of the residues may be modified - perhaps methylated, either at
specific sites or in a statistical sense - or you want to store the 3D
coordinates for each one.
If the residues are identical and position independent, there's no
need to create new objects, so this could be turned into:
sequence = [ALA, PRO, GLY, ALA, ...]
These are sometimes called fly-weight objects because the objects are
reused. The new "objects" are really just the pointers/references to
the real object. This, BTW, is the biojava approach.
Space and performance are important, and there are fewer than 256 of
them (okay, fewer than 27) , so this is further reduced to:
sequence_type = "protein"
sequence = "APGA"
or
sequence = Seq("APGA", moltype = "protein")
This is the bioperl approach.
This compression hinges on several predicates:
a) all residues are identical
b) there is one-to-one mapping from letters+seqtype to the "real"
residue objects
c) residues are of homogenous data types (can't mix protein and
rna types)
Using a gap character or stop codon symbol violates c) because
non-homogenous types in the container. A stop codon symbol violates
b) because there is nothing real about it. It indicates the
non-existence of an object.
So those characters (and likely others in different contexts) do not
belong as elements of a sequence. They really should be considered
part of the sequence itself.
Flipping things around, how would I define something which holds a
gapped sequence? I would make a new class which has the sequence and
the location/length of the gaps. It takes the sequence string and
knows the appropriate way to interpret the gap character. Very much
like what bioperl's SimpleAlign.pm and UnivAln.pm do, except holding
only one sequence.
There is an interesting question here - should GappedSeq be derived
from Seq, or vice versa? If GappedSeq is derived from Seq, then by
good class design, a GappedSeq is usable anywhere a Seq is usable, but
this is false - I can't use the same algorithm to calculate the
molecular weight.
Should Seq be derived from GappedSeq? No, since then Seq would also
need to be derived from any other possible encoding property, like the
"contains stop codons" class.
Should they be the same class? No, since that's just too heavyweight-
the Seq must implement every property.
So GappedSeq is not a type of Seq, although there is an data view of a
GappedSeq which acts like a Seq. (That's the one encoded in the FASTA
file.)
Pause for a moment and reconsider the generic FASTA file reader. What
does it return? Assuming you know nothing about what's in the file,
it can only be a set of sequence-like objects, with some unknown
encoding scheme. At some point, I have to specify the encoding. Only
then can I make the "real" object, whether it be a protein, dna,
secondary structure, or whatever.
2.1) Sequences using a finite alphabet must contain an "alphabet"
member, which is used to map the alphabet back to the appropriate
per-residue type.
Bioperl does not do this - for the most part they assume IUPAC-encoded
sequences.
Biojava does store the alphabet as part of the sequence. Even after
looking through the code, I can't tell how they know if the sequence
type is protein, dna, or whatever. Grepping through everything, the
only hits to "protein" are in these two files:
Annotator.java:
* domains in proteins, genes in genomes and all sorts of other things.
/Biocorba/Seqcore/SeqType.java: public static final int _PROTEIN = 0,
Here's how I want to define the alphabet for biopython:
class Alphabet:
size = None # unspecified/non-constant size per sequence element
class SingleLetterAlphabet(Alphabet):
size = 1
letters = None # No restriction on the alphabet
single_letter_alphabet = SingleLetterAlphabet()
class Protein(SingleLetterAlphabet):
pass
class IUPACProtein(Protein):
letters = "ACDEFGHIKLMNPQRSTVWY"
class DNA(SingleLetterAlphabet):
pass
class IUPACAmbiguousDNA(DNA):
letters = "GATCRYWSMKHBVDN"
class IUPACUnambiguousDNA(IUPACAmbiguousDNA):
pass
class HasStopCodon:
stop_codon = "*"
def __init__(self, stop_codon = stop_codon):
self.stop_codon = codon
class ProteinWithStopCodon(IUPACProtein, HasStopCodon):
def __init__(self, stop_codon = HasStopCodon.stop_codon):
HasStopCodon.__init__(self, stop_codon)
class SecondaryStructure(SingleLetterAlphabet):
letters = "HTSC"
class Percentage(SingleLetterAlphabet):
letters = "0123456789"
class IgnoreLowercaseProteinAlphabet(ProteinAlphabet):
pass
This approach heavily uses multiple inheritence, which Java won't
like, and which I've rarely had to use. Another way is to have a
generic associative array, and do lookups. The problem is, I haven't
yet implemented this to see how well it will work.
By tagging a sequence with an alphabet, the generic FASTA file parser
might look like:
class EncodedSeq:
def __init__(self, seq, alphabet):
self.seq = seq
self.alphabet = alphabet
class FastaRecord:
def __init__(self, desc, seq):
self.desc = desc
self.seq = seq
def read_fasta_record(infile, alphabet = single_letter_alphabet):
# get the description line
# get the sequence lines
# merge sequence lines into one
return FastaRecord(desc, seq = EncodedSeq(sequence_string, encoding))
and used like:
a_fasta_record = read_fasta_record(open("file.aa"))
This input data is untyped, because I didn't specify a type, and it
could be something non-DNA/RNA/protein, like secondary structure
labels. However, adding guess support would be
a_fasta_record = guess_enoding(read_fasta_record(open("file.aa")))
where guess_encoding() would modify the "encoding" attribute of the
sequence, as need be. Better yet, if I knew that the file contained
standard IUPAC defined protein sequences, then I would do
a_fasta_record = read_fasta_record(open("file.aa"), protein_alphabet)
(One of my future proposals will be having types Seq classes, like
ProteinSeq, DNASeq, and RNASeq. The generic FASTA parser then
wouldn't take the alphabet type, but rather it would get an
appropriate factory object.)
But even without guessing, there are still things I can do with a
generic alphabet encoding, like display the sequence, or save it back
as a FASTA file.
This same framework also lets me do some normalizations, like force
sequence to uppercase using a transformation done after the read. I
could also call "verify_encoding" which would check that all
characters are appropriately defined.
For sequence records which specify type, (SWISS-PROT always contains
proteins), then there's no need to specify the encoding type on the
call.
Consider once again the "A--T" question. I've read in that record
from my generic FASTA reader, and tied it to a "standard DNA sequence
with '-' encoded gaps." I still have the problem that the length of
that sequence is 2, since nothing has been done to treat it different
than a normal Seq.
Instead, it has to have one more transformation to turn it into the
"GappedSeq" class mentioned earlier. The way to think of it is as a
GappedSeq which is encoded as a Seq. The problem with the class
definition earlier was the confusion in that normal sequences have no
special encoding, or rather, a Seq class is encoded at itself.
So if I want to read a gapped sequence from a file, I could do
something like:
gapped_seq = to_gapped(read_fasta_record(StringIO(">test\nA--T\n")))
This would get the sequence data from the record, find the encoding
for the gap character (if any), and use it to make the GappedSeq
object. I can then do:
len(gapped_seq.sequence)
to find the that unencoded length is indeed 2.
Also, for performance reasons there may be specialized parsers which
implement the whole call chain of
file -> fasta record + encoding -> to_gapped()
(A biopython project brought up last year was the possibility of
generating some of these specialied parsers automatically.)
The end result of all this is that almost nothing changes from the
current bioperl data structure, except that "moltype" becomes
"encoding" and takes on many more properties.
There will be more in future emails about how the properties work
together, so that you can preserve lower case letters if you want, or
have lower case DNA go to upper case protein, or compute the molecular
weight of a specialized encoding (eg, for selenocysteine). But I
really need to implement it first to make sure it works
Andrew Dalke
dalke@acm.org
From katel@worldpath.net Mon, 27 Mar 2000 23:59:23 -0800
Date: Mon, 27 Mar 2000 23:59:23 -0800
From: Cayte katel@worldpath.net
Subject: [BioPython] PyUnit
----- Original Message -----
From: Eugene Leitl
To: Cayte
Cc:
Sent: Sunday, March 26, 2000 11:16 PM
Subject: [BioPython] PyUnit
>
> Please do not use HTML in mail messages. HTML is insecure (via
> scripting languages), can be used to trace when you've read your
> message and otherwise reveal your whereabouts.
>
Thank you for the info. Is there a safe way to send web sites?
Cayte
From jchang@SMI.Stanford.EDU Thu, 30 Mar 2000 01:50:07 -0800 (PST)
Date: Thu, 30 Mar 2000 01:50:07 -0800 (PST)
From: Jeffrey Chang jchang@SMI.Stanford.EDU
Subject: [BioPython] sequence proposals (long)
On Mon, 27 Mar 2000, Andrew Dalke wrote:
> Hello,
>
> I've been thinking about how to define the basic sequence classes for
> biopython. I've got a set of proposals on the topic which I would
> love to get feedback about. Since I have a tendency to write long
> emails, I've broken them up into several messages, which I'll be
> sending over the next week or so.
Yikes! And this is the broken-up one!
> This one contains my proposals for the basic sequence protocol, and an
> idea of how to handle alphabets and encodings. I have a bias towards
> Python, but I also compare the Perl, Java and C++ ways of doing
> things.
Sure. I don't think it would be a failure if biopython were to make
sequences classes that were biased (even heavily) toward python's way of
doing things. I'd rather have something that works well here, rather than
sequences that suck equally on all languages! ;)
> Proposal 1) the basic sequence interface
>
> Conceptually, people think about sequences as a string - a list of
> residues. A residue can be an object (as with biojava's ResidueList)
> or a character (as with bioperl's PrimarySeq). Since I am a fan of
> generic programming, I want the basic sequence interface to act the
> same as a string, when possible.
Do you mean sequences should support the same slicing semantics? After
python 1.6, strings will become objects, with their own methods, and how
string objects and biological sequences act will diverge.
> Here is a list of the interface I think all sequences, and
> sequence-like objects, should implement.
>
> 1.1) There is a way to iterate forward through the sequence, element
> by element. When finished, all elements will have been visited in
> order. Forward-only iterators should be rarely implemented.
[discussion of forward iterators]
> 1.2) Random access sequences must be integer subscriptable via the
> appropriate means for a string for the given language.
[... I'm liberally cutting things from the email, for length and
relevance reasons. I hope that I'm not leaving anything without the
proper context. I apologize if I do!]
> 1.3) If the sequence length is known, there must be a way to get
> access to it in constant time. The returned value must be usable to
> get access to the last element of the list.
[...]
> 1.4) Position, lengths and ranges should be used in language
> appropriate form.
>
> This is the standard "do sequences start at 1, like biologists think
> about them or 0 like the language (except for Visual Basic and
> Fortran)." I strongly believe if the language is 0 based, then the
> sequence implementation is 0 based. Output meant for a biologist
> should be converted to 1 based, but not the internals.
Yes, this has been covered here before, and IIRC, the consensus.
> I say "should" in the proposal instead of a "must" because there is
> much contention about the subject. Biojava has sequences starting a 1
> even though Java is a 0 based langauge.
In defense of the biojava people, having 1-based sequences seem less
offensive in java than it would in python, where the subscripting is
overloadable. It feels to me like the semantics of the indexes for a
method call is less stringently enforced than that of subscripting, where
the syntax is built into the language.
> 1.5) Subsequences can be extracted from sequences given a range, and
> the subsequence implements the sequence interface.
[...]
> (should strides be supported?)
What's a stride?
> 1.6) Mutability - if the sequence class is mutable, then
> a) elements must be changed by a means equivalent to subscripts
> b) ranges must be changed by a means equivalent to subslicing
> At the implementor's discretion, neither or both of the following
> are implemented:
> c) assignment to slices may change the length of the list
> (eg, seq[1:5] = "AAAAAAAAAAAAAAAAAAAAAAAAA")
> d) elements may be able to be added and removed
> At the implementor's discretion:
> e) modification to slices may affect the original sequence
>
> This part of my proposal is meant mostly as a guideline to show that
> mutability is a complex problem.
Agreed.
(Slight tangent) I'm not sure if you've mentioned it explicitly, but
we're going to need both mutable and immutable sequences. Immutable
sequences are necessary in order to guarantee the consistency between the
sequence information and any annotations that may have been carried with
it. Because it would be so hairy otherwise, I propose that any annotated
sequences must be immutable.
> As an example of e), should the last line of following return 'C' or
> 'x'?
>
> >>> a = Seq("ABCDE")
> >>> b = a[2:4]
> >>> str(b[0])
> 'C'
> >>> str(a[2])
> 'C'
> >>> b[0] = "x"
> >>> b[0]
> 'x'
> >>> str(a[2])
>
>
> It turns out that an implementation with array.array will return 'C'
> while one with Numeric.array will return 'x'. Which is best?
I don't have a definitive answer for this, but can probably add to the
confusion. I'm not really a fan of the Numeric way of doing this, because
it breaks the usual python idiom where a[:] creates a copy of the a list.
However, the Numeric way does save a lot of memory when accessing just a
region of a large matrix (or DNA sequence).
> 1.7) If the sequence elements are meant to be viewable, their string
> value is found through the normal stringification operation.
>
> In the simplest case, str(seq[i]) should return the single, (or three
> letter) code for the given element. Adding this clause makes it
> easier to support aligned displays of different sequence types.
Yes, but I'm not sure we need to allow this kind of flexibility. I
believe str should just return a human-readable string, and leave
specialized formatting to other functions.
> Note that there can be other ways to get a name for a residue. You
> may prefer the three letter name, or the english name, or the spanish
> name. There is a general need for looking up an property of a
> residue, which will be discussed in one of my future proposals.
OK.
> 1.8) If possible, the stringification of a sequence object must return
> the string value of the elements such that, when given a sequence of
> size N and some x in a valid subrange, then:
>
> str(seq) == str(seq[:x]) + str(seq[x:])
>
>
> I'm not sure about this one, which is why I added the "if possible".
> The problem comes when there are seperators. Suppose you use a three
> letter code, then you might want the stringification of a sequence to
> be: ALA-GLY-PRO-ASP instead of ALAGLYPROASP. But splitting the
> sequence in half and joining them may create
>
> "ALA-GLY" + "PRO-ASP" -> "ALA-GLYPRO-ASP"
>From what I understand, str is just supposed to return a human-readable
string, useful for interactive mode or debugging. I'm uncomfortable
imposing other constraints on it.
> ===============================
>
> Proposal 2) Encodings
>
> My first proposal topic works fine for sequences which can be
> expressed as a set of letters/elements. In addition to the normal
> IUPAC protein/nucleotide usage, here are some of the ways it can be
> used (if you know of more, please let me know):
>
> A) non-standard residues have their own characters. Some are
> mentioned in the IUPAC definitons, like:
> X = selenocysteine; for proteins
> B = Asx = aspartic acid or asparagine; for protein
> W = wyosine; for DNA
> B) non-IUPAC usage
> - perhaps the most common is using upper/lower case as an
> indication of certainty.
> - Aaron J Mackey (ajm6q@virginia.edu) on bioperl-guts points
> out that SEG can convert low-information residues to lower case
> so they act like X in FASTA, but you can still see the original
> characters in the output
> C) non-residue information
> (I especially want more information about these.)
> - gaps, as in alignments (often with "-", "." or " ")
> - stop codons; "*" is often used to show where a protein residue
> would be, if the corresponding codon wasn't a stop codon.
> D) extra-residue information, still with a single character
> - show secondary structure prediction using "H"elix, "S"trand,
> "C"oil, "T"urn
> - digit to show % properties, like % conservation
> E) non-single character definitions, still per-residue
> - a prosite pattern match ("[FILAPVM]"), esp. as when aligned
> to the matching protein,
> - alignment of 3-character nucleotides to 1-character amino acids
>
> All of the above scenerios can be modeled with objects implementing
> the sequence protocol. That is, they are subscriptable, sliceable,
> stringifiable, etc. I know this is true for A)-D) because I've seen
> FASTA files used to store all of them, and the records of a FASTA file
> are easily mapped to the sequence protocol (trivially so, since the
> data is natually stored as a string of characters). And E) works
> because I've restricted it to have a one-to-one mapping to a residue.
>
> When read from a generic FASTA file, the sequence isn't really
> anything other than a glorified string. In a formal sense, you can't
> calculate the molecular weight because you don't know if
> "CCCCCHHHHHHHTCCC" is a protein or nucleotide sequence, or maybe a
> secondary structure prediction! (BTW, I consider the autoguessing
> requirement of bioperl to be a flaw because other encodings exist.)
>
> You can't even get the biological sequence length, because it can
> contain gap characters. Should the length of "A--T" be 2 or 4? I
> think 2, but that violates my "1.3)", above.
It depends on what you consider the sequence length.
I don't consider "A-T" to be a biological sequence. I think the sequence
is "AT", with extra information embedded within. B-D above all describe
cases in which information other than the sequence is contained in the
string representation of the sequence. This should be stored in some
other place, and the sequence protocol preserved for the actual sequence.
For example:
>>> seq = GappedSequence("AT-G--C")
>>> seq[1:3]
'TG'
>>> seq.gapped[1:3]
'T-G'
Here, the sequence slicing returns the actual biological sequence, while
the gapped representation is delegated to another object. The actual
storage of the gap information is unspecified.
This would solve the problem discussed on the bioperl list, where doing
upper or lower was destroying the information.
> The problem is two-fold; what's the per-residue alphabet, and what's
> the encoding for non-residue information? (The last part is C) in my
> list.)
>
>
> I got to thinking about this problem a lot. Staring from basics, if
> space and performance were of no concern, a sequence like "APGA..."
> could be represented as
>
> from IUPAC.Protein import ALA, PRO, GLY
> sequence = [ALA(), PRO(), GLY(), ALA(), ...]
>
>
> where PRO, GLY and ALA are constructors for a residue object. This
> would even solve the typing problem because each residue would be
> typed, and all you need to assert is that the element types are
> homogenous. Also, in Python, a list (the "[]") implements the
> sequence interface.
>
> In the general case, each residue may be different. For example, some
> of the residues may be modified - perhaps methylated, either at
> specific sites or in a statistical sense - or you want to store the 3D
> coordinates for each one.
>
> If the residues are identical and position independent, there's no
> need to create new objects, so this could be turned into:
>
> sequence = [ALA, PRO, GLY, ALA, ...]
>
> These are sometimes called fly-weight objects because the objects are
> reused. The new "objects" are really just the pointers/references to
> the real object. This, BTW, is the biojava approach.
>
> Space and performance are important, and there are fewer than 256 of
> them (okay, fewer than 27) , so this is further reduced to:
>
> sequence_type = "protein"
> sequence = "APGA"
> or
> sequence = Seq("APGA", moltype = "protein")
>
> This is the bioperl approach.
>
> This compression hinges on several predicates:
> a) all residues are identical
> b) there is one-to-one mapping from letters+seqtype to the "real"
> residue objects
> c) residues are of homogenous data types (can't mix protein and
> rna types)
>
> Using a gap character or stop codon symbol violates c) because
> non-homogenous types in the container. A stop codon symbol violates
> b) because there is nothing real about it. It indicates the
> non-existence of an object.
>
> So those characters (and likely others in different contexts) do not
> belong as elements of a sequence. They really should be considered
> part of the sequence itself.
>
> Flipping things around, how would I define something which holds a
> gapped sequence? I would make a new class which has the sequence and
> the location/length of the gaps. It takes the sequence string and
> knows the appropriate way to interpret the gap character. Very much
> like what bioperl's SimpleAlign.pm and UnivAln.pm do, except holding
> only one sequence.
>
>
> There is an interesting question here - should GappedSeq be derived
> from Seq, or vice versa? If GappedSeq is derived from Seq, then by
> good class design, a GappedSeq is usable anywhere a Seq is usable, but
> this is false - I can't use the same algorithm to calculate the
> molecular weight.
Yes. I think GappedSeq will need to behave like Seq, with the
gap-specific stuff in methods specific to GappedSeq. Unless there's
something I'm missing... (It's getting late).
> 2.1) Sequences using a finite alphabet must contain an "alphabet"
> member, which is used to map the alphabet back to the appropriate
> per-residue type.
Seems reasonable. I've never had to use an alphabet, but I've not been in
a situation where the requirement would've gotten in my way, either.
[...]
> There will be more in future emails about how the properties work
> together, so that you can preserve lower case letters if you want, or
> have lower case DNA go to upper case protein, or compute the molecular
> weight of a specialized encoding (eg, for selenocysteine). But I
> really need to implement it first to make sure it works
Looking forward to them!
Jeff
>
>
> Andrew Dalke
> dalke@acm.org
>
>
>
> _______________________________________________
> BioPython mailing list - BioPython@biopython.org
> http://biopython.org/mailman/listinfo/biopython
>
From dalke@acm.org Thu, 30 Mar 2000 14:23:59 -0700
Date: Thu, 30 Mar 2000 14:23:59 -0700
From: Andrew Dalke dalke@acm.org
Subject: [BioPython] proposal 3
This is a short one.
From what I can tell, a sequence is characterized by a list
of residues, which I'm encoding via a list of letters and an
alphabet description.
Is that all that's needed for a minimal data structure?
I can think of one more - are the end points physical end
points, or parts of a larger structure? The characters in
a string are not exactly one-to-one equivalent to the residues.
Consider the carboxyl end of a protein. Because it's a
terminal, it contains the extra "O-H", so it's mass and atom
count will be higher than any other residue in the middle of
the sequence with the same letter.
So my proposal is:
Proposal 3 - The ends of a sequence may correspond to physical
ends of the real sequence. This data is stored in the attribute
"endings", which has two elements, "left" and "right". (Left
is position 0.) The possible values for the elements are
UNKNOWN, TERMINAL, NONTERMINAL.
The only results I can see it affecting are the atom count
and mass calculations, and only if the functions to calculate
the count and mass are accurate.
Let me explain that. An accurate mass calculation function
might look like:
def total_mass(seq, mass_table):
mass = 0.0
for c in seq:
mass = mass + mass_table[c]
return mass + 18.0 # the extra H and O-H for the terminals
In this case,
mass(seq) != mass(seq[:len(seq)/2]) + mass(seq[len(seq)/2:])
because the 18.0 is added twice on the right hand side.
Thus, slicing corresponds to a physical cut of the protein,
as compared to being a subsection of the string. It doesn't
let you answer "what is the mass contribution of the first half
of the sequence?"
This information will only be used rarely (as proof, biojava
and bioperl don't track this data). Adding it means that every
constructor and factory function and generative function must
do the right thing for the ending. There are three possible
values for each end, which makes things complex. It will be
run often, but since the data is almost never used, this work
will be wasted.
There would also need to be a function other than subslicing
used to modify the ending. Eg,
seq.cut(10, 40, chop = (TERMINAL, UNKNOWN))
Since I don't like the complexity and performance hits,
I'm against the proposal.
Andrew
dalke@acm.org
From dalke@acm.org Thu, 30 Mar 2000 14:24:09 -0700
Date: Thu, 30 Mar 2000 14:24:09 -0700
From: Andrew Dalke dalke@acm.org
Subject: [BioPython] sequence proposals (long)
> Sure. I don't think it would be a failure if biopython were
> to make sequences classes that were biased (even heavily) toward
> python's way of doing things. I'd rather have something that
> works well here, rather than sequences that suck equally on
> all languages! ;)
Yes. But if there are three "natural" ways to do something
in Python, and one of them is common with the Perl and Java
ways, then I would rather chose the common one.
> Do you mean sequences should support the same slicing semantics?
Yes.
> After python 1.6, strings will become objects, with their own
> methods, and how string objects and biological sequences act
> will diverge.
Ohh, good point. Currently a string is little more than a byte
array, and I was thinking just of the list-like interfaces.
> [... I'm liberally cutting things from the email, for length and
> relevance reasons. I hope that I'm not leaving anything without
> the proper context. I apologize if I do!]
No problem. The extra text was commentary/justification meant
to back up my proposals. I was thinking of resending just the
proposal part; thanks for doing so.
> It feels to me like the semantics of the indexes for a method
> call is less stringently enforced than that of subscripting,
> where the syntax is built into the language.
I hadn't thought of that before. Sounds reasonable.
> What's a stride?
The step length in a slice. The default is 1, so [1:5] returns
4 characters. [1:5:2] returns 2 characters (as positions 1 and 3).
>>> import string, Numeric
>>> a = Numeric.array("This is a test.")
>>> import string
>>> string.join(a[1:12:2], "")
'hsi e'
>>>
> I'm not sure if you've mentioned it explicitly, but
> we're going to need both mutable and immutable sequences.
I hadn't mentioned it, but it is true. I've got classes for
both types; the mutable one is based off of array.array.
> Because it would be so hairy otherwise, I propose that any
> annotated sequences must be immutable
Agreed.
> However, the Numeric way does save a lot of memory when accessing
> just a region of a large matrix (or DNA sequence).
I was thinking about that. It's also possible to have subsequences
return a proxy object, which references back to the main sequence
only when needed. There's a higher per-subsequence object cost,
but the original object was really big.
It becomes rather more difficult to implement and use these.
What is a case where the subsequence copies must be nearly genome
sized?
> Yes, but I'm not sure we need to allow this kind of flexibility.
> I believe str should just return a human-readable string, and
> leave specialized formatting to other functions.
You are right. Using the stringification operator is not the
right choice. Looking at the other character array objects
(Numeric.array and array.array), the proper method is "tostring()".
If Python 1.6 strings have a tostring() method, returning itself,
then I would be pleased. I'll ask about/for that on the Python
list.
> It depends on what you consider the sequence length.
>
> I don't consider "A-T" to be a biological sequence.
Right. I've since changed my alphabet proposal so that
gaps are not types of physical alphabets, but are encodings
around alphabets.
> For example:
> >>> seq = GappedSequence("AT-G--C")
> >>> seq[1:3]
> 'TG'
> >>> seq.gapped[1:3]
> 'T-G'
I would have it the other way around, where the default subscript
contains the '-' and the ".ungapped" attribute yields the sequence.
This makes it easier to compare relative positions of a sequence
with a gapped sequence.
Andrew
dalke@acm.org
From jchang@SMI.Stanford.EDU Thu, 30 Mar 2000 14:00:53 -0800 (PST)
Date: Thu, 30 Mar 2000 14:00:53 -0800 (PST)
From: Jeffrey Chang jchang@SMI.Stanford.EDU
Subject: [BioPython] sequence proposals (long)
> [Jeff]
> > For example:
> > >>> seq = GappedSequence("AT-G--C")
> > >>> seq[1:3]
> > 'TG'
> > >>> seq.gapped[1:3]
> > 'T-G'
>
[Andrew]
> I would have it the other way around, where the default subscript
> contains the '-' and the ".ungapped" attribute yields the sequence.
> This makes it easier to compare relative positions of a sequence
> with a gapped sequence.
It looks like there's 2 things going on here. In this example, one is
getting a display-able representation of the sequence, where you can make
inferences on character lengths and such, and the other is accessing the
biological sequence.
More generally, all sequences need to support some way of getting the
biological sequence, and possibly other access methods depending on the
requirements of the class.
Maybe all sequences will need to support at least biological sequence
access, in addition to displayable representation? I'm beginning to worry
about sliding down a slippery slope towards large classes, though...
The disadvantage of having it the other way around, is that people who
want to access the underlying biological sequence (without gap characters)
will need to do it a different way for every type of sequence.
Jeff