Sequence numbering (was: Re: [BioPython] bioperl idl)

Jeffrey Chang jchang@SMI.Stanford.EDU
Wed, 22 Sep 1999 10:31:27 -0700 (PDT)


On Wed, 22 Sep 1999, Iddo Friedberg wrote:

> : > [Andrew Dalke]
> : > Therefore, the answer I've said to this argument
> : > before is that
> : > only the top-level "talking to the biologist" (or
> : > the outside
> : > world) layer does the translation to the internal
> : > representation.
> : > Everything else, libraries included, should and must
> : > use the
> : > system's representation.

Sigh.  Using the system's representation would obviously increase the
compatibility with the language and reduce the number of layers of code
called between libraries functions.  But I still worry about the
ramifications of doing this when the system's representation is different
than the data.

Our library code will typically operate on data gathered from biological
databases, which number sequences from 1 to n.  Thus, when we do
operations based on that numbering, it would be most convenient if the
library worked that way too.

For example, the Swiss-Prot feature table entries specify sequence
features based on a strict ordinal numbering of the residues starting from
one.  If our sequence indexing were 1-based and end-inclusive, we could
do:
  seq[feature.start:feature.end]

rather than (for Python-style indexing):
   # start is 0-based, instead of 1
   # don't subtract 1 from end, because it's exclusive
   seq[feature.start-1:feature.end]



> 2) However, I do believe that we will require classes holding different
> residue number attributes for the same residue, based on the different
> databases these residues exist in. A sort of a "translation table" such as
> exists in FSSP.

Yep, I agree.  For some sequence class, we will need to have some way of
specifying alternate numbering of sequences.  I proposed a possible
solution to the bioperl-guts mailing list:
http://www.uni-bielefeld.de/mailinglists/BCD/vsns-bcd-perl-guts/9908/0012.html


Jeff