Sequence numbering (was: Re: [BioPython] bioperl idl)

Iddo Friedberg
Wed, 22 Sep 1999 09:51:48 +0200 (GMT+0200)

On Tue, 21 Sep 1999, Bradley Marshall wrote:

: > Therefore, the answer I've said to this argument
: > before is that
: > only the top-level "talking to the biologist" (or
: > the outside
: > world) layer does the translation to the internal
: > representation.
: > Everything else, libraries included, should and must
: > use the
: > system's representation.
: Agreed.

Same here. However, the sequence numbering problem does not begin & end
with the application language ordinality disagreeing with the user's idea.
There are also different ordinality schemes in different databases. Most
notably, within PDB there is no fixed residue numbering scheme. Residue
numbers may start at 0, 1, 2 or even -2. Due to gaps within the
structure, there might be gaps in the numbering of residues; also, there
is the problem of the so-called "insertion codes": two consecutive
residues might have the same residue number (yes!), while having an "A"
and "B" suffix to distinguish among them.

This becomes a rather tough problem when doing sequence-structure studies,
and when one wishes to, for example, locate a given residue position from
a sequence database within the structural database.

This brings me to another, much wider inconsistency, which exists among
_sequence_ databases, and which requires a solution from application
programmers. Namely, the same sequence, in part or in whole, may exist in
different databases, with different residue numberings.

What am I trying to get at? 

1) Residue ordinality, in the application level is a major headache. No
need to compound that by migrating it to a wrapper-requiring system
representation. Use the Python scheme of "Andrew"[1:3] = "nd" when in the
coding layer.

2) However, I do believe that we will require classes holding different
residue number attributes for the same residue, based on the different
databases these residues exist in. A sort of a "translation table" such as
exists in FSSP. In that database, based on PDB, each residue has both it's
PDB number, and a running ordinal number, which don't necessarily
coincide. This is one form of migration scheme. Others can be used as
well, I imagine.

Bottom line: a migration scheme for sequence numbering is needed. Since
this particular problem takes up a lot of my time, I'd be happy to hear
from people wrestling with the same problem, and what type of solutions
people may come up with.



/* --- */main(c){float t,x,y,b=-2,a=b;for(;b-=a>2?.1/(a=-2):0,b<2;
/*  |  */putchar(30+c),a+=.0503) for(x=y=c=0;++c<90&x*x+y*y<4;y=2*
/*  |  */x*y+b,x=t)t=x*x-y*y+a;}
/* --- ddo Friedberg */