[BioPython] Sequence Annotation: sequence numbering

Tue, 26 Jun 2001 11:51:11 +0100

At 12:21 26/06/01, Iddo wrote:
>I would like to start a discussion about the annotation of protein
>sequence numbering in Biopython. You are probably all aware of the fact
>that, given a protein sequence, each position in that sequence can be
>overloaded with several ordinal numbering schemes. This usually arises
>when doing cross-database work. For example, I extracted a SwissProt
>sequence, now I would like to look it up in PDB. So positions along the SP
>sequence will now have two sets of numberings: the SP one, and the PDB
>one. [...]
>
>Example:
>
>FSSP:             1   2   3       4   5   6
>Sequence:     A   G   C   V       S   L   F
>
>PDB:              23  24  25      26  27  27A
>Sequence:     A   G   C   V       S   L   F

[...]

>o Need?
>o Suggested implementation?
>o Anything else?

Hi all,

There are other problems with numbering in structural databases. For 
example, PDB entries 1bbc and 1pod, each of which are human synovial fluid 
phospholipase A2 and share the same protein sequence, are numbered 
differently. For example, the fragment:

SEQRES   2   124  GLY LYS GLU ALA ALA LEU SER TYR GLY PHE TYR GLY CYS  1BBC  79
SEQRES   2   124  GLY LYS GLU ALA ALA LEU SER TYR GLY PHE TYR GLY CYS  1POD  29

is numbered:

1bbc:           14      15      16      17      18      19      20      22 
     23
1pod:           14      15      16      17      18      19      20      21 
     22
Sequence:       G       K       D       A       A       L       S       Y 
     G

and it gets worse - the fragment:

SEQRES   5   124  ARG LEU GLU LYS ARG GLY CYS GLY THR LYS PHE LEU SER  1BBC  82
SEQRES   5   124  ARG LEU GLU LYS ARG GLY CYS GLY THR LYS PHE LEU SER  1POD  32

goes:

1bbc:           59      61      67      68      69      70      71      72 

1pod:           58      59      60      61      62      63      64      65
Sequence:       G       C       G       T       K       F       L       S

... and there are further disparities later in the sequence.

These two structures are of the same protein, with the same primary 
sequence, yet are numbered differently for >80% of their sequence.

I don't know how widespread this problem is in the PDB as a whole, but for 
the sPLA2 family (to which these examples belong) it is common, and appears 
to have a historical, rather than rational,  basis. When composing their 
structure files, some groups chose to number sites sequentially from the N 
to the C-terminal, while others chose to number the sites in identical 
proteins according to a sequence alignment with the structure 1p2p.

Not only do the inconsistent numberings make direct relation of structures 
to their Swiss-Prot sequences awkward, but in cases such as 1bbc, where a 
contiguous sequence is numbered discontinuously, even simple calculations 
of residue chain separation often require manual intervention as it becomes 
difficult to distinguish automatically between a missing chain segment and 
numbering by homology.

As for solutions? I've been thinking about the problem intermittently for a 
wee while, and haven't got anything robust. I'd be glad for others' input, 
though.

My own opinion tends to numbering all PDB/FSSP submissions in line with 
their Swiss-Prot sequences, but that doesn't exactly give us a quick fix, 
does it?

Leighton

-- 
Dr Leighton Pritchard GRSC
T44, Cledwyn Building
Institute of Biological Sciences
University of Wales, Aberystwyth, SY23 3DD
Tel 01970 622353    ext. 2353
PGP public key - http://www.keyserver.net (0x47B4A485)