[BioPython] Sequence Annotation: sequence numbering
Leighton Pritchard
lep@aber.ac.uk
Tue, 26 Jun 2001 11:51:11 +0100
At 12:21 26/06/01, Iddo wrote:
>I would like to start a discussion about the annotation of protein
>sequence numbering in Biopython. You are probably all aware of the fact
>that, given a protein sequence, each position in that sequence can be
>overloaded with several ordinal numbering schemes. This usually arises
>when doing cross-database work. For example, I extracted a SwissProt
>sequence, now I would like to look it up in PDB. So positions along the SP
>sequence will now have two sets of numberings: the SP one, and the PDB
>one. [...]
>
>Example:
>
>FSSP: 1 2 3 4 5 6
>Sequence: A G C V S L F
>
>PDB: 23 24 25 26 27 27A
>Sequence: A G C V S L F
[...]
>o Need?
>o Suggested implementation?
>o Anything else?
Hi all,
There are other problems with numbering in structural databases. For
example, PDB entries 1bbc and 1pod, each of which are human synovial fluid
phospholipase A2 and share the same protein sequence, are numbered
differently. For example, the fragment:
SEQRES 2 124 GLY LYS GLU ALA ALA LEU SER TYR GLY PHE TYR GLY CYS 1BBC 79
SEQRES 2 124 GLY LYS GLU ALA ALA LEU SER TYR GLY PHE TYR GLY CYS 1POD 29
is numbered:
1bbc: 14 15 16 17 18 19 20 22
23
1pod: 14 15 16 17 18 19 20 21
22
Sequence: G K D A A L S Y
G
and it gets worse - the fragment:
SEQRES 5 124 ARG LEU GLU LYS ARG GLY CYS GLY THR LYS PHE LEU SER 1BBC 82
SEQRES 5 124 ARG LEU GLU LYS ARG GLY CYS GLY THR LYS PHE LEU SER 1POD 32
goes:
1bbc: 59 61 67 68 69 70 71 72
1pod: 58 59 60 61 62 63 64 65
Sequence: G C G T K F L S
... and there are further disparities later in the sequence.
These two structures are of the same protein, with the same primary
sequence, yet are numbered differently for >80% of their sequence.
I don't know how widespread this problem is in the PDB as a whole, but for
the sPLA2 family (to which these examples belong) it is common, and appears
to have a historical, rather than rational, basis. When composing their
structure files, some groups chose to number sites sequentially from the N
to the C-terminal, while others chose to number the sites in identical
proteins according to a sequence alignment with the structure 1p2p.
Not only do the inconsistent numberings make direct relation of structures
to their Swiss-Prot sequences awkward, but in cases such as 1bbc, where a
contiguous sequence is numbered discontinuously, even simple calculations
of residue chain separation often require manual intervention as it becomes
difficult to distinguish automatically between a missing chain segment and
numbering by homology.
As for solutions? I've been thinking about the problem intermittently for a
wee while, and haven't got anything robust. I'd be glad for others' input,
though.
My own opinion tends to numbering all PDB/FSSP submissions in line with
their Swiss-Prot sequences, but that doesn't exactly give us a quick fix,
does it?
Leighton
--
Dr Leighton Pritchard GRSC
T44, Cledwyn Building
Institute of Biological Sciences
University of Wales, Aberystwyth, SY23 3DD
Tel 01970 622353 ext. 2353
PGP public key - http://www.keyserver.net (0x47B4A485)