[BioPython] Sequence Annotation: sequence numbering

Tue, 26 Jun 2001 12:21:47 +0300 (GMT+0300)

Hi all,

I would like to start a discussion about the annotation of protein
sequence numbering in Biopython. You are probably all aware of the fact
that, given a protein sequence, each position in that sequence can be
overloaded with several ordinal numbering schemes. This usually arises
when doing cross-database work. For example, I extracted a SwissProt
sequence, now I would like to look it up in PDB. So positions along the SP
sequence will now have two sets of numberings: the SP one, and the PDB
one. Furthermore, going from PDB to the FSSP, we receive yet a 3rd
numbering system.

Example:

FSSP:             1   2   3       4   5   6
Sequence:     A   G   C   V       S   L   F

PDB:              23  24  25      26  27  27A
Sequence:     A   G   C   V       S   L   F

SP:           1   2   3   4   5   6   7   8
Sequence:     A   G   C   V   T   S   L   F

Note the omission of A1 and T5 from the structural data. Additionally,
note the insertion code ('27A') in F8. These are quite typical phenomena.

I have addressed proteins hrere, but I believe that there is a similar
need for nucleic acid sequences, the most immediate example being the gDNA
<--> cDNA transition.

Would anybody care to comment in terms of:

o Need?
o Suggested implementation?
o Anything else?

Iddo

--

Iddo Friedberg                                  | Tel: +972-2-6758647
Dept. of Molecular Genetics and Biotechnology   | Fax: +972-2-6757308
The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il
POB 12272, Jerusalem 91120                      |
Israel                                          |
http://bioinfo.md.huji.ac.il/marg/people-home/iddo/