[BioPython] Sequence Annotation: sequence numbering

Wed, 27 Jun 2001 22:54:08 -0700

At 4:51 PM +0300 6/26/01, Iddo Friedberg wrote:
>: [Iddo]
>: >I would like to start a discussion about the annotation of protein
>: >sequence numbering in Biopython. You are probably all aware of the fact

>[Leighton]
>: My own opinion tends to numbering all PDB/FSSP submissions in line with
>: their Swiss-Prot sequences, but that doesn't exactly give us a quick fix,
>: does it?
>
>Yes, that would be a good solution for the positional numbering problem.
>And a SwissProt - PDB mapper will be extremely useful to the
>sequence-structure community.
>
>However, it is not really within Biopython's scope to do so. (If anyone
>knows of such a database, please let us know!

>Given the following sequence & numberings:
>
>sequence     A  C  R  L  M  P
>PDB          1  2  -  4  5  5A
>SwissProt    1  2  3  4  5  6
>
>A possible implementation would be:
>
>from Bio import SeqRecord, Seq
>from Bio.Alphabet import Alphabet
>
>my_seq = Seq.Seq('ACRLMP', Alphabet.ProteinAlphabet())
>pdb_positions = [(1,''), (2,''), (None,''), (4,''), (5,''), (5,'A')]
>sp_positions = [1, 2, 3, 4, 5, 6]
>my_seq_rec = SeqRecord.SeqRecord(my_seq)
>my_seq_rec.annotations['pdb_pos'] = pdb_positions
>my_seq_rec.annotations['sp_pos'] = sp_positions

Something like this would work, but it would also be nice to be able 
to retrieve sequences based on specific nomenclatures.  For example, 
I'd expect something like:
my_seq_rec.sp_pos[1]
my_seq_rec.sp_pos[4:6]
to work.

This, however, brings up semantic issues of how to deal with 
sequences without numbers:
my_seq_rec.pdb_pos[(1,''):(4, '')]
  (or my_seq_rec.pdb_pos["1":"4"])

Would this return AC or ACR?

>Comments on this? General comments? Can this be adapted to the genomic DNA
><--> cDNA problem?

Hmmm...  This is tricky.  At first, I thought no, because cDNA and 
genomic DNA's are different biological entities.  However, they are 
mappable to one another and could probably be considered a sequence 
mapping problem.  In that case, you could also make an argument that 
something similar could be done for DNA<->protein as well.

Biopython certainly needs a way to handle multiple sequence 
numberings.  Being able to handle mappings in general would be the 
icing on the cake.

Jeff