[BioPython] Sequence Annotation: sequence numbering

Mon, 2 Jul 2001 11:15:25 +0300 (GMT+0300)

Hi,

[Iddo]
: >Given the following sequence & numberings:
: >
: >sequence   A  C  R  L  M  P
: >PDB        1  2  -  4  5  5A
: >SwissProt  1  2  3  4  5  6
: >
: >A possible implementation would be:
: >
: >from Bio import SeqRecord, Seq
: >from Bio.Alphabet import Alphabet
: >
: >my_seq = Seq.Seq('ACRLMP', Alphabet.ProteinAlphabet())
: >pdb_positions = [(1,''), (2,''), (None,''), (4,''), (5,''), (5,'A')]
: >sp_positions = [1, 2, 3, 4, 5, 6]
: >my_seq_rec = SeqRecord.SeqRecord(my_seq)
: >my_seq_rec.annotations['pdb_pos'] = pdb_positions
: >my_seq_rec.annotations['sp_pos'] = sp_positions
:

[Jeff:]

: Somethinglike this would work, but it would also be nice to be able
: to retrieve sequences based on specific nomenclatures.For example,
: I'd expect something like:
: my_seq_rec.sp_pos[1]
: my_seq_rec.sp_pos[4:6]
: to work.
:
: This, however, brings up semantic issues of how to deal with
: sequences without numbers:
: my_seq_rec.pdb_pos[(1,''):(4, '')]
: (or my_seq_rec.pdb_pos["1":"4"])
:
: Would this return AC or ACR?

Direct answer: "AC". It's not that "R" is unnumbered in this example, it
simply does not exist in the PDB file. Say, because the crystallographer
couldn't see it.

The problem with obtaining slices, is that the PDB sequence numbering can
be discontinuous. This may be the result of:

1) True gaps within the sequence (like the one in the example above). The
PDB sequence simply does not contain the entire swissprot sequence. This
is quite common. To whit: "true gaps".

2) The sequence itself is continuous, but the numbering isn't. Less
common, but definitely there. "Artificial gaps". insertion codes (5,5A,5B)
is the flipside of that.

I cannot think right now of an elegant solution to this problem. In terms
of slices. Any takers?

:
: >Comments on this? General comments? Can this be adapted to the genomic DNA
: ><--> cDNA problem?
:

[Jeff:]

: Hmmm...This is tricky.  At first, I thought no, because cDNA and
: genomic DNA's are different biological entities.However, they are
: mappable to one another and could probably be considered a sequence
: mapping problem.In that case, you could also make an argument that
: something similar could be done for DNA<->protein as well.

Getting the coding sequence for any given protein sequence (and especially
a PDB sequence) is sorely needed. SwissProt provides this via links on the
NiceProt display, but I don't know of any other such mapping facility.

:
: Biopython certainly needs a way to handle multiple sequence
: numberings.Being able to handle mappings in general would be the
: icing on the cake.

Yes, I quite agree. I haven't thought this through in terms of mapping.

Iddo

--

Iddo Friedberg                                  | Tel: +972-2-6758647
Dept. of Molecular Genetics and Biotechnology   | Fax: +972-2-6757308
The Hebrew University - Hadassah Medical School | email: idoerg@cc.huji.ac.il
POB 12272, Jerusalem 91120                      |
Israel                                          |
http://bioinfo.md.huji.ac.il/marg/people-home/iddo/