[BioPython] Interface to sequence information in PDB Files?

Thu Jan 18 17:10:27 UTC 2007

Peter wrote:

> This was something I was thinking about doing using Bio.PDB for the new
> Bio.SeqIO code that I've been working on:
> 
> http://www.biopython.org/wiki/SeqIO
> 
> I haven't written anything yet specifically for PDB files, but my idea
> was to produce a SeqRecord for each peptide chain in the PDB file -
> based on the residues in the 3D structure, not the stated sequence in
> the header of the PDB file.
> 
> Does this sound close to what you had in mind?
> 
> One big question I was thinking about is how would it be best to handle
> chains with breaks in them (e.g. residues missing from the PDB file
> because they were not solved).  Simply skipping them in the sequence and
> returning a single continuous amino acid sequence would be misleading,
> so perhaps including a single gap character would suffice?

Yes, that's more or less the functionality that I was hoping to find.  I would
have been happy to have the SEQRES records show up as a sequence object, but
actually reading the structure is probably the right approach.  I think that
putting a single gap character is the right thing to do for unsolved residues by
default

It might not be bad to provide an option to either only parse the SEQRES records
in the file, or possibly use the data there to fill in if the depositor included
the sequence data for disordered residues.  I am not enough of a standards
lawyer to know how common that is in PDB entries, or even if it's allowed,
required, or forbidden, but if it is something that happens, being able to take
advantage of the situation would be nice.

Andy

-- 
Andrew Fant    | And when the night is cloudy    | This space to let
Molecular Geek | There is still a light          |----------------------
fant at pobox.com | That shines on me               | Disclaimer:  I don't
Boston, MA     | Shine until tomorrow, Let it be | even speak for myself