[Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML)

Fri Jun 19 16:03:46 UTC 2009

On Fri, Jun 19, 2009 at 5:18 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

>
>
>
> OK - that is useful feedback. I will try and clarify that, but in essence:
>
> * letter_annotations - where you have a bit of information for each letter
>  (i.e. amino acid or nucleotide) in the sequence, such as a list of quality
>  scores or secondary structure predictions.
> * features - where you have annotation associated with a particular
>   region of the sequence (e.g. a gene)
> * annotations - things that apply to the whole sequence like organism

Thanks.

Another odd one is any references, which in GenBank files may apply
> to a particular region of the sequence (but in normal usage seem to
> apply to the whole thing). These get stored separately in BioSQL, which
> to me makes sense. At the moment in the SeqRecord  they are stored
> in the annotations dictionary (as a list of reference objects under the
> key "references"). I've been thinking about upgrading this to a new
> SeqRecord property (a list of reference objects) but as I have never
> actually needed to access this information it hasn't been a high priority.
>

Good to know. I'll be careful with SeqRecord.features['references'] for now.

> >
> > If secondary structure or miscellaneous information is listed in the
> > PDB header, then parse_pdb_header could produce SeqFeatures
> > from that. Right now it doesn't build any Biopython objects at all.
>
> I see. Yes, the header parsing in Bio.PDB is very limited at the
> moment, and even sticking to well defined line types (and ignoring
> many or most of the REMARK lines) there is room for improvement.
>
> For the secondary structure, this is given as a string with one letter
> for each residue - I see this as a more natural match to SeqRecord
> letter_annotations rather than a SeqFeature, but giving a list of
> SeqFeatures for the helices, beta sheets, coils etc would also work.
> Of course, you might also want a Seq object to relate them to (to
> give the locations meaning).
>
> One idea I have toyed with is a Bio.SeqIO parser for PDB files, which
> would focus on the sequence information in the headers (and probably
> ignore the ATOM lines completely). I would like to keep the core of
> Biopython independent of NumPy (and I see Bio.SeqIO as part of the
> core), so this wouldn't depend on Bio.PDB. I'm not sure this idea
> would actually be useful so haven't worked on it.
>
>
I'll have a real use for this in the fall, once GSoC is done. It would be
nice to link a set of parsed PDB objects to a multiple alignment of protein
sequences, but I think I'd always want to have the 3D structure information
close at hand. The other use case I've mentioned before is to verify and fix
existing PDB files from Biopython, rather than manually -- 3D coordinates
would probably be useful here, too, for checking collisions and such.
Eventually I'll resurrect my pdbtidy branch and make the parser emit a
SeqRecord or whatever's most appropriate.

-Eric