[Biopython-dev] SeqRecord in Tutorial (was GSoC and PhyloXML)

Fri Jun 19 05:18:40 EDT 2009

On Fri, Jun 19, 2009 at 2:57 AM, Eric Talevich<eric.talevich at gmail.com> wrote:
> On Thu, Jun 18, 2009 at 6:05 PM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
>>
>> On Thu, Jun 18, 2009 at 10:49 PM, Eric Talevich<eric.talevich at gmail.com>
>> wrote:
>> >
>> > I didn't notice any typos other than Python being consistently
>> > lowercase,
>> > which I assume is how the author likes it.
>>
>> I was aiming for consistency, with no strong preference - at the time
>> there were more uses of "python" than "Python" so I picked that. We
>> can change it easily enough - does anyone care either way?
>
> Python.org capitalizes it. Shrug.

Maybe we should use "Python" then.

>> > The ordering is good -- the SeqIO chapter makes more advanced use
>> > of sequences of SeqRecord objects, so it's good to be familiar with the
>> > basic objects first. In general, I like the organization of covering
>> > fundamental types first, then moving on to larger collections, rather
>> > than covering the majority of a big collection in one shot and leaving
>> > the tricky parts unaddressed.
>>
>> There is a case for leaving messy corner cases to the end, as long as
>> the main chapters cover the core.
>
> Agreed. In the SeqRecord chapter, I was looking for a paragraph or so on
> what sort of information goes into a SeqFeature to see whether it would be a
> suitable stand-in for PhyloXML's DomainArchitecture. From the initial
> description I wasn't sure if annotations or letter_annotations would be more
> appropriate, and the other mentionings are basically "here be dragons"...
> which is true, but a quick example would be helpful. The GenBank parsing
> section would be a good place for that.

OK - that is useful feedback. I will try and clarify that, but in essence:

* letter_annotations - where you have a bit of information for each letter
  (i.e. amino acid or nucleotide) in the sequence, such as a list of quality
  scores or secondary structure predictions.
* features - where you have annotation associated with a particular
   region of the sequence (e.g. a gene)
* annotations - things that apply to the whole sequence like organism

There are some odd cases, like the GenBank source feature, which
covers the whole of the sequence but is listed in the feature table just
like a gene etc (you'd have to ask the NCBI why they did it this way).
In Biopython, these source features get stored as a SeqFeature for
consistency with the rest of the GenBank feature table entries.

Another odd one is any references, which in GenBank files may apply
to a particular region of the sequence (but in normal usage seem to
apply to the whole thing). These get stored separately in BioSQL, which
to me makes sense. At the moment in the SeqRecord  they are stored
in the annotations dictionary (as a list of reference objects under the
key "references"). I've been thinking about upgrading this to a new
SeqRecord property (a list of reference objects) but as I have never
actually needed to access this information it hasn't been a high priority.

>> > -- maybe integrating it a little more comfortably into other
>> > modules like PDB would help with that.
>>
>> I don't see how SeqFeature objects and their FeatureLocations
>> related to PDB. Could you elaborate?
>
> If secondary structure or miscellaneous information is listed in the
> PDB header, then parse_pdb_header could produce SeqFeatures
> from that. Right now it doesn't build any Biopython objects at all.

I see. Yes, the header parsing in Bio.PDB is very limited at the
moment, and even sticking to well defined line types (and ignoring
many or most of the REMARK lines) there is room for improvement.

For the secondary structure, this is given as a string with one letter
for each residue - I see this as a more natural match to SeqRecord
letter_annotations rather than a SeqFeature, but giving a list of
SeqFeatures for the helices, beta sheets, coils etc would also work.
Of course, you might also want a Seq object to relate them to (to
give the locations meaning).

One idea I have toyed with is a Bio.SeqIO parser for PDB files, which
would focus on the sequence information in the headers (and probably
ignore the ATOM lines completely). I would like to keep the core of
Biopython independent of NumPy (and I see Bio.SeqIO as part of the
core), so this wouldn't depend on Bio.PDB. I'm not sure this idea
would actually be useful so haven't worked on it.

Peter