[Biopython-dev] upcoming Bio.PDB enhancements - RNA

Wed Jun 2 08:17:01 UTC 2010

Hi,

>> >>> from Bio.Struct import RNA
..
>> > I don't necessarily see a problem with Bio.Struct or Bio.Structure
>> > covering both 2D and 3D structures.

Eric, I agree with you - the secondary structure of RNA maps nicely to 3D
space. Generally, I think it is a little more common to work with RNA 2D
structures in absence of 3D information than in proteins - 2D prediction
of RNA is maybe simply a less nasty target.

Eric wrote:

> I could be totally wrong here, but I think it's useful to lay out some
> assumptions and intuitions explicitly.
>
> To me, secondary structure is not really a separate dimension in its own
> right, the way tertiary structure corresponds to 3D space and primary
> structure corresponds to a linear sequence. Instead, secondary structure
> has
> meaning in 3D space, but is usually serialized as a linear sequence. That
> is, we want to parse something that resembles a sequence, but be able to
> map
> it onto a 3D structure. (More for proteins than for RNA, usually.)
>
> (For non-RNA folk, here's an example of RNA secondary structure:
> http://github.com/krother/biopython/blob/rna/Tests/RNA/sample.vienna
> )
>
> For instance, the output of DSSP and Jpred describes a protein's secondary
> structure, but the input to DSSP is a 3D structure, while Jpred accepts a
> protein sequence. The representation of secondary structure isn't distinct
> from either of these. I'd want both of these available in Bio.Struct
> (eventually).
>
> This means that some interaction between Bio.Struct and SeqIO is
> necessary.
> It would be neat if secondary structure regions were represented as
> SeqFeature instances, and secondary-structure parsers returned some kind
> of
> subclass of SeqRecord -- or a standard SeqRecord containing a special kind
> of Seq.

So far the Secstruc parsers I've implemented just return
(sequence,secstruc) tuples. But putting this into a SeqRecord makes sense
- I understand this fits better to the BioPython architecture.

Maybe instead of a Seq or SeqRecord subclass we could use the decorator
pattern (decorating a class, not the Python decorator function syntax).

A potential problem that I'd like to point out early is that we are
working with modified RNA nucleotides a lot (up to 20% of residues in
every tRNA). This would require extending the RNA Alphabet (which now just
is "AGCU") - but I see this as remote from the Bio.XXXX.read() thread.

> The secondary-structure parsers for RNA and proteins should be separate,
> too, since the annotated features are different. So the function
> Bio.Struct.read() can apply exclusively to 3D structures. Would it be
> reasonable for Bio.Struct.RNA.read() to apply exclusively to RNA secondary
> structures -- assuming that anything that's not a secondary structure, 3D
> structure, or nucleotide sequence is something special that belongs in its
> own module?

To summarize, we could use:

1) protein 3D structures:
   Bio.Struct.read() --> Bio.PDB.Structure

2) RNA 3D structures:
   Bio.Struct.read() --> Bio.PDB.Structure

3) RNA 2D structures:
   Bio.Struct.RNA.read() --> Bio.SeqRecord (extended/decorated by a
secstruc field)

4) protein 2D structures: uses special parser module??

5) plain sequences:
   Bio.read() --> Bio.SeqRecord

Eric, does this summarize your thoughts correctly?

This would work for me. Any comments from the others.

Best,
   Kristian