[Biopython-dev] Biopython 1.60 plans and beyond

Mon Feb 20 16:20:33 UTC 2012

Hey João,

On Mon, Feb 20, 2012 at 9:30 AM, João Rodrigues <anaryin at gmail.com> wrote:

> Hi all,
>
> Answering what "concerns" me :)
>
>  >
>> > If there are disordered regions (very common), the missing residues are
>> > replaced with 'X' characters. These residues can be listed in the SEQRES
>> > lines of the PDB header, if it's available, but they're not included
>> with
>> > the atomic coordinates, so PdbIO can't reliably fill in these disordered
>> > residues for all PDB files. This matches the behavior of the tool I was
>> > using before (which is non-free and not widely used).
>>
>
> The SEQRES contains the sequence used in the construct expressed and
> crystallized so it's never incomplete. What I've done in the past in these
> situations is iterate over the SEQRES and fill as '-' those residues that
> do not have coordinates.
>

OK, we should implement that then. Perhaps we can avoid both the
conditional numpy/PDB import and code duplication if we let
parse_pdb_header call SeqIO.PdbIO for SEQRES lines.

What about PDB files that don't have SEQRES lines? Should we...
- Fall back to ATOM parsing automatically
- Allow a flag for fallback (use_atoms_if_absolutely_must=False)
- Require the user to specify whether to use SEQRES or ATOMs
(use_seqres=True)
- Use different format names, e.g. "pdb-seqres" and "pdb"/"pdb-atom"?

Keeping in mind that secondary structure is also best represented as a
SeqRecord, we could use "pdb-ss" or similar as another format eventually.

>  I don't know if I have any decent version of my MODELLER PIR format SeqIO
> stuff on github, but maybe we could work together to make it consistent
> (since what I wanted was PDB to seq essentially) ? Or maybe these are two
> different points of view for the same problem and need different
> solutions...
>
> https://github.com/JoaoRodrigues/biopython/tree/modeller-pirIO
>

Let's try to decouple these. I remember the original use case -- our goal
would be to create Modeller-ready files with code like:

target = SeqIO.read("foo.fa", "fasta")
template = SeqIO.read("bar.pdb", "pdb")
aln = ... # Pairwise alignment
AlignIO.write(aln, "foobar.pir", "pir")

How much more information would we need to extract from the PDB file (that
isn't normally in a SeqRecord) to satisfy Modeller?

> Rather than literally copying it, do you think it is realistic to make
>> some of Bio.PDB work without NumPy? e.g. fall back on tuples
>> of floats (x,y,z) for atom co-ordinates. Just brainstorming - this
>> might be a horrible idea?
>>
>
> I kind of disagree because otherwise we'd have to convert them to numpy
> arrays everytime we need them.
>

For atomic coordinates, I don't think there's a pressing need to make numpy
optional, but perhaps we could refactor parse_pdb_header to work without
loading numpy. That would give use access to SEQRES lines, secondary
structure, PDB ID, deposition date, etc. if they're specified in the header.

> Regarding my own work, I've been slowly working on cleaning a bit Bio.PDB
> (for example, all those get_X methods that just return class attributes)
> and organising my own GSoC code into it and in Bio.Struct. I don't know
> when I have this even "alpha"-testable, it's been a long road and I had a
> couple of computer crashes that made me lose my data so.. When would there
> be a soft deadline for 1.60?
>
>
Cool, no worries about the timeline. I think it's generally best if major
new feature sets are merged shortly after a stable release, so
bleeding-edge users (like us) have time to use the new code in a variety of
situations and find bugs and design issues.

However, if you have a stub of Bio/Struct/__init__.py that you feel is
ready to merge right after this week's release, I think we could start
there and add new features under that namespace in the coming months.

Cheers,
Eric