[Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas

Thu Mar 1 18:03:49 UTC 2012

2012/3/1 Eric Talevich <eric.talevich at gmail.com>:
>
> Here's one semi-coherent project idea that could fly:
>
> Overhaul Biopython's parsing infrastructure for protein
> primary, secondary and tertiary structures
>
> - Refactor PDBParser and parse_pdb_header to allow parsing
>   amino-acid sequences from SEQRES lines (header) and ATOM
>   records (body) without building the PDB structure object,
>   i.e. without using numpy
> - Write a pure-Python replacement for parsing mmCIF files.
>   (The module MMCIF2Dict already does almost all the work;
>   lex+yacc just manages a fairly simple state machine for
>   recognizing comments, special sub-sections, etc.)
> - Wrap the parsers for PDB, PDBML and mmCIF under a common
>   I/O interface under the Bio.Struct namespace
> - Add parsing support for protein secondary structures,
>   based on the relevant PDB records or (perhaps) DSSP
>   output. (Note that João did some work on this already.)

Do you think you could mentor that? One serious downside
would be even more work on PDB related code which will
make future merging even harder. We do need to tackle the
GSoC back log as a priority.

> Variants
> --------
>
> So, from the Biopython 1.60 thread:
>
> - James Casbon has offered to merge PyVCF into Biopython, right?
> - BCF, the binary form of VCF (via blocked gzip), may also
>   be worthwhile to support
> - GVF, the Genome Variation Format, appears to be intended
>   to be competitive with VCF. It's probably at least as well
>   thought-out as VCF, sight unseen. It's based on GFF.
>
> Synthesizing the above, we have a GSoC project that looks like:
>
> - Help merge PyVCF into Python (w/ James's support -- I
>   don't mean to volunteer him for this in absentia)?
> - Write a GVF parser that emits the same object type as
>   PyVCF, potentially also using existing GFF code
> - Time permitting, look into blocked gzip support for VCF
>   (BCF), also looking at SAM/BAM for inspiration and
>   reusable code.

Sounds interesting - who might be willing to mentor it?

>> SearchIO?
>> ---------
>>
>> I'm wondering if a Biopython SearchIO would make a good project,
>> that I might supervise. This name is obviously based on BioPerl. I
>> would be aiming for iterator based parser/writer framework (like SeqIO
>> and AlignIO) for pairwise 'sequence' searches initially, but have also
>> been thinking about indexing - at least by query, ideally also by match,
>> to allow random access akin to what Bio.SeqIO.index offers.
>>
>> In some cases the results would also be pairwise sequence alignments,
>> in which case some code can be shared/linked with AlignIO. In other
>> cases all you get is co-ordinates of the query and match plus some
>> kind of score. Therefore this could include a hierarchical SearchIO
>> result object structure for minimal matches up to full pairwise
>> alignments.
>>
>> I'd hope to cover BLAST XML, BLAST tabular, HMMER tabular (not
>> really sequence vs sequence, but HMM vs sequence), RPS-BLAST
>> (again not really sequence vs sequence). Perhaps this could also tie
>> into the Bio.Motif code as well (if we consider things like PSSM vs
>> sequence in the same framework).
>>
>> You can already do some of this in Biopython (e.g. BLAST XML
>> parsing, and there is some HMMER work on branches), but I'm
>> hoping for a unified API here.
>>
>
> Interesting. It would be very nice if the objects emitted by SearchIO
> could be easily fed into GenomeDiagram.

Funnily enough, that is one of my motivations - specifically for doing
ACT style diagrams comparing multiple genomes to each other. I've
just started putting some examples into the Tutorial on this today,
where I say ideally you'd parse some BLAST output or whatever,
but here I'm manually typing in a list of links to draw ;)

Peter