[Biopython-dev] Project ideas for GSoC (or other student projects)

Thu Mar 21 12:29:29 EDT 2013

On Thu, Mar 21, 2013 at 4:11 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Right now we need to put this list of ideas on the wiki page (ready
> for combining into the OBF page which will be shown to Google
> to make our case for taking part in the GSoC 2013 program).
> http://biopython.org/wiki/Google_Summer_of_Code
>
> If any of you as a potential mentor want to put up an outline
> proposal, even better.
>

I've been wondering about potential GSoC projects which I'd
be interested in mentoring (or co-mentoring), and thus far I've
only got one outline idea.

I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
functionality (which does whole record parsing on demand)
and extending this with lazy-loading or lazy-parsing (which
has precedent in our BioSQL wrappers). For example, with
whole genome FASTA files you may never need to load the
entire sequence, but using an index system like tabix (or
even actually using a tabix index) Biopython could provide
a lazy-loading Seq object which extracts only the sequence
region of interest on demand.

The same idea applies to richer file formats too, like EMBL
and GenBank. Here lazy loading the sequence is actually
easier (the number of bases per line is strictly defined),
but you can apply the same ideas to lazy loading features
too. This means indexing both the sequence and the feature
table.

Likewise, this makes sense for GTF/GFF/GFF3 where you
would index the features, and also if present index the
embedded FASTA sequence at the end of the file. Clearly
handling this would ideally build on Lenna and Brad's work
with the underlying parser.

With what I have in mind, there are two technical sides to
this. First, the index format (binning strategies etc) for
which we should review tabix and BAM's indexing and its
planned replacement CSI (able to handle longer references).

Second, to avoid code duplication, this would mean some
re-factoring of the existing parser code to ensure that if
a record is loaded in full via the traditional API, it would
go though the same code as if it were loaded via the new
lazy loading approach. Potentially the existing parsers
could optionally also become lazy loaders (contingent
on this requiring ownership of the file handle as it will
use seek and tell to move the file pointer). That in theory
could make our parsers much faster (depending on the
overheads) for tasks where only a minority of the data
is ever used. I've had some fun chats with Pjotr Prins
from BioRuby about this at a CodeFest/BOSC meeting.

Brad and Lenna, I've CC'd you explicitly as I'm guessing
from the GFF work you are most likely to have considered
some of these issues.

Does this sound like something worth exploring further,
and worth proposing as an outline GSoC project? I think
it would be quite a challenging project - but like last year,
it is something I would like to try myself if I had the time.

Regards,

Peter