[Biopython-dev] Project ideas for GSoC (or other student projects)

Fri Mar 22 12:48:34 UTC 2013

Peter;

> I've been wondering about potential GSoC projects which I'd
> be interested in mentoring (or co-mentoring), and thus far I've
> only got one outline idea.
>
> I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
> functionality (which does whole record parsing on demand)
> and extending this with lazy-loading or lazy-parsing (which
> has precedent in our BioSQL wrappers). For example, with
> whole genome FASTA files you may never need to load the
> entire sequence, but using an index system like tabix (or
> even actually using a tabix index) Biopython could provide
> a lazy-loading Seq object which extracts only the sequence
> region of interest on demand.

This sounds incredibly useful. It's definitely worthwhile writing up if
you'll have time this summer to mentor it.

> Likewise, this makes sense for GTF/GFF/GFF3 where you
> would index the features, and also if present index the
> embedded FASTA sequence at the end of the file.

I'm cc'ing Ryan, who has been thinking about similar work as part of
gffutils. We're planning now on an approach that takes the BCBio.GFF
parsing and rolls it into gffutils so we can parse, index in a SQLite
database and expose as Biopython objects. Here is some initial
discussion and planning:

https://github.com/daler/gffutils/issues/2
https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing

Brad