[Biopython-dev] Project ideas for GSoC (or other student projects)

Fri Mar 22 16:20:45 UTC 2013

Hi Brad & Peter -

On 03/22/2013 08:48 AM, Brad Chapman wrote:
> Peter;
>
>> I've been wondering about potential GSoC projects which I'd
>> be interested in mentoring (or co-mentoring), and thus far I've
>> only got one outline idea.
>>
>> I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
>> functionality (which does whole record parsing on demand)
>> and extending this with lazy-loading or lazy-parsing (which
>> has precedent in our BioSQL wrappers). For example, with
>> whole genome FASTA files you may never need to load the
>> entire sequence, but using an index system like tabix (or
>> even actually using a tabix index) Biopython could provide
>> a lazy-loading Seq object which extracts only the sequence
>> region of interest on demand.
> This sounds incredibly useful. It's definitely worthwhile writing up if
> you'll have time this summer to mentor it.

Agreed - a general, lazy-loading/lazy-parsing, indexed mechanism for 
accessing data annotation-like file formats would be fantastic.

>> Likewise, this makes sense for GTF/GFF/GFF3 where you
>> would index the features, and also if present index the
>> embedded FASTA sequence at the end of the file.
> I'm cc'ing Ryan, who has been thinking about similar work as part of
> gffutils. We're planning now on an approach that takes the BCBio.GFF
> parsing and rolls it into gffutils so we can parse, index in a SQLite
> database and expose as Biopython objects. Here is some initial
> discussion and planning:
>
> https://github.com/daler/gffutils/issues/2
> https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing

As Peter pointed out on the GitHub issues page, what he has in mind is 
more general than just GFF/GTF, and I see gffutils as extending upon a 
specific subset of the functionality he proposes.

For example, there are common use-cases that I think make sense for a 
GFF/GTF-only library (say, adding new annotations for introns, as 
inferred from the isoform + exon annotations) that might not be readily 
generalizable to all annotation-like file formats. But if this general 
indexing approach were already available, then gffutils could just be a 
wrapper around that, adding the specific GFF/GTF functionality as 
another layer.

Then again . . . currently gffutils imports GFF data into a sqlite3 
database, so data are persistent and both read/write.  For the 
intron-inferring example, we simply add new records to the db, but with 
an indexing approach, the file would presumably have to be re-indexed 
before reading again.  So how you'd like to use your GFF files 
(read-only vs read/write) would influence which strategy you'd chooses.

So I think there's actually smaller-than-expected overlap between 
gffutils and Peter's general indexing idea, and in the context of GSoC, 
I'm not sure you'd have to take gffutils into account.  But gffutils 
would certainly benefit from general indexing, especially when 
retrieving sequences for features!

-ryan