[Biopython-dev] Project ideas for GSoC (or other student projects)
Ryan Dale
dalerr at niddk.nih.gov
Fri Mar 22 16:20:45 UTC 2013
Hi Brad & Peter -
On 03/22/2013 08:48 AM, Brad Chapman wrote:
> Peter;
>
>> I've been wondering about potential GSoC projects which I'd
>> be interested in mentoring (or co-mentoring), and thus far I've
>> only got one outline idea.
>>
>> I'm interested in taking the Bio.SeqIO.index(...) / index_db(...)
>> functionality (which does whole record parsing on demand)
>> and extending this with lazy-loading or lazy-parsing (which
>> has precedent in our BioSQL wrappers). For example, with
>> whole genome FASTA files you may never need to load the
>> entire sequence, but using an index system like tabix (or
>> even actually using a tabix index) Biopython could provide
>> a lazy-loading Seq object which extracts only the sequence
>> region of interest on demand.
> This sounds incredibly useful. It's definitely worthwhile writing up if
> you'll have time this summer to mentor it.
Agreed - a general, lazy-loading/lazy-parsing, indexed mechanism for
accessing data annotation-like file formats would be fantastic.
>> Likewise, this makes sense for GTF/GFF/GFF3 where you
>> would index the features, and also if present index the
>> embedded FASTA sequence at the end of the file.
> I'm cc'ing Ryan, who has been thinking about similar work as part of
> gffutils. We're planning now on an approach that takes the BCBio.GFF
> parsing and rolls it into gffutils so we can parse, index in a SQLite
> database and expose as Biopython objects. Here is some initial
> discussion and planning:
>
> https://github.com/daler/gffutils/issues/2
> https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing
As Peter pointed out on the GitHub issues page, what he has in mind is
more general than just GFF/GTF, and I see gffutils as extending upon a
specific subset of the functionality he proposes.
For example, there are common use-cases that I think make sense for a
GFF/GTF-only library (say, adding new annotations for introns, as
inferred from the isoform + exon annotations) that might not be readily
generalizable to all annotation-like file formats. But if this general
indexing approach were already available, then gffutils could just be a
wrapper around that, adding the specific GFF/GTF functionality as
another layer.
Then again . . . currently gffutils imports GFF data into a sqlite3
database, so data are persistent and both read/write. For the
intron-inferring example, we simply add new records to the db, but with
an indexing approach, the file would presumably have to be re-indexed
before reading again. So how you'd like to use your GFF files
(read-only vs read/write) would influence which strategy you'd chooses.
So I think there's actually smaller-than-expected overlap between
gffutils and Peter's general indexing idea, and in the context of GSoC,
I'm not sure you'd have to take gffutils into account. But gffutils
would certainly benefit from general indexing, especially when
retrieving sequences for features!
-ryan
More information about the Biopython-dev
mailing list