[Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many

Eric Talevich eric.talevich at gmail.com
Tue Dec 7 15:40:09 UTC 2010


On Tue, Dec 7, 2010 at 10:11 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Dec 7, 2010 at 1:59 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> > Peter;
> >
> >> You may recall some previous discussion about extending the
> >> Bio.SeqIO.index functionality. I'm particularly interested in
> >> keeping the index on disk to reduce the memory overhead
> >> and thus support NGS files with many millions of reads. e.g.
> > [...]
> >> I've been working on the follow idea on branches in github,
> >> and have something workable using SQLite3 to store a
> >> table of record identifiers, file offset, and file number
> >> (for where we have multiple files indexed together).
> > [...]
> >> https://github.com/peterjc/biopython/tree/index-many
> >
> > This is great and definitely needed. The implementation
> > looks nice and fits with the current index functionality,
> > and SQLite definitely seems like the right choice.
> > So a big +1 on all of this.
> >
> > My only suggestion would be the naming: index_file makes it a little
> > clearer about the intentions, instead of index_many (the best
> > naming would be 'index' for this functionality and 'index_memory' for
> > the in-memory indexing, but the ship has probably sailed on that).
>
> Yes, we've already used "index" for the in-memory index, and
> its API doesn't lend itself to being extended in this way. So too
> late now.
>
> What do you think of index_files (plural) rather than index_file?
>

How about index_db or index_sqlite? The fact that it uses a SQLite database
for storage seems significant enough to be noted in the name.

Thanks for adding this feature, it will be very useful!

-Eric



More information about the Biopython-dev mailing list