[Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many

Peter biopython at maubp.freeserve.co.uk
Tue Nov 30 18:24:35 EST 2010


Hi all,

You may recall some previous discussion about extending the
Bio.SeqIO.index functionality. I'm particularly interested in
keeping the index on disk to reduce the memory overhead
and thus support NGS files with many millions of reads. e.g.

http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006713.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006716.html

I'd also like to index multiple files (e.g. a folder of GenBank
files for different chromosomes), functionality we used to
have with the OBDA style index (using BDB or a flat file)
and Martel/Mindy (deprecated and removed some time ago
due to problems with 3rd party libraries, scaling problems
when parsing, and ultimately no one familiar enough with
the code to try and fix it). See also:

http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006704.html

I've been working on the follow idea on branches in github,
and have something workable using SQLite3 to store a
table of record identifiers, file offset, and file number
(for where we have multiple files indexed together).
Following the OBDA standard, I extended this to
also (optionally) store the record length on disk.
This allows the get_raw method to be much faster,
but may not be possible on all file formats.

[Currently I get the length when building the index
on all supported file formats except SFF. Here we
normally use the Roche index, and that doesn't
have the raw record lengths.]

Note that using SQLite seems sensible to me as
it is included with Python 2.5+ including Python 3,
while BDB, the other candidate from the standard
library, has been deprecated.

The current API is as follows, a new function:

def index_many(index_filename, filenames=None,
                        format=None, alphabet=None,
                        key_function=None)

This is similar to the existing index function, although
here the key_function must return a string for use as
the key in the SQLite database.

The idea is that you call index_many to build a new
index (if the index file does not exist) or reload an
existing index (if the index file does exist). If you
are reloading an existing index, you can omit the
filenames and format.

The index_many function returns a read only dictionary
like object - very much like the existing index function.

Although not (currently) exposed by this API, the code
allows a configurable limit on the number of handles
(since these are a finite resource limited by the OS).

I've put a branch up for comment:
https://github.com/peterjc/biopython/tree/index-many

I hope the docstring text and embedded doctest
examples are clear. You can read them here:
https://github.com/peterjc/biopython/blob/index-many/Bio/SeqIO/__init__.py

What do people think?

One thing I haven't done yet (any volunteers?) is any
benchmarking - for example comparing the index
build and retrieval times for some large files using
Biopython 1.55 (recent baseline), Biopython 1.56
(should be faster on retrieval) and the branch to
check for any regressions in Bio.SeqIO.index(), and
compare this to Bio.SeqIO.index_many() which being
disk based will be slower but require much less RAM.

Peter

P.S. This was based on the following branch, which
proved non-trivial to merge since in the meantime I'd
made separate tweaks to the index code on the trunk:
https://github.com/peterjc/biopython/tree/index-many-length

I didn't propose merging this back then because it
absolutely requires SQLite, and thus Python 2.5+
and we wanted Biopython 1.56 to support Python 2.4.


More information about the Biopython-dev mailing list