[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Mon Aug 31 12:42:52 UTC 2009

On Thu, Aug 20, 2009 at 12:28 PM, Peter wrote:
> Hi all,
>
> You may recall a thread back in June with Cedar Mckay (cc'd - not
> sure if he follows the dev list or not) about indexing large sequence
> files - specifically FASTA files but any sequential file format. I posted
> some rough code which did this building on Bio.SeqIO:
> http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html

The Bio.SeqIO.indexed_dict() functionality is in CVS/github now
as I would like some wider testing. My earlier email explained the
implementation approach, and gave some example code:
http://lists.open-bio.org/pipermail/biopython-dev/2009-August/006654.html

This aims to solve a fairly narrow problem - dictionary like random
access to any record in a sequence file as a SeqRecord via the
record id string as the key. It should works on any sequential file
format, and can even works on binary SFF files (code on a branch
in github still).

Bio.SeqIO.to_dict() has always offered a very simple in memory
solution (a python dictionary of SeqRecord objects) which is fine
for small files (e.g. a few thousand FASTA entries), but won't scale
much more than that.

Using a BioSQL database would also allow random access to any
SeqRecord (and not just by looking it up by the identifier), but I
doubt it would scale well to 10s of millions of short read sequences.
It is also non-trivial to install the DB itself, the schema and the
Python bindings.

The new Bio.SeqIO.indexed_dict() code offers a read only
dictionary interface which does work for millions of reads. As
implemented, there is still a memory bound as all the keys and
their associated file offsets are held in memory. For example, a
7 million record FASTQ file taking 1.3GB on disk seems to need
almost 700MB in memory (just a very crude measurement).
Although clearly this is much more capable than the naive full
dictionary in memory approach (which is out of the question
here), this too could become a real bottle-neck before long

Biopython's old Martel/Mindy code used to build an on disk index,
which avoided this memory constraint. However, we're removed
that (due to mxTextTool breakage etc). In any case, it was also
much much slower:
http://lists.open-bio.org/pipermail/biopython/2009-June/005309.html

Using a Bio.SeqIO.indexed_dict() like API, we could of course
build an index file on disk to avoid this potential memory problem.
As Cedar suggested, this index file could be handled transparently
(created and deleted automatically), or indeed could be explicitly
persisted/reloaded to avoid re-indexing unnecessarily:
http://lists.open-bio.org/pipermail/biopython/2009-June/005265.html

Sticking to the narrow use case of (read only) random access to
a sequence file, all we really need to store is the lookup table of
keys (or their Python hash) and offsets in the original file. If they
are fast enough, we might even be able to reuse the old Martel/
Mindy index file format... or the OBDA specification if that is still
in use:
http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html

Another option (like the shelve idea we talked about last month)
is to parse the sequence file with SeqIO, and serialise all the
SeqRecord objects to disk, e.g. with pickle or some key/value
database. This is potentially very complex (e.g. arbitrary Python
objects in the annotation), and could lead to a very large "index"
file on disk. On the other hand, some possible back ends would
allow editing the database... which could be very useful.

Brad - do you have any thoughts? I know you did some work
with key/value indexers:
http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/

Peter