[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Brad Chapman chapmanb at 50mail.com
Tue Sep 1 09:06:39 EDT 2009


Hi Peter;

[indexed dict usage]
> What file formats where you working on, and how many records?

It was a 100Mb fasta file with about 41,000 records. Nothing too
heavy but it worked great. The only change I made was to generalize
the record building line:
                
self._record_key(line[marker_offset:].strip().split(None,1)[0], offset)

to allow an arbitrary function to be passed to define the
identifier, instead of defaulting to the first part of the line.
This is helpful for those fun NCBI ids
(gi|83029091|ref|XM_357633.3|) where other parts of the program only
have the accession number.

> True. Have got any bright ideas for a better name? While the
> index is in memory, the SeqRecord objects are not (unlike the
> original Bio.SeqIO.to_dict() function).
> 
> Or we have one function Bio.SeqIO.indexed_dict() which can
> either use an in memory index, OR an on disk index, offering
> the same functionality.

That's a nice idea -- provide some reasonable defaults based on file
size and type, and allow them to be over-ridden with function
params.

> >> Another option (like the shelve idea we talked about last month)
> >> is to parse the sequence file with SeqIO, and serialise all the
> >> SeqRecord objects to disk, e.g. with pickle or some key/value
> >> database. This is potentially very complex (e.g. arbitrary Python
> >> objects in the annotation), and could lead to a very large "index"
> >> file on disk. On the other hand, some possible back ends would
> >> allow editing the database... which could be very useful.
> >
> > My thought here was to use BioSQL and the SQLite mappings for
> > serializing. We build off a tested and existing serialization, and
> > also guide people into using BioSQL for larger projects.
> > Essentially, we would build an API on top of existing BioSQL
> > functionality that creates the index by loading the SQL and then
> > pushes the parsed records into it.
> 
> Using BioSQL in this way is a much more general tool than
> simply "indexing a sequence file". It feels like a sledgehammer
> to crack a nut. Also, do you expect it to scale well for 10 million
> plus short reads? It may do, but on the other hand it may not.

Agreed that it would introduce extra overhead for something like
short reads. If you are talking about serializing SeqRecords, it
would make sense to re-use what we have in BioSQL. If you are
talking about storing just file offsets, then a lightweight solution
makes more sense.

For me, the initial parse time to prepare an index is not as much of an
issue since it happens once while queries on it will happen multiple
times.

> Also while the current BioSQL mappings are "tried and tested",
> they don't cover everything, in particular per-letter-annotation
> such as a set of quality scores (something that needs addressing
> anyway, probably with JSON or XML serialisation).

Agreed, but the advantage is that improvements can feed back into
BioSQL, instead of work in parallel.

> All the above make me lean towards a less ambitious target
> (read only dictionary access to a sequence file), which just
> requires having an (on disk) index of file offsets (which could
> be done with SQLite or anything else suitable). This choice
> could even be done on the fly at run time (e.g. we look at the
> size of the file to decide if we should use an in memory index
> or on disk - or start out in memory and if the number of records
> gets too big, switch to on disk).

That makes sense. SQLite has in-memory caching which could help with
some of the decision making as it would handle writing and holding
in memory without having to reimplement that bit. Another file based
indexing scheme is the one in bx-python:

http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/interval_index_file.py

This is a bit more specific as it also handles queries based on
genomic intervals in addition to retrieving by file position. It may
be useful for looking at the underlying storage details.

Brad


More information about the Biopython-dev mailing list