[Biopython-dev] Indexing (large) sequence files with Bio.SeqIO

Thu Aug 20 11:28:46 UTC 2009

Hi all,

You may recall a thread back in June with Cedar Mckay (cc'd - not
sure if he follows the dev list or not) about indexing large sequence
files - specifically FASTA files but any sequential file format. I posted
some rough code which did this building on Bio.SeqIO:
http://lists.open-bio.org/pipermail/biopython/2009-June/005281.html

I have since generalised this, and have something which I think
would be ready for merging into the main trunk for wider testing.
The code is on github on my "index" branch at the moment,
http://github.com/peterjc/biopython/commits/index

This would add a new function to Bio.SeqIO, provisionally called
indexed_dict, which takes two arguments: filename and format
name (e.g. "fasta"), plus an optional alphabet. This will return a
dictionary like object, using SeqRecord identifiers as keys, and
SeqRecord objects as values. There is (deliberately) no way to
allow the user to choose a different keying mechanism (although
I can see how to do this at a severe performance cost).

As with the Bio.SeqIO.convert() function, the new addition of
Bio.SeqIO.indexed_dict() will be the only public API change.
Everything else is deliberately private, allowing us the freedom
to change details if required in future.

The essential idea is the same my post in June. Nothing about
the existing SeqIO framework is changed (so this won't break
anything). For each file we scan though it looking for new records,
not the file offset, and extract the identifier string. These are stored
in a normal (private) Python dictionary. On requesting a record, we
seek to the appropriate offset and parse the data into a SeqRecord.
For simple file formats we can do this by calling Bio.SeqIO.parse().

For complex file formats (such as SFF files, or anything else with
important information in a header), the implementation is a little
more complicated - but we can provide the same API to the user.

Note that the indexing step does not fully parse the file, and
thus may ignore corrupt/invalid records. Only when (if) they are
accessed will this trigger a parser error. This is a shame, but
means the indexing can (in general) be done very fast.

I am proposing to merge all of this (except the SFF file support),
but would welcome feedback (even after a merger). I already
have basic unit tests, covering the following SeqIO file formats:
"ace", "embl", "fasta", "fastq" (all three variants), "genbank"/"gb",
"ig", "phd", "pir", and "swiss" (plus "sff" but I don't think that
parser is ready to be checked in yet).

An example using the new code, this takes just a few seconds
to index this 238MB GenBank file, and record access is almost
instant:

>>> from Bio import SeqIO
>>> gb_dict = SeqIO.indexed_dict("gbpln1.seq", "gb")
>>> len(gb_dict)
59918
>>> gb_dict.keys()[:5]
['AB246540.1', 'AB038764.1', 'AB197776.1', 'AB036027.1', 'AB161026.1']
>>> record = gb_dict["AB433451.1"]
>>> print record.id, len(record), len(record.features)
AB433451.1 590 2

And using a 1.3GB FASTQ file, indexing is about a minute, and
again, record access is almost instant:

>>> from Bio import SeqIO
>>> fq_dict = SeqIO.indexed_dict("SRR001666_1.fastq", "fastq")
>>> len(fq_dict)
7047668
>>> fq_dict.keys()[:4]
['SRR001666.2320093', 'SRR001666.2320092', 'SRR001666.1250635',
'SRR001666.2354360']
>>> record = fq_dict["SRR001666.2765432"]
>>> print record.id, record.seq
SRR001666.2765432 CTGGCGGCGGTGCTGGAAGGACTGACCCGCGGCATC

Peter