[Biopython-dev] [BioPython] poor man's databases for large sequence files

Peter biopython-dev at maubp.freeserve.co.uk
Tue Sep 25 14:48:37 EDT 2007


I wrote:
>> What I had in mind was say indexing all of UniProt which is currently
>> 1.1 GB in the SwissProt flat file format, but each record is pretty small.

I have written some experimental code to store SeqRecord objects using 
pickle (and zlib), and tried this on the 283454 UniProt records from 
here (both fasta and swiss-prot flat file format):

ftp://ftp.uniprot.org/pub/databases/uniprot_datafiles_by_format/fasta/uniprot_sprot.fasta.gz
ftp://ftp.uniprot.org/pub/databases/uniprot_datafiles_by_format/flatfile/uniprot_sprot.dat.gz

Fasta file, "uniprot_sprot.fasta", 125 MB
* my pickled SeqRecord database needs about 230 MB (two files),
   takes about 30s to build the index, 1s to load it
* my zlib-pickled SeqRecord database needs about 147 MB (two files),
   takes about 75s to build the index, 2s to load it
* existing Bio.Fasta index using Mindy needs 73 MB (four files)
   takes about 90s to build the index, 2s to load it

SwissProt file, "uniprot_sprot.dat", 1.1 GB
* my pickled SeqRecord database needs about 550 MB (two files)
   takes about 7min to build the index, 1s to load it
* my zlib-pickled SeqRecord database needs about 295 MB (two files)
   takes about 8min to build the index, 3s to load it
* existing Bio.SwissProt.SProt index needs only 35 MB (one file)
   takes about 7.5min to build the index, 16s to load it

Note that just parsing the big SwissProt format file takes about 6min, 
indexing it adds only a comparatively modest overhead.

In all cases, once the index has been built and loaded, accessing 
records by key is almost instantaneous.

In terms of run time, my experimental (zlib) pickled read only 
dictionary is comparable to the existing Biopython functionality - they 
are both sub-second.

However, is the overhead of the bigger index files too much?  We appear 
to be talking about between twice and ten times the size required by the 
old format specific indexing.  Comments?

The reason my index are big is I am storing complete records - not just 
their position within the original file.  The motivation is this will 
work with any file format (regardless of the parser), or even any 
collection of records.

Peter



More information about the Biopython-dev mailing list