[Biopython-dev] [BioPython] poor man's databases for large sequence files
Peter
biopython-dev at maubp.freeserve.co.uk
Tue Sep 25 18:48:37 UTC 2007
I wrote:
>> What I had in mind was say indexing all of UniProt which is currently
>> 1.1 GB in the SwissProt flat file format, but each record is pretty small.
I have written some experimental code to store SeqRecord objects using
pickle (and zlib), and tried this on the 283454 UniProt records from
here (both fasta and swiss-prot flat file format):
ftp://ftp.uniprot.org/pub/databases/uniprot_datafiles_by_format/fasta/uniprot_sprot.fasta.gz
ftp://ftp.uniprot.org/pub/databases/uniprot_datafiles_by_format/flatfile/uniprot_sprot.dat.gz
Fasta file, "uniprot_sprot.fasta", 125 MB
* my pickled SeqRecord database needs about 230 MB (two files),
takes about 30s to build the index, 1s to load it
* my zlib-pickled SeqRecord database needs about 147 MB (two files),
takes about 75s to build the index, 2s to load it
* existing Bio.Fasta index using Mindy needs 73 MB (four files)
takes about 90s to build the index, 2s to load it
SwissProt file, "uniprot_sprot.dat", 1.1 GB
* my pickled SeqRecord database needs about 550 MB (two files)
takes about 7min to build the index, 1s to load it
* my zlib-pickled SeqRecord database needs about 295 MB (two files)
takes about 8min to build the index, 3s to load it
* existing Bio.SwissProt.SProt index needs only 35 MB (one file)
takes about 7.5min to build the index, 16s to load it
Note that just parsing the big SwissProt format file takes about 6min,
indexing it adds only a comparatively modest overhead.
In all cases, once the index has been built and loaded, accessing
records by key is almost instantaneous.
In terms of run time, my experimental (zlib) pickled read only
dictionary is comparable to the existing Biopython functionality - they
are both sub-second.
However, is the overhead of the bigger index files too much? We appear
to be talking about between twice and ten times the size required by the
old format specific indexing. Comments?
The reason my index are big is I am storing complete records - not just
their position within the original file. The motivation is this will
work with any file format (regardless of the parser), or even any
collection of records.
Peter
More information about the Biopython-dev
mailing list