[BioPython] poor man's databases for large sequence files
Peter
biopython at maubp.freeserve.co.uk
Mon Sep 24 21:47:13 UTC 2007
I've been thinking about extending Bio.SeqIO to support a (read only)
dictionary like interface for large sequence files (WITHOUT having
everything in memory).
Some of the older Biopython sequence format specific modules have an
index_file function and matching Dictionary class to do this (based
internally on either Martel/Mindy or a DIY Biopython indexer based on
pickle).
When thinking about a format agnostic SeqRecord dictionary, the built in
python "Shelf" object from python's built in "shelve library" looks like
a good choice. I could add a Bio.SeqIO.to_shelf() function similar to
the existing Bio.SeqIO.to_dict() function.
The only downside I've thought of so far is updating a shelf database,
something supported by shelve but with a few gotchas when dealing with
non-trivial datatypes (like dictionaries). The need I am thinking about
addressing is a little less flexible - read only low-memory access to a
large collection of SeqRecords (typically from a large sequence file).
Does anyone already use python's shelve library with sequence data?
Peter
More information about the Biopython
mailing list