[BioPython] poor man's databases for large sequence files
Sean Davis
sdavis2 at mail.nih.gov
Tue Sep 25 01:40:21 UTC 2007
Peter wrote:
> I've been thinking about extending Bio.SeqIO to support a (read only)
> dictionary like interface for large sequence files (WITHOUT having
> everything in memory).
>
> Some of the older Biopython sequence format specific modules have an
> index_file function and matching Dictionary class to do this (based
> internally on either Martel/Mindy or a DIY Biopython indexer based on
> pickle).
>
> When thinking about a format agnostic SeqRecord dictionary, the built in
> python "Shelf" object from python's built in "shelve library" looks like
> a good choice. I could add a Bio.SeqIO.to_shelf() function similar to
> the existing Bio.SeqIO.to_dict() function.
>
> The only downside I've thought of so far is updating a shelf database,
> something supported by shelve but with a few gotchas when dealing with
> non-trivial datatypes (like dictionaries). The need I am thinking about
> addressing is a little less flexible - read only low-memory access to a
> large collection of SeqRecords (typically from a large sequence file).
>
> Does anyone already use python's shelve library with sequence data?
>
Just a curiosity, Peter, but would this extension deal with small
collections of large sequences (finished genomes, for example)?
Sean
More information about the Biopython
mailing list