[Bioperl-l] storing/retrieving a large hash on file system?

Wed May 19 04:17:24 UTC 2010

Ben Bimber writes:
 > this question is more of a general perl one than bioperl specific, so
 > I hope it is appropriate for this list:
 > 
 > I am writing code that has two steps.  the first generates a large,
 > complex hash describing mutations.  it takes a fair amount of time to
 > run this step.  the second step uses this data to perform downstream
 > calculations.  for the purposes of writing/debugging this downstream
 > code, it would save me a lot of time if i could run the first step
 > once, then store this hash in something like the file system.  this
 > way I could quickly load it, when debugging the downstream code
 > without waiting for the hash to be recreated.
 > 
 > is there a 'best practice' way to do something like this?  I could
 > save a tab-delimited file, which is human readable, but does not
 > represent the structure of the hash, so I would need code to re-parse
 > it.  I assume I could probably do something along the lines of dumping
 > a JSON string, then read/decode it.  this is easy, but not so
 > human-readable.  is there another option i'm not thinking of?  what do
 > others do in this sort of situation?

Someone early on in the thread said not to invent another format, and
I concur with that whole heartedly.

Your choice of words, "large complex hash" makes me worry that you
have something more than a large single level hash with sensible keys.
Hashes of references to hashes to references to lists to etc... give
me hives.

If you'ld like to put add a nice general purpose tool to your kit,
think about putting it into a simple SQLite database.

Put it into an SQLite db and talk to it via DBI and you get some
really cool tricks:

  - you can store complex stuff,
  - get back the just the part you need, a column, several columns, or
    the result of a join among multiple tables,
  - add indexes to make it Go Fast.

and in the cool tricks category

  - you can use SQLite's backup interface to build the database in
    memory (nice and fast) then quickly stream it out to a disk based
    file for persistence.
  - same trick in reverse, if you know you're going to do a reasonably
    large number of complex queries you can stream a database into
    memory and then run your queries quickly.
  - rtree indexes are cool.

Going forward you can scale things up to big databases (Pg, Oracle),
you can provide safe multiuser access, transactions, etc.... (NFS not
withstanding), etc....

g.