[Bioperl-l] storing/retrieving a large hash on file system?

George Hartzell hartzell at alerce.com
Wed May 19 04:17:24 UTC 2010


Ben Bimber writes:
 > this question is more of a general perl one than bioperl specific, so
 > I hope it is appropriate for this list:
 > 
 > I am writing code that has two steps.  the first generates a large,
 > complex hash describing mutations.  it takes a fair amount of time to
 > run this step.  the second step uses this data to perform downstream
 > calculations.  for the purposes of writing/debugging this downstream
 > code, it would save me a lot of time if i could run the first step
 > once, then store this hash in something like the file system.  this
 > way I could quickly load it, when debugging the downstream code
 > without waiting for the hash to be recreated.
 > 
 > is there a 'best practice' way to do something like this?  I could
 > save a tab-delimited file, which is human readable, but does not
 > represent the structure of the hash, so I would need code to re-parse
 > it.  I assume I could probably do something along the lines of dumping
 > a JSON string, then read/decode it.  this is easy, but not so
 > human-readable.  is there another option i'm not thinking of?  what do
 > others do in this sort of situation?

Someone early on in the thread said not to invent another format, and
I concur with that whole heartedly.

Your choice of words, "large complex hash" makes me worry that you
have something more than a large single level hash with sensible keys.
Hashes of references to hashes to references to lists to etc... give
me hives.

If you'ld like to put add a nice general purpose tool to your kit,
think about putting it into a simple SQLite database.

Put it into an SQLite db and talk to it via DBI and you get some
really cool tricks:

  - you can store complex stuff,
  - get back the just the part you need, a column, several columns, or
    the result of a join among multiple tables,
  - add indexes to make it Go Fast.

and in the cool tricks category

  - you can use SQLite's backup interface to build the database in
    memory (nice and fast) then quickly stream it out to a disk based
    file for persistence.
  - same trick in reverse, if you know you're going to do a reasonably
    large number of complex queries you can stream a database into
    memory and then run your queries quickly.
  - rtree indexes are cool.

Going forward you can scale things up to big databases (Pg, Oracle),
you can provide safe multiuser access, transactions, etc.... (NFS not
withstanding), etc....

g.



More information about the Bioperl-l mailing list