[BioPython] [Fwd: Advice on optimium data structure for billion long list?]
Mark blobby Robinson
m.1.robinson@herts.ac.uk
Mon, 14 May 2001 16:30:10 +0100
Hi guys,
I posted this to the python newsgroup and just wanted to run it past you
guys as well . What I failed to mention before was I am currently trying
to run this on a 700Mhz machine with 730MB of RAM, but my boss will be
buying in 8-9 dual cpu 1Ghz, 1GB RAM machines to allow me to parallelise
the program somewhat.
So if anyone has any ideas please feel free to throw them my way ;)
blobby
-------- Original Message --------
Subject: Advice on optimium data structure for billion long list?
Date: Sat, 12 May 2001 17:12:22 +0100
From: Mark blobby Robinson <m.1.robinson@herts.ac.uk>
Organization: University of Hertfordshire
Newsgroups: comp.lang.python
Hey guys,
I'd just like to pick the best of the worlds python brains if thats ok.
I am building a program that is pattern searching in DNA sequences and
generating a list of combinations of 3 patterns that meet certain
criteria. My problem is that this list could potentially get as large as
~1.4 billion entries. Now originally I was using a dictionary with the
key as a 15 length string (the patterns catted) and the value simply a
count of the number of hots for that pattern combination. Obviously that
hit memory problems pretty soon, so started storing the information as
shelves which worked fine except that as the size of the shelf increased
it started taking forever update the shelf file (involved a check for
key entry and then either updated or created new entry). This was
slowing things down to the extent that it would have taken literally
weeks to run. So next I have written an extension in C which basically
creates a huge c array on disk and the offset positions for each patt
combo is a hash generated from the unique string. This is still way too
slow and I am just wondering if I am ignorant of some far easier or more
effective way of tackling this problem. Maybe I should be using some
kinda database, but I thought that basically the shelve operation did
that ....I was using with gdbm.
any ideas?
Blobby