[BioPython] [Fwd: Advice on optimium data structure for billion long list?]

Mark blobby Robinson m.1.robinson@herts.ac.uk
Mon, 14 May 2001 16:30:10 +0100


Hi guys,

I posted this to the python newsgroup and just wanted to run it past you 
guys as well . What I failed to mention before was I am currently trying 
to run this on a 700Mhz machine with 730MB of RAM, but  my boss will be 
buying in 8-9 dual cpu 1Ghz, 1GB RAM machines to allow me to parallelise 
the program somewhat.

So if anyone has any ideas please feel free to throw them my way ;)

blobby

-------- Original Message --------
Subject: Advice on optimium data structure for billion long list?
Date: Sat, 12 May 2001 17:12:22 +0100
From: Mark blobby Robinson <m.1.robinson@herts.ac.uk>
Organization: University of Hertfordshire
Newsgroups: comp.lang.python



Hey guys,

I'd just like to pick the best of the worlds python brains if thats ok. 
I am building a program that is pattern searching in DNA sequences and 
generating a list of combinations of 3 patterns that meet certain 
criteria. My problem is that this list could potentially get as large as 
~1.4 billion entries. Now originally I was using a dictionary with the 
key as a 15 length string (the patterns catted) and the value simply a 
count of the number of hots for that pattern combination. Obviously that 
hit memory problems pretty soon,  so started storing the information as 
shelves which worked fine except that as the size of the shelf increased 
it started taking forever update the shelf file (involved a check for 
key entry and then either updated or created new entry).  This was 
slowing things down to the extent that it would have taken literally 
weeks to run. So next I have written an extension in C which basically 
creates a huge c array on disk and the offset positions for each patt 
combo is a hash generated from the unique string.  This is still way too 
slow and I am just wondering if I am ignorant of some far easier or more 
effective way of tackling this problem. Maybe I should be using some 
kinda database, but I thought that basically the shelve operation did 
that ....I was using with gdbm.

any ideas?

Blobby