[BioPython] [Fwd: Advice on optimium data structure for billion long list?]

Brad Chapman chapmanb@arches.uga.edu
Tue, 15 May 2001 21:34:50 -0400


Hey Blobby;

> I am building a program that is pattern searching in DNA sequences and 
> generating a list of combinations of 3 patterns that meet certain 
> criteria. My problem is that this list could potentially get as large as 
> ~1.4 billion entries. Now originally I was using a dictionary with the 
> key as a 15 length string (the patterns catted) and the value simply a 
> count of the number of hots for that pattern combination. 

Just a random idea that popped in my head, but is it possible that
most of the combination of the 3 patterns are never actually found?
I'm not sure if this would be the case for your particular problem
without knowning anything about it, but if it is a potential solution
that is presented in "The Quick Python Book" by Harms and McDonald is
Sparse Matrices.

If you think of the 3 patterns as making up a three dimensional
matrix, you could encode this matrix in a python dicitionary using
tuples for keys, like:

pattern_dict[("pattern 1", "pattern 2", "pattern 3")] = hit_count

You would only add a pattern to the dictionary if it ever matches, and 
has a hit_count bigger than zero. If most elements are zero, then this
might reduce the size of the dictionary you have to deal with to
something smaller and more manageable.

Hope this might help some.
Brad