[BioPython] [Fwd: Advice on optimium data structure for billion long list?]

Mark blobby Robinson m.1.robinson@herts.ac.uk
Wed, 16 May 2001 11:23:46 +0100


Hey Brad,

Thanks for taking the time to think about my problem. As it turns out, 
my current implementation is pretty much exactly as you suggested. I 
have been finding somewhere in the region of 260 million or so different 
combination, and the only reason I haven't found more is cos I can't 
handle more than that yet.  I am starting to get the feeling I am 
attempting too much, and am going to have to compromise and filter some 
candidate patterns out at an earlier stage and hope I don't lose any I 
am interested in.

Thanks again
Blobby

Brad Chapman wrote:

> Hey Blobby;
> 
>> I am building a program that is pattern searching in DNA sequences and 
>> generating a list of combinations of 3 patterns that meet certain 
>> criteria. My problem is that this list could potentially get as large as 
>> ~1.4 billion entries. Now originally I was using a dictionary with the 
>> key as a 15 length string (the patterns catted) and the value simply a 
>> count of the number of hots for that pattern combination. 
> 
> 
> Just a random idea that popped in my head, but is it possible that
> most of the combination of the 3 patterns are never actually found?
> I'm not sure if this would be the case for your particular problem
> without knowning anything about it, but if it is a potential solution
> that is presented in "The Quick Python Book" by Harms and McDonald is
> Sparse Matrices.
> 
> If you think of the 3 patterns as making up a three dimensional
> matrix, you could encode this matrix in a python dicitionary using
> tuples for keys, like:
> 
> pattern_dict[("pattern 1", "pattern 2", "pattern 3")] = hit_count
> 
> You would only add a pattern to the dictionary if it ever matches, and 
> has a hit_count bigger than zero. If most elements are zero, then this
> might reduce the size of the dictionary you have to deal with to
> something smaller and more manageable.
> 
> Hope this might help some.
> Brad
> 
> 
> _______________________________________________
> BioPython mailing list  -  BioPython@biopython.org
> http://biopython.org/mailman/listinfo/biopython
> 
> 
>