[BioPython] Re: Advice on optimium data structure for billion long list?
Mark blobby Robinson
m.1.robinson@herts.ac.uk
Mon, 14 May 2001 17:04:39 +0100
Thanks for all the input, really appreciated.
Have you considered parallel or distributed algorithms?
Use a dictionary until it reaches some huge size.
Then use some method to prune its size. Storing the pruned values to disk.
And loop.
From a recent posting I recall that dictionaries won't give back memory.
In your case that would be a speed up after it peaks.
Converting the code to run in parallel is definately in the pipeline but
as most of the current running time is disk access this will still
leaves some significant problems (I think). At the moment I do exactly
as you suggest, I use a dictionary until it reaches something like a
million entries and then write that to disk. I have been simply using:
del dictionary to try and reclaim the memory but that doesn't seem to
be too effective so maybe using pop.item or something similar to
'prune' the dict would be better.
That's a lot of data but from your description there are too many
unknowns. Where is the source data coming from? What is an example of
the 15 character string? Can it be compressed at the source? What does
the data structure look like?
Did you get a faster machine more RAM. Multiple CPU's etc...
Costas
The source data is DNA sequence which consists of a string constrained
to contained the four letters A, T, C and G. So an example string would
be something like 'AACGTATAGCCTGA'. As Andrew Dalke has suggested the
storage of the string can be reduced significantly by converting to
something like binary notation. I am currently running on a 700Mhz
single processor, 730MB RAM machine running red hat 7 and looking get
some higher spec machines in the near future (I hope).
* real world DBMS (such as Postgresql, DB2, Oracle...) are made to
handle tables with a size of the order of 10e7 rows. These are considered
'big' databases. Dealing with billions of rows is a really really big
database, for which you generally need special support from the OS, and
and the DB vendor, not mentioning the hardware (RAID anyone) in order to
get decent performance. Oracle and IBM will be happy to sell you support
to help you create and tune the DB.
* there is abolutely no way you'll be able to use gdbm on an average
workstation to store 1.4 billion rows. According to my personal
experience, Gdbm is OK up to about 10000 rows. I'm currently dealing
with 300000 rows for an application and I use 26 gdbm files to
pre-hash the data into reasonable sized files. If you want to go in this
direction, be aware that it means about 150000 files. Using Postgres will
ease things, but you'll have to deal with DMBS specific problems (index
updates, datafile size...)
Ok, that all sounds like good sound advice to me. Not necessarily good
news but it rings true dammit.
You say there are ~1.4 billion possible entries, which is
4**15 + 4**14 + 4**13 + 4**12 + ... 4**1 + 4 ** 0
and implies you also have strings of length less than 15. This
means you also need to encode the length since 4 bits
doesn't encode a termination character. A simple length
encoding needs at least 4 bits (2**4 = 16), which puts each
key as 34 bits.
HOWEVER, suppose you encode the first few bits to mean 00 - 15 bases encoded (plus 30 for the bases = 32 bits total) 01 - 14 bases encoded (plus 28 for the bases = 30 bits total) 11101 - 13 bases encoded (plus 26 for the bases = 31 bits total) 11100 - 12 bases encoded (plus 24 for the bases = 29 bits total) 11011 - 11 bases encoded (plus 22 for the bases = 27 bits total) 11010 - ...
so this is a way (amoung many) to fit all of your fragments
in a 32 bit word. This encoding can be done with a simple C
extension. The 32 bits works nicely as the hash for the entry.
You haven't said how you store the values for each entry. Most
likely it will be a list of record identifiers.
Ok, I am actually only looking for 15 length strings, the reason being I am looking at combinations of 3 strings of 5 length. There are The order of the patterns is irrelevant so I sort the strings and concat them to produce a unique identifier for that combination. Then at this stage all I am interested is a frequency count for each combination. I have been thinking that I should code the strings more effectively, so your advice there is definately helpful :). I haven't been storing them as records so far simply because although positions will be needed at some point cos of size restrictions I have decided to re-search for the positions of a (much) smaller subset at a later point after a significant number have been filtered out.The unfortunate thing with search for combinations of 5 length strings is that is sequences of 100K or so which will be running on I really am likely to find most if not all of the different combinations.
thanks again for your input guys, it is invaluable
blobby