[BioPython] Re: Advice on optimium data structure for billion long list?

Mon, 14 May 2001 17:04:39 +0100

Thanks for all the input, really appreciated.

    Have you considered parallel or distributed algorithms?

    Use a dictionary until it reaches some huge size.
    Then use some method to prune its size. Storing the pruned values to disk.
    And loop.

    From a recent posting I recall that dictionaries won't give back memory.
    In your case that would be a speed up after it peaks.

Converting the code to run in parallel is definately in the pipeline but 
as most of the current running time is disk access this will still 
leaves some significant problems (I think). At the moment I do exactly 
as you suggest,  I use a dictionary until it reaches something like a 
million entries and then write that to disk. I have been simply using: 
del dictionary  to try and reclaim the memory but that doesn't seem to 
be  too effective so maybe using pop.item or something similar to 
'prune' the dict would be better.

    That's a lot of data but from your description there are too many
    unknowns. Where is the source data coming from? What is an example of
    the 15 character string? Can it be compressed at the source? What does
    the data structure look like? 

    Did you get a faster machine more RAM. Multiple CPU's etc...

    Costas

The source data is DNA sequence which consists of a string constrained 
to contained the four letters A, T, C and G.  So an example string would 
be something like 'AACGTATAGCCTGA'. As Andrew Dalke has suggested the 
storage of the string can be reduced significantly by converting to 
something like binary notation. I am currently running on a 700Mhz 
single processor, 730MB RAM machine running red hat 7 and looking get 
some higher spec machines in the near future (I hope).

    * real world DBMS (such as Postgresql, DB2, Oracle...) are made to 
    handle tables with a size of the order of 10e7 rows. These are considered
    'big' databases. Dealing with billions of rows is a really really big 
    database, for which you generally need special support from the OS, and
    and the DB vendor, not mentioning the hardware (RAID anyone) in order to 
    get decent performance. Oracle and IBM will be happy to sell you support 
    to help you create and tune the DB.

     * there is abolutely no way you'll be able to use gdbm on an average 
    workstation to store 1.4 billion rows. According to my personal 
    experience, Gdbm is OK up to about 10000 rows. I'm currently dealing 
    with 300000 rows for an application and I use 26 gdbm files to 
    pre-hash the data into reasonable sized files. If you want to go in this 
    direction, be aware that it means about 150000 files. Using Postgres will
    ease things, but you'll have to deal with DMBS specific problems (index
    updates, datafile size...)

Ok, that all sounds like good sound advice to me. Not necessarily good 
news but it rings true dammit.

        You say there are ~1.4 billion possible entries, which is
        4**15 + 4**14 + 4**13 + 4**12 + ... 4**1 + 4 ** 0
        and implies you also have strings of length less than 15.  This
        means you also need to encode the length since 4 bits
        doesn't encode a termination character.  A simple length
        encoding needs at least 4 bits (2**4 = 16), which puts each
        key as 34 bits.  

        HOWEVER, suppose you encode the first few bits to mean   00 - 15 bases encoded (plus 30 for the bases = 32 bits total)   01 - 14 bases encoded (plus 28 for the bases = 30 bits total)   11101 - 13 bases encoded (plus 26 for the bases = 31 bits total)   11100 - 12 bases encoded (plus 24 for the bases = 29 bits total)   11011 - 11 bases encoded (plus 22 for the bases = 27 bits total)   11010 -  ...  

        so this is a way (amoung many) to fit all of your fragments
        in a 32 bit word.  This encoding can be done with a simple C
        extension.  The 32 bits works nicely as the hash for the entry.  

        You haven't said how you store the values for each entry.  Most
        likely it will be a list of record identifiers.

    Ok, I am actually only looking for 15 length strings, the reason being I am looking at combinations of 3 strings of 5 length. There are The order of the patterns is irrelevant so I sort the strings and concat them to produce a unique identifier for that combination. Then at this stage all I am interested is a frequency count for each combination. I have been thinking that I should code the strings more effectively, so your advice there is definately helpful :). I haven't been storing them as records so far simply because although positions will be needed at some point cos of size restrictions I have decided to re-search for the positions of a (much) smaller subset at a later point after a significant number have been filtered out.The unfortunate thing with search for combinations of 5 length strings is that is sequences of 100K or so which will be running on I really am likely to find most if not all of the different combinations.

    thanks again for your input guys, it is invaluable

    blobby