[Biopython] SeqIO.index for csfasta files memory issues

Thu Jan 21 11:31:29 UTC 2010

On Thu, Jan 21, 2010 at 9:10 AM, Kevin Lam <aboulia at gmail.com> wrote:
>
> Yups python is 64bit
>>>> platform.architecture()
> ('64bit', 'ELF')

Hmm - I was hoping that wouldn't be the case.

> the sample 1 file has
> 48412673 reads
> here's the top 20 reads
>
> head -n 20 Sample2.csfasta
>>427_22_20_F3
> T33133100313302011000100000000000000000000010000000
>>427_22_29_F3
> T30101002122001001000300000200030000000002121003000
>>427_22_44_F3
> T12223211021010030202120002130211102100003002010303
>>427_22_52_F3
> T32031331333133301101223023301013011032103032122123
>>427_22_58_F3
> T23010130111130001000202232101031001010000000000000
>>427_22_66_F3
> T10303202110222020010200311000110011001001111000110
>>427_22_72_F3
> T23332102212232122131103321303322213023003233100320
>>427_22_87_F3
> T20112313302013303131123323002203111122211310000010
>>427_22_113_F3
> T32021321020200032003222000221030102023012000003013
>>427_22_169_F3
> T22012322202220000000100000100000000000000010100020

Thanks Kevin,

I wrote a trivial script to generate a big fake Solid CSFASTA like this:

import random
total = 48412673 # 48 million
count = 0
handle = open("big_fake_solid.csfasta", "w")
for i in range(1000):
    for j in range(100):
        for k in range(1000):
            for h in range(256):
                nuc = random.choice("ACGT")
                #I could make the color sequence random, but
                #there is no real point for testing indexing:
                color_changes =
"33133100313302011000100000000000000000000010000000"
                handle.write(">%03i_%02i_%02i_%02X\n%s%s\n" \
                             % (i,j,k,h, nuc, color_changes))
                count += 1
                if count >= total : break
            if count >= total : break
        #print "Done %i so far" % count
        if count >= total : break
    if count >= total : break
handle.close()

I then tried indexing with Bio.SeqIO.index("big_fake_solid.csfasta","fasta")
using Biopython 1.53+ (latest code from git) on Mac OS X 10.5 Leopard
with 12GB of RAM, using the Apple provided Python 2.5 installation.
I watched the process in system monitor and it failed when memory
consumption reached 4GB, with a repeated message:

Python(608) malloc: *** mmap(size=262144) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

and traceback:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/SeqIO/_index.py",
line 262, in __init__
    self._record_key(line[marker_offset:].strip().split(None,1)[0], offset)

It turns out that my copy of Python (the default Apple provided one
on Leopard) seems to be just 32bit,

$ python
Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53)
[GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform
>>> platform.architecture()
('32bit', '')

So *if* your system was running 32bit python, I would expect it to
fail like this. I'd like to try a 64bit python locally - either I could
install this manually, or look for a big memory Linux box to try.

Or, If I updated my OS, it looks like Mac OS X 10.6 Snow Leopard
includes 64bit Python 2.6, plus a Python 2.5 which is only 32bit:
http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man1/python.1.html

Peter