[Biopython] SeqIO.index for csfasta files memory issues

Kevin aboulia at gmail.com
Tue Jan 19 23:43:11 UTC 2010


Hi Peter
It's 64 bit centos shared cluster
I assumed all the rest of
Python and such are the same as well but I may be wrong. It's version  
1.53 I believe for biopython
I wanted random access as I need half the reads separated this way and  
I think it is faster. Guess I have to do it the old way
Thanks
Kev
Sent from my iPod

On 19-Jan-2010, at 5:38 PM, Peter <biopython at maubp.freeserve.co.uk>  
wrote:

> On Tue, Jan 19, 2010 at 8:31 AM, Kevin Lam <aboulia at gmail.com> wrote:
>> What are the memory limitations for SeqIO.index?
>> I am trying to create an index for a 4.5 gb csfasta file
>> ~ 60 million reads
>> but the script crashes at 5 Gb ram usage
>> the machine has 31 Gb ram.
>
> What OS are you using (and is it 64bit)?
> What Python are you using (and is it 64bit)?
> What version of Biopython are you using?
>
> I've never tried a file with quite that many reads, but
> crashing at about 5GB is odd. I wonder if this is a 4GB
> limit somewhere in your system (e.g. running 32bit
> Python). Adding some debug statements we could
> see when it falls over (i.e. how many reads had
> been indexed).
>
> Long term, really really big indexes will be too big
> to hold in memory as a python dict (record IDs and
> file offsets). Therefore we have done a little work
> looking at disk based indexes, including sqlite3.
> This does make building the index much slower
> though.
>
> For your immediate task, try a simple iteration
> through the records, selecting the records of
> interest using Bio.SeqIO.parse and write as per
> my other email. This way you'll only have to keep
> in memory one record at a time, and a list/set
> of the wanted IDs:
> http://lists.open-bio.org/pipermail/biopython/2010-January/006128.html
>
> Peter



More information about the Biopython mailing list