[Biopython] SeqIO.index for csfasta files memory issues

Tue Jan 19 09:38:43 UTC 2010

On Tue, Jan 19, 2010 at 8:31 AM, Kevin Lam <aboulia at gmail.com> wrote:
> What are the memory limitations for SeqIO.index?
> I am trying to create an index for a 4.5 gb csfasta file
> ~ 60 million reads
> but the script crashes at 5 Gb ram usage
> the machine has 31 Gb ram.

What OS are you using (and is it 64bit)?
What Python are you using (and is it 64bit)?
What version of Biopython are you using?

I've never tried a file with quite that many reads, but
crashing at about 5GB is odd. I wonder if this is a 4GB
limit somewhere in your system (e.g. running 32bit
Python). Adding some debug statements we could
see when it falls over (i.e. how many reads had
been indexed).

Long term, really really big indexes will be too big
to hold in memory as a python dict (record IDs and
file offsets). Therefore we have done a little work
looking at disk based indexes, including sqlite3.
This does make building the index much slower
though.

For your immediate task, try a simple iteration
through the records, selecting the records of
interest using Bio.SeqIO.parse and write as per
my other email. This way you'll only have to keep
in memory one record at a time, and a list/set
of the wanted IDs:
http://lists.open-bio.org/pipermail/biopython/2010-January/006128.html

Peter