[Biopython-dev] [Biopython] SciPy paper: documenting statistical data structure design issues
Steve Lianoglou
mailinglist.honeypot at gmail.com
Tue May 25 13:52:26 UTC 2010
Hi,
> My main concern with the current tools is the memory issue. For instance when
> I try to create a distribution of sequence lengths or qualities using NGS
> data I end up with millions of numbers. That is too much for any reasonable
> computer.
Several million numbers aren't all that much, though, right?
To simulate your example, I created a 100,000,000 long vector (which,
depending on what type of NGS data you have, should be considered a
large number of reads) representing faux read-lengths, and it's only
taking up ~ 382 MB's[1] and gathering basic statistics on it
(variance, mean, histograms, etc.) isn't painful at all.
Once you start adding more metadata to the 100,000,000 elements, I can
see where you start running into problems, though.
> I've solved the problem by using disk caches that work as
> iterators. I'm sure that this is not the most performant solucion. It's just
> a hack and I would like to use better tools for sure.
Have you tried looking at something like PyTables? Might be something
to consider ...
Just a thought,
-steve
[1] I'm using R, which only used 32bit integers, but the language
itself isn't really the the point since we're all going to be running
into a wall with respect to NGS-sized datasets.
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Biopython-dev
mailing list