[BioPython] Martel Question
Peter Wilkinson
pwilkinson at videotron.ca
Fri Mar 26 00:04:23 EST 2004
I have just built a parser for Quantarray ... but not with Martel. I did
this as warm up for a pile of code that I need to write for a microarray
project (~3000 arrays). It has been some time since I have written any
code, and I need to "get into it".
This parser is built with the typical scanner consumer model philosophy,
built on a state machine that will handle the quantarray output files. This
parser was not built to load anything into the memory. It was meant to be
as fast as possible to transform the original file into a new format
written to disk :I was processing many files 5mb each x 3000... . So each
line was processed from the input stream and processed and discarded.
Eventually I will be building matrices of 19000 genes by 3000 samples from
the quantarray files that I will be reading and I need something that can
load into a Quantarray Record object, however I was a little worried about
the Record sizes. There is only 1 record per file, which might be the
saving grace. When I was parsing Genbank genomic files (many megs), the
Genbank parser was slowing to a crawl (and required piles of memory); 1
genomic record per file.
I would like to know if Martel scales to processing 5mb Records at a time,
if the entire file is is in memory? Has Martel been improved over the last
few months in the regard ... I may have a need to parse large Genbank NT
records again.
In Dalke's Martel paper it reads "Similarly, it should be possible to read
data from the input stream only when required, so that overall memory
footprint stays low. " is that still to be done ?
Peter
More information about the BioPython
mailing list