[BioPython] Martel Question

Fri Mar 26 00:04:23 EST 2004

I have just built a parser for Quantarray ... but not with Martel. I did 
this as warm up for a pile of code that I need to write for a microarray 
project (~3000 arrays). It has been some time since I have written any 
code, and I need to "get into it".

This parser is built with the typical scanner consumer model philosophy, 
built on a state machine that will handle the quantarray output files. This 
parser was not built to load anything into the memory. It was meant to be 
as fast as possible to transform the original file into a new format 
written to disk :I was processing many files 5mb each x 3000... . So each 
line was processed from the input stream and processed and discarded. 
Eventually I will be building matrices of 19000 genes by 3000 samples from 
the quantarray files that I will be reading and I need something that can 
load into a Quantarray Record object, however I was a little worried about 
the Record sizes. There is only 1 record per file, which might be the 
saving grace. When I was parsing Genbank genomic files (many megs), the 
Genbank parser was slowing to a crawl (and required piles of memory); 1 
genomic record per file.

I would like to know if Martel scales to processing 5mb Records at a time, 
if the entire file is  is in memory? Has Martel been improved over the last 
few months in the regard ... I may have a need to parse large Genbank NT 
records again.

In Dalke's Martel paper it reads "Similarly, it should be possible to read 
data from the input stream only when required, so that overall memory 
footprint stays low. " is that still to be done ?

Peter