[BioPython] Martel Question

Tue Mar 30 19:24:44 EST 2004

Hi Peter;

> I have just built a parser for Quantarray ... but not with Martel. 
> This parser is built with the typical scanner consumer model philosophy, 
> built on a state machine that will handle the quantarray output files. This 
> parser was not built to load anything into the memory. 
[...]
> I will be reading and I need something that can 
> load into a Quantarray Record object, however I was a little worried about 
> the Record sizes. There is only 1 record per file, which might be the 
> saving grace. When I was parsing Genbank genomic files (many megs), the 
> Genbank parser was slowing to a crawl (and required piles of memory); 1 
> genomic record per file.

I know there were some speedups done on the GenBank parser over
recent releases. Nothing related specifically Martel, but rather to
some of my Python code which utilizes it. Have you tried it lately
on your files and machines and found it to be especially slow? But
yes, we haven't done any work on making big records not be stored in
memory.

> I would like to know if Martel scales to processing 5mb Records at a time, 
> if the entire file is  is in memory? Has Martel been improved over the last 
> few months in the regard ... I may have a need to parse large Genbank NT 
> records again.

There haven't been any specific changes to Martel -- if you are
basing the memory problems soley on the GenBank parser I know a
number of parts of that were written badly (by myself) and have been
attempted to fixed up.

> In Dalke's Martel paper it reads "Similarly, it should be possible to read 
> data from the input stream only when required, so that overall memory 
> footprint stays low. " is that still to be done ?

Nothing drastic has happened by Andrew to Martel in the last few
months so I assume so. He could probably give a better answer.

Yeah, so sorry but my answer sums up to -- I'm not sure how it will
act, I guess you'll have to try and see.

>From my own experience using the new Fasta parser (which uses
Martel) -- it works quite well on large chromosome sized FASTA
sequences on my machine (nothing fancy, just a standard desktop).

Hope this answer helps some, sorry I can't be more specific.
Brad