[BioPython] blast parse

Wed Jan 30 09:15:49 UTC 2008

Hi:
I'm new on the list and on biopython. I come from perl and I'm liking python a 
lot.
I'm trying to read a big blast file and it takes a lot o time and memory. I'm 
not sure if I'm taking the most efficient path. Basically I'm doing:

blasth = file('blast.xml', 'r')
from Bio.Blast import NCBIXML
p = NCBIXML.BlastParser()
blast_parse = p.parse(blasth)
for blast_result in blast_parse:
	#do whatever

I was expecting to read the records one by one, but the call to 
p.parse(blasth) takes a lot of time and memory. I'm not sure about what this 
function returns, a list or an iterator. I've looked at the NCBIXML.py file 
and the BlastParser class has two parse methods (am I wrong?).

    def parse(self, handler):
        """Parses the XML data

        handler -- file handler or StringIO

        This method returns a list of Blast record objects.
        """

def parse(handle, debug=0):
    """Returns an iterator a Blast record for each query.

    handle - file handle to and XML file to parse
    debug - integer, amount of debug information to print

    This is a generator function that returns multiple Blast records
    objects - one for each query sequence given to blast.  The file
    is read incrementally, returning complete records as they are read
    in.

I guess that the first function would read the complete file before returning 
anything, but the second should return and read the records one by one. I 
don't know if this guess is correct.
Is there other way to read these huge blast files without using so much 
memory?
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)