[BioPython] blast parse

Michiel de Hoon mjldehoon at yahoo.com
Wed Jan 30 09:56:56 UTC 2008


Dear Jose,

To get the records one-by-one, use

from Bio.Blast import NCBIXML
blast_parse = NCBIXML.parse(blasth)
for blast_result in blast_parse:
    # do whatever with blast_result

This avoids having to read the complete XML file all at once.

To the developers:
We should probably think about removing the NCBIXML.BlastParser.parse, and perhaps adding a NCBIXML.read function to read exactly one record from the XML file.

--Michiel.

Jose Blanca <jblanca at btc.upv.es> wrote: Hi:
I'm new on the list and on biopython. I come from perl and I'm liking python a 
lot.
I'm trying to read a big blast file and it takes a lot o time and memory. I'm 
not sure if I'm taking the most efficient path. Basically I'm doing:

blasth = file('blast.xml', 'r')
from Bio.Blast import NCBIXML
p = NCBIXML.BlastParser()
blast_parse = p.parse(blasth)
for blast_result in blast_parse:
 #do whatever

I was expecting to read the records one by one, but the call to 
p.parse(blasth) takes a lot of time and memory. I'm not sure about what this 
function returns, a list or an iterator. I've looked at the NCBIXML.py file 
and the BlastParser class has two parse methods (am I wrong?).

    def parse(self, handler):
        """Parses the XML data

        handler -- file handler or StringIO

        This method returns a list of Blast record objects.
        """

def parse(handle, debug=0):
    """Returns an iterator a Blast record for each query.

    handle - file handle to and XML file to parse
    debug - integer, amount of debug information to print

    This is a generator function that returns multiple Blast records
    objects - one for each query sequence given to blast.  The file
    is read incrementally, returning complete records as they are read
    in.

I guess that the first function would read the complete file before returning 
anything, but the second should return and read the records one by one. I 
don't know if this guess is correct.
Is there other way to read these huge blast files without using so much 
memory?
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)
_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.



More information about the Biopython mailing list