[BioPython] Better blasting with XML

Peter Maxwell maxwell@biolateral.com.au
Wed, 21 Aug 2002 16:35:29 +1000


Hi all,

Biopython's blast parser keeps breaking.  This is normal behaviour for a blast 
parser.  To quote the blast2 release notes:

  "The BLAST report is not intended to be a 
  parseable document. It is subject to change 
  with little or no notice. "

Blast can produce parser friendly XML or tabular output so there is no need to 
battle with the traditional blast report format.  I attempted to use 
Bio.Blast.NCBIWWW.blast(..., format_type='html'), to fetch XML formated 
output from NCBI but that doesn't work.  I think the function makes 
assumptions about what will appear on the returned web pages that aren't true 
when format_type isn't 'html'.

There is another function in there, blasturl(), but the 'stable URL' it uses 
is based on the old email blast interface and so predates format_type and 
other recent blast features.

So I wrote something more or less equivalent to NCBIWWW.py using the 'new 
stable URL' http://www.ncbi.nlm.nih.gov/blast/Blast.cgi, documentation for 
which lives at http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html.  This is 
the same interface (also known as QBlast) that Bio.Blast.NCBIWWW.blast() 
uses, but since my version doesn't try to parse web forms it is is a bit more 
flexible and reliable.  

I also wrote the XML blast output parser I needed.  It doesn't make an object 
with the same interface as the current biopython blast parser because that 
turned out to be too hard, the interface being very much influenced by the 
details of the traditional blast report.  The XML schema is simpler, it is 
directly based on the ASN.1 schema which in turn is very close to the C data 
structures in the blast code itself.

Available at:
  http://www.biolateral.com.au/download/NCBI.py
  http://www.biolateral.com.au/download/NCBIXML.py

The code is GPL'ed for general distribution but I (and BioLateral) would be 
happy to see any of it find its way into biopython so it is also available to 
the biopython project for integration into biopython under biopython's 
licence.

Cheers,
 -- Peter Maxwell