[Biopython-dev] Proposed addition to Standalone BLAST
Brad Chapman
chapmanb at arches.uga.edu
Tue Nov 7 04:45:07 EST 2000
Jeff:
> Sure. Having some code that would help to diagnose errors in BLAST
> reports would be a very nice feature. Certainly more user friendly than
> having SyntaxError this or SyntaxError that.
>
> We would have to build this on top of the current exceptions, though.
> It's still nice to have the SyntaxErrors under the hood, as an explanation
> on why the parser is complaining in the first place.
Okay, I went ahead and tried to implement something to do what we are
talking about. The code is attached as a diff to the current
NCBIStandalone module. Basically, what I did was implement a class
BlastErrorParser that uses the regular BlastParser, but catches
SyntaxErrors and tries to figure out the problems with them. It will
also optionally save any BLAST reports that cause syntax errors to a
file (which I think is a useful feature if you want to look at the
records that are causing the errors in a big ol' file of BLAST
results).
I use copy.deepcopy() to copy the handle, and since I was curious
about how this would affect the parsing time, I did a little timing
test. This wasn't anything scientific or anything, just a big BLAST
report that I had to parse which had errors in it. The results are:
using BlastErrorParser -> 1 hour and 31 minutes
Starting parsing at: Mon Nov 6 22:38:32 2000
Stopped parsing at: Tue Nov 7 00:09:04 2000
using BlastParser -> 1 hour and 30 minutes
Starting parsing at: Tue Nov 7 00:37:56 2000
Stopped parsing at: Tue Nov 7 02:07:57 2000
So I guess the overhead is minimal, and this makes me happy -- if
anyone else knows more about timings and wants to do tests, I would be
happy to hear about them.
Anyways, this does everything I was originally writing about wanting
to happen, and I like it, but I'd like to hear people's opinions and
comments on it. If people are for including it, then I can check it in
and also add a test that uses it to the regression tests.
Thanks for all the input on this so far!
Brad
-------------- next part --------------
*** NCBIStandalone.py.orig Thu Oct 12 13:32:21 2000
--- NCBIStandalone.py Mon Nov 6 22:28:16 2000
***************
*** 36,41 ****
--- 36,42 ----
import re
import popen2
from types import *
+ import copy
from Bio import File
from Bio.ParserSupport import *
***************
*** 471,476 ****
--- 472,563 ----
consumer.end_parameters()
+ class LowQualityBlastError(Exception):
+ """Error caused by running a low quality sequence through BLAST.
+
+ When low quality sequences (like GenBank entries containing only
+ stretches of a single nucleotide) are BLASTed, they will result in
+ BLAST generating an error and not being able to perform the BLAST.
+ search. This error should be raised for the BLAST reports produced
+ in this case.
+ """
+ pass
+
+ class BlastErrorParser:
+ """Attempt to catch and diagnose BLAST errors while parsing.
+
+ This utilizes the BlastParser module but adds an additional layer
+ of complexity on top of it by attempting to diagnose SyntaxError's
+ that may actually indicate problems during BLAST parsing.
+
+ Current BLAST problems this detects are:
+ o LowQualityBlastError - When BLASTing really low quality sequences
+ (ie. some GenBank entries which are just short streches of a single
+ nucleotide), BLAST will report an error with the sequence and be
+ unable to search with this. This will lead to a badly formatted
+ BLAST report that the parsers choke on. The parser will convert the
+ SyntaxError to a LowQualityBlastError and attempt to provide useful
+ information.
+ """
+ def __init__(self, bad_report_file = None):
+ """Initialize a parser that tries to catch BlastErrors.
+
+ Arguments:
+ o bad_report_file - An optional argument specifying a file to
+ write any reports that raise errors to. If not specified, these
+ reports will not be saved.
+ """
+ self._bad_report_file = bad_report_file
+ # if the report file exists, we want to clear the info in it
+ if self._bad_report_file and os.path.exists(self._bad_report_file):
+ tmp = open(self._bad_report_file, 'w')
+ tmp.close()
+
+ self._b_parser = BlastParser()
+
+ def parse(self, handle):
+ """Parse a handle, attempting to diagnose errors.
+ """
+ # copy the handle so we have it if we find an error
+ copy_handle = copy.deepcopy(handle)
+
+ try:
+ return self._b_parser.parse(handle)
+ except SyntaxError, msg:
+ # if we have a bad_report_file, save the info to it first
+ if self._bad_report_file:
+ # copy the handle so we can write it
+ error_handle = copy.deepcopy(copy_handle)
+ # append the info to the file
+ error_file = open(self._bad_report_file, 'a')
+ error_file.write(error_handle.read())
+ error_file.close()
+
+ # now we want to try and diagnose the error
+ self._diagnose_error(copy_handle, self._b_parser._consumer.data)
+
+ # if we got here we can't figure out the problem
+ # so we should pass along the syntax error we got
+ raise SyntaxError, msg
+
+ def _diagnose_error(self, handle, data_record):
+ """Attempt to diagnose an error in the passed handle.
+
+ Arguments:
+ o handle - The handle potentially containing the error
+ o data_record - The data record partially created by the consumer.
+ """
+ line = handle.readline()
+
+ while line:
+ # 'Searchingdone' instead of 'Searching......done' seems
+ # to indicate a failure to perform the BLAST due to
+ # low quality sequence
+ if line[:13] == 'Searchingdone':
+ raise LowQualityBlastError("Blast failure occured on query: ",
+ data_record.query)
+ line = handle.readline()
+
class BlastParser:
"""Parses BLAST data into a Record.Blast object.
More information about the Biopython-dev
mailing list