[BioPython] Blast parser error -- problem averted
Sagar Damle
sagar@caltech.edu
Thu, 7 Feb 2002 18:05:49 -0800
problems solved. Here are the updates I made to the NCBIWWW parser to restore functionality:
NCBIWWW.py
1------------------------------------------------
~ line 167
def _scan_database_info(self, uhandle, consumer):
attempt_read_and_call(uhandle, consumer.noevent, start='<p>')
read_and_call(uhandle, consumer.database_info, contains='Database')
# read_and_call(uhandle, consumer.database_info, contains='sequences')
# below is a horrible hack. must deal with 2-line database names!!!
--> read_and_call(uhandle, consumer.noevent, blank=0)
read_and_call(uhandle, consumer.database_info, contains='sequences')
read_and_call(uhandle, consumer.noevent, blank=1)
read_and_call(uhandle, consumer.noevent,
This isn't a great fix, because it doesn't deal with the possibility of multi-line descriptions
of a Database
2------------------------------------------------
~ line 433
def _scan_database_report(self, uhandle, consumer):
# <PRE>
# Database: Non-redundant SwissProt sequences
# Posted date: Dec 18, 1999 8:26 PM
# Number of letters in database: 29,652,561
# Number of sequences in database: 82,258
#
# Lambda K H
# 0.317 0.133 0.395
#
# Gapped
# Lambda K H
# 0.270 0.0470 0.230
#
consumer.start_database_report()
read_and_call(uhandle, consumer.noevent, start='<PRE>')
read_and_call(uhandle, consumer.database, start=' Database')
# looks like database gets two lines sometimes - this is line going to
# break the code. must deal with 2-line database names!!!
--> attempt_read_and_call(uhandle, consumer.noevent, blank=0)
read_and_call(uhandle, consumer.posted_date, start=' Posted')
read_and_call(uhandle, consumer.num_letters_in_database,
start=' Number of letters')
read_and_call(uhandle, consumer.num_sequences_in_database,
start=' Number of sequences')
again.. same story as 1) (above)
3--------------------------------------------------
def _scan_descriptions(self, uhandle, consumer):
consumer.start_descriptions()
# Three things can happen here:
# 1. line contains 'Score E'
# 2. line contains "No significant similarity"
# 3. no descriptions
if not attempt_read_and_call(
--> uhandle, consumer.description_header, contains='Score E'):
# Either case 2 or 3. Look for "No hits found".
attempt_read_and_call(uhandle, consumer.no_hits,
contains='No significant similarity')
read_and_call_while(uhandle, consumer.noevent, blank=1)
consumer.end_descriptions()
# Stop processing.
return
There were 5 spaces between "Score" and "E" originally. The new blast output seems to want
only 4 spaces. I think specifying any number of spaces is not robust. How about a regexp
between "score" and "E" (like \s+ in perl) to allow any number of spaces between score and E
4--------------------------------------------------
def _scan_one_pairwise_alignment(self, uhandle, consumer):
# Alignment format:
# <CENTER><b><FONT color="green">Alignments</FONT></b></CENTER>
# (BLAST 2.0.14)
# <PRE>
# alignment_header
# hsp_header
# hsp_alignment
# [...]
# The hsp_header and hsp_alignment blocks can be repeated.
consumer.start_alignment()
attempt_read_and_call(uhandle, consumer.noevent, contains='Alignments')
read_and_call(uhandle, consumer.noevent, start='<PRE>')
self._scan_alignment_header(uhandle, consumer)
# Scan a bunch of score/alignment's.
while 1:
# An HSP header starts with ' Score'.
# However, if the HSP header is not the first one in the
# alignment, there will be a '<PRE>' line first. Therefore,
# I will need to check either of the first two lines to
# see if I'm at an HSP header.
line1 = safe_readline(uhandle)
line2 = safe_readline(uhandle)
--> line3 = safe_readline(uhandle)
uhandle.saveline(line3)
uhandle.saveline(line2)
--> uhandle.saveline(line1)
--> if line1[:6] != ' Score' and line2[:6] != ' Score' and line3[:6] != ' Score':
break
self._scan_hsp(uhandle, consumer)
consumer.end_alignment()
first two lines pointed here had to be added. Third line is just a modification. As it turns
out, there is an extra blank line (line2 in the code above), so that line3 actually should
contain the word " Score" in the event of multiple alignments for a single hsp_header. Again
I doubt that my solution is the best one, because it doesn't generalize.
but... now I can hapily parse current NCBIWWW blast results!!!
-sagar