[BioPython] Blast parser error -- problem averted

Sagar Damle sagar@caltech.edu
Thu, 7 Feb 2002 18:05:49 -0800


problems solved.  Here are the updates I made to the NCBIWWW parser to restore functionality:

NCBIWWW.py


1------------------------------------------------
~ line 167
    def _scan_database_info(self, uhandle, consumer):
        attempt_read_and_call(uhandle, consumer.noevent, start='<p>')
        read_and_call(uhandle, consumer.database_info, contains='Database')
        # read_and_call(uhandle, consumer.database_info, contains='sequences')
        # below is a horrible hack.  must deal with 2-line database names!!!
-->     read_and_call(uhandle, consumer.noevent, blank=0)
        read_and_call(uhandle, consumer.database_info, contains='sequences')
        read_and_call(uhandle, consumer.noevent, blank=1)
        read_and_call(uhandle, consumer.noevent,


This isn't a great fix, because it doesn't deal with the possibility of multi-line descriptions
  of a Database

2------------------------------------------------
~ line 433
    def _scan_database_report(self, uhandle, consumer):
        # <PRE>
        #   Database: Non-redundant SwissProt sequences
        #     Posted date:  Dec 18, 1999  8:26 PM
        #   Number of letters in database: 29,652,561
        #   Number of sequences in database:  82,258
        #   
        # Lambda     K      H
        #    0.317    0.133    0.395 
        # 
        # Gapped
        # Lambda     K      H
        #    0.270   0.0470    0.230 
        # 

        consumer.start_database_report()

        read_and_call(uhandle, consumer.noevent, start='<PRE>')
        read_and_call(uhandle, consumer.database, start='  Database')
        # looks like database gets two lines sometimes - this is line going to
        # break the code.  must deal with 2-line database names!!!
-->     attempt_read_and_call(uhandle, consumer.noevent, blank=0)
        read_and_call(uhandle, consumer.posted_date, start='    Posted')
        read_and_call(uhandle, consumer.num_letters_in_database,
                      start='  Number of letters')
        read_and_call(uhandle, consumer.num_sequences_in_database,
                      start='  Number of sequences')

again.. same story as 1) (above)

3--------------------------------------------------

    def _scan_descriptions(self, uhandle, consumer):
        consumer.start_descriptions()

        # Three things can happen here:
        # 1.  line contains 'Score     E'
        # 2.  line contains "No significant similarity"
        # 3.  no descriptions
        if not attempt_read_and_call(
-->         uhandle, consumer.description_header, contains='Score    E'):
            # Either case 2 or 3.  Look for "No hits found".
            attempt_read_and_call(uhandle, consumer.no_hits,
                                  contains='No significant similarity')
            read_and_call_while(uhandle, consumer.noevent, blank=1)
            consumer.end_descriptions()
            # Stop processing.
            return

There were 5 spaces between "Score" and "E" originally.  The new blast output seems to want
  only 4 spaces.  I think specifying any number of spaces is not robust.  How about a regexp 
  between "score" and "E"  (like \s+ in perl) to allow any number of spaces between score and E

4--------------------------------------------------

    def _scan_one_pairwise_alignment(self, uhandle, consumer):
        # Alignment format:
        # <CENTER><b><FONT color="green">Alignments</FONT></b></CENTER>
        #       (BLAST 2.0.14)
        # <PRE>
        # alignment_header
        #   hsp_header
        #   hsp_alignment
        #   [...]
        # The hsp_header and hsp_alignment blocks can be repeated.

        consumer.start_alignment()
        attempt_read_and_call(uhandle, consumer.noevent, contains='Alignments')
        read_and_call(uhandle, consumer.noevent, start='<PRE>')
        self._scan_alignment_header(uhandle, consumer)

        # Scan a bunch of score/alignment's.
        while 1:
            # An HSP header starts with ' Score'.
            # However, if the HSP header is not the first one in the
            # alignment, there will be a '<PRE>' line first.  Therefore,
            # I will need to check either of the first two lines to
            # see if I'm at an HSP header.
            line1 = safe_readline(uhandle)
            line2 = safe_readline(uhandle)
-->         line3 = safe_readline(uhandle)
            uhandle.saveline(line3)
            uhandle.saveline(line2)
-->         uhandle.saveline(line1)
-->         if line1[:6] != ' Score' and line2[:6] != ' Score' and line3[:6] != ' Score':
                break
            self._scan_hsp(uhandle, consumer)
                
        consumer.end_alignment()


first two lines pointed here had to be added.  Third line is just a modification.  As it turns
  out, there is an extra blank line (line2 in the code above), so that line3 actually should
  contain the word " Score" in the event of multiple alignments for a single hsp_header.  Again
  I doubt that my solution is the best one, because it doesn't generalize.


 but... now I can hapily parse current NCBIWWW blast results!!!


-sagar