[BioPython] blast parser ideas

Jeffrey Chang jchang@SMI.Stanford.EDU
Tue, 9 Nov 1999 19:14:12 -0800 (PST)


Hi everybody,

I have been giving some thought to writing BLAST parsers.  Some of the
problems that I think a BLAST parser should address are:
1. Multiple flavors of BLAST (blastp, blastn, tblastn, etc.)
2. Multiple versions of BLAST (blast1, blast2, psi-blast, phi-blast,
wu-blast)
3. Frequent format changes
4. BLAST reports can be large.
5. Much of the output may not relevant.

I'm not happy with the default design where a parser takes some BLAST
information and fills out some data structure.  In my experience,
problems 2 and 3 make this design hard to manage, and you end up with
code that's no longer backwards compatible or lots of bits of parser
code.

Instead, I have been thinking of using an event-oriented parser.  This
style of parser has been discussed in bionet.software and is used for
the *ML parsers in the standard Python distribution.  I believe Andrew
Dalke has played with this in various projects.

The way this works, is that the client feeds data into a Parser
object.  The parser recognizes information in the data stream and
calls a function in a Consumer to handle the information.  The
Consumer is supplied by the client, and can do application-specific
things with the data.  Typically, it would capture the information in
a data structure suitable for the application.


For example:

class Parser:
    def parse(self, handle, consumer):
        # Read data from handle.
        # Call functions in the consumer, as appropriate.


class Consumer:
    #def start_XXX(self):
    #    # Called at the beginning of a section named XXX.

    #def end_XXX(self):
    #    # Called at the end of a section named XXX.

    #def XXX(self, data):
    #    # Called when data of type XXX is encountered.

    def unhandled_section(self):
        pass

    def unhandled(self, data):
        pass

    def __getattr__(self, attr):
        method = getattr(self, attr)
        if method is None:
            if attr[:6] == 'start_' or attr[:4] == 'end_':
                method = unhandled_section
            else:
                method = unhandled
        return method



Here's a sample consumer that would just print out alignment
information from some BLAST output:

class MyConsumer(Consumer):
    def start_alignment(self):
        print "Alignments start"

    def end_alignment(self):
        print "Alignments end"

    def query(self, data):
        print "Query line:", data,

    def align(self, data):
        print "Align line:", data,
 
    def sbjct(self, data):
        print "Sbjct line:", data,


A sample session might be:

>>> parser = Parser()
>>> consumer = MyConsumer()
>>> parser.parse(open('blast_data.txt'), consumer)
Alignments start
Query line: Query: 8   DRLNTTFRQMEQELAIFAAHLE 29
Align line:            +RL   +R++E+ L+   AH+E
Sbjct line: Sbjct: 135 ERLLWLYREVERPLSAVLAHME 156
Alignments end
>>> 


This event-oriented design decouples the parsing from the handler, so
you can use the same consumer for multiple versions and flavors of
BLAST.  Plus, you can ignore data that you're not interested in by not
implementing handler methods in your consumer.


By doing things this way, the Parser and Consumer need to agree on an
interface through which to pass data.  Thus, I went through the latest
BLAST code and named the lines in the output.  These will be the names
of the methods in the Consumer class.

I've attached the list of names as well as some sample BLAST output.
Please let me know what you think about the parser ideas as well as
the proposed names.

Thanks,
Jeff





SECTION
    DATATYPE                      WHEN AVAILABLE


header
    version
    reference
    query
    database_information

descriptions
    description

score
    title
    length
    score
    identities
    frame                         frame
    strand                        strand

alignment
    query
    align
    sbjct

database_report
    database                      not subset
    posted_date                   not subset
    num_letters_in_database       not subset
    num_sequences_in_database     not subset
    num_letters_searched          subset
    num_sequences_searched        subset


parameters
    matrix
    gap_penalties                 gapped mode
    second_pass_hits              not two pass method
    second_pass_sequences         not two pass method
    second_pass_extends           not two pass method
    second_pass_good_extends      not two pass method
    num_hits                      two pass method
    num_sequences                 two pass method
    num_extends                   two pass method
    num_good_extends              two pass method
    num_seqs_better_e             gapped and not blastn
    hsps_no_gap                   gapped and not blastn
    hsps_prelim_gapped            gapped and not blastn
    hsps_prelim_gap_attempted     gapped and not blastn
    hsps_gapped                   gapped and not blastn
    query_length
    database_length
    effective_hsp_length
    effective_query_length
    effective_database_length
    effective_search_space
    effective_search_space_used
    frameshift_decay              blastx or tblastn or tblastx
    threshold second
    window_size
    dropoff_1st_pass
    gap_x_dropoff
    gap_x_dropoff_final           not blastn and gapped calculation (?)
    gap_trigger
    blast_cutoff
    

    
    



BLASTP 2.0.10 [Aug-26-1999]




Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= test
         (140 letters)

Database: sdqib40-1.35.seg.fa
           1323 sequences; 223,339 total letters

Searching..................................................done

                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value

d1rip__ 2.24.7.1.1 Ribosomal S17 protein [Bacillus stearothermo...    23  2.5
d1rlr_1 1.56.1.1.1 (1-212) R1 subunit of ribonucleotide reducta...    23  2.5
d1lfaa_ 3.42.1.1.1 Integrin CD11a/CD18 (LFA-1) [Human (Homo sap...    22  5.6
d1ktq_1 3.38.3.4.2 (1-161) Exonuclease domain of DNA polymerase...    21  9.7
d1prea1 4.88.1.2.2 (1-83) Proaerolysin, N-terminal domain [Aero...    21  9.7

>d1rip__ 2.24.7.1.1 Ribosomal S17 protein [Bacillus
          stearothermophilus]
          Length = 81
          
 Score = 22.7 bits (47), Expect = 2.5
 Identities = 10/36 (27%), Positives = 18/36 (49%)

Query: 16 QMEQELAIFAAHLEQHKLLVARVFSLPEVKKEDEHN 51
          +M++ + +     ++H L   RV    + K  DEHN
Sbjct: 13 KMDKTITVLVETYKKHPLYGKRVKYSKKYKAHDEHN 48


>d1rlr_1 1.56.1.1.1 (1-212) R1 subunit of ribonucleotide reductase,
           N-terminal domain [Escherichia coli]
           Length = 212
           
 Score = 22.7 bits (47), Expect = 2.5
 Identities = 11/36 (30%), Positives = 20/36 (55%)

Query: 104 NLSQAALVSHIQHINKLKTTFEHIVTVESELPTAAR 139
           ++SQ  L SHIQ  + +KT+  H   +++     +R
Sbjct: 28  SISQVELRSHIQFYDGIKTSDIHETIIKAAADLISR 63


>d1lfaa_ 3.42.1.1.1 Integrin CD11a/CD18 (LFA-1) [Human (Homo
           sapiens)]
           Length = 183
           
 Score = 21.6 bits (44), Expect = 5.6
 Identities = 10/26 (38%), Positives = 16/26 (61%)

Query: 109 ALVSHIQHINKLKTTFEHIVTVESEL 134
           AL+ H++H+  L  TF  I  V +E+
Sbjct: 67  ALLKHVKHMLLLTNTFGAINYVATEV 92


>d1ktq_1 3.38.3.4.2 (1-161) Exonuclease domain of DNA polymerase KF
           [Thermus aquaticus]
           Length = 161
           
 Score = 20.8 bits (42), Expect = 9.7
 Identities = 8/22 (36%), Positives = 15/22 (67%)

Query: 8   DRLNTTFRQMEQELAIFAAHLE 29
           +RL   +R++E+ L+   AH+E
Sbjct: 135 ERLLWLYREVERPLSAVLAHME 156


>d1prea1 4.88.1.2.2 (1-83) Proaerolysin, N-terminal domain
          [Aeromonas hydrophila]
          Length = 83
          
 Score = 20.8 bits (42), Expect = 9.7
 Identities = 10/28 (35%), Positives = 16/28 (56%)

Query: 37 RVFSLPEVKKEDEHNPLNRIEVKQHLGN 64
          R+FSL +    D++ P+NR E +    N
Sbjct: 9  RLFSLGQGVCGDKYRPVNREEAQSVKSN 36


  Database: sdqib40-1.35.seg.fa
    Posted date:  Nov 1, 1999  4:25 PM
  Number of letters in database: 223,339
  Number of sequences in database:  1323
  
Lambda     K      H
   0.322    0.133    0.369 

Lambda     K      H
   0.270   0.0470    0.230 


                                                    # SECTION: PARAMETERS
Matrix: BLOSUM62                                    # MATRIX
Gap Penalties: Existence: 11, Extension: 1          # GAP_PENALTIES (only if gapped)
Number of Hits to DB: 50604                         # SECOND_PASS_HITS
Number of Sequences: 1323
Number of extensions: 1526
Number of successful extensions: 6
Number of sequences better than 10.0: 5
Number of HSP's better than 10.0 without gapping: 5
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 1
Number of HSP's gapped (non-prelim): 5
length of query: 140
length of database: 223,339
effective HSP length: 39
effective length of query: 101
effective length of database: 171,742
effective search space: 17345942
effective search space used: 17345942
T: 11
A: 40
X1: 16 ( 7.4 bits)
X2: 38 (14.8 bits)
X3: 64 (24.9 bits)
S1: 41 (21.9 bits)
S2: 42 (20.8 bits)