[BioPython] blast parser ideas

Jeffrey Chang jchang@SMI.Stanford.EDU
Wed, 10 Nov 1999 17:11:00 -0800 (PST)


[Jeff talks about an event-oriented model for blast parsers]

[Arne]
> Very good idea I think but I'm worried about point 2 of the above list.
> If I understood the principle of the system (parser feeds consumer) the
> input stream is parsed by the parser object which recognizes information
> (like 'Sbject' ...) and then calls
> 
> consumer.start_Sbject()
> consumer.Sbject()
> consumer.end_Sbject()

Oops, I should have explained myself clearer.  The start_ and end_ methods
are used called for sections and provide the Consumer with contextual
clues.  For example, the parser may call:  

consumer.start_alignment()
consumer.query(data)
consumer.align(data)
consumer.sbjct(data)
consumer.end_alignment()


For PSI-BLAST, this can help the Consumer figure out where each round
begins and ends. 

start_round
    score and alignment info
end_round
start_round
    score and alignment info
end_round



> The consumer can handle 'Sbject' lines of the blast output by defining
> the above 3 functions. But in the end it's up to the parser to recignize
> certain keywords like 'Sbjct', isn't it? That means the parser has to
> recognize all keywords of all different blast programs etc ... (e.g.
> 'Results from Round' for PSI-Blast) and is not independant anymore. 

What I'm hoping to do is to separate the easier task of recognizing where
information exists from the more detailed one of actually extracting the
information.  I believe it's going to be unmanageable if we try to write
code to parse information from various versions and flavors of BLAST.
Thus, I'm happy if we can develop Parsers that will just point the
Consumer to where the information exists, and let the user develop quick
and dirty throwaway Consumers that can do the highly specific task of
extracting wanted information.

But back to your point, that the Parser will need code that allows it to
recognize lines from various flavors of BLAST.  This is certainly true,
and we will need different versions of the Parser to handle things like
PSI-BLAST.  However, I'm hoping that this design will keep the Parser
relatively lightweight.  It should contain the minimum amount of specific
code so that it is minimally sensitive to format changes.  Instead, the
specific stuff is thrown to the Consumer.


> > Sequences producing significant alignments:                        (bits)  Value
> > 
> > d1rip__ 2.24.7.1.1 Ribosomal S17 protein [Bacillus stearothermo...    23  2.5
> > d1rlr_1 1.56.1.1.1 (1-212) R1 subunit of ribonucleotide reducta...    23  2.5
> > d1lfaa_ 3.42.1.1.1 Integrin CD11a/CD18 (LFA-1) [Human (Homo sap...    22  5.6
> > d1ktq_1 3.38.3.4.2 (1-161) Exonuclease domain of DNA polymerase...    21  9.7
> > d1prea1 4.88.1.2.2 (1-83) Proaerolysin, N-terminal domain [Aero...    21  9.7
> 
> How can the parser handle this summary block? 

The parser would recognize this as a description block, and call:
consumer.start_descriptions()
consumer.description('d1rip__ 2.24.7.1.1 [...]')
consumer.description('d1rlr_1 1.56.1.1.1 [...]')
[...]
consumer.end_descriptions()


> My suggestion or extension to the event oreinted model:
> 
> The consumer class has to define a list with the keywords (i.g.
> information) the Parser object recognizes and there's the triplet of
> start_KEYWORD, KEWORD, END_KEYWORD functions defined for each of the
> keywords in the list. These keywords could also be regular expressions.
> THe parser is then a very simple class that's completely independant
> from the blast format.

Ah.  This is getting closer to the idea of building parser generators
(well, technically, scanner generators) that Andrew and I have been
kicking around.
http://www.biopython.org/pipermail/biopython/1999-October/000100.html

The idea is that most bioinformatics formats can be specified as a regular
language and used to drive a parser.  And, yep, this is certainly
possible! 

Jeff