[BioPython] blast parser ideas

Arne Mueller a.mueller@icrf.icnet.uk
Thu, 11 Nov 1999 15:04:59 +0000


Jeffrey Chang wrote:
> 
> [Jeff talks about an event-oriented model for blast parsers]
> 
> [Arne]
> > Very good idea I think but I'm worried about point 2 of the above list.
> > If I understood the principle of the system (parser feeds consumer) the
> > input stream is parsed by the parser object which recognizes information
> > (like 'Sbject' ...) and then calls
> >
> > consumer.start_Sbject()
> > consumer.Sbject()
> > consumer.end_Sbject()
> 
> Oops, I should have explained myself clearer.  The start_ and end_ methods
> are used called for sections and provide the Consumer with contextual
> clues.  For example, the parser may call:
> 
> consumer.start_alignment()
> consumer.query(data)
> consumer.align(data)
> consumer.sbjct(data)
> consumer.end_alignment()

Ok, I got it! 

> For PSI-BLAST, this can help the Consumer figure out where each round
> begins and ends.
> 
> start_round
>     score and alignment info
> end_round
> start_round
>     score and alignment info
> end_round
> 
> > The consumer can handle 'Sbject' lines of the blast output by defining
> > the above 3 functions. But in the end it's up to the parser to recignize
> > certain keywords like 'Sbjct', isn't it? That means the parser has to
> > recognize all keywords of all different blast programs etc ... (e.g.
> > 'Results from Round' for PSI-Blast) and is not independant anymore.
> 
> What I'm hoping to do is to separate the easier task of recognizing where
> information exists from the more detailed one of actually extracting the
> information.  I believe it's going to be unmanageable if we try to write
> code to parse information from various versions and flavors of BLAST.
> Thus, I'm happy if we can develop Parsers that will just point the
> Consumer to where the information exists, and let the user develop quick
> and dirty throwaway Consumers that can do the highly specific task of
> extracting wanted information.
> 
> But back to your point, that the Parser will need code that allows it to
> recognize lines from various flavors of BLAST.  This is certainly true,
> and we will need different versions of the Parser to handle things like
> PSI-BLAST.  However, I'm hoping that this design will keep the Parser
> relatively lightweight.  It should contain the minimum amount of specific
> code so that it is minimally sensitive to format changes.  Instead, the
> specific stuff is thrown to the Consumer.
> 
> [...]

I may be helpfull to introduce a layer between the Consumer class and
the Parser class. I think about two ways:

1. There's a specialized Parser like StandardPSIblastParser which
inherits from Parser. The specialized class (defined by a user
application) defines the PSI-Blast specific sections.

class Parser:
	# basic parser stuff (e.g read lines from stream ...)
		
class StandardPSIblastParser(Parser):
	def __init__(self, ...):
		list_of_sections_to_be_recognized = []
		Parser.__init__(self, ...)

class StandardPSIblastConsumer(Consumer):
	# special

parser = StandardPSIblastParser()
consumer = StandardPSIblastConsumer()
parser.parse(open('blast_data.txt'), consumer)

2. Parser require a Consumer object that keeps a list of sections and
datatypes to be recognized be the parser.

class Parser:
	def __init__(self):
		# init ...
	def parse(self, file_handle, consumer):
		use consumer's sections and datatypes to
                control the parsing
	
	# other methods ... ?

class Consumer:
	# base class consumer

# this is defined in an application
class 
	def __init__(self):
		self.sections = [('alignment',
regex_or_keword_to_recognize_alignment), 
				 other_section_definitions ...]
		self.datatypes = [('query', regex_or_keword_to_recognize_query),
			          ('align', regex_or_keword_to_recognize_align),
                                  ('sbjct',
regex_or_keword_to_recognize_sbjct)]
			
   	def start_alignment(self):
        	# ...

        def end_alignment(self):
		# ...	    	    

        def query(self, data):
    		# ...    

        def align(self, data):
    		# ...    
 
        def sbjct(self, data):
     		# ...   

consumer = StandardPsiBlastConsumer()
parser = Parser()
parser.parse(open('blast_data.txt'), consumer)

This is what Jeff suggested (?), but in this implementation of the
StandardPsiBlastConsumer class and Parser the Parser will read it's
section definiton and datatypes to be recognized from the consumer
object. This has two advantages:

1. There's only one Parser class that is independant of the context
it'll work on.
2. The actual consumer will tell the Parser which sections and datatypes
to recognize - maybe this is probally something like a lazy parser
mentioned by Thomas Sicheritz. 

It's also possible to initialize the Parser object with the consumer
instead of passing it to the parser method of the Parser object:

consumer = StandardPsiBlastConsumer()
parser = Parser(open('blast_data.txt'), consumer)
parser.parse()

Anyway, this is a detail.

When the parser is written and people start writing applications with
the parser we can add the more general Consumer classes to the parser
package (sort of database for the different blast programs). I'm looking
forward to write my first StandardPsiBlastConsumer ;-)

	Arne

-- 
Arne Mueller
Biomolecular Modelling Laboratory
Imperial Cancer Research Fund
44 Lincoln's Inn Fields
London WC2A 3PX, U.K.
phone : +44-(0)171 2693405      | fax :+44-(0)171-269-3534
email : a.mueller@icrf.icnet.uk | http://www.icnet.uk/bmm/