[BioPython] BLAST parser updates

Jeffrey Chang jchang@SMI.Stanford.EDU
Tue, 21 Dec 1999 12:54:59 -0800 (PST)


Hello Everybody,

I've been working on an event-oriented parser for BLAST.  We have
discussed it on this list.  I have written some documentation describing
some ideas behind these types of parsers and have attached it to this
email.  I'm hoping to use this design for other Biopython parsers as well.
Please let me know what you think!

The main change I've made is in terminology.  Now, a Scanner reads from
the text data and generates events for the Consumer.  Parsing, then, using
a Scanner and Consumer to construct an language data structure (e.g.  an
object) from the original text.  I think this is more consistent with what
people expect parsing to do.  Sorry for any confusion this may cause. 

For Biopython, what needs to be done is to build parsers that generate
objects around the scanner/consumer interface.  I expect most people will
use the default parser.  However, "power users" will still be able to
build their own custom parsers around a particular scanner or consumer
that would serve their purposes more efficiently.

I've finished writing a scanner that will scan the standalone version of
various flavors of NCBI blast (blastp, blastn, blastx, tblastn, tblastx,
psi-blast).  I'm working on one that will handle the web version.

I will make those available via CVS and tarball when the anonymous CVS
server is set up!

Jeff



Parser.txt
==========

Design documentation for Biopython parsers.



Design Overview
---------------

Parsers are built around an event-oriented design that includes
Scanner and Consumer objects.

Scanners take input from a data source and analyze it line by line,
sending off an event whenever it recognizes some information in the
data.  For example, if the data includes information about an organism
name, the scanner may generate an "organism_name" event whenever it
encounters a line containing the name.

Consumers are objects that receive the events generated by Scanners.
Following the previous example, the consumer receives the
"organism_name" event, and the processes it in whatever manner
necessary in the current application.


Events
------

There are two types of events: info events that tag the location of
information within a data stream, and section events that mark
sections within a stream.  Info events are associated with specific
lines within the data, while section events are not.

Section event names must be in the format start_EVENTNAME and
end_EVENTNAME where EVENTNAME is the name of the event.

For example, a FASTA-formatted sequence scanner may generate the
following events:
EVENT NAME      ORIGINAL INPUT
begin_sequence  
title           >gi|132871|sp|P19947|RL30_BACSU 50S RIBOSOMAL PROTEIN L30 (BL27
sequence        MAKLEITLKRSVIGRPEDQRVTVRTLGLKKTNQTVVHEDNAAIRGMINKVSHLVSVKEQ
end_sequence
begin_sequence
title           >gi|132679|sp|P19946|RL15_BACSU 50S RIBOSOMAL PROTEIN L15
sequence        MKLHELKPSEGSRKTRNRVGRGIGSGNGKTAGKGHKGQNARSGGGVRPGFEGGQMPLFQRLPK
sequence        RKEYAVVNLDKLNGFAEGTEVTPELLLETGVISKLNAGVKILGNGKLEKKLTVKANKFSASAK
sequence        GTAEVI
end_sequence
[...]

(I cut the lines shorter so they'd look nicer in my editor).

The FASTA scanner generated the following events: 'title', 'sequence',
'begin_sequence', and 'end_sequence'.  Note that the 'begin_sequence'
and 'end_sequence' events are not associated with any line in the
original input.  They are used to delineate separate sequences within
the file.

The events a scanner can send must be specifically defined for each
data format.



'noevent' EVENT
-----------------

A data file can contain lines that have no meaningful information,
such as blank lines.  By convention, a scanner should generate the
"noevent" event for these lines.



Scanners
--------

class Scanner:
    def feed(self, handle, consumer):
        # Implementation


Scanners should implement a method named 'feed' that takes a file
handle and a consumer.  The scanner should read data from the file
handle and generate appropriate events for the consumer.



Consumers
---------

class Consumer:
    # event handlers


Consumers contain methods that handle events.  The name of the method
is the event that it handles.  Info events are passed the line of the
data containing the information, and section events are passed
nothing.

You are free to ignore events that are not interesting for your
application.  You should just not implement methods for those events.

All consumers should be derived from the base Consumer class.

An example:

class FASTAConsumer(Consumer):
    def title(self, line):
        # do something with the title
    def sequence(self, line):
        # do something with the sequence
    def begin_sequence(self):
        # a new sequence starts
    def end_sequence(self):
        # a sequence ends