[BioPython] BLAST parser updates
Jeffrey Chang
jchang@SMI.Stanford.EDU
Tue, 21 Dec 1999 12:54:59 -0800 (PST)
Hello Everybody,
I've been working on an event-oriented parser for BLAST. We have
discussed it on this list. I have written some documentation describing
some ideas behind these types of parsers and have attached it to this
email. I'm hoping to use this design for other Biopython parsers as well.
Please let me know what you think!
The main change I've made is in terminology. Now, a Scanner reads from
the text data and generates events for the Consumer. Parsing, then, using
a Scanner and Consumer to construct an language data structure (e.g. an
object) from the original text. I think this is more consistent with what
people expect parsing to do. Sorry for any confusion this may cause.
For Biopython, what needs to be done is to build parsers that generate
objects around the scanner/consumer interface. I expect most people will
use the default parser. However, "power users" will still be able to
build their own custom parsers around a particular scanner or consumer
that would serve their purposes more efficiently.
I've finished writing a scanner that will scan the standalone version of
various flavors of NCBI blast (blastp, blastn, blastx, tblastn, tblastx,
psi-blast). I'm working on one that will handle the web version.
I will make those available via CVS and tarball when the anonymous CVS
server is set up!
Jeff
Parser.txt
==========
Design documentation for Biopython parsers.
Design Overview
---------------
Parsers are built around an event-oriented design that includes
Scanner and Consumer objects.
Scanners take input from a data source and analyze it line by line,
sending off an event whenever it recognizes some information in the
data. For example, if the data includes information about an organism
name, the scanner may generate an "organism_name" event whenever it
encounters a line containing the name.
Consumers are objects that receive the events generated by Scanners.
Following the previous example, the consumer receives the
"organism_name" event, and the processes it in whatever manner
necessary in the current application.
Events
------
There are two types of events: info events that tag the location of
information within a data stream, and section events that mark
sections within a stream. Info events are associated with specific
lines within the data, while section events are not.
Section event names must be in the format start_EVENTNAME and
end_EVENTNAME where EVENTNAME is the name of the event.
For example, a FASTA-formatted sequence scanner may generate the
following events:
EVENT NAME ORIGINAL INPUT
begin_sequence
title >gi|132871|sp|P19947|RL30_BACSU 50S RIBOSOMAL PROTEIN L30 (BL27
sequence MAKLEITLKRSVIGRPEDQRVTVRTLGLKKTNQTVVHEDNAAIRGMINKVSHLVSVKEQ
end_sequence
begin_sequence
title >gi|132679|sp|P19946|RL15_BACSU 50S RIBOSOMAL PROTEIN L15
sequence MKLHELKPSEGSRKTRNRVGRGIGSGNGKTAGKGHKGQNARSGGGVRPGFEGGQMPLFQRLPK
sequence RKEYAVVNLDKLNGFAEGTEVTPELLLETGVISKLNAGVKILGNGKLEKKLTVKANKFSASAK
sequence GTAEVI
end_sequence
[...]
(I cut the lines shorter so they'd look nicer in my editor).
The FASTA scanner generated the following events: 'title', 'sequence',
'begin_sequence', and 'end_sequence'. Note that the 'begin_sequence'
and 'end_sequence' events are not associated with any line in the
original input. They are used to delineate separate sequences within
the file.
The events a scanner can send must be specifically defined for each
data format.
'noevent' EVENT
-----------------
A data file can contain lines that have no meaningful information,
such as blank lines. By convention, a scanner should generate the
"noevent" event for these lines.
Scanners
--------
class Scanner:
def feed(self, handle, consumer):
# Implementation
Scanners should implement a method named 'feed' that takes a file
handle and a consumer. The scanner should read data from the file
handle and generate appropriate events for the consumer.
Consumers
---------
class Consumer:
# event handlers
Consumers contain methods that handle events. The name of the method
is the event that it handles. Info events are passed the line of the
data containing the information, and section events are passed
nothing.
You are free to ignore events that are not interesting for your
application. You should just not implement methods for those events.
All consumers should be derived from the base Consumer class.
An example:
class FASTAConsumer(Consumer):
def title(self, line):
# do something with the title
def sequence(self, line):
# do something with the sequence
def begin_sequence(self):
# a new sequence starts
def end_sequence(self):
# a sequence ends