[Bioperl-l] RFC: Bio::SearchIO and Bio::Search::* objects

Ewan Birney birney@ebi.ac.uk
Sun, 21 Oct 2001 11:36:06 +0100 (BST)


On Sat, 20 Oct 2001, Jason Eric Stajich wrote:

> Amendments:
> 
> In order to handle multiple reports in a single file I would introduce a
> Bio::Search::ReportI object which would have the methods
> 
> next_subject
> database_name
> database_size
> query_name
> query_size
> 
> and Bio::SearchIO would simply have the methods
> next_report()
> 
> unless their are other ideas on how to approach this?
> 
> I would like to have the parsing managed within the SearchIO modules
> rather than in the child objects (as we have done with BPlite::Sbcjt and
> BPlite::HSP) so either we prefetch all the Subjects and all the HSPs when
> getting a report or we have callback functions (_next_subject and
> _next_hsp) as part of the SearchIO::XX parsing module.  Thoughts about
> efficiency and/or memory usage here.  There are tradeoffs either way.  Am
> I trying to decouple things too much here, is the overhead or many
> objects going to present difficulties?

I think there will be a real benefit in having an underlying SAX like
system to the parsers - like BioJava. I would imagine it like this


  Bio::Search::SearchIO --> Top level wrapper
 
             ::SearchResultEventBuilder --> attaches to an EventGeneratorI and
                                      builds a SearchI result
     
             ::EventGeneratorI --> interface each parsing code must comply
                                   to. Generates structured events

             ::SearchIO::blast.pm --> implments EventGeneratorI


The "dumb" users use the top level SearchIO system which makes a format
event generator and attaches the SearchResultEventBuilder to give back an
in memory SearchResult object

However power users can use the event generating system to build much
quicker parsers which ignore alot of the information or map the results to
new objects


I wish we had structured the SeqIO system this way - easy use for first
time users with extensibility for more serious users.


More modules to write, but should fit well in the XML parsing end of the
world.



I've cc'd thomas down to get his opinion on this - thomas - what do you
suggest?




> 
> My plan is to write the xml parser first since that is really what I need
> and map the BPlite objects into this later on.
> 
> -jason
> 
> On Thu, 18 Oct 2001, Jason Eric Stajich wrote:
> 
> > RFC: Bio::SearchIO and Bio::Search::* Objects
> >
> > Justification:
> >
> > We have some disjointed ways of doing blast parsing that were specific to
> > the orginal implementations.  It would also be helpful if all sequence
> > searching report parsing followed the same framework.  I would like to
> > propose a cleaner object model for parsing the results of database
> > sequence searching programs.  I feel that these are just a remapping of
> > our existing objects and would like to see the API included in 1.0 if
> > possible.
> >
> > Modules and Interfaces:
> >
> > Bio::SearchIO - Driver & Interface module.  This module would contain code
> > that would instantiate the appropriate parser based on parameter input on
> > initialization.  This is the same model as AlignIO and SeqIO systems
> > currently in place.  May have some autodetection for format type as well
> > if we feel up to that.
> >
> > The interface methods that would be required are
> > next_subject - return the next subject (hit) as a Bio::Search::Subject.
> >
> > database_name() - name of database
> > database_size()	-  database size (if available)
> > query_name()    - query name
> > query_size()    - query size (if available)
> >
> >
> > Bio::SearchIO::blast - parse blast text based reports (-m 0 parameter) and
> > would be a reimplement/copy&paste of current BPlite code.
> >
> > Bio::SearchIO::bl2seq - same as above, maps to BPbl2seq
> > Bio::SearchIO::psiblast - same as above, maps to BPpsilite and can handle
> > 			  psi and phi blast searches (?)
> > Bio::SearchIO::xmlblast - parse NCBI xml and/or others depending on how
> > 			  this is implemented
> > Bio::SearchIO::fasta - someone can write a FASTA/SSEARCH/etc parser and
> > 	 	       plug it in.
> >   hmmer may also fit in here as well.
> >
> >   Other searching program output parser can easily be written and plugged
> >   into the system.
> >
> > Bio::Search::SubjectI - Handle the concept of a query hit in the search.
> > methods:
> > name()      - subject name
> > length()    - subject length
> >
> > next_hsp()  - retrieve the next HSP from the stream
> >
> > Bio::Search::HSPI -
> >
> > percent_identical() 	- percentage id
> > P()			- P (signifigance)
> > frac_identical()	-
> > hsp_length()		- length of HSP
> > query_seq()		- get the query sequence (string or Seq?)
> > subject_seq()		- get the subject sequence (string or Seq?)
> > homology_seq()		- get the homology seq
> > gaps()			- number of gaps
> > positive()		- number of positive matches
> > query_start()		- start of the HSP on the query
> > query_end()		- end of the HSP on the query
> > subject_start()		- start of the HSP on the subject
> > subject_end()		- end of the HSP on the subject
> >
> > [this is where polymorhpism could come in?]
> > frame()			- translation frame (if tblastn/blastx/tblastx)
> > report_type()		- type of report - program used
> >                           FASTY/FASTX/TBLASTN, etc
> >
> > Potentially have polymorphism with different types of HSPs depending on
> > tblastx/tblastn or just handle all of this inside one object as we
> > currently do.
> >
> > -----------------
> >
> > Please provide any feedback or comments if you think this a worthy
> > undertaking and approached correctly.
> >
> > Thanks.
> > -jason
> >
> >
> 
>