[Bioperl-l] RFC: Bio::SearchIO and Bio::Search::* objects

Jason Eric Stajich jason@cgt.mc.duke.edu
Sat, 20 Oct 2001 18:33:53 -0400 (EDT)


Amendments:

In order to handle multiple reports in a single file I would introduce a
Bio::Search::ReportI object which would have the methods

next_subject
database_name
database_size
query_name
query_size

and Bio::SearchIO would simply have the methods
next_report()

unless their are other ideas on how to approach this?

I would like to have the parsing managed within the SearchIO modules
rather than in the child objects (as we have done with BPlite::Sbcjt and
BPlite::HSP) so either we prefetch all the Subjects and all the HSPs when
getting a report or we have callback functions (_next_subject and
_next_hsp) as part of the SearchIO::XX parsing module.  Thoughts about
efficiency and/or memory usage here.  There are tradeoffs either way.  Am
I trying to decouple things too much here, is the overhead or many
objects going to present difficulties?

My plan is to write the xml parser first since that is really what I need
and map the BPlite objects into this later on.

-jason

On Thu, 18 Oct 2001, Jason Eric Stajich wrote:

> RFC: Bio::SearchIO and Bio::Search::* Objects
>
> Justification:
>
> We have some disjointed ways of doing blast parsing that were specific to
> the orginal implementations.  It would also be helpful if all sequence
> searching report parsing followed the same framework.  I would like to
> propose a cleaner object model for parsing the results of database
> sequence searching programs.  I feel that these are just a remapping of
> our existing objects and would like to see the API included in 1.0 if
> possible.
>
> Modules and Interfaces:
>
> Bio::SearchIO - Driver & Interface module.  This module would contain code
> that would instantiate the appropriate parser based on parameter input on
> initialization.  This is the same model as AlignIO and SeqIO systems
> currently in place.  May have some autodetection for format type as well
> if we feel up to that.
>
> The interface methods that would be required are
> next_subject - return the next subject (hit) as a Bio::Search::Subject.
>
> database_name() - name of database
> database_size()	-  database size (if available)
> query_name()    - query name
> query_size()    - query size (if available)
>
>
> Bio::SearchIO::blast - parse blast text based reports (-m 0 parameter) and
> would be a reimplement/copy&paste of current BPlite code.
>
> Bio::SearchIO::bl2seq - same as above, maps to BPbl2seq
> Bio::SearchIO::psiblast - same as above, maps to BPpsilite and can handle
> 			  psi and phi blast searches (?)
> Bio::SearchIO::xmlblast - parse NCBI xml and/or others depending on how
> 			  this is implemented
> Bio::SearchIO::fasta - someone can write a FASTA/SSEARCH/etc parser and
> 	 	       plug it in.
>   hmmer may also fit in here as well.
>
>   Other searching program output parser can easily be written and plugged
>   into the system.
>
> Bio::Search::SubjectI - Handle the concept of a query hit in the search.
> methods:
> name()      - subject name
> length()    - subject length
>
> next_hsp()  - retrieve the next HSP from the stream
>
> Bio::Search::HSPI -
>
> percent_identical() 	- percentage id
> P()			- P (signifigance)
> frac_identical()	-
> hsp_length()		- length of HSP
> query_seq()		- get the query sequence (string or Seq?)
> subject_seq()		- get the subject sequence (string or Seq?)
> homology_seq()		- get the homology seq
> gaps()			- number of gaps
> positive()		- number of positive matches
> query_start()		- start of the HSP on the query
> query_end()		- end of the HSP on the query
> subject_start()		- start of the HSP on the subject
> subject_end()		- end of the HSP on the subject
>
> [this is where polymorhpism could come in?]
> frame()			- translation frame (if tblastn/blastx/tblastx)
> report_type()		- type of report - program used
>                           FASTY/FASTX/TBLASTN, etc
>
> Potentially have polymorphism with different types of HSPs depending on
> tblastx/tblastn or just handle all of this inside one object as we
> currently do.
>
> -----------------
>
> Please provide any feedback or comments if you think this a worthy
> undertaking and approached correctly.
>
> Thanks.
> -jason
>
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu