[Bioperl-l] RFC: Bio::SearchIO and Bio::Search::* objects

Thu, 18 Oct 2001 17:53:11 -0400 (EDT)

RFC: Bio::SearchIO and Bio::Search::* Objects

Justification:

We have some disjointed ways of doing blast parsing that were specific to
the orginal implementations.  It would also be helpful if all sequence
searching report parsing followed the same framework.  I would like to
propose a cleaner object model for parsing the results of database
sequence searching programs.  I feel that these are just a remapping of
our existing objects and would like to see the API included in 1.0 if
possible.

Modules and Interfaces:

Bio::SearchIO - Driver & Interface module.  This module would contain code
that would instantiate the appropriate parser based on parameter input on
initialization.  This is the same model as AlignIO and SeqIO systems
currently in place.  May have some autodetection for format type as well
if we feel up to that.

The interface methods that would be required are
next_subject - return the next subject (hit) as a Bio::Search::Subject.

database_name() - name of database
database_size()	-  database size (if available)
query_name()    - query name
query_size()    - query size (if available)

Bio::SearchIO::blast - parse blast text based reports (-m 0 parameter) and
would be a reimplement/copy&paste of current BPlite code.

Bio::SearchIO::bl2seq - same as above, maps to BPbl2seq
Bio::SearchIO::psiblast - same as above, maps to BPpsilite and can handle
			  psi and phi blast searches (?)
Bio::SearchIO::xmlblast - parse NCBI xml and/or others depending on how
			  this is implemented
Bio::SearchIO::fasta - someone can write a FASTA/SSEARCH/etc parser and
	 	       plug it in.
  hmmer may also fit in here as well.

  Other searching program output parser can easily be written and plugged
  into the system.

Bio::Search::SubjectI - Handle the concept of a query hit in the search.
methods:
name()      - subject name
length()    - subject length

next_hsp()  - retrieve the next HSP from the stream

Bio::Search::HSPI -

percent_identical() 	- percentage id
P()			- P (signifigance)
frac_identical()	-
hsp_length()		- length of HSP
query_seq()		- get the query sequence (string or Seq?)
subject_seq()		- get the subject sequence (string or Seq?)
homology_seq()		- get the homology seq
gaps()			- number of gaps
positive()		- number of positive matches
query_start()		- start of the HSP on the query
query_end()		- end of the HSP on the query
subject_start()		- start of the HSP on the subject
subject_end()		- end of the HSP on the subject

[this is where polymorhpism could come in?]
frame()			- translation frame (if tblastn/blastx/tblastx)
report_type()		- type of report - program used
                          FASTY/FASTX/TBLASTN, etc

Potentially have polymorphism with different types of HSPs depending on
tblastx/tblastn or just handle all of this inside one object as we
currently do.

-----------------

Please provide any feedback or comments if you think this a worthy
undertaking and approached correctly.

Thanks.
-jason

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu