[Bioperl-l] Naming consistency and Bioperl future search result parsing

Andrew Dalke Andrew Dalke" <dalke@dalkescientific.com
Tue, 1 Jan 2002 05:13:16 -0700


Jason:
>In an effort to make bioperl approachable to new developers and users
>alike we are trying to establish some consistency for naming of things.

I'm working on this for the Biopython code.  I would like to use
the same names as bioperl wherever possible.  (Although I'm currently
looking at the various sequence record classes.)  Do you have these
names listed anywhere besides the code?

The goal for me is to be able to do something like:

from Bio import SearchIO

result = SearchIO.parse(open("blastout.xml"))
print "Generated by", result.programname
print len(record.subjects), "subjects"
for subject in subject:
  print subject.name

but I want to use identical names as you all do.

I scanned through the SearchIO to try to figure them
out.  Here's what I came up with.

Report contains:
  db_name  -- string; example "plantPept"
  db_size  -- integer number of sequences; example "6920"
        # what about "total letters"?
  query_name -- string; example "GI_747847 prion protein"
  query_size -- integer number of characters in query; example "245"
  program_name -- string; example "BLASTP"
  program_version -- string; example "2.0a19MP-WashU"
     (or is that
"2.0a19MP-WashU [05-Feb-1998] [Build decunix3.2 01:53:21 05-Feb-1998]"?)
  parameters -- a Parameters object
  statistics -- a Statistics object
  subjects = list of Subject objects

Subject contains:
  report_type -- string?  example?
  name -- string; example?
  length -- integer; example "471"
  accession -- string; example?
  desc -- string; example?
  hsps = list of HSP objects  # Biopython also calls them hsps, I think

(the "example?"s are because when given something like the following,
from blastp.2b.gz,

>PLANTPEPT:GI_1922937 Arabidopsis Similar to Glycine SRC2
            (gb|AB000130). ESTs  gb|H76869,gb|T21700,gb|ATTS5089
            come from this gene.

I don't know which parts go to name, accession, and desc.
I also though there could be multiple database references, as in

>gi|3318709|pdb|1A91|  Subunit C Of The F1fo Atp Synthase Of
            Escherichia Coli; Nmr

Biopython just lumps them all together into "title"

HSP:
  report_type -- string?  example?
    (Is this used because the HSP details change depending on
      the report type, and you don't want to derive a new class?)
  score -- integer; example "124"
  bits -- float; example "43.7"
  match -- integer = the number of identical matches; example "35"
         # Why isn't this "matches"?  Or "bit" instead of "bits"?
         # Or "num_matches"?  Biopython calls this "identities".
  hsp_length -- integer; example "95"
  positive -- integer; example "42"  # Why isn't this "positives"?
  gaps -- integer; example? (I don't see a value for this in
         blastp.2b.gz.)  # Why is this "gaps" instead of "gap"?
  evalue -- float (log-odds) expectation value; example "3.0e-08"
       # Biopython calls this "expect"
  query:
    name -- string; is this identical to Report.query_name ?
    begin -- integer; example "29"
            # Does this start at 1 like the rest of Bioperl?
            # Why isn't this named 'start' like a Bio::Location?
            # It's also easy for XML people to remember, since SAX
            #   uses startElement/endElement :)
    end -- integer, starting at 1(?); example "88"
            # Is the character at position 'end' in the sequence?
    seq -- string; example "GGGGWG-QPH"
    length -- integer; computed from abs(end-begin)+1
  subject:
    name  -- string; is this identical to Subject.name ?
    begin  -- same as for query
    end    -- same as for query
    seq    -- same as for query
    length -- same as for query
  # Biopython also has "frame" and "strand"

Parameters:
  depends on the specific alignment search used.  For example:

FastaParameters:  (BTW, where is an example FASTA alignment output file?)
  matrix -- string
  ktup -- integer
  expect -- float
  include -- ?
  match -- ("sc-match")
  mismatch -- ("sc-mismatch")
  gapopen ...
  gapext
  wordsize
  ktup
  filter
  

Statistics:
  depends on the specific alignment search used.  For example:

FastaStatistics:
  dbnum
  dblength
  hsplength
  effectivespace
  kappa
  lambda
  entropy

# Biopython uses a "Parameters" object to store both Parameters
# and Statistics.

So did I read the code correctly?  If so, it makes sense to me
except I think you also need the middle line of the graphical
display:

Query:    98 KTNMKHMAGAAAAGAVVGGLGGYMLG 123
              +      G+   GA VGG GGY  G        <<-- this line
Sbjct:    88 GSGYGSGQGSGY-GAGVGGAGGYGSG 112

(I don't know what even to call it!)

It's needed because that's the only place which says which
residues are considered similar by the matrix used.  That is,
I assume it's figured out from the matrix.  If all the programs
use the same built-in definitions of "similar" then that
table can be built into bioperl/biopython/etc. as well.

On the other hand, Jeff Chang, who wrote the code for Biopython,
agrees with you all and not with me, so you can consider me
outvoted - I needed it when I wanted to colorize the alignments
and so that matches where one color, similars another, and
dissimilar a third.

>Result - a database or pairwise alignment search run (formerly 'report').
>        Report can be used to refer to a
>

That text seems abrubtly stopped.

>The Bio::Search and Bio::SearchIO classes and directories will be
>reorganized to only contain Query, Hit, HSP, & Result in the API.

What about Statistics and Parameters?

>Bioperl 1.0 should contain a robust event based parsing framework for
>search results.  We will focus on providing simple access to report data
>in the SearchIO system in a standard API for multiple search result
>formats.

BTW, have you been following the work I've been doing in Biopython
with Martel?  I'm using an event based parsing framework for
nearly all the parsing we do nowadays.

>Additionally groundwork has been laid by Steve C to provide lazy parsing
>for those with specific performance and flexibilty needs.

It doesn't have lazy parsing, but it does have a way to rewrite the
parser to generate events for only selected tags.  That gives a
huge performance boost, but then again, I generate a lot more
events than what I've seen so far in the Search code.

Oh, and the latest cool thing with it is automatic file typing.  :)

                    Andrew
                    dalke@dalkescientific.com