[Bioperl-l] Naming consistency and Bioperl future search result parsing

Jason Stajich jason@cgt.mc.duke.edu
Tue, 1 Jan 2002 10:53:33 -0500 (EST)


On Tue, 1 Jan 2002, Andrew Dalke wrote:

> Jason:
> >In an effort to make bioperl approachable to new developers and users
> >alike we are trying to establish some consistency for naming of things.
>
> I'm working on this for the Biopython code.  I would like to use
> the same names as bioperl wherever possible.  (Although I'm currently
> looking at the various sequence record classes.)  Do you have these
> names listed anywhere besides the code?
>

Not yet - this is still very much a work in progress.

> The goal for me is to be able to do something like:
>
> from Bio import SearchIO
>
> result = SearchIO.parse(open("blastout.xml"))
> print "Generated by", result.programname
> print len(record.subjects), "subjects"
> for subject in subject:
>   print subject.name
>
> but I want to use identical names as you all do.
>
> I scanned through the SearchIO to try to figure them
> out.  Here's what I came up with.
>
These names were influenced greatly by the NCBI xml tags as I was building
my internal "SAX-like events" in mind and having to keep track of only 3
states inside the parser rather than the event handler - not a good
delegation of tasks I realize but it made sense to me at the time.
Trying to rework these after the objects get stabilized.

The XML I speak of is ncbi blast -m 7 (NCBI XML) -- see blastxml.pm for
the DTD (which I don't actually use to validate anything).

> Report contains:
>   db_name  -- string; example "plantPept"
>   db_size  -- integer number of sequences; example "6920"
>         # what about "total letters"?

Steve has spun this off into a database object - I'm debating whether or
not it is necessary.

>   query_name -- string; example "GI_747847 prion protein"
>   query_size -- integer number of characters in query; example "245"
>   program_name -- string; example "BLASTP"
>   program_version -- string; example "2.0a19MP-WashU"
>      (or is that
> "2.0a19MP-WashU [05-Feb-1998] [Build decunix3.2 01:53:21 05-Feb-1998]"?)

You get to decide when you write a parser - typically I only care about
the 2.0a19MP-WashU part

>   parameters -- a Parameters object
(a hash - key-value pairs)
>   statistics -- a Statistics object
(a hash - key-value pairs)
>   subjects = list of Subject objects
Don't stick with this - am transitioning subject --> hit
>
> Subject contains: (Hit)
Names here are in flux here!  I don't want to get stuck to this API as
this is not part of a stable release so I plan to rip things up
whereever it seems appropriate.

>   report_type -- string?  example?
 should be algorithm (BLASTX, BLASTP, TBLASTX etc)
>   name -- string; example?
>   length -- integer; example "471"
Length of the actual hit - not the alignment
>   accession -- string; example?
>   desc -- string; example?
>   hsps = list of HSP objects  # Biopython also calls them hsps, I think
>
> (the "example?"s are because when given something like the following,
> from blastp.2b.gz,
>
> >PLANTPEPT:GI_1922937 Arabidopsis Similar to Glycine SRC2
>             (gb|AB000130). ESTs  gb|H76869,gb|T21700,gb|ATTS5089
>             come from this gene.
>
> I don't know which parts go to name, accession, and desc.

In NCBI formatted dbs the following is what is assigned in the NCBI blast
xml output:
>gi|1234|ref|NP_1234.1|NP_1234 some desc here
name is 'gi|1234|ref|NP_1234.1|NP_1234'
accession should be 'NP_1234'
desc is 'some desc here' (everything else)

You could be smart and pick up multiple accessions or multiple references
but the DTD did not account for this as far as I can tell.

> I also though there could be multiple database references, as in
>
> >gi|3318709|pdb|1A91|  Subunit C Of The F1fo Atp Synthase Of
>             Escherichia Coli; Nmr
>
> Biopython just lumps them all together into "title"
>
> HSP:

Again, standardizing my names with what Steve has put together many of
these will change.  I just reused the names from the BPlite::HSP object.

>   report_type -- string?  example?
again - algorithm is the new name
>     (Is this used because the HSP details change depending on
>       the report type, and you don't want to derive a new class?)
yep.
>   score -- integer; example "124"
>   bits -- float; example "43.7"
>   match -- integer = the number of identical matches; example "35"
>          # Why isn't this "matches"?  Or "bit" instead of "bits"?
>          # Or "num_matches"?  Biopython calls this "identities".
>   hsp_length -- integer; example "95"
This can get confusing because $hsp->length acually means
$hsp->query->length() if you look at the documentation for SimilarityPair
hence a more specific name.

>   positive -- integer; example "42"  # Why isn't this "positives"?

Just propigating existing nomeclature - will see about this.

>   gaps -- integer; example? (I don't see a value for this in
>          blastp.2b.gz.)  # Why is this "gaps" instead of "gap"?
Some alignments don't have gaps.
Will look into standardizing plurality where possible.

>   evalue -- float (log-odds) expectation value; example "3.0e-08"
>        # Biopython calls this "expect"

We will likely support e, expect, and evalue as aliases for the same
thing.  I was initially trying to propigate a similar API to the
existing bioperl HSP objects, but want to instead choose the best names.

HSPs are Bio::SeqFeature::SimilarityPair objects so see that class for
more info.
>   query: -  This is a Bio::SeqFeature::Similarity
>     name -- string; is this identical to Report.query_name ?
>     begin -- integer; example "29"
I believe it is start - comes from Bio::SeqFeatureI/Bio::RangeI
>             # Does this start at 1 like the rest of Bioperl?
>             # Why isn't this named 'start' like a Bio::Location?
>             # It's also easy for XML people to remember, since SAX
>             #   uses startElement/endElement :)
>     end -- integer, starting at 1(?); example "88"
>             # Is the character at position 'end' in the sequence?
>     seq -- string; example "GGGGWG-QPH"
>     length -- integer; computed from abs(end-begin)+1
>   subject: renamed to hit(), is-a  Bio::SeqFeature::Similarity
>     name  -- string; is this identical to Subject.name ?
>     begin  -- same as for query
      start() not begin
>     end    -- same as for query
>     seq    -- same as for query
>     length -- same as for query
>   # Biopython also has "frame" and "strand"
We also have frame and strand as part of Bio::SeqFeature::Similarity
>




> Parameters:
>   depends on the specific alignment search used.
(hence my desire to leave it as a tag/value hash rather than its own
 object)

> For example:
>
> FastaParameters:  (BTW, where is an example FASTA alignment output file?)
t/data/cysprot1.FASTA, t/data/HUMBETGLOA.FASTA
>   matrix -- string
>   ktup -- integer
>   expect -- float
>   include -- ?
>   match -- ("sc-match")
>   mismatch -- ("sc-mismatch")
>   gapopen ...
>   gapext
>   wordsize
>   ktup
>   filter
>
>
> Statistics:
>   depends on the specific alignment search used.  For example:
>
> FastaStatistics:
>   dbnum
>   dblength
>   hsplength
>   effectivespace
>   kappa
>   lambda
>   entropy
>
> # Biopython uses a "Parameters" object to store both Parameters
> # and Statistics.
>
> So did I read the code correctly?  If so, it makes sense to me
> except I think you also need the middle line of the graphical
> display:
>

It's there - It's been called the homology sequence. stored in
HSP::homology_seq.

> Query:    98 KTNMKHMAGAAAAGAVVGGLGGYMLG 123
>               +      G+   GA VGG GGY  G        <<-- this line
> Sbjct:    88 GSGYGSGQGSGY-GAGVGGAGGYGSG 112
>
> (I don't know what even to call it!)

>
> It's needed because that's the only place which says which
> residues are considered similar by the matrix used.  That is,
> I assume it's figured out from the matrix.  If all the programs
> use the same built-in definitions of "similar" then that
> table can be built into bioperl/biopython/etc. as well.
>

I hard coded this when building the match line in MSA based on the
groups of weakly and strongly similar groups as provided in the CLUSTALW
documentation - you can see this in the Bio::SimpleAlign code for
match_line.  Not sure whether the groups are derived from the matricies

> On the other hand, Jeff Chang, who wrote the code for Biopython,
> agrees with you all and not with me, so you can consider me
> outvoted - I needed it when I wanted to colorize the alignments
> and so that matches where one color, similars another, and
> dissimilar a third.
>
> >Result - a database or pairwise alignment search run (formerly 'report').
> >        Report can be used to refer to a
> >
>
> That text seems abrubtly stopped.
oops - well you get the idea.
>
> >The Bio::Search and Bio::SearchIO classes and directories will be
> >reorganized to only contain Query, Hit, HSP, & Result in the API.
>
> What about Statistics and Parameters?
>
Don't think we need objects for them at this point.  Tag-value should be
enough.

> >Bioperl 1.0 should contain a robust event based parsing framework for
> >search results.  We will focus on providing simple access to report data
> >in the SearchIO system in a standard API for multiple search result
> >formats.
>
> BTW, have you been following the work I've been doing in Biopython
> with Martel?  I'm using an event based parsing framework for
> nearly all the parsing we do nowadays.
>

Not been following it as I am not really python literate - not really
clear how it would be ported.  Open to suggestions, improvements on
building the event-based parser - I like your conversion to SAX events,
but I'm not sure I see a big win for us with what I percieve as added
confusion and complexity to new developers.  Happy to be dissuaded.

> >Additionally groundwork has been laid by Steve C to provide lazy parsing
> >for those with specific performance and flexibilty needs.
>
> It doesn't have lazy parsing, but it does have a way to rewrite the
> parser to generate events for only selected tags.  That gives a
> huge performance boost, but then again, I generate a lot more
> events than what I've seen so far in the Search code.
>
> Oh, and the latest cool thing with it is automatic file typing.  :)
>

Would love to take advantages of your work - this is one of the
frustrating things about doing all this work in different languages.
Perhaps we can map some strategies about how to best share code between
languages at the hackathon?

>                     Andrew
>                     dalke@dalkescientific.com
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu