[BioPython] Better blasting with XML

Andrew Dalke dalke@dalkescientific.com
Wed, 21 Aug 2002 02:58:38 -0600


Peter Maxwell:
> Blast can produce parser friendly XML or tabular output so there is no need to 
> battle with the traditional blast report format.

I'll still argue that there is, for two reasons.  First, that's the format
that people expect to see.  Suppose you are making an intranet web system
which includes BLAST functionality.  How should the results be returned?
The "human" version? XML?  Tabular?  Most expect the first of those three,
so there should be some way to work with the output in that form.

(BTW, another option is to dump to XML then have your code convert from
XML to a more human readable form, perhaps one which mimics the BLAST
report.  I haven't been too happy with the systems I've seen which do that.)

Second, suppose someone has a BLAST output, say from some run done by hand.
It would be useful to parse that file rather than try to rerun everything.


> I also wrote the XML blast output parser I needed.  It doesn't make an object 
> with the same interface as the current biopython blast parser because that 
> turned out to be too hard, the interface being very much influenced by the 
> details of the traditional blast report.  The XML schema is simpler, it is 
> directly based on the ASN.1 schema which in turn is very close to the C data 
> structures in the blast code itself.

I'm "working" (haven't touched it in 7 months (!)) on a rewrite of the BLAST
code which uses Martel and creates a new similarity search data structure that
can be shared amoung the different algorithms.  The idea is to be similar to
both the bioperl SearchIO interface and work better with the XML output from
BLAST.  I want to be able to say

  from Bio import SearchIO
  results = SearchIO.parse("filename")

and have it work with *anything*.

The biggest delay is purely my fault.  There isn't enough documentation for
anyone else to figure out what I've been doing with that code and how it's
supposed to go, and I haven't had the time (err, more like haven't had the
funding) to work on it.


> The code is GPL'ed for general distribution but I (and BioLateral) would be 
> happy to see any of it find its way into biopython so it is also available to 
> the biopython project for integration into biopython under biopython's 
> licence.

Interesting.  As I recall, Entigen liked Jeff's BLAST parser and used it in
lieu of their own.  Now I see that BioLateral was founded by Tim Littlejohn. 
Any interest in funding me to work on this?  No?  Well, had to ask.  Anyone
else on the list?  :)

Seriously, I looked at your code.  There are a few suggestions I have, which
I sent by personal email.

The biggest problem I have with it is the tight match it has with the BLAST
data structure.  I would like it to also support FASTA generically.  I just
don't know what FASTA needs, or which algorithms are similar enough to be
considered part of a 'SearchIO'.  Is BLAT one?  Any others?

I've also found that using [] for arbitrary properties is easier to comprehend
than using attribute lookups like you do.  For example, you change "eff-space"
to "eff_space" so people can say

   blah.parameters.eff_space

when I would rather have

  blah.parameters["eff-space"]

Otherwise people have to remember more to go from documentation to how the
code works.

Combining those two points, I see "." as something which should be used for
properties which are common to many types of search results while []
should be used for things which are highly variable and specific to just
one algorithm.

Could you look into bioperl's SearchIO system and see what they do, then
come up with a generic data structure for the various search programs which
is also more Pythonic than the perl API?  Then you can have the XML code
build that data structure and be part of biopython, while also having
something I can target the Martel-based system to use.

					Andrew
					dalke@dalkescientific.com