[Biopython-dev] [Biopython] Google Summer of Code Project: SearchIO in Biopython

Sun Apr 29 12:42:14 EDT 2012

On Sun, Apr 29, 2012 at 13:00, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Hi Bow,
>
> Thanks for updating the list. I'm replying just on the dev list
> as I'm focusing on implementation discussion in this reply.
>
> On Sat, Apr 28, 2012 at 1:08 PM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
> > 1. My main biopython branch for development:
> > https://github.com/bow/biopython/tree/searchio. Since I will be building
> > on
> > top of Peter's SearchIO branch (
> > https://github.com/peterjc/biopython/tree/search-io-test), right now it
> > only contains Peter's branch rebased against the latest master.
>
> Just to be clear - you don't have to start from that branch ;)
> http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html

Ok :). I wasn't so sure about how much code from your previous branch
that I will end up using, so I decided to rebase everything and then
see later how much of it can be used. But it's also easier to start clean :).

> As I said before, that may not be the best approach. The idea
> behind that code was to focus on the HSPs (in BLAST terms),
> and for the low level parsers to iterate over each HSP. Higher
> level wrappers can then batch these up by query/subject, or
> into the larger grouping of all the results for one query -
> which was the exposed high level Bio.SearchIO.parse
> function.
>
> That branch introduced a SearchResult object which was
> essentially something like a list or dict (like an OrderedDict
> in some ways), with some (unnecessary?) error checking for
> consistent contents (all from the same query). It also introduced
> a TopMatches object which was essentially list list (again,
> with some error checking for consistent contents).
>
> The advantage of using simple objects (OrderedDict
> and list) is simplicity and hopefully performance. But
> specific classes have the advantage of allowing more
> user friendly str/repr etc.
>
> The idea on this branch of focusing on iteration over the
> HSPs at the low level was it allowed a lot of flexibility, and
> the low level parser could be used in conjunction with
> indexing to see to a particular HSP and parse it, or goto
> the results for a particular query+match and parse its
> HSPs  (not implemented on my old branch, but that was
> the plan).
>
> However, while this makes perfect sense for say the BLAST
> tabular output, it isn't quite such a good match for all the
> possible datatypes.
>
> For instance, BLAST plain text/html includes an e-value for
> a query/subject combination which is calculated from all the
> HSPs for that query/subject (taking into account order etc -
> I'd have to check the O'Reilly BLAST book for the details).
> This isn't in the tabular output, but the point is that it isn't a
> property of the individual HSPs, but of the match (group of
> HSPs).
>
> I think we need to consider the other main formats, and if
> all their important information lies at the HSP level or not.
> Perhaps iteration at the query+match level (groups of
> HSPs) would be best overall?
>
> Bow - If some of that doesn't make sense, I can try to clarify
> by email on the list, and/or we can talk about it at our next
> video chat. Also see if you can get the BLAST book from
> your library - it will probably be quite useful in this project
> even though it describes the 'legacy' BLAST suite:
>
> "BLAST" by Ian Korf, Mark Yandell, Joseph Bedell
> Publisher: O'Reilly Media, Released: July 2003
>
> Regards,
>
> Peter

I think I got the gist of it (please correct me if I'm wrong). Some
information about the search, such as the sequence-wide e-value, may
not be present in the HSP level. Ignoring them could let us focus on a
perhaps simpler and more flexible implementation with better
performance, but at the cost of usefulness of the data itself since we
are throwing away information.

What I have in mind now is actually closer to iteration on the
query+subject level. To be clear first, the hierarchy of the objects
that I propose is this:

* Search object, to represent the entire search session.
* Result object, to represent a search with one query against the
database. Depending on the number of queries, we could have one to
several Result objects contained in a Search.
* Hit object, to represent a sequence hit. Depending on the search, we
could also have multiple Hits in one Result object.
* and finally, HSP object, to represent individual alignments.

Iteration is done on the Results level, so the information is parsed
on the search query level, not just a single HSPs (I wrote a  very
short description about what I'm planning the objects to be in here as
well: http://bit.ly/searchio-terms). I suppose if we aim for maximum
information parsing over performance and simplicity of the
format-specific parsers, this is the way to go. There are other
formats, too, that contains sequence-level search information not
present in the alignment (e.g. HMMER text output). What do you think
about this?

Thanks for the BLAST book suggestion. I'll see if I can find it in my
library in the mean time.

regards,
Bow