[Biopython-dev] SearchIO HSP indexing

Colin Archer colin.aibn at gmail.com
Sun Feb 10 07:28:36 UTC 2013


On Sun, Feb 10, 2013 at 2:56 AM, Wibowo Arindrarto
<w.arindrarto at gmail.com>wrote:

> Hi everyone,
>
> Colin, thanks for the feedback! Peter has explained the rationale
> behind the decision, so I would like to add that there has been indeed
> an explanation of this behavior in the tutorial
> (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc106) and
> the code (
> https://github.com/biopython/biopython/blob/master/Bio/SearchIO/__init__.py#L100
> ).
> I do admit that the explanation in the code could be made clearer with
> some comments in hsp.py ~ which I can add :).
>
> As for your point about the alignment code:
>
> > I was wondering if there was any code in SearchIO to align high-scoring
> > segment pairs against the same hit? I see the fragmentation code but that
> > seems specific to BLAT results and when I look at the HSPFragments in the
> > QueryResult object it does not seem to combine multiple HSPs against the
> > same hit even if they are not overlapping.
>
> SearchIO relies on BLAST to do this ~ which has already grouped each
> HSP aligning to the same database sequence in one group (all of which
> is accessible through the Hit object). I've always assumed that if two
> HSPs came from the same database entry (Hit), they are grouped into
> one Hit by BLAST, regardless of whether they overlap or not. Have you
> seen any results from BLAST that shows otherwise?
>
>
I have a couple of examples where BLAST doesn't combine the HSPs as you
would expect. It seems to mainly occur because the HSP alignments overlap
and to combine them would mean including more gaps in each hsp. For
example, *ftsK* in *E. coli* (ftsK.blast) or *aceF* in *E. coli* (aceF.blast).
In the second case, the first HSP spans the entire query and there are two
additional HSPs that are overlapped by it.

I know that BioPerl tries to align/tile (in Bio::Search::BlastUtils) the
HSPs somewhat when required but some people are hesitant to use their
method in certain situations (e.g., with tblastn results that overestimate
some of the metrics). They also implement additional functionality so that
the user could do a complete smith-waterman alignment if they wanted to.

Thanks
Colin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: aceF.blast
Type: application/octet-stream
Size: 12124 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20130210/34bb5bfb/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ftsK.blast
Type: application/octet-stream
Size: 18537 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20130210/34bb5bfb/attachment-0005.obj>


More information about the Biopython-dev mailing list