[BioPython] blast

Wed Jul 23 13:53:50 UTC 2008

On Wed, Jul 23, 2008 at 3:51 PM, Julien Cigar <jcigar at ulb.ac.be> wrote:
> Hello list,
>
> I'm working on a project (on African rats) which has a "blast"
> functionnality. In our (PostgreSQL) database sequences are linked to a
> specimen. The sequence table contains the sequence itself (which is just
> a text field) and additional things like the accession number (if
> exists), etc.
>
> The idea behind this is that a user can enter a part of DNA and see what
> kind of specimen (taxonomy) it could be.
>
> So I created a FASTA file with all the sequences, and then I used
> formatdb to be able to BLAST on it.
>
> It works quite well, you can see it in action on
> http://projects.biodiversity.be/africanrodentia/search/dna
> In the output I use the score attribute to show how close the entered
> sequence is (is it OK ?). The problem is that the users would prefer a
> kind of percentage. This is a copy of a mail I received :

It is pretty, but doesn't look like the BLAST output many biologists
will be used to, so I think I understand why the want some more
information to judge the similarity.

> "Yet I have a question regarding the BLAST function: It works well but
> the output is unclear. Could one not indicate the % similarity  of the
> ranked sequences versus the pasted sequence instead of te rather
> enigmatic scores that we see now?  Such an output would make the
> interpretation clearer,  particularly in cases where we do not have the
> sequence of the investigated species. In such a case the species with
> the most similar sequence will be given as a potential species name, in
> itself not a problem as long as one knows how different they are (e.g.
> in rodents wthin species variability approx less <3%, between species
> differences typically >4%). This provides a direct indication whether
> the output is likely to be yield a correct species name."

I think they are asking for the percentage identify.  Another
possiblity is the bit-score which should be in the BLAST output as a
floating point number (hsp.score).

> I wondered how this "score" is calculated, and if there's already a
> function to convert it to a percentage ? Otherwise, is it OK to take the
> hsp.sbjct field as 100% and then compare it to hsp.query ?
>
> For example if I have :
>
> CCTATTCCTAGCTATACACTACACATCGGACACAACAAC
> |||||||||||| ||||||||||||||||||||||||||
> CCTATTCCTAGCCATACACTACACATCGGACACAACAAC
>
> The score is shown as "35", is it OK if I do something like :
>
> percentage = (100.0 / len(hsp.sbjct)) * hsp.match.count('|')
>
> so with my example it will be: (100 / 39) * 38

To get the percentage identify over the alignment, yes.  Watch out for
integer division, but because you multiplied by 100.0 it should be
fine.

I would also go directly from the relevant integer fields (which
should be the same).  Something like this (assuming you are using the
XML parser):

assert hsp.identities == hsp.match.count('|') #Only works on nucleotide blasts!
assert hsp.align_length == len(hsp.sbjct)
assert hsp.align_length == len(hsp.match)
assert hsp.align_length == len(hsp.query)
percentage_identity = (100.0 * hsp.identities) / hsp.align_length

Have you considered actually showing the blast alignment too?

Peter