[BioPython] blast

Wed Jul 23 14:51:21 UTC 2008

Hello list,

I'm working on a project (on African rats) which has a "blast"
functionnality. In our (PostgreSQL) database sequences are linked to a
specimen. The sequence table contains the sequence itself (which is just
a text field) and additional things like the accession number (if
exists), etc.

The idea behind this is that a user can enter a part of DNA and see what
kind of specimen (taxonomy) it could be.

So I created a FASTA file with all the sequences, and then I used
formatdb to be able to BLAST on it.

It works quite well, you can see it in action on
http://projects.biodiversity.be/africanrodentia/search/dna 
In the output I use the score attribute to show how close the entered
sequence is (is it OK ?). The problem is that the users would prefer a
kind of percentage. This is a copy of a mail I received :

"Yet I have a question regarding the BLAST function: It works well but
the output is unclear. Could one not indicate the % similarity  of the
ranked sequences versus the pasted sequence instead of te rather
enigmatic scores that we see now?  Such an output would make the
interpretation clearer,  particularly in cases where we do not have the
sequence of the investigated species. In such a case the species with
the most similar sequence will be given as a potential species name, in
itself not a problem as long as one knows how different they are (e.g.
in rodents wthin species variability approx less <3%, between species
differences typically >4%). This provides a direct indication whether
the output is likely to be yield a correct species name."

I wondered how this "score" is calculated, and if there's already a
function to convert it to a percentage ? Otherwise, is it OK to take the
hsp.sbjct field as 100% and then compare it to hsp.query ?

For example if I have :

CCTATTCCTAGCTATACACTACACATCGGACACAACAAC
|||||||||||| ||||||||||||||||||||||||||
CCTATTCCTAGCCATACACTACACATCGGACACAACAAC

The score is shown as "35", is it OK if I do something like :

percentage = (100.0 / len(hsp.sbjct)) * hsp.match.count('|')

so with my example it will be: (100 / 39) * 38

Many thanks,
Julien

-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar at ulb.ac.be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52