[BioPython] blast

Wed Jul 23 16:20:21 UTC 2008

On Wed, 2008-07-23 at 14:53 +0100, Peter Cock wrote:
> On Wed, Jul 23, 2008 at 3:51 PM, Julien Cigar <jcigar at ulb.ac.be> wrote:
> > Hello list,
> >
> > I'm working on a project (on African rats) which has a "blast"
> > functionnality. In our (PostgreSQL) database sequences are linked to a
> > specimen. The sequence table contains the sequence itself (which is just
> > a text field) and additional things like the accession number (if
> > exists), etc.
> >
> > The idea behind this is that a user can enter a part of DNA and see what
> > kind of specimen (taxonomy) it could be.
> >
> > So I created a FASTA file with all the sequences, and then I used
> > formatdb to be able to BLAST on it.
> >
> > It works quite well, you can see it in action on
> > http://projects.biodiversity.be/africanrodentia/search/dna
> > In the output I use the score attribute to show how close the entered
> > sequence is (is it OK ?). The problem is that the users would prefer a
> > kind of percentage. This is a copy of a mail I received :
> 
> It is pretty, but doesn't look like the BLAST output many biologists
> will be used to, so I think I understand why the want some more
> information to judge the similarity.
> 

Yes, I'm not a biologist, I'm learning on the job .. so that explains
it :) (but it's interresting to learn)

> > "Yet I have a question regarding the BLAST function: It works well but
> > the output is unclear. Could one not indicate the % similarity  of the
> > ranked sequences versus the pasted sequence instead of te rather
> > enigmatic scores that we see now?  Such an output would make the
> > interpretation clearer,  particularly in cases where we do not have the
> > sequence of the investigated species. In such a case the species with
> > the most similar sequence will be given as a potential species name, in
> > itself not a problem as long as one knows how different they are (e.g.
> > in rodents wthin species variability approx less <3%, between species
> > differences typically >4%). This provides a direct indication whether
> > the output is likely to be yield a correct species name."
> 
> I think they are asking for the percentage identify.  Another
> possiblity is the bit-score which should be in the BLAST output as a
> floating point number (hsp.score).
> 

Yes, it's should be that, I forgot to take the alignement length into
account. I already showed the bit-score, but apparently it's not too
usefull for them.

> > I wondered how this "score" is calculated, and if there's already a
> > function to convert it to a percentage ? Otherwise, is it OK to take the
> > hsp.sbjct field as 100% and then compare it to hsp.query ?
> >
> > For example if I have :
> >
> > CCTATTCCTAGCTATACACTACACATCGGACACAACAAC
> > |||||||||||| ||||||||||||||||||||||||||
> > CCTATTCCTAGCCATACACTACACATCGGACACAACAAC
> >
> > The score is shown as "35", is it OK if I do something like :
> >
> > percentage = (100.0 / len(hsp.sbjct)) * hsp.match.count('|')
> >
> > so with my example it will be: (100 / 39) * 38
> 
> To get the percentage identify over the alignment, yes.  Watch out for
> integer division, but because you multiplied by 100.0 it should be
> fine.
> 
> I would also go directly from the relevant integer fields (which
> should be the same).  Something like this (assuming you are using the
> XML parser):
> 
> assert hsp.identities == hsp.match.count('|') #Only works on nucleotide blasts!
> assert hsp.align_length == len(hsp.sbjct)
> assert hsp.align_length == len(hsp.match)
> assert hsp.align_length == len(hsp.query)
> percentage_identity = (100.0 * hsp.identities) / hsp.align_length
> 

Thanks, I got another formula which seems to work well too (and which
seems cleaner than counting the number of '|') :

float(100) * float(hsp.score) / float(alignment.length)

> Have you considered actually showing the blast alignment too?
> 

Yes, it's in the output, I just replaced the '|' with a red background
and the ' ' with a green background, so they see directly where it
doesn't match

> Peter

Thanks,
Julien

-- 
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar at ulb.ac.be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52