[Bioperl-l] BLAST parameters

Peter Kos kos@rite.or.jp" <kos@rite.or.jp
Sat, 10 Aug 2002 20:39:34 +0900


Hi,

I do not know the throughput of the NCBI server and the network.
Since Brian and Jason did not have a comment on this point, it may be 
completely all right, but I personally would feel it a kind of abuse. 
Not mentioning that I would worry about the 10 days run without 
problems due to malfunctions, loss of data and such.
Is it not possible (or simply more comfortable) to download, 
establish and use the database locally? It would surely be completed 
faster than 10 days, but of course a (not extremely) reasonable 
processor and some storage capacity is necessary for that. But in 
return you'd be ready in a day altogether.
Whereas for the 15000 RemoreBlasts you may need luck and a lot of 
patience.

I really love the idea of the RemoteBlast, as it saves lots of people 
from tremendous agony, but this magnitude really seems to me a 
challenge.
Or is this exactly for what RemoteBlast was created, rather than some 
( or some dozens or some hundreds of ) sequences? I really do not   
know. I just think loud.

Any way, have fun
Peter

> I haven't used bioperl before, so some of these questions might be 
a
> little
> dumb, so flame away where needed.  Let me first give the goal, in
> case I'm
> missing something conceptual here:
>
> Goal:
> I have a long list of sequences (15,000) that I would like to
> identify.  In
> particular, I want to find out what (rat) cluster they most likely
> represent.
>
> Approach:
> - submit genes one by one to remote BLAST (it's a lot of BLASTing 
so
> I'm
> waiting 60 seconds between submissions (I do realize this will take
> 10 days,
> btw, and I don't have access to a local BLAST)
> - retrieve the BLAST results and parse out the top ten hits by e-
> value or
> bit-score (undecided if there is a reason to prefer expectation
> values to
> the normalized bit-scores?)
> - for each of the top 10 hits, parse out the genbank accession
> - use this accession to determine the corresponding cluster (I
> expect I will
> have to download the unigene .dat file to do this)
> - if I can assign a conclusive identity to the sequence, great, if
> not store
> the results for future analysis
>
> I hope to be able to automatically identify 70-80% of the sequences
> using
> selection criteria like:
> 2 top hits for same cluster
> 3 of the top 5 hits for same cluster
> 6 of the top 10 hits for same cluster
> or something similar.  The assignations don't have to be perfect,
> just
> reasonably close.
>
> Now, my (first) two problems involve submitting the BLAST to NCBI.
>  I'm
> doing a test case with a 3-sequence FASTA file, btw.  What I would
> like is
> to restrict my BLAST searches to "Rattus norvegicus" as you can on
> the NCBI
> web-site under advanced options.
>
> In addition, I would like to be able to submit customized 
nucleotide
>
> substitution matrices to use with the BLAST.
>
> That latter point isn't as critical, but I really would like to
> avoid having
> to get back a pile of BLAST hits and have to filter through non-rat
> hits if
> possible.
>
> The RemoteBlast module accepts an @params array array to its 
->new()
> method,
> but I don't know what to call these parameters that I would like to
> use.
>
> Any comments, suggestions, ideas are very much welcome.
> Thanks in advance!
> Tats
>
> _________________________________________________________________
> Send and receive Hotmail on your mobile device:
> http://mobile.msn.com
>



..................................................................  
..........
Peter B. Kos, Ph.D.
Molecular Microbiology and Genetics Lab.
Research Institute of Innovative Technology for the Earth (RITE)
9-2 Kizugawadai, Kizu-cho, Soraku-gun,
Kyoto 619-0292 JAPAN
Phone: +81-774-75-2308
Fax: +81-774-75-2321
E-mail: kos@rite.or.jp