[Bioperl-l] different results with remote-blast skript

Jonas Schaer Brotelzwieb at gmx.de
Mon Jul 6 12:14:18 UTC 2009


Hi guys, thanks for your answers so far.
@jason: integer overflow in blast.... sorry, but what do you mean by that? 
how can I fix it...?

Since I never really changed any parameters I thought them all to be 
default. whatever, I tried to get "better" results with my prog by changing 
these:
$Bio::Tools::Run::RemoteBlast::HEADER{'GAPCOSTS'} = '11 1';
$Bio::Tools::Run::RemoteBlast::HEADER{'MAX_NUM_SEQ'} = '100';
$Bio::Tools::Run::RemoteBlast::HEADER{'EXPECT'} = '10';
$Bio::Tools::Run::RemoteBlast::HEADER{'COMPOSITION_BASED_STATISTICS'} = '1';
with no effect...I guess these were default values anyway.

So please maybe you can tell me all the other parameters I can change with 
my perl-skript AND how to do that?
Unfortunately both, perl and the blast-algorithm are pretty much new to me, 
maybe thats why I just cannot find out how to do that on my own... :/

Here is the output I get with my remote-blast skript:
#################################################################################################################
 Query Name: 
MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLRSL
L
        hit name is ref|XP_001702807.1|
                score is 442
BLASTP 2.2.21+
Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro
A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and
David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs", Nucleic
Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A.
Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin,
John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen
F. Altschul (2001), "Improving the accuracy of PSI-BLAST
protein database searches with composition-based statistics
and other refinements", Nucleic Acids Res. 29:2994-3005.


RID: 53STX5G2013


Database: All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF excluding environmental samples
from WGS projects
           9,252,587 sequences; 3,169,972,781 total letters
Query= 
MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLRSLL
DVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVARAWHERDDNAFRQAHQNTAM
ATGPDPDDEYE
Length=150


                                                                   Score 
E
Sequences producing significant alignments:                       (Bits) 
Value

ref|XP_001702807.1|  ClpS-like protein [Chlamydomonas reinhard...   174 
2e-42


ALIGNMENTS
>ref|XP_001702807.1| ClpS-like protein [Chlamydomonas reinhardtii]
 gb|EDP06586.1| ClpS-like protein [Chlamydomonas reinhardtii]
Length=303

 Score =  174 bits (442),  Expect = 2e-42, Method: Composition-based stats.
 Identities = 150/150 (100%), Positives = 150/150 (100%), Gaps = 0/150 (0%)

Query  1    MGSSSVGTYHLLLVLMgaggeqqavqagaevaSTEQVDGSGMAANSRGSTSGSEQPPrds  60
            MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDS
Sbjct  154  MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDS 
213

Query  61   dlgllrslldVAGVDRTalevkllalaeagaeMPPAQDSQATAAGVVATLTSVYRQQVAR 
120
            DLGLLRSLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVAR
Sbjct  214  DLGLLRSLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVAR 
273

Query  121  AWHERDDNAFRQAHQNTAMATGPDPDDEYE  150
            AWHERDDNAFRQAHQNTAMATGPDPDDEYE
Sbjct  274  AWHERDDNAFRQAHQNTAMATGPDPDDEYE  303



  Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
excluding environmental samples from WGS projects
    Posted date:  Jul 5, 2009  4:41 AM
  Number of letters in database: -1,124,994,511
  Number of sequences in database:  9,252,587

Lambda     K      H
   0.309    0.122    0.345
Gapped
Lambda     K      H
   0.267   0.0410    0.140
Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Sequences: 9252587
Number of Hits to DB: 60273703
Number of extensions: 1448367
Number of successful extensions: 2103
Number of sequences better than 10: 0
Number of HSP's better than 10 without gapping: 0
Number of HSP's gapped: 2113
Number of HSP's successfully gapped: 0
Length of query: 150
Length of database: 3169972781
Length adjustment: 113
Effective length of query: 37
Effective length of database: 2124430450
Effective search space: 78603926650
Effective search space used: 78603926650
T: 11
A: 40
X1: 16 (7.1 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 42 (20.8 bits)
S2: 74 (33.1 bits)

#################################################################################################################
and here are the hits (?) of the blast-algorithm on the ncbi-homepage with 
the same query of course:
ref|XP_001702807.1|  ClpS-like protein [Chlamydomonas reinhard...   300 
3e-80
ref|XP_001942719.1|  PREDICTED: similar to GA16705-PA [Acyrtho...  36.2 
1.1
ref|ZP_03781446.1|  hypothetical protein RUMHYD_00880 [Blautia...  35.4 
1.8
ref|XP_001563232.1|  leucyl-tRNA synthetase [Leishmania brazil...  34.3 
4.2
ref|XP_680841.1|  hypothetical protein AN7572.2 [Aspergillus n...  33.5 
6.0
ref|YP_001768110.1|  hypothetical protein M446_1150 [Methyloba...  33.5 
7.0
#################################################################################################################at 
least the first hit is the same, but even there there is a different score 
and e-value.

thanks so much for any help :)
regards, jonas


----- Original Message ----- 
From: "Chris Fields" <cjfields at illinois.edu>
To: "Jason Stajich" <jason at bioperl.org>
Cc: "Smithies, Russell" <Russell.Smithies at agresearch.co.nz>; "'BioPerl 
List'" <bioperl-l at lists.open-bio.org>; "'Jonas Schaer'" 
<Jonas_Schaer at gmx.de>
Sent: Monday, July 06, 2009 12:51 AM
Subject: Re: [Bioperl-l] different results with remote-blast skript


> That inspires confidence ;>
>
> chris
>
> On Jul 5, 2009, at 4:40 PM, Jason Stajich wrote:
>
>> integer overflow in blast....
>>
>> On Jul 5, 2009, at 2:00 PM, Smithies, Russell wrote:
>>
>>> I'd guess it's a difference in the parameters used.
>>> Interesting that both have the number of letters in the db as
>>> "-1,125,070,205", I assume that's a bug  :-)
>>>
>>> Stats from your remote_blast:
>>>
>>> 'stats' => {
>>>            'S1' => '42',
>>>            'S1_bits' => '20.8',
>>>            'lambda' => '0.309',
>>>            'entropy' => '0.345',
>>>            'kappa_gapped' => '0.0410',
>>>            'T' => '11',
>>>            'kappa' => '0.122',
>>>            'X3_bits' => '24.7',
>>>            'X1' => '16',
>>>            'lambda_gapped' => '0.267',
>>>            'X2' => '38',
>>>            'S2' => '74',
>>>            'seqs_better_than_cutoff' => '0',
>>>            'posted_date' => 'Jul 4, 2009  4:41 AM',
>>>            'Hits_to_DB' => '60102303',
>>>            'dbletters' => '-1125070205',
>>>            'A' => '40',
>>>            'num_successful_extensions' => '2004',
>>>            'num_extensions' => '1436892',
>>>            'X1_bits' => '7.1',
>>>            'X3' => '64',
>>>            'entropy_gapped' => '0.140',
>>>            'dbentries' => '9252258',
>>>            'X2_bits' => '14.6',
>>>            'S2_bits' => '33.1'
>>>          }
>>>
>>>
>>> Stats from a blast done on the NCBI webpage:
>>>
>>> Database: All non-redundant GenBank CDS translations+PDB+SwissProt
>>> +PIR+PRF
>>> excluding environmental samples from WGS projects
>>>   Posted date:  Jul 4, 2009  4:41 AM
>>> Number of letters in database: -1,125,070,205
>>> Number of sequences in database:  9,252,258
>>>
>>> Lambda     K      H
>>>  0.309    0.124    0.340
>>> Gapped
>>> Lambda     K      H
>>>  0.267   0.0410    0.140
>>> Matrix: BLOSUM62
>>> Gap Penalties: Existence: 11, Extension: 1
>>> Number of Sequences: 9252258
>>> Number of Hits to DB: 86493230
>>> Number of extensions: 3101413
>>> Number of successful extensions: 9001
>>> Number of sequences better than 100: 65
>>> Number of HSP's better than 100 without gapping: 0
>>> Number of HSP's gapped: 9000
>>> Number of HSP's successfully gapped: 66
>>> Length of query: 150
>>> Length of database: 3169897087
>>> Length adjustment: 113
>>> Effective length of query: 37
>>> Effective length of database: 2124391933
>>> Effective search space: 78602501521
>>> Effective search space used: 78602501521
>>> T: 11
>>> A: 40
>>> X1: 16 (7.1 bits)
>>> X2: 38 (14.6 bits)
>>> X3: 64 (24.7 bits)
>>> S1: 42 (20.8 bits)
>>> S2: 65 (29.6 bits)
>>>
>>>
>>>> -----Original Message-----
>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>>>> bounces at lists.open-bio.org] On Behalf Of Jonas Schaer
>>>> Sent: Sunday, 28 June 2009 10:15 p.m.
>>>> To: BioPerl List
>>>> Subject: [Bioperl-l] different results with remote-blast skript
>>>>
>>>> Hi again :)
>>>> please, I only have this little question:
>>>> why do I get different results with my remote::blast perl skript
>>>> then on the
>>>> ncbi blast homepage?
>>>> I am using blastp, the query is an amino-sequence (different
>>>> results with any
>>>> sequence, differences not only in number of hits but even in e-
>>>> values, scores
>>>> etc...), the database is 'nr'.
>>>> PLEASE help me,
>>>> thank you in advance,
>>>> Jonas
>>>>
>>>> ps: my skript:
>>>> ##############################################################################
>>>> ##
>>>> use Bio::Seq::SeqFactory;
>>>> use Bio::Tools::Run::RemoteBlast;
>>>> use strict;
>>>> my @blast_report;
>>>> my $prog = 'blastp';
>>>> my $db   = 'nr';
>>>> my $e_val= '1e-10';
>>>> #my $e_val= '10';
>>>> my @params = ( '-prog' => $prog,
>>>>        '-data' => $db,
>>>>        '-expect' => $e_val,
>>>>        '-readmethod' => 'SearchIO' );
>>>> my $factory = Bio::Tools::Run::RemoteBlast->new(@params);
>>>>  $Bio::Tools::Run::RemoteBlast::HEADER{'GAPCOSTS'} = '11 1';
>>>>  $Bio::Tools::Run::RemoteBlast::HEADER{'MAX_NUM_SEQ'} = '100';
>>>> $Bio::Tools::Run::RemoteBlast::HEADER{'EXPECT'} = '10';
>>>> $
>>>> Bio
>>>> ::Tools::Run::RemoteBlast::HEADER{'COMPOSITION_BASED_STATISTICS'}
>>>> = '1';
>>>>
>>>> my
>>>> $
>>>> blast_seq
>>>> ='MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLR
>>>> SLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVARAWHERDDNAFRQAHQNTAMATGPD
>>>> PDDEYE';
>>>> #$v is just to turn on and off the messages
>>>> my $v = 1;
>>>> my $seqbuilder = Bio::Seq::SeqFactory->new('-type' =>
>>>> 'Bio::PrimarySeq');
>>>> my $seq = $seqbuilder->create(-seq =>$blast_seq, -display_id =>
>>>> "$blast_seq");
>>>> my $filename='temp2.out';
>>>> my $r = $factory->submit_blast($seq);
>>>> print STDERR "waiting..." if( $v > 0 );
>>>>   while ( my @rids = $factory->each_rid )
>>>>   {
>>>>       foreach my $rid ( @rids )
>>>>       {
>>>>           my $rc = $factory->retrieve_blast($rid);
>>>>           if( !ref($rc) )
>>>>           {
>>>>               if( $rc < 0 )
>>>>               {
>>>>                   $factory->remove_rid($rid);
>>>>               }
>>>>               print STDERR "." if ( $v > 0 );
>>>>           }
>>>>               else
>>>>               {
>>>>                   my $result = $rc->next_result();
>>>>                   $factory->save_output($filename);
>>>>                   $factory->remove_rid($rid);
>>>>                   print "\nQuery Name: ", $result->query_name(),
>>>> "\n";
>>>>                   while ( my $hit = $result->next_hit )
>>>>                   {
>>>>                       next unless ( $v > 0);
>>>>                       print "\thit name is ", $hit->name, "\n";
>>>>                       while( my $hsp = $hit->next_hsp )
>>>>                       {
>>>>                           print "\t\tscore is ", $hsp->score, "\n";
>>>>                       }
>>>>                   }
>>>>               }
>>>>       }
>>>>
>>>>
>>>>   }
>>>> @blast_report = get_file_data ($filename);
>>>> return @blast_report;
>>>> ##############################################################################
>>>> ####
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> =
>>> =
>>> =====================================================================
>>> Attention: The information contained in this message and/or
>>> attachments
>>> from AgResearch Limited is intended only for the persons or entities
>>> to which it is addressed and may contain confidential and/or
>>> privileged
>>> material. Any review, retransmission, dissemination or other use
>>> of, or
>>> taking of any action in reliance upon, this information by persons or
>>> entities other than the intended recipients is prohibited by
>>> AgResearch
>>> Limited. If you have received this message in error, please notify
>>> the
>>> sender immediately.
>>> =
>>> =
>>> =====================================================================
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> --
>> Jason Stajich
>> jason at bioperl.org
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l


--------------------------------------------------------------------------------



No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.375 / Virus Database: 270.13.5/2219 - Release Date: 07/05/09 
05:53:00
-------------- next part --------------
A non-text attachment was scrubbed...
Name: U.gif
Type: image/gif
Size: 231 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20090706/47916715/attachment-0008.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: G.gif
Type: image/gif
Size: 358 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20090706/47916715/attachment-0009.gif>


More information about the Bioperl-l mailing list