[Bioperl-l] different results with remote-blast skript

Smithies, Russell Russell.Smithies at agresearch.co.nz
Mon Jul 6 16:56:41 EDT 2009


Hi Jonas,
You can't just play with the BLAST parameters and hope for a "better" result.
I'd suggest that if you aren't sure what they do, you should leave them alone as small changes can make huge differences in the output - it's quite possible to miss finding what you're looking for by using the wrong parameters.
If all else fails, read the blast manual: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_all.html
http://www.ncbi.nlm.nih.gov/blast/tutorial/
Or Read Ian Korfs' excellent book: http://books.google.com/books?id=xvcnhDG9fNUC&lpg=PR17&ots=WJpfuHF6Hn&dq=ian%20korf%20%20blast%20book&pg=PA3

Don't worry about the integer overflow bug as there's nothing you can do about it. If you're interested, Google and Wikipedia are your friends: http://en.wikipedia.org/wiki/Integer_overflow


Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Jonas Schaer
> Sent: Tuesday, 7 July 2009 12:14 a.m.
> To: BioPerl List; Chris Fields
> Subject: Re: [Bioperl-l] different results with remote-blast skript
>
> Hi guys, thanks for your answers so far.
> @jason: integer overflow in blast.... sorry, but what do you mean by that?
> how can I fix it...?
>
> Since I never really changed any parameters I thought them all to be default.
> whatever, I tried to get "better" results with my prog by changing
> these:
> $Bio::Tools::Run::RemoteBlast::HEADER{'GAPCOSTS'} = '11 1';
> $Bio::Tools::Run::RemoteBlast::HEADER{'MAX_NUM_SEQ'} = '100';
> $Bio::Tools::Run::RemoteBlast::HEADER{'EXPECT'} = '10';
> $Bio::Tools::Run::RemoteBlast::HEADER{'COMPOSITION_BASED_STATISTICS'} = '1';
> with no effect...I guess these were default values anyway.
>
> So please maybe you can tell me all the other parameters I can change with my
> perl-skript AND how to do that?
> Unfortunately both, perl and the blast-algorithm are pretty much new to me,
> maybe thats why I just cannot find out how to do that on my own... :/
>
> Here is the output I get with my remote-blast skript:
> ##############################################################################
> ###################################
>  Query Name:
> MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLRSL
> L
>         hit name is ref|XP_001702807.1|
>                 score is 442
> BLASTP 2.2.21+
> Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer,
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped
> BLAST and PSI-BLAST: a new generation of protein database search programs",
> Nucleic Acids Res. 25:3389-3402.
>
>
> Reference for composition-based statistics: Alejandro A.
> Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
> I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the
> accuracy of PSI-BLAST protein database searches with composition-based
> statistics and other refinements", Nucleic Acids Res. 29:2994-3005.
>
>
> RID: 53STX5G2013
>
>
> Database: All non-redundant GenBank CDS
> translations+PDB+SwissProt+PIR+PRF excluding environmental samples
> from WGS projects
>            9,252,587 sequences; 3,169,972,781 total letters Query=
> MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLRSLL
> DVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVARAWHERDDNAFRQAHQNTAM
> ATGPDPDDEYE
> Length=150
>
>
>                                                                    Score
> E
> Sequences producing significant alignments:                       (Bits)
> Value
>
> ref|XP_001702807.1|  ClpS-like protein [Chlamydomonas reinhard...   174
> 2e-42
>
>
> ALIGNMENTS
> >ref|XP_001702807.1| ClpS-like protein [Chlamydomonas reinhardtii]
>  gb|EDP06586.1| ClpS-like protein [Chlamydomonas reinhardtii]
> Length=303
>
>  Score =  174 bits (442),  Expect = 2e-42, Method: Composition-based stats.
>  Identities = 150/150 (100%), Positives = 150/150 (100%), Gaps = 0/150 (0%)
>
> Query  1    MGSSSVGTYHLLLVLMgaggeqqavqagaevaSTEQVDGSGMAANSRGSTSGSEQPPrds  60
>             MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDS
> Sbjct  154  MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDS
> 213
>
> Query  61   dlgllrslldVAGVDRTalevkllalaeagaeMPPAQDSQATAAGVVATLTSVYRQQVAR
> 120
>             DLGLLRSLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVAR
> Sbjct  214  DLGLLRSLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVAR
> 273
>
> Query  121  AWHERDDNAFRQAHQNTAMATGPDPDDEYE  150
>             AWHERDDNAFRQAHQNTAMATGPDPDDEYE
> Sbjct  274  AWHERDDNAFRQAHQNTAMATGPDPDDEYE  303
>
>
>
>   Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
> excluding environmental samples from WGS projects
>     Posted date:  Jul 5, 2009  4:41 AM
>   Number of letters in database: -1,124,994,511
>   Number of sequences in database:  9,252,587
>
> Lambda     K      H
>    0.309    0.122    0.345
> Gapped
> Lambda     K      H
>    0.267   0.0410    0.140
> Matrix: BLOSUM62
> Gap Penalties: Existence: 11, Extension: 1
> Number of Sequences: 9252587
> Number of Hits to DB: 60273703
> Number of extensions: 1448367
> Number of successful extensions: 2103
> Number of sequences better than 10: 0
> Number of HSP's better than 10 without gapping: 0
> Number of HSP's gapped: 2113
> Number of HSP's successfully gapped: 0
> Length of query: 150
> Length of database: 3169972781
> Length adjustment: 113
> Effective length of query: 37
> Effective length of database: 2124430450
> Effective search space: 78603926650
> Effective search space used: 78603926650
> T: 11
> A: 40
> X1: 16 (7.1 bits)
> X2: 38 (14.6 bits)
> X3: 64 (24.7 bits)
> S1: 42 (20.8 bits)
> S2: 74 (33.1 bits)
>
> ##############################################################################
> ###################################
> and here are the hits (?) of the blast-algorithm on the ncbi-homepage with
> the same query of course:
> ref|XP_001702807.1|  ClpS-like protein [Chlamydomonas reinhard...   300
> 3e-80
> ref|XP_001942719.1|  PREDICTED: similar to GA16705-PA [Acyrtho...  36.2
> 1.1
> ref|ZP_03781446.1|  hypothetical protein RUMHYD_00880 [Blautia...  35.4
> 1.8
> ref|XP_001563232.1|  leucyl-tRNA synthetase [Leishmania brazil...  34.3
> 4.2
> ref|XP_680841.1|  hypothetical protein AN7572.2 [Aspergillus n...  33.5
> 6.0
> ref|YP_001768110.1|  hypothetical protein M446_1150 [Methyloba...  33.5
> 7.0
> ##############################################################################
> ###################################at
> least the first hit is the same, but even there there is a different score
> and e-value.
>
> thanks so much for any help :)
> regards, jonas
>
>
> ----- Original Message -----
> From: "Chris Fields" <cjfields at illinois.edu>
> To: "Jason Stajich" <jason at bioperl.org>
> Cc: "Smithies, Russell" <Russell.Smithies at agresearch.co.nz>; "'BioPerl
> List'" <bioperl-l at lists.open-bio.org>; "'Jonas Schaer'"
> <Jonas_Schaer at gmx.de>
> Sent: Monday, July 06, 2009 12:51 AM
> Subject: Re: [Bioperl-l] different results with remote-blast skript
>
>
> > That inspires confidence ;>
> >
> > chris
> >
> > On Jul 5, 2009, at 4:40 PM, Jason Stajich wrote:
> >
> >> integer overflow in blast....
> >>
> >> On Jul 5, 2009, at 2:00 PM, Smithies, Russell wrote:
> >>
> >>> I'd guess it's a difference in the parameters used.
> >>> Interesting that both have the number of letters in the db as
> >>> "-1,125,070,205", I assume that's a bug  :-)
> >>>
> >>> Stats from your remote_blast:
> >>>
> >>> 'stats' => {
> >>>            'S1' => '42',
> >>>            'S1_bits' => '20.8',
> >>>            'lambda' => '0.309',
> >>>            'entropy' => '0.345',
> >>>            'kappa_gapped' => '0.0410',
> >>>            'T' => '11',
> >>>            'kappa' => '0.122',
> >>>            'X3_bits' => '24.7',
> >>>            'X1' => '16',
> >>>            'lambda_gapped' => '0.267',
> >>>            'X2' => '38',
> >>>            'S2' => '74',
> >>>            'seqs_better_than_cutoff' => '0',
> >>>            'posted_date' => 'Jul 4, 2009  4:41 AM',
> >>>            'Hits_to_DB' => '60102303',
> >>>            'dbletters' => '-1125070205',
> >>>            'A' => '40',
> >>>            'num_successful_extensions' => '2004',
> >>>            'num_extensions' => '1436892',
> >>>            'X1_bits' => '7.1',
> >>>            'X3' => '64',
> >>>            'entropy_gapped' => '0.140',
> >>>            'dbentries' => '9252258',
> >>>            'X2_bits' => '14.6',
> >>>            'S2_bits' => '33.1'
> >>>          }
> >>>
> >>>
> >>> Stats from a blast done on the NCBI webpage:
> >>>
> >>> Database: All non-redundant GenBank CDS translations+PDB+SwissProt
> >>> +PIR+PRF
> >>> excluding environmental samples from WGS projects
> >>>   Posted date:  Jul 4, 2009  4:41 AM
> >>> Number of letters in database: -1,125,070,205
> >>> Number of sequences in database:  9,252,258
> >>>
> >>> Lambda     K      H
> >>>  0.309    0.124    0.340
> >>> Gapped
> >>> Lambda     K      H
> >>>  0.267   0.0410    0.140
> >>> Matrix: BLOSUM62
> >>> Gap Penalties: Existence: 11, Extension: 1
> >>> Number of Sequences: 9252258
> >>> Number of Hits to DB: 86493230
> >>> Number of extensions: 3101413
> >>> Number of successful extensions: 9001
> >>> Number of sequences better than 100: 65
> >>> Number of HSP's better than 100 without gapping: 0
> >>> Number of HSP's gapped: 9000
> >>> Number of HSP's successfully gapped: 66
> >>> Length of query: 150
> >>> Length of database: 3169897087
> >>> Length adjustment: 113
> >>> Effective length of query: 37
> >>> Effective length of database: 2124391933
> >>> Effective search space: 78602501521
> >>> Effective search space used: 78602501521
> >>> T: 11
> >>> A: 40
> >>> X1: 16 (7.1 bits)
> >>> X2: 38 (14.6 bits)
> >>> X3: 64 (24.7 bits)
> >>> S1: 42 (20.8 bits)
> >>> S2: 65 (29.6 bits)
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >>>> bounces at lists.open-bio.org] On Behalf Of Jonas Schaer
> >>>> Sent: Sunday, 28 June 2009 10:15 p.m.
> >>>> To: BioPerl List
> >>>> Subject: [Bioperl-l] different results with remote-blast skript
> >>>>
> >>>> Hi again :)
> >>>> please, I only have this little question:
> >>>> why do I get different results with my remote::blast perl skript
> >>>> then on the
> >>>> ncbi blast homepage?
> >>>> I am using blastp, the query is an amino-sequence (different
> >>>> results with any
> >>>> sequence, differences not only in number of hits but even in e-
> >>>> values, scores
> >>>> etc...), the database is 'nr'.
> >>>> PLEASE help me,
> >>>> thank you in advance,
> >>>> Jonas
> >>>>
> >>>> ps: my skript:
> >>>>
> ##############################################################################
> >>>> ##
> >>>> use Bio::Seq::SeqFactory;
> >>>> use Bio::Tools::Run::RemoteBlast;
> >>>> use strict;
> >>>> my @blast_report;
> >>>> my $prog = 'blastp';
> >>>> my $db   = 'nr';
> >>>> my $e_val= '1e-10';
> >>>> #my $e_val= '10';
> >>>> my @params = ( '-prog' => $prog,
> >>>>        '-data' => $db,
> >>>>        '-expect' => $e_val,
> >>>>        '-readmethod' => 'SearchIO' );
> >>>> my $factory = Bio::Tools::Run::RemoteBlast->new(@params);
> >>>>  $Bio::Tools::Run::RemoteBlast::HEADER{'GAPCOSTS'} = '11 1';
> >>>>  $Bio::Tools::Run::RemoteBlast::HEADER{'MAX_NUM_SEQ'} = '100';
> >>>> $Bio::Tools::Run::RemoteBlast::HEADER{'EXPECT'} = '10';
> >>>> $
> >>>> Bio
> >>>> ::Tools::Run::RemoteBlast::HEADER{'COMPOSITION_BASED_STATISTICS'}
> >>>> = '1';
> >>>>
> >>>> my
> >>>> $
> >>>> blast_seq
> >>>> ='MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLR
> >>>>
> SLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVARAWHERDDNAFRQAHQNTAMATGPD
> >>>> PDDEYE';
> >>>> #$v is just to turn on and off the messages
> >>>> my $v = 1;
> >>>> my $seqbuilder = Bio::Seq::SeqFactory->new('-type' =>
> >>>> 'Bio::PrimarySeq');
> >>>> my $seq = $seqbuilder->create(-seq =>$blast_seq, -display_id =>
> >>>> "$blast_seq");
> >>>> my $filename='temp2.out';
> >>>> my $r = $factory->submit_blast($seq);
> >>>> print STDERR "waiting..." if( $v > 0 );
> >>>>   while ( my @rids = $factory->each_rid )
> >>>>   {
> >>>>       foreach my $rid ( @rids )
> >>>>       {
> >>>>           my $rc = $factory->retrieve_blast($rid);
> >>>>           if( !ref($rc) )
> >>>>           {
> >>>>               if( $rc < 0 )
> >>>>               {
> >>>>                   $factory->remove_rid($rid);
> >>>>               }
> >>>>               print STDERR "." if ( $v > 0 );
> >>>>           }
> >>>>               else
> >>>>               {
> >>>>                   my $result = $rc->next_result();
> >>>>                   $factory->save_output($filename);
> >>>>                   $factory->remove_rid($rid);
> >>>>                   print "\nQuery Name: ", $result->query_name(),
> >>>> "\n";
> >>>>                   while ( my $hit = $result->next_hit )
> >>>>                   {
> >>>>                       next unless ( $v > 0);
> >>>>                       print "\thit name is ", $hit->name, "\n";
> >>>>                       while( my $hsp = $hit->next_hsp )
> >>>>                       {
> >>>>                           print "\t\tscore is ", $hsp->score, "\n";
> >>>>                       }
> >>>>                   }
> >>>>               }
> >>>>       }
> >>>>
> >>>>
> >>>>   }
> >>>> @blast_report = get_file_data ($filename);
> >>>> return @blast_report;
> >>>>
> ##############################################################################
> >>>> ####
> >>>> _______________________________________________
> >>>> Bioperl-l mailing list
> >>>> Bioperl-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>> =
> >>> =
> >>> =====================================================================
> >>> Attention: The information contained in this message and/or
> >>> attachments
> >>> from AgResearch Limited is intended only for the persons or entities
> >>> to which it is addressed and may contain confidential and/or
> >>> privileged
> >>> material. Any review, retransmission, dissemination or other use
> >>> of, or
> >>> taking of any action in reliance upon, this information by persons or
> >>> entities other than the intended recipients is prohibited by
> >>> AgResearch
> >>> Limited. If you have received this message in error, please notify
> >>> the
> >>> sender immediately.
> >>> =
> >>> =
> >>> =====================================================================
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >> --
> >> Jason Stajich
> >> jason at bioperl.org
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> ------------------------------------------------------------------------------
> --
>
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.375 / Virus Database: 270.13.5/2219 - Release Date: 07/05/09
> 05:53:00



More information about the Bioperl-l mailing list