[Bioperl-l] cdd-search with remoteblast?

Chris Fields cjfields at illinois.edu
Thu Jul 9 15:08:53 UTC 2009


I'm not sure, but I think adding this in will take a little work  
(we'll need to catch the RID returned, which I'm fairly sure will  
require some modifications to checking the returned output).  I would  
also have to look at the RemoteBlast API to see how this would fit in  
(I'm assuming we could either lump it in with other returned RIDs or  
create a new method for that).

You are more than welcome to add this in as an enhancement request to  
bugzilla for BioPerl:

http://bugzilla.open-bio.org/

chris

On Jul 9, 2009, at 5:16 AM, Jonas Schaer wrote:

> Hi guys,
> Thank you all so much for your help and patience :). Of course you  
> were right and I finaly found the right put-parameter to get exactly  
> the same hits as on the homepage.
> I do have an other question though :)...
> I now want to include a search for conserved domains, but when I try  
> to use the CDD_SEARCH-parameter (http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node16.html#sub 
> :CDD_SEARCH) like the other put-parameters the way chris once told  
> me(works fine with the other params):
>
> my %put = (
>   WORD_SIZE => 3,
>   HITLIST_SIZE => 100,
>   THRESHOLD => 11,
>   FILTER => 'R',
>   GENETIC_CODE => 1,
>   CDD_SEARCH => 'on'                                      ###I tried  
> it with 'true' and '1', too.
>
> );
>
> for my $putName (keys %put) {
>   $factory->submit_parameter($putName,$put{$putName});
> }
>
>
> ...an exception is thrown:
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: CDD_SEARCH is not a valid PUT parameter.
> STACK: Error::throw
> STACK: Bio::Root::Root::throw C:/Perl/site/lib/Bio/Root/Root.pm:359
> STACK: Bio::Tools::Run::RemoteBlast::submit_parameter C:/Perl/site/ 
> lib/Bio/Tools
> /Run/RemoteBlast.pm:325
> STACK: main::blast_a_sequence firsteval0.8.pm:383
> STACK: main::blast_it firsteval0.8.pm:288
> STACK: firsteval0.8.pm:35
> ----------------------------------------------------------- .
> I guess somehow this could be the solution to my problem:
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node78.html#sub:RID- 
> for-Simultaneous
> , but unfortunately I don't understand what to do.
> I'm so sorry to bother you with this but please help me once more...:)
>
> Best regards and thanks in advance,
> Jonas
>
> ----- Original Message ----- From: "Smithies, Russell" <Russell.Smithies at agresearch.co.nz 
> >
> To: "'Jonas Schaer'" <Brotelzwieb at gmx.de>
> Cc: "'Chris Fields'" <cjfields at illinois.edu>; "'BioPerl List'" <bioperl-l at lists.open-bio.org 
> >
> Sent: Monday, July 06, 2009 10:56 PM
> Subject: RE: [Bioperl-l] different results with remote-blast skript
>
>
> Hi Jonas,
> You can't just play with the BLAST parameters and hope for a  
> "better" result.
> I'd suggest that if you aren't sure what they do, you should leave  
> them alone as small changes can make huge differences in the output  
> - it's quite possible to miss finding what you're looking for by  
> using the wrong parameters.
> If all else fails, read the blast manual: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_all.html
> http://www.ncbi.nlm.nih.gov/blast/tutorial/
> Or Read Ian Korfs' excellent book: http://books.google.com/books?id=xvcnhDG9fNUC&lpg=PR17&ots=WJpfuHF6Hn&dq=ian%20korf%20%20blast%20book&pg=PA3
>
> Don't worry about the integer overflow bug as there's nothing you  
> can do about it. If you're interested, Google and Wikipedia are your  
> friends: http://en.wikipedia.org/wiki/Integer_overflow
>
>
> Russell
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Jonas Schaer
>> Sent: Tuesday, 7 July 2009 12:14 a.m.
>> To: BioPerl List; Chris Fields
>> Subject: Re: [Bioperl-l] different results with remote-blast skript
>>
>> Hi guys, thanks for your answers so far.
>> @jason: integer overflow in blast.... sorry, but what do you mean  
>> by that?
>> how can I fix it...?
>>
>> Since I never really changed any parameters I thought them all to  
>> be default.
>> whatever, I tried to get "better" results with my prog by changing
>> these:
>> $Bio::Tools::Run::RemoteBlast::HEADER{'GAPCOSTS'} = '11 1';
>> $Bio::Tools::Run::RemoteBlast::HEADER{'MAX_NUM_SEQ'} = '100';
>> $Bio::Tools::Run::RemoteBlast::HEADER{'EXPECT'} = '10';
>> $ 
>> Bio 
>> ::Tools::Run::RemoteBlast::HEADER{'COMPOSITION_BASED_STATISTICS'} =  
>> '1';
>> with no effect...I guess these were default values anyway.
>>
>> So please maybe you can tell me all the other parameters I can  
>> change with my
>> perl-skript AND how to do that?
>> Unfortunately both, perl and the blast-algorithm are pretty much  
>> new to me,
>> maybe thats why I just cannot find out how to do that on my own... :/
>>
>> Here is the output I get with my remote-blast skript:
>> ##############################################################################
>> ###################################
>> Query Name:
>> MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLRSL
>> L
>>        hit name is ref|XP_001702807.1|
>>                score is 442
>> BLASTP 2.2.21+
>> Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.  
>> Schaffer,
>> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman  
>> (1997), "Gapped
>> BLAST and PSI-BLAST: a new generation of protein database search  
>> programs",
>> Nucleic Acids Res. 25:3389-3402.
>>
>>
>> Reference for composition-based statistics: Alejandro A.
>> Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L.  
>> Spouge, Yuri
>> I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),  
>> "Improving the
>> accuracy of PSI-BLAST protein database searches with composition- 
>> based
>> statistics and other refinements", Nucleic Acids Res. 29:2994-3005.
>>
>>
>> RID: 53STX5G2013
>>
>>
>> Database: All non-redundant GenBank CDS
>> translations+PDB+SwissProt+PIR+PRF excluding environmental samples
>> from WGS projects
>>           9,252,587 sequences; 3,169,972,781 total letters Query=
>> MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLRSLL
>> DVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVARAWHERDDNAFRQAHQNTAM
>> ATGPDPDDEYE
>> Length=150
>>
>>
>>                                                                    
>> Score
>> E
>> Sequences producing significant alignments:                        
>> (Bits)
>> Value
>>
>> ref|XP_001702807.1|  ClpS-like protein [Chlamydomonas reinhard...    
>> 174
>> 2e-42
>>
>>
>> ALIGNMENTS
>> >ref|XP_001702807.1| ClpS-like protein [Chlamydomonas reinhardtii]
>> gb|EDP06586.1| ClpS-like protein [Chlamydomonas reinhardtii]
>> Length=303
>>
>> Score =  174 bits (442),  Expect = 2e-42, Method: Composition-based  
>> stats.
>> Identities = 150/150 (100%), Positives = 150/150 (100%), Gaps =  
>> 0/150 (0%)
>>
>> Query  1     
>> MGSSSVGTYHLLLVLMgaggeqqavqagaevaSTEQVDGSGMAANSRGSTSGSEQPPrds 60
>>             
>> MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDS
>> Sbjct  154   
>> MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDS
>> 213
>>
>> Query  61    
>> dlgllrslldVAGVDRTalevkllalaeagaeMPPAQDSQATAAGVVATLTSVYRQQVAR
>> 120
>>             
>> DLGLLRSLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVAR
>> Sbjct  214   
>> DLGLLRSLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVAR
>> 273
>>
>> Query  121  AWHERDDNAFRQAHQNTAMATGPDPDDEYE  150
>>            AWHERDDNAFRQAHQNTAMATGPDPDDEYE
>> Sbjct  274  AWHERDDNAFRQAHQNTAMATGPDPDDEYE  303
>>
>>
>>
>>  Database: All non-redundant GenBank CDS translations+PDB+SwissProt 
>> +PIR+PRF
>> excluding environmental samples from WGS projects
>>    Posted date:  Jul 5, 2009  4:41 AM
>>  Number of letters in database: -1,124,994,511
>>  Number of sequences in database:  9,252,587
>>
>> Lambda     K      H
>>   0.309    0.122    0.345
>> Gapped
>> Lambda     K      H
>>   0.267   0.0410    0.140
>> Matrix: BLOSUM62
>> Gap Penalties: Existence: 11, Extension: 1
>> Number of Sequences: 9252587
>> Number of Hits to DB: 60273703
>> Number of extensions: 1448367
>> Number of successful extensions: 2103
>> Number of sequences better than 10: 0
>> Number of HSP's better than 10 without gapping: 0
>> Number of HSP's gapped: 2113
>> Number of HSP's successfully gapped: 0
>> Length of query: 150
>> Length of database: 3169972781
>> Length adjustment: 113
>> Effective length of query: 37
>> Effective length of database: 2124430450
>> Effective search space: 78603926650
>> Effective search space used: 78603926650
>> T: 11
>> A: 40
>> X1: 16 (7.1 bits)
>> X2: 38 (14.6 bits)
>> X3: 64 (24.7 bits)
>> S1: 42 (20.8 bits)
>> S2: 74 (33.1 bits)
>>
>> ##############################################################################
>> ###################################
>> and here are the hits (?) of the blast-algorithm on the ncbi- 
>> homepage with
>> the same query of course:
>> ref|XP_001702807.1|  ClpS-like protein [Chlamydomonas reinhard...    
>> 300
>> 3e-80
>> ref|XP_001942719.1|  PREDICTED: similar to GA16705-PA [Acyrtho...   
>> 36.2
>> 1.1
>> ref|ZP_03781446.1|  hypothetical protein RUMHYD_00880 [Blautia...   
>> 35.4
>> 1.8
>> ref|XP_001563232.1|  leucyl-tRNA synthetase [Leishmania brazil...   
>> 34.3
>> 4.2
>> ref|XP_680841.1|  hypothetical protein AN7572.2 [Aspergillus n...   
>> 33.5
>> 6.0
>> ref|YP_001768110.1|  hypothetical protein M446_1150 [Methyloba...   
>> 33.5
>> 7.0
>> ##############################################################################
>> ###################################at
>> least the first hit is the same, but even there there is a  
>> different score
>> and e-value.
>>
>> thanks so much for any help :)
>> regards, jonas
>>
>>
>> ----- Original Message -----
>> From: "Chris Fields" <cjfields at illinois.edu>
>> To: "Jason Stajich" <jason at bioperl.org>
>> Cc: "Smithies, Russell" <Russell.Smithies at agresearch.co.nz>;  
>> "'BioPerl
>> List'" <bioperl-l at lists.open-bio.org>; "'Jonas Schaer'"
>> <Jonas_Schaer at gmx.de>
>> Sent: Monday, July 06, 2009 12:51 AM
>> Subject: Re: [Bioperl-l] different results with remote-blast skript
>>
>>
>> > That inspires confidence ;>
>> >
>> > chris
>> >
>> > On Jul 5, 2009, at 4:40 PM, Jason Stajich wrote:
>> >
>> >> integer overflow in blast....
>> >>
>> >> On Jul 5, 2009, at 2:00 PM, Smithies, Russell wrote:
>> >>
>> >>> I'd guess it's a difference in the parameters used.
>> >>> Interesting that both have the number of letters in the db as
>> >>> "-1,125,070,205", I assume that's a bug  :-)
>> >>>
>> >>> Stats from your remote_blast:
>> >>>
>> >>> 'stats' => {
>> >>>            'S1' => '42',
>> >>>            'S1_bits' => '20.8',
>> >>>            'lambda' => '0.309',
>> >>>            'entropy' => '0.345',
>> >>>            'kappa_gapped' => '0.0410',
>> >>>            'T' => '11',
>> >>>            'kappa' => '0.122',
>> >>>            'X3_bits' => '24.7',
>> >>>            'X1' => '16',
>> >>>            'lambda_gapped' => '0.267',
>> >>>            'X2' => '38',
>> >>>            'S2' => '74',
>> >>>            'seqs_better_than_cutoff' => '0',
>> >>>            'posted_date' => 'Jul 4, 2009  4:41 AM',
>> >>>            'Hits_to_DB' => '60102303',
>> >>>            'dbletters' => '-1125070205',
>> >>>            'A' => '40',
>> >>>            'num_successful_extensions' => '2004',
>> >>>            'num_extensions' => '1436892',
>> >>>            'X1_bits' => '7.1',
>> >>>            'X3' => '64',
>> >>>            'entropy_gapped' => '0.140',
>> >>>            'dbentries' => '9252258',
>> >>>            'X2_bits' => '14.6',
>> >>>            'S2_bits' => '33.1'
>> >>>          }
>> >>>
>> >>>
>> >>> Stats from a blast done on the NCBI webpage:
>> >>>
>> >>> Database: All non-redundant GenBank CDS translations+PDB 
>> +SwissProt
>> >>> +PIR+PRF
>> >>> excluding environmental samples from WGS projects
>> >>>   Posted date:  Jul 4, 2009  4:41 AM
>> >>> Number of letters in database: -1,125,070,205
>> >>> Number of sequences in database:  9,252,258
>> >>>
>> >>> Lambda     K      H
>> >>>  0.309    0.124    0.340
>> >>> Gapped
>> >>> Lambda     K      H
>> >>>  0.267   0.0410    0.140
>> >>> Matrix: BLOSUM62
>> >>> Gap Penalties: Existence: 11, Extension: 1
>> >>> Number of Sequences: 9252258
>> >>> Number of Hits to DB: 86493230
>> >>> Number of extensions: 3101413
>> >>> Number of successful extensions: 9001
>> >>> Number of sequences better than 100: 65
>> >>> Number of HSP's better than 100 without gapping: 0
>> >>> Number of HSP's gapped: 9000
>> >>> Number of HSP's successfully gapped: 66
>> >>> Length of query: 150
>> >>> Length of database: 3169897087
>> >>> Length adjustment: 113
>> >>> Effective length of query: 37
>> >>> Effective length of database: 2124391933
>> >>> Effective search space: 78602501521
>> >>> Effective search space used: 78602501521
>> >>> T: 11
>> >>> A: 40
>> >>> X1: 16 (7.1 bits)
>> >>> X2: 38 (14.6 bits)
>> >>> X3: 64 (24.7 bits)
>> >>> S1: 42 (20.8 bits)
>> >>> S2: 65 (29.6 bits)
>> >>>
>> >>>
>> >>>> -----Original Message-----
>> >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> >>>> bounces at lists.open-bio.org] On Behalf Of Jonas Schaer
>> >>>> Sent: Sunday, 28 June 2009 10:15 p.m.
>> >>>> To: BioPerl List
>> >>>> Subject: [Bioperl-l] different results with remote-blast skript
>> >>>>
>> >>>> Hi again :)
>> >>>> please, I only have this little question:
>> >>>> why do I get different results with my remote::blast perl skript
>> >>>> then on the
>> >>>> ncbi blast homepage?
>> >>>> I am using blastp, the query is an amino-sequence (different
>> >>>> results with any
>> >>>> sequence, differences not only in number of hits but even in e-
>> >>>> values, scores
>> >>>> etc...), the database is 'nr'.
>> >>>> PLEASE help me,
>> >>>> thank you in advance,
>> >>>> Jonas
>> >>>>
>> >>>> ps: my skript:
>> >>>>
>> ##############################################################################
>> >>>> ##
>> >>>> use Bio::Seq::SeqFactory;
>> >>>> use Bio::Tools::Run::RemoteBlast;
>> >>>> use strict;
>> >>>> my @blast_report;
>> >>>> my $prog = 'blastp';
>> >>>> my $db   = 'nr';
>> >>>> my $e_val= '1e-10';
>> >>>> #my $e_val= '10';
>> >>>> my @params = ( '-prog' => $prog,
>> >>>>        '-data' => $db,
>> >>>>        '-expect' => $e_val,
>> >>>>        '-readmethod' => 'SearchIO' );
>> >>>> my $factory = Bio::Tools::Run::RemoteBlast->new(@params);
>> >>>>  $Bio::Tools::Run::RemoteBlast::HEADER{'GAPCOSTS'} = '11 1';
>> >>>>  $Bio::Tools::Run::RemoteBlast::HEADER{'MAX_NUM_SEQ'} = '100';
>> >>>> $Bio::Tools::Run::RemoteBlast::HEADER{'EXPECT'} = '10';
>> >>>> $
>> >>>> Bio
>> > 
>> >>> ::Tools::Run::RemoteBlast::HEADER{'COMPOSITION_BASED_STATISTICS'}
>> >>>> = '1';
>> >>>>
>> >>>> my
>> >>>> $
>> >>>> blast_seq
>> >>>>  
>> ='MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLR
>> >>>>
>> SLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVARAWHERDDNAFRQAHQNTAMATGPD
>> >>>> PDDEYE';
>> >>>> #$v is just to turn on and off the messages
>> >>>> my $v = 1;
>> >>>> my $seqbuilder = Bio::Seq::SeqFactory->new('-type' =>
>> >>>> 'Bio::PrimarySeq');
>> >>>> my $seq = $seqbuilder->create(-seq =>$blast_seq, -display_id =>
>> >>>> "$blast_seq");
>> >>>> my $filename='temp2.out';
>> >>>> my $r = $factory->submit_blast($seq);
>> >>>> print STDERR "waiting..." if( $v > 0 );
>> >>>>   while ( my @rids = $factory->each_rid )
>> >>>>   {
>> >>>>       foreach my $rid ( @rids )
>> >>>>       {
>> >>>>           my $rc = $factory->retrieve_blast($rid);
>> >>>>           if( !ref($rc) )
>> >>>>           {
>> >>>>               if( $rc < 0 )
>> >>>>               {
>> >>>>                   $factory->remove_rid($rid);
>> >>>>               }
>> >>>>               print STDERR "." if ( $v > 0 );
>> >>>>           }
>> >>>>               else
>> >>>>               {
>> >>>>                   my $result = $rc->next_result();
>> >>>>                   $factory->save_output($filename);
>> >>>>                   $factory->remove_rid($rid);
>> >>>>                   print "\nQuery Name: ", $result->query_name(),
>> >>>> "\n";
>> >>>>                   while ( my $hit = $result->next_hit )
>> >>>>                   {
>> >>>>                       next unless ( $v > 0);
>> >>>>                       print "\thit name is ", $hit->name, "\n";
>> >>>>                       while( my $hsp = $hit->next_hsp )
>> >>>>                       {
>> >>>>                           print "\t\tscore is ", $hsp->score,  
>> "\n";
>> >>>>                       }
>> >>>>                   }
>> >>>>               }
>> >>>>       }
>> >>>>
>> >>>>
>> >>>>   }
>> >>>> @blast_report = get_file_data ($filename);
>> >>>> return @blast_report;
>> >>>>
>> ##############################################################################
>> >>>> ####
>> >>>> _______________________________________________
>> >>>> Bioperl-l mailing list
>> >>>> Bioperl-l at lists.open-bio.org
>> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >>> =
>> >>> =
>> >>>  
>> =====================================================================
>> >>> Attention: The information contained in this message and/or
>> >>> attachments
>> >>> from AgResearch Limited is intended only for the persons or  
>> entities
>> >>> to which it is addressed and may contain confidential and/or
>> >>> privileged
>> >>> material. Any review, retransmission, dissemination or other use
>> >>> of, or
>> >>> taking of any action in reliance upon, this information by  
>> persons or
>> >>> entities other than the intended recipients is prohibited by
>> >>> AgResearch
>> >>> Limited. If you have received this message in error, please  
>> notify
>> >>> the
>> >>> sender immediately.
>> >>> =
>> >>> =
>> >>>  
>> =====================================================================
>> >>>
>> >>> _______________________________________________
>> >>> Bioperl-l mailing list
>> >>> Bioperl-l at lists.open-bio.org
>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >>
>> >> --
>> >> Jason Stajich
>> >> jason at bioperl.org
>> >>
>> >> _______________________________________________
>> >> Bioperl-l mailing list
>> >> Bioperl-l at lists.open-bio.org
>> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>> ------------------------------------------------------------------------------
>> --
>>
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 8.5.375 / Virus Database: 270.13.5/2219 - Release Date:  
>> 07/05/09
>> 05:53:00
>
>
> --------------------------------------------------------------------------------
>
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.375 / Virus Database: 270.13.5/2220 - Release Date:  
> 07/05/09 17:54:00
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list