[Bioperl-l] cdd-search with remoteblast?
Chris Fields
cjfields at illinois.edu
Thu Jul 9 15:08:53 UTC 2009
I'm not sure, but I think adding this in will take a little work
(we'll need to catch the RID returned, which I'm fairly sure will
require some modifications to checking the returned output). I would
also have to look at the RemoteBlast API to see how this would fit in
(I'm assuming we could either lump it in with other returned RIDs or
create a new method for that).
You are more than welcome to add this in as an enhancement request to
bugzilla for BioPerl:
http://bugzilla.open-bio.org/
chris
On Jul 9, 2009, at 5:16 AM, Jonas Schaer wrote:
> Hi guys,
> Thank you all so much for your help and patience :). Of course you
> were right and I finaly found the right put-parameter to get exactly
> the same hits as on the homepage.
> I do have an other question though :)...
> I now want to include a search for conserved domains, but when I try
> to use the CDD_SEARCH-parameter (http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node16.html#sub
> :CDD_SEARCH) like the other put-parameters the way chris once told
> me(works fine with the other params):
>
> my %put = (
> WORD_SIZE => 3,
> HITLIST_SIZE => 100,
> THRESHOLD => 11,
> FILTER => 'R',
> GENETIC_CODE => 1,
> CDD_SEARCH => 'on' ###I tried
> it with 'true' and '1', too.
>
> );
>
> for my $putName (keys %put) {
> $factory->submit_parameter($putName,$put{$putName});
> }
>
>
> ...an exception is thrown:
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: CDD_SEARCH is not a valid PUT parameter.
> STACK: Error::throw
> STACK: Bio::Root::Root::throw C:/Perl/site/lib/Bio/Root/Root.pm:359
> STACK: Bio::Tools::Run::RemoteBlast::submit_parameter C:/Perl/site/
> lib/Bio/Tools
> /Run/RemoteBlast.pm:325
> STACK: main::blast_a_sequence firsteval0.8.pm:383
> STACK: main::blast_it firsteval0.8.pm:288
> STACK: firsteval0.8.pm:35
> ----------------------------------------------------------- .
> I guess somehow this could be the solution to my problem:
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node78.html#sub:RID-
> for-Simultaneous
> , but unfortunately I don't understand what to do.
> I'm so sorry to bother you with this but please help me once more...:)
>
> Best regards and thanks in advance,
> Jonas
>
> ----- Original Message ----- From: "Smithies, Russell" <Russell.Smithies at agresearch.co.nz
> >
> To: "'Jonas Schaer'" <Brotelzwieb at gmx.de>
> Cc: "'Chris Fields'" <cjfields at illinois.edu>; "'BioPerl List'" <bioperl-l at lists.open-bio.org
> >
> Sent: Monday, July 06, 2009 10:56 PM
> Subject: RE: [Bioperl-l] different results with remote-blast skript
>
>
> Hi Jonas,
> You can't just play with the BLAST parameters and hope for a
> "better" result.
> I'd suggest that if you aren't sure what they do, you should leave
> them alone as small changes can make huge differences in the output
> - it's quite possible to miss finding what you're looking for by
> using the wrong parameters.
> If all else fails, read the blast manual: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_all.html
> http://www.ncbi.nlm.nih.gov/blast/tutorial/
> Or Read Ian Korfs' excellent book: http://books.google.com/books?id=xvcnhDG9fNUC&lpg=PR17&ots=WJpfuHF6Hn&dq=ian%20korf%20%20blast%20book&pg=PA3
>
> Don't worry about the integer overflow bug as there's nothing you
> can do about it. If you're interested, Google and Wikipedia are your
> friends: http://en.wikipedia.org/wiki/Integer_overflow
>
>
> Russell
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Jonas Schaer
>> Sent: Tuesday, 7 July 2009 12:14 a.m.
>> To: BioPerl List; Chris Fields
>> Subject: Re: [Bioperl-l] different results with remote-blast skript
>>
>> Hi guys, thanks for your answers so far.
>> @jason: integer overflow in blast.... sorry, but what do you mean
>> by that?
>> how can I fix it...?
>>
>> Since I never really changed any parameters I thought them all to
>> be default.
>> whatever, I tried to get "better" results with my prog by changing
>> these:
>> $Bio::Tools::Run::RemoteBlast::HEADER{'GAPCOSTS'} = '11 1';
>> $Bio::Tools::Run::RemoteBlast::HEADER{'MAX_NUM_SEQ'} = '100';
>> $Bio::Tools::Run::RemoteBlast::HEADER{'EXPECT'} = '10';
>> $
>> Bio
>> ::Tools::Run::RemoteBlast::HEADER{'COMPOSITION_BASED_STATISTICS'} =
>> '1';
>> with no effect...I guess these were default values anyway.
>>
>> So please maybe you can tell me all the other parameters I can
>> change with my
>> perl-skript AND how to do that?
>> Unfortunately both, perl and the blast-algorithm are pretty much
>> new to me,
>> maybe thats why I just cannot find out how to do that on my own... :/
>>
>> Here is the output I get with my remote-blast skript:
>> ##############################################################################
>> ###################################
>> Query Name:
>> MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLRSL
>> L
>> hit name is ref|XP_001702807.1|
>> score is 442
>> BLASTP 2.2.21+
>> Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
>> Schaffer,
>> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
>> (1997), "Gapped
>> BLAST and PSI-BLAST: a new generation of protein database search
>> programs",
>> Nucleic Acids Res. 25:3389-3402.
>>
>>
>> Reference for composition-based statistics: Alejandro A.
>> Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L.
>> Spouge, Yuri
>> I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
>> "Improving the
>> accuracy of PSI-BLAST protein database searches with composition-
>> based
>> statistics and other refinements", Nucleic Acids Res. 29:2994-3005.
>>
>>
>> RID: 53STX5G2013
>>
>>
>> Database: All non-redundant GenBank CDS
>> translations+PDB+SwissProt+PIR+PRF excluding environmental samples
>> from WGS projects
>> 9,252,587 sequences; 3,169,972,781 total letters Query=
>> MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLRSLL
>> DVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVARAWHERDDNAFRQAHQNTAM
>> ATGPDPDDEYE
>> Length=150
>>
>>
>>
>> Score
>> E
>> Sequences producing significant alignments:
>> (Bits)
>> Value
>>
>> ref|XP_001702807.1| ClpS-like protein [Chlamydomonas reinhard...
>> 174
>> 2e-42
>>
>>
>> ALIGNMENTS
>> >ref|XP_001702807.1| ClpS-like protein [Chlamydomonas reinhardtii]
>> gb|EDP06586.1| ClpS-like protein [Chlamydomonas reinhardtii]
>> Length=303
>>
>> Score = 174 bits (442), Expect = 2e-42, Method: Composition-based
>> stats.
>> Identities = 150/150 (100%), Positives = 150/150 (100%), Gaps =
>> 0/150 (0%)
>>
>> Query 1
>> MGSSSVGTYHLLLVLMgaggeqqavqagaevaSTEQVDGSGMAANSRGSTSGSEQPPrds 60
>>
>> MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDS
>> Sbjct 154
>> MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDS
>> 213
>>
>> Query 61
>> dlgllrslldVAGVDRTalevkllalaeagaeMPPAQDSQATAAGVVATLTSVYRQQVAR
>> 120
>>
>> DLGLLRSLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVAR
>> Sbjct 214
>> DLGLLRSLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVAR
>> 273
>>
>> Query 121 AWHERDDNAFRQAHQNTAMATGPDPDDEYE 150
>> AWHERDDNAFRQAHQNTAMATGPDPDDEYE
>> Sbjct 274 AWHERDDNAFRQAHQNTAMATGPDPDDEYE 303
>>
>>
>>
>> Database: All non-redundant GenBank CDS translations+PDB+SwissProt
>> +PIR+PRF
>> excluding environmental samples from WGS projects
>> Posted date: Jul 5, 2009 4:41 AM
>> Number of letters in database: -1,124,994,511
>> Number of sequences in database: 9,252,587
>>
>> Lambda K H
>> 0.309 0.122 0.345
>> Gapped
>> Lambda K H
>> 0.267 0.0410 0.140
>> Matrix: BLOSUM62
>> Gap Penalties: Existence: 11, Extension: 1
>> Number of Sequences: 9252587
>> Number of Hits to DB: 60273703
>> Number of extensions: 1448367
>> Number of successful extensions: 2103
>> Number of sequences better than 10: 0
>> Number of HSP's better than 10 without gapping: 0
>> Number of HSP's gapped: 2113
>> Number of HSP's successfully gapped: 0
>> Length of query: 150
>> Length of database: 3169972781
>> Length adjustment: 113
>> Effective length of query: 37
>> Effective length of database: 2124430450
>> Effective search space: 78603926650
>> Effective search space used: 78603926650
>> T: 11
>> A: 40
>> X1: 16 (7.1 bits)
>> X2: 38 (14.6 bits)
>> X3: 64 (24.7 bits)
>> S1: 42 (20.8 bits)
>> S2: 74 (33.1 bits)
>>
>> ##############################################################################
>> ###################################
>> and here are the hits (?) of the blast-algorithm on the ncbi-
>> homepage with
>> the same query of course:
>> ref|XP_001702807.1| ClpS-like protein [Chlamydomonas reinhard...
>> 300
>> 3e-80
>> ref|XP_001942719.1| PREDICTED: similar to GA16705-PA [Acyrtho...
>> 36.2
>> 1.1
>> ref|ZP_03781446.1| hypothetical protein RUMHYD_00880 [Blautia...
>> 35.4
>> 1.8
>> ref|XP_001563232.1| leucyl-tRNA synthetase [Leishmania brazil...
>> 34.3
>> 4.2
>> ref|XP_680841.1| hypothetical protein AN7572.2 [Aspergillus n...
>> 33.5
>> 6.0
>> ref|YP_001768110.1| hypothetical protein M446_1150 [Methyloba...
>> 33.5
>> 7.0
>> ##############################################################################
>> ###################################at
>> least the first hit is the same, but even there there is a
>> different score
>> and e-value.
>>
>> thanks so much for any help :)
>> regards, jonas
>>
>>
>> ----- Original Message -----
>> From: "Chris Fields" <cjfields at illinois.edu>
>> To: "Jason Stajich" <jason at bioperl.org>
>> Cc: "Smithies, Russell" <Russell.Smithies at agresearch.co.nz>;
>> "'BioPerl
>> List'" <bioperl-l at lists.open-bio.org>; "'Jonas Schaer'"
>> <Jonas_Schaer at gmx.de>
>> Sent: Monday, July 06, 2009 12:51 AM
>> Subject: Re: [Bioperl-l] different results with remote-blast skript
>>
>>
>> > That inspires confidence ;>
>> >
>> > chris
>> >
>> > On Jul 5, 2009, at 4:40 PM, Jason Stajich wrote:
>> >
>> >> integer overflow in blast....
>> >>
>> >> On Jul 5, 2009, at 2:00 PM, Smithies, Russell wrote:
>> >>
>> >>> I'd guess it's a difference in the parameters used.
>> >>> Interesting that both have the number of letters in the db as
>> >>> "-1,125,070,205", I assume that's a bug :-)
>> >>>
>> >>> Stats from your remote_blast:
>> >>>
>> >>> 'stats' => {
>> >>> 'S1' => '42',
>> >>> 'S1_bits' => '20.8',
>> >>> 'lambda' => '0.309',
>> >>> 'entropy' => '0.345',
>> >>> 'kappa_gapped' => '0.0410',
>> >>> 'T' => '11',
>> >>> 'kappa' => '0.122',
>> >>> 'X3_bits' => '24.7',
>> >>> 'X1' => '16',
>> >>> 'lambda_gapped' => '0.267',
>> >>> 'X2' => '38',
>> >>> 'S2' => '74',
>> >>> 'seqs_better_than_cutoff' => '0',
>> >>> 'posted_date' => 'Jul 4, 2009 4:41 AM',
>> >>> 'Hits_to_DB' => '60102303',
>> >>> 'dbletters' => '-1125070205',
>> >>> 'A' => '40',
>> >>> 'num_successful_extensions' => '2004',
>> >>> 'num_extensions' => '1436892',
>> >>> 'X1_bits' => '7.1',
>> >>> 'X3' => '64',
>> >>> 'entropy_gapped' => '0.140',
>> >>> 'dbentries' => '9252258',
>> >>> 'X2_bits' => '14.6',
>> >>> 'S2_bits' => '33.1'
>> >>> }
>> >>>
>> >>>
>> >>> Stats from a blast done on the NCBI webpage:
>> >>>
>> >>> Database: All non-redundant GenBank CDS translations+PDB
>> +SwissProt
>> >>> +PIR+PRF
>> >>> excluding environmental samples from WGS projects
>> >>> Posted date: Jul 4, 2009 4:41 AM
>> >>> Number of letters in database: -1,125,070,205
>> >>> Number of sequences in database: 9,252,258
>> >>>
>> >>> Lambda K H
>> >>> 0.309 0.124 0.340
>> >>> Gapped
>> >>> Lambda K H
>> >>> 0.267 0.0410 0.140
>> >>> Matrix: BLOSUM62
>> >>> Gap Penalties: Existence: 11, Extension: 1
>> >>> Number of Sequences: 9252258
>> >>> Number of Hits to DB: 86493230
>> >>> Number of extensions: 3101413
>> >>> Number of successful extensions: 9001
>> >>> Number of sequences better than 100: 65
>> >>> Number of HSP's better than 100 without gapping: 0
>> >>> Number of HSP's gapped: 9000
>> >>> Number of HSP's successfully gapped: 66
>> >>> Length of query: 150
>> >>> Length of database: 3169897087
>> >>> Length adjustment: 113
>> >>> Effective length of query: 37
>> >>> Effective length of database: 2124391933
>> >>> Effective search space: 78602501521
>> >>> Effective search space used: 78602501521
>> >>> T: 11
>> >>> A: 40
>> >>> X1: 16 (7.1 bits)
>> >>> X2: 38 (14.6 bits)
>> >>> X3: 64 (24.7 bits)
>> >>> S1: 42 (20.8 bits)
>> >>> S2: 65 (29.6 bits)
>> >>>
>> >>>
>> >>>> -----Original Message-----
>> >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> >>>> bounces at lists.open-bio.org] On Behalf Of Jonas Schaer
>> >>>> Sent: Sunday, 28 June 2009 10:15 p.m.
>> >>>> To: BioPerl List
>> >>>> Subject: [Bioperl-l] different results with remote-blast skript
>> >>>>
>> >>>> Hi again :)
>> >>>> please, I only have this little question:
>> >>>> why do I get different results with my remote::blast perl skript
>> >>>> then on the
>> >>>> ncbi blast homepage?
>> >>>> I am using blastp, the query is an amino-sequence (different
>> >>>> results with any
>> >>>> sequence, differences not only in number of hits but even in e-
>> >>>> values, scores
>> >>>> etc...), the database is 'nr'.
>> >>>> PLEASE help me,
>> >>>> thank you in advance,
>> >>>> Jonas
>> >>>>
>> >>>> ps: my skript:
>> >>>>
>> ##############################################################################
>> >>>> ##
>> >>>> use Bio::Seq::SeqFactory;
>> >>>> use Bio::Tools::Run::RemoteBlast;
>> >>>> use strict;
>> >>>> my @blast_report;
>> >>>> my $prog = 'blastp';
>> >>>> my $db = 'nr';
>> >>>> my $e_val= '1e-10';
>> >>>> #my $e_val= '10';
>> >>>> my @params = ( '-prog' => $prog,
>> >>>> '-data' => $db,
>> >>>> '-expect' => $e_val,
>> >>>> '-readmethod' => 'SearchIO' );
>> >>>> my $factory = Bio::Tools::Run::RemoteBlast->new(@params);
>> >>>> $Bio::Tools::Run::RemoteBlast::HEADER{'GAPCOSTS'} = '11 1';
>> >>>> $Bio::Tools::Run::RemoteBlast::HEADER{'MAX_NUM_SEQ'} = '100';
>> >>>> $Bio::Tools::Run::RemoteBlast::HEADER{'EXPECT'} = '10';
>> >>>> $
>> >>>> Bio
>> >
>> >>> ::Tools::Run::RemoteBlast::HEADER{'COMPOSITION_BASED_STATISTICS'}
>> >>>> = '1';
>> >>>>
>> >>>> my
>> >>>> $
>> >>>> blast_seq
>> >>>>
>> ='MGSSSVGTYHLLLVLMGAGGEQQAVQAGAEVASTEQVDGSGMAANSRGSTSGSEQPPRDSDLGLLR
>> >>>>
>> SLLDVAGVDRTALEVKLLALAEAGAEMPPAQDSQATAAGVVATLTSVYRQQVARAWHERDDNAFRQAHQNTAMATGPD
>> >>>> PDDEYE';
>> >>>> #$v is just to turn on and off the messages
>> >>>> my $v = 1;
>> >>>> my $seqbuilder = Bio::Seq::SeqFactory->new('-type' =>
>> >>>> 'Bio::PrimarySeq');
>> >>>> my $seq = $seqbuilder->create(-seq =>$blast_seq, -display_id =>
>> >>>> "$blast_seq");
>> >>>> my $filename='temp2.out';
>> >>>> my $r = $factory->submit_blast($seq);
>> >>>> print STDERR "waiting..." if( $v > 0 );
>> >>>> while ( my @rids = $factory->each_rid )
>> >>>> {
>> >>>> foreach my $rid ( @rids )
>> >>>> {
>> >>>> my $rc = $factory->retrieve_blast($rid);
>> >>>> if( !ref($rc) )
>> >>>> {
>> >>>> if( $rc < 0 )
>> >>>> {
>> >>>> $factory->remove_rid($rid);
>> >>>> }
>> >>>> print STDERR "." if ( $v > 0 );
>> >>>> }
>> >>>> else
>> >>>> {
>> >>>> my $result = $rc->next_result();
>> >>>> $factory->save_output($filename);
>> >>>> $factory->remove_rid($rid);
>> >>>> print "\nQuery Name: ", $result->query_name(),
>> >>>> "\n";
>> >>>> while ( my $hit = $result->next_hit )
>> >>>> {
>> >>>> next unless ( $v > 0);
>> >>>> print "\thit name is ", $hit->name, "\n";
>> >>>> while( my $hsp = $hit->next_hsp )
>> >>>> {
>> >>>> print "\t\tscore is ", $hsp->score,
>> "\n";
>> >>>> }
>> >>>> }
>> >>>> }
>> >>>> }
>> >>>>
>> >>>>
>> >>>> }
>> >>>> @blast_report = get_file_data ($filename);
>> >>>> return @blast_report;
>> >>>>
>> ##############################################################################
>> >>>> ####
>> >>>> _______________________________________________
>> >>>> Bioperl-l mailing list
>> >>>> Bioperl-l at lists.open-bio.org
>> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >>> =
>> >>> =
>> >>>
>> =====================================================================
>> >>> Attention: The information contained in this message and/or
>> >>> attachments
>> >>> from AgResearch Limited is intended only for the persons or
>> entities
>> >>> to which it is addressed and may contain confidential and/or
>> >>> privileged
>> >>> material. Any review, retransmission, dissemination or other use
>> >>> of, or
>> >>> taking of any action in reliance upon, this information by
>> persons or
>> >>> entities other than the intended recipients is prohibited by
>> >>> AgResearch
>> >>> Limited. If you have received this message in error, please
>> notify
>> >>> the
>> >>> sender immediately.
>> >>> =
>> >>> =
>> >>>
>> =====================================================================
>> >>>
>> >>> _______________________________________________
>> >>> Bioperl-l mailing list
>> >>> Bioperl-l at lists.open-bio.org
>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >>
>> >> --
>> >> Jason Stajich
>> >> jason at bioperl.org
>> >>
>> >> _______________________________________________
>> >> Bioperl-l mailing list
>> >> Bioperl-l at lists.open-bio.org
>> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>> ------------------------------------------------------------------------------
>> --
>>
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 8.5.375 / Virus Database: 270.13.5/2219 - Release Date:
>> 07/05/09
>> 05:53:00
>
>
> --------------------------------------------------------------------------------
>
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.375 / Virus Database: 270.13.5/2220 - Release Date:
> 07/05/09 17:54:00
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list