[Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?)

Chris Fields cjfields at illinois.edu
Tue May 12 20:46:39 EDT 2009


We should probably indicate this in the BLAST docs (and possibly  
deprecate using tile_hsps and its ilk in the long run).  Worries me  
seeing modification of the score where none is apparent, so it may be  
worth tracking that down.

chris

On May 12, 2009, at 6:07 PM, Jason Stajich wrote:

> I really don't think tile_hsps should be used on BLAST data folks,  
> it is a pretty blind approach.
> If you really want the right answer you need to do -links with WU- 
> BLAST  or FASTA.
> Been discussed a few times on the mailing list.
>
> Good to fix the code bug I guess to avoid the warnings, but unless  
> you are going to walk through all the HSPs and extract the  
> consistent paths wrt query I think you'll have loops, etc in there  
> which will make Hit->percent_id non-accurate.
>
> -jason
>
> On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote:
>
>> Patch below (to SearchUtils.pm) fixes the non-numeric warnings on  
>> Dan's data, but something deeper may be going on.
>> Also get many of the following warnings, haven't looked at it  
>> closely:
>>
>> --------------------- WARNING ---------------------
>> MSG: Removing score value(s)
>> ---------------------------------------------------
>>
>> PATCH:
>>
>> Index: SearchUtils.pm
>> ===================================================================
>> --- SearchUtils.pm (revision 15674)
>> +++ SearchUtils.pm (working copy)
>> @@ -252,8 +252,8 @@
>>       }
>>
>>       $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_- 
>> >{'stop'} - $_->{'start'} + 1;
>> -        $qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_- 
>> >{'iden'};
>> -        $qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_- 
>> >{'cons'};
>> +        $qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_- 
>> >{'iden'} || 0;
>> +        $qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_- 
>> >{'cons'} || 0;
>>       $qctg_dat{ "$frame$strand" }->{'qstrand'}   = $strand;
>>   }
>>
>> @@ -407,9 +407,12 @@
>>           };
>>           if($@) { warn "\a\n$@\n"; }
>>           else {
>> +  # make sure numerical
>> +  $_->{'iden'} ||= 0;
>> +  $_->{'cons'} ||= 0;
>>               $_->{'start'} = $start; # Assign a new start  
>> coordinate to the contig
>> -                $_->{'iden'} += $numID; # and add new data to  
>> #identical, #conserved.
>> -                $_->{'cons'} += $numCons;
>> +                $_->{'iden'} += ($numID||0); # and add new data to  
>> #identical, #conserved.
>> +                $_->{'cons'} += ($numCons||0);
>>               push(@{$_->{hsps}}, $hsp);
>>               $overlap     = 1;
>>           }
>> @@ -424,9 +427,13 @@
>>           };
>>           if($@) { warn "\a\n$@\n"; }
>>           else {
>> +  # make sure numerical
>> +  $_->{'iden'} ||= 0;
>> +  $_->{'cons'} ||= 0;
>> +
>>               $_->{'stop'}  = $stop; # Assign a new stop coordinate  
>> to the contig
>> -                $_->{'iden'} += $numID; # and add new data to  
>> #identical, #conserved.
>> -                $_->{'cons'} += $numCons;
>> +                $_->{'iden'} += ($numID||0); # and add new data to  
>> #identical, #conserved.
>> +                $_->{'cons'} += ($numCons||0);
>>               push(@{$_->{hsps}}, $hsp);
>>               $overlap    = 1;
>>           }
>> @@ -461,8 +468,8 @@
>>                       };
>>                       if($@) { warn "\a\n$@\n"; }
>>                       else {
>> -                            $ids  += $these_ids;
>> -                            $cons += $these_cons;
>> +                            $ids  += ($these_ids||0);
>> +                            $cons += ($these_cons||0);
>>                       }
>>
>>                       last if $hsp_start == $u_start;
>> @@ -490,8 +497,8 @@
>>                       };
>>                       if($@) { warn "\a\n$@\n"; }
>>                       else {
>> -                            $ids  += $these_ids;
>> -                            $cons += $these_cons;
>> +                            $ids  += ($these_ids||0);
>> +                            $cons += ($these_cons||0);
>>                       }
>>
>>                       last if $hsp_end == $u_stop;
>>
>> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu 
>> >
>> To: "Mark A. Jensen" <maj at fortinbras.us>
>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Dan Bolser" <dan.bolser at gmail.com 
>> >
>> Sent: Tuesday, May 12, 2009 10:04 AM
>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting  
>> 'features'fromSearchIO?)
>>
>>
>>> More complicated than that, I'm afraid.  We should try to fix that  
>>> at  the source of the problem.
>>>
>>> This appears to stem from SearchUtils HSP tiling, which in turn   
>>> utilizes HSPI::matches(), which in turn checks num_identical/  
>>> num_conserved.  My guess is, since this is blasttable format, one  
>>> of  these isn't set and thus is returning the wrong value.  I'll  
>>> attempt  to track it down today, but it may take some time.
>>>
>>> chris
>>>
>>> On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote:
>>>
>>>> This sounds like a
>>>>
>>>> $sum = eval join( '+', @a);
>>>>
>>>> problem, which can be fixed with
>>>>
>>>> $sum = eval join('+', map { $_ || () } @a) ;
>>>>
>>>> MAJ
>>>> ----- Original Message ----- From: "Dan Bolser" <dan.bolser at gmail.com 
>>>> >
>>>> To: "Chris Fields" <cjfields at illinois.edu>
>>>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
>>>> Sent: Tuesday, May 12, 2009 9:17 AM
>>>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'  
>>>> fromSearchIO?)
>>>>
>>>>
>>>> 2009/5/12 Chris Fields <cjfields at illinois.edu>:
>>>>> Fixed that in svn. We're all still learning the ropes...
>>>>
>>>> In that case, I'm seeing multiple instances of...
>>>>
>>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>>> SearchUtils.pm line 256
>>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>>> SearchUtils.pm line 412
>>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>>> SearchUtils.pm line 429
>>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>>> SearchUtils.pm line 465
>>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>>> SearchUtils.pm line 473
>>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>>> SearchUtils.pm line 494
>>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>>> SearchUtils.pm line 502
>>>>
>>>>
>>>> Hmm... I was about to go on to complain about the weird GFF that  
>>>> I was
>>>> seeing, but suddenly it looks OK. My bioperl install must think  
>>>> your
>>>> standing over my shoulder and is therefore behaving itself!
>>>>
>>>>
>>>> Thanks again for all the help,
>>>> Dan.
>>>>
>>>>
>>>>
>>>>
>>>>> chris
>>>>>
>>>>> On May 12, 2009, at 5:55 AM, Dan Bolser wrote:
>>>>>
>>>>>> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>>>>>>
>>>>>>> Unfortunately bp_search2gff.pl is giving me errors:
>>>>>>>
>>>>>>> bp_search2gff.pl --version 3 -i BlastResults/ 
>>>>>>> blast_table_filtered  -f
>>>>>>> blasttable -o BlastResults/blast_table_filtered.gff -t hit
>>>>>>> --match --target --component
>>>>>>>
>>>>>>> --------------------- WARNING ---------------------
>>>>>>> MSG: Removing score value(s)
>>>>>>> ---------------------------------------------------
>>>>>>> Can't locate object method "remove_tags" via package
>>>>>>> "Bio::SeqFeature::Similarity" at
>>>>>>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/ 
>>>>>>> Generic.pm  line
>>>>>>> 393, <GEN1> line 5.
>>>>>>
>>>>>>
>>>>>> I'm just learning the ropes...
>>>>>>
>>>>>> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11
>>>>>> 15:25:55.000000000 +0100
>>>>>> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12
>>>>>> 11:52:41.000000000 +0100
>>>>>> @@ -390,7 +390,7 @@
>>>>>> }
>>>>>> if ($self->has_tag('score')) {
>>>>>> $self->warn("Removing score value(s)");
>>>>>> - $self->remove_tags('score');
>>>>>> + $self->remove_tag('score');
>>>>>> }
>>>>>> $self->add_tag_value('score',$value);
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Anyone seen this before?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Dan.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>>>>>>>
>>>>>>>> Thanks for the info guys, I think I was naively hoping that the
>>>>>>>> feature would know how to cast itself as a 'SeqFeature' (GFF).
>>>>>>>>
>>>>>>>> I think I understand the problem better now, so I'll try to   
>>>>>>>> summarise:
>>>>>>>>
>>>>>>>> There is no standard way to encode a HSP as a feature (not  
>>>>>>>> least
>>>>>>>> because there are two choices about which sequence (query or  
>>>>>>>> the  hit)
>>>>>>>> it should be attached to). BioPerl will try, but the result  
>>>>>>>> will  not
>>>>>>>> be "well structured" SeqFeatures or "well formed" GFF.
>>>>>>>>
>>>>>>>>
>>>>>>>> From what I read I guess it should be possible to standardize  
>>>>>>>> this
>>>>>>>> mapping (based on something in one of the examples or the   
>>>>>>>> 'search2gff'
>>>>>>>> script), assuming you specify weather you want features put  
>>>>>>>> on the
>>>>>>>> query or on the hit.
>>>>>>>>
>>>>>>>> At some point last year I was trying out the  
>>>>>>>> bp_search2gff.pl  and my
>>>>>>>> own code to write a GFF file for loading and viewing by  
>>>>>>>> Gbrowse.  At
>>>>>>>> that time I gave up, as nothing seemed to be working. I was  
>>>>>>>> hoping
>>>>>>>> that doing this at a lower level (i.e. never writing any GFF   
>>>>>>>> myself)
>>>>>>>> it would stand a better chance of working.
>>>>>>>>
>>>>>>>> Also I was thinking that Gbrowse, if given a  
>>>>>>>> SeqFeature::Store,  could
>>>>>>>> autoconfigure its interface to some degree. I guess its back  
>>>>>>>> to  the
>>>>>>>> docs ;-)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'll keep trying and see if I can get anywhere.
>>>>>>>>
>>>>>>>> Thanks again,
>>>>>>>> Dan.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> References for the above:
>>>>>>>>
>>>>>>>> 2009/5/11 Jason Stajich <jason at bioperl.org>:
>>>>>>>>
>>>>>>>>> otherwise you need to be converting the HSPs into  
>>>>>>>>> seqfeatures  with the
>>>>>>>>> right associated information (i.e. the tag/value pairs that  
>>>>>>>>> are  in the 9th
>>>>>>>>> column) in order to have well structured data in the database.
>>>>>>>>
>>>>>>>>> You can get the individual features from the feature pair with
>>>>>>>>> $hsp->query or $hsp->hit which can also be passed to a GFF   
>>>>>>>>> writer (or call
>>>>>>>>> $hsp->hit->gff_string). Note that since the data storage is  
>>>>>>>>> not structured
>>>>>>>>> in a GFF3 like-way this won't immediately produce well  
>>>>>>>>> formed  GFF3 for the
>>>>>>>>> 9th column.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2009/5/11 Chris Fields <cjfields at illinois.edu>:
>>>>>>>>
>>>>>>>>> The main problem is the mapping is subjective based on what  
>>>>>>>>> your
>>>>>>>>> reference sequence is within the BLAST run (e.g. whether it  
>>>>>>>>> is  the query or
>>>>>>>>> the hit), and is something that can't be automatically   
>>>>>>>>> discerned. I ended
>>>>>>>>> up rolling my own with SeqFeature::Store (just mapped the   
>>>>>>>>> relevant data to
>>>>>>>>> Bio::DB::SeqFeatures), but I have long wanted to fix up the   
>>>>>>>>> relevant scripts
>>>>>>>>> to integrate my changes in, just haven't had the time
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Jason Stajich
> jason at bioperl.org
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list