[Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?)

Jason Stajich jason at bioperl.org
Tue May 12 19:07:21 EDT 2009


I really don't think tile_hsps should be used on BLAST data folks, it  
is a pretty blind approach.
If you really want the right answer you need to do -links with WU- 
BLAST  or FASTA.
Been discussed a few times on the mailing list.

Good to fix the code bug I guess to avoid the warnings, but unless you  
are going to walk through all the HSPs and extract the consistent  
paths wrt query I think you'll have loops, etc in there which will  
make Hit->percent_id non-accurate.

-jason

On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote:

> Patch below (to SearchUtils.pm) fixes the non-numeric warnings on  
> Dan's data, but something deeper may be going on.
> Also get many of the following warnings, haven't looked at it closely:
>
> --------------------- WARNING ---------------------
> MSG: Removing score value(s)
> ---------------------------------------------------
>
> PATCH:
>
> Index: SearchUtils.pm
> ===================================================================
> --- SearchUtils.pm (revision 15674)
> +++ SearchUtils.pm (working copy)
> @@ -252,8 +252,8 @@
>        }
>
>        $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_- 
> >{'stop'} - $_->{'start'} + 1;
> -        $qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_- 
> >{'iden'};
> -        $qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_- 
> >{'cons'};
> +        $qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_- 
> >{'iden'} || 0;
> +        $qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_- 
> >{'cons'} || 0;
>        $qctg_dat{ "$frame$strand" }->{'qstrand'}   = $strand;
>    }
>
> @@ -407,9 +407,12 @@
>            };
>            if($@) { warn "\a\n$@\n"; }
>            else {
> +  # make sure numerical
> +  $_->{'iden'} ||= 0;
> +  $_->{'cons'} ||= 0;
>                $_->{'start'} = $start; # Assign a new start  
> coordinate to the contig
> -                $_->{'iden'} += $numID; # and add new data to  
> #identical, #conserved.
> -                $_->{'cons'} += $numCons;
> +                $_->{'iden'} += ($numID||0); # and add new data to  
> #identical, #conserved.
> +                $_->{'cons'} += ($numCons||0);
>                push(@{$_->{hsps}}, $hsp);
>                $overlap     = 1;
>            }
> @@ -424,9 +427,13 @@
>            };
>            if($@) { warn "\a\n$@\n"; }
>            else {
> +  # make sure numerical
> +  $_->{'iden'} ||= 0;
> +  $_->{'cons'} ||= 0;
> +
>                $_->{'stop'}  = $stop; # Assign a new stop coordinate  
> to the contig
> -                $_->{'iden'} += $numID; # and add new data to  
> #identical, #conserved.
> -                $_->{'cons'} += $numCons;
> +                $_->{'iden'} += ($numID||0); # and add new data to  
> #identical, #conserved.
> +                $_->{'cons'} += ($numCons||0);
>                push(@{$_->{hsps}}, $hsp);
>                $overlap    = 1;
>            }
> @@ -461,8 +468,8 @@
>                        };
>                        if($@) { warn "\a\n$@\n"; }
>                        else {
> -                            $ids  += $these_ids;
> -                            $cons += $these_cons;
> +                            $ids  += ($these_ids||0);
> +                            $cons += ($these_cons||0);
>                        }
>
>                        last if $hsp_start == $u_start;
> @@ -490,8 +497,8 @@
>                        };
>                        if($@) { warn "\a\n$@\n"; }
>                        else {
> -                            $ids  += $these_ids;
> -                            $cons += $these_cons;
> +                            $ids  += ($these_ids||0);
> +                            $cons += ($these_cons||0);
>                        }
>
>                        last if $hsp_end == $u_stop;
>
> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu 
> >
> To: "Mark A. Jensen" <maj at fortinbras.us>
> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Dan Bolser" <dan.bolser at gmail.com 
> >
> Sent: Tuesday, May 12, 2009 10:04 AM
> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting  
> 'features'fromSearchIO?)
>
>
>> More complicated than that, I'm afraid.  We should try to fix that  
>> at  the source of the problem.
>>
>> This appears to stem from SearchUtils HSP tiling, which in turn   
>> utilizes HSPI::matches(), which in turn checks num_identical/  
>> num_conserved.  My guess is, since this is blasttable format, one  
>> of  these isn't set and thus is returning the wrong value.  I'll  
>> attempt  to track it down today, but it may take some time.
>>
>> chris
>>
>> On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote:
>>
>>> This sounds like a
>>>
>>> $sum = eval join( '+', @a);
>>>
>>> problem, which can be fixed with
>>>
>>> $sum = eval join('+', map { $_ || () } @a) ;
>>>
>>> MAJ
>>> ----- Original Message ----- From: "Dan Bolser" <dan.bolser at gmail.com 
>>> >
>>> To: "Chris Fields" <cjfields at illinois.edu>
>>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
>>> Sent: Tuesday, May 12, 2009 9:17 AM
>>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'  
>>> fromSearchIO?)
>>>
>>>
>>> 2009/5/12 Chris Fields <cjfields at illinois.edu>:
>>>> Fixed that in svn. We're all still learning the ropes...
>>>
>>> In that case, I'm seeing multiple instances of...
>>>
>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 256
>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 412
>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 429
>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 465
>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 473
>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 494
>>> Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 502
>>>
>>>
>>> Hmm... I was about to go on to complain about the weird GFF that I  
>>> was
>>> seeing, but suddenly it looks OK. My bioperl install must think your
>>> standing over my shoulder and is therefore behaving itself!
>>>
>>>
>>> Thanks again for all the help,
>>> Dan.
>>>
>>>
>>>
>>>
>>>> chris
>>>>
>>>> On May 12, 2009, at 5:55 AM, Dan Bolser wrote:
>>>>
>>>>> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>>>>>
>>>>>> Unfortunately bp_search2gff.pl is giving me errors:
>>>>>>
>>>>>> bp_search2gff.pl --version 3 -i BlastResults/ 
>>>>>> blast_table_filtered  -f
>>>>>> blasttable -o BlastResults/blast_table_filtered.gff -t hit
>>>>>> --match --target --component
>>>>>>
>>>>>> --------------------- WARNING ---------------------
>>>>>> MSG: Removing score value(s)
>>>>>> ---------------------------------------------------
>>>>>> Can't locate object method "remove_tags" via package
>>>>>> "Bio::SeqFeature::Similarity" at
>>>>>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/ 
>>>>>> Generic.pm  line
>>>>>> 393, <GEN1> line 5.
>>>>>
>>>>>
>>>>> I'm just learning the ropes...
>>>>>
>>>>> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11
>>>>> 15:25:55.000000000 +0100
>>>>> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12
>>>>> 11:52:41.000000000 +0100
>>>>> @@ -390,7 +390,7 @@
>>>>> }
>>>>> if ($self->has_tag('score')) {
>>>>> $self->warn("Removing score value(s)");
>>>>> - $self->remove_tags('score');
>>>>> + $self->remove_tag('score');
>>>>> }
>>>>> $self->add_tag_value('score',$value);
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Anyone seen this before?
>>>>>>
>>>>>> Cheers,
>>>>>> Dan.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>>>>>>
>>>>>>> Thanks for the info guys, I think I was naively hoping that the
>>>>>>> feature would know how to cast itself as a 'SeqFeature' (GFF).
>>>>>>>
>>>>>>> I think I understand the problem better now, so I'll try to   
>>>>>>> summarise:
>>>>>>>
>>>>>>> There is no standard way to encode a HSP as a feature (not least
>>>>>>> because there are two choices about which sequence (query or  
>>>>>>> the  hit)
>>>>>>> it should be attached to). BioPerl will try, but the result  
>>>>>>> will  not
>>>>>>> be "well structured" SeqFeatures or "well formed" GFF.
>>>>>>>
>>>>>>>
>>>>>>> From what I read I guess it should be possible to standardize  
>>>>>>> this
>>>>>>> mapping (based on something in one of the examples or the   
>>>>>>> 'search2gff'
>>>>>>> script), assuming you specify weather you want features put on  
>>>>>>> the
>>>>>>> query or on the hit.
>>>>>>>
>>>>>>> At some point last year I was trying out the bp_search2gff.pl   
>>>>>>> and my
>>>>>>> own code to write a GFF file for loading and viewing by  
>>>>>>> Gbrowse.  At
>>>>>>> that time I gave up, as nothing seemed to be working. I was  
>>>>>>> hoping
>>>>>>> that doing this at a lower level (i.e. never writing any GFF   
>>>>>>> myself)
>>>>>>> it would stand a better chance of working.
>>>>>>>
>>>>>>> Also I was thinking that Gbrowse, if given a  
>>>>>>> SeqFeature::Store,  could
>>>>>>> autoconfigure its interface to some degree. I guess its back  
>>>>>>> to  the
>>>>>>> docs ;-)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'll keep trying and see if I can get anywhere.
>>>>>>>
>>>>>>> Thanks again,
>>>>>>> Dan.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> References for the above:
>>>>>>>
>>>>>>> 2009/5/11 Jason Stajich <jason at bioperl.org>:
>>>>>>>
>>>>>>>> otherwise you need to be converting the HSPs into  
>>>>>>>> seqfeatures  with the
>>>>>>>> right associated information (i.e. the tag/value pairs that  
>>>>>>>> are  in the 9th
>>>>>>>> column) in order to have well structured data in the database.
>>>>>>>
>>>>>>>> You can get the individual features from the feature pair with
>>>>>>>> $hsp->query or $hsp->hit which can also be passed to a GFF   
>>>>>>>> writer (or call
>>>>>>>> $hsp->hit->gff_string). Note that since the data storage is  
>>>>>>>> not structured
>>>>>>>> in a GFF3 like-way this won't immediately produce well  
>>>>>>>> formed  GFF3 for the
>>>>>>>> 9th column.
>>>>>>>
>>>>>>>
>>>>>>> 2009/5/11 Chris Fields <cjfields at illinois.edu>:
>>>>>>>
>>>>>>>> The main problem is the mapping is subjective based on what  
>>>>>>>> your
>>>>>>>> reference sequence is within the BLAST run (e.g. whether it  
>>>>>>>> is  the query or
>>>>>>>> the hit), and is something that can't be automatically   
>>>>>>>> discerned. I ended
>>>>>>>> up rolling my own with SeqFeature::Store (just mapped the   
>>>>>>>> relevant data to
>>>>>>>> Bio::DB::SeqFeatures), but I have long wanted to fix up the   
>>>>>>>> relevant scripts
>>>>>>>> to integrate my changes in, just haven't had the time
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason at bioperl.org







More information about the Bioperl-l mailing list