[Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?)
Jason Stajich
jason at bioperl.org
Tue May 12 19:07:21 EDT 2009
I really don't think tile_hsps should be used on BLAST data folks, it
is a pretty blind approach.
If you really want the right answer you need to do -links with WU-
BLAST or FASTA.
Been discussed a few times on the mailing list.
Good to fix the code bug I guess to avoid the warnings, but unless you
are going to walk through all the HSPs and extract the consistent
paths wrt query I think you'll have loops, etc in there which will
make Hit->percent_id non-accurate.
-jason
On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote:
> Patch below (to SearchUtils.pm) fixes the non-numeric warnings on
> Dan's data, but something deeper may be going on.
> Also get many of the following warnings, haven't looked at it closely:
>
> --------------------- WARNING ---------------------
> MSG: Removing score value(s)
> ---------------------------------------------------
>
> PATCH:
>
> Index: SearchUtils.pm
> ===================================================================
> --- SearchUtils.pm (revision 15674)
> +++ SearchUtils.pm (working copy)
> @@ -252,8 +252,8 @@
> }
>
> $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_-
> >{'stop'} - $_->{'start'} + 1;
> - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_-
> >{'iden'};
> - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_-
> >{'cons'};
> + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_-
> >{'iden'} || 0;
> + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_-
> >{'cons'} || 0;
> $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand;
> }
>
> @@ -407,9 +407,12 @@
> };
> if($@) { warn "\a\n$@\n"; }
> else {
> + # make sure numerical
> + $_->{'iden'} ||= 0;
> + $_->{'cons'} ||= 0;
> $_->{'start'} = $start; # Assign a new start
> coordinate to the contig
> - $_->{'iden'} += $numID; # and add new data to
> #identical, #conserved.
> - $_->{'cons'} += $numCons;
> + $_->{'iden'} += ($numID||0); # and add new data to
> #identical, #conserved.
> + $_->{'cons'} += ($numCons||0);
> push(@{$_->{hsps}}, $hsp);
> $overlap = 1;
> }
> @@ -424,9 +427,13 @@
> };
> if($@) { warn "\a\n$@\n"; }
> else {
> + # make sure numerical
> + $_->{'iden'} ||= 0;
> + $_->{'cons'} ||= 0;
> +
> $_->{'stop'} = $stop; # Assign a new stop coordinate
> to the contig
> - $_->{'iden'} += $numID; # and add new data to
> #identical, #conserved.
> - $_->{'cons'} += $numCons;
> + $_->{'iden'} += ($numID||0); # and add new data to
> #identical, #conserved.
> + $_->{'cons'} += ($numCons||0);
> push(@{$_->{hsps}}, $hsp);
> $overlap = 1;
> }
> @@ -461,8 +468,8 @@
> };
> if($@) { warn "\a\n$@\n"; }
> else {
> - $ids += $these_ids;
> - $cons += $these_cons;
> + $ids += ($these_ids||0);
> + $cons += ($these_cons||0);
> }
>
> last if $hsp_start == $u_start;
> @@ -490,8 +497,8 @@
> };
> if($@) { warn "\a\n$@\n"; }
> else {
> - $ids += $these_ids;
> - $cons += $these_cons;
> + $ids += ($these_ids||0);
> + $cons += ($these_cons||0);
> }
>
> last if $hsp_end == $u_stop;
>
> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
> >
> To: "Mark A. Jensen" <maj at fortinbras.us>
> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Dan Bolser" <dan.bolser at gmail.com
> >
> Sent: Tuesday, May 12, 2009 10:04 AM
> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting
> 'features'fromSearchIO?)
>
>
>> More complicated than that, I'm afraid. We should try to fix that
>> at the source of the problem.
>>
>> This appears to stem from SearchUtils HSP tiling, which in turn
>> utilizes HSPI::matches(), which in turn checks num_identical/
>> num_conserved. My guess is, since this is blasttable format, one
>> of these isn't set and thus is returning the wrong value. I'll
>> attempt to track it down today, but it may take some time.
>>
>> chris
>>
>> On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote:
>>
>>> This sounds like a
>>>
>>> $sum = eval join( '+', @a);
>>>
>>> problem, which can be fixed with
>>>
>>> $sum = eval join('+', map { $_ || () } @a) ;
>>>
>>> MAJ
>>> ----- Original Message ----- From: "Dan Bolser" <dan.bolser at gmail.com
>>> >
>>> To: "Chris Fields" <cjfields at illinois.edu>
>>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
>>> Sent: Tuesday, May 12, 2009 9:17 AM
>>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'
>>> fromSearchIO?)
>>>
>>>
>>> 2009/5/12 Chris Fields <cjfields at illinois.edu>:
>>>> Fixed that in svn. We're all still learning the ropes...
>>>
>>> In that case, I'm seeing multiple instances of...
>>>
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 256
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 412
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 429
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 465
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 473
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 494
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 502
>>>
>>>
>>> Hmm... I was about to go on to complain about the weird GFF that I
>>> was
>>> seeing, but suddenly it looks OK. My bioperl install must think your
>>> standing over my shoulder and is therefore behaving itself!
>>>
>>>
>>> Thanks again for all the help,
>>> Dan.
>>>
>>>
>>>
>>>
>>>> chris
>>>>
>>>> On May 12, 2009, at 5:55 AM, Dan Bolser wrote:
>>>>
>>>>> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>>>>>
>>>>>> Unfortunately bp_search2gff.pl is giving me errors:
>>>>>>
>>>>>> bp_search2gff.pl --version 3 -i BlastResults/
>>>>>> blast_table_filtered -f
>>>>>> blasttable -o BlastResults/blast_table_filtered.gff -t hit
>>>>>> --match --target --component
>>>>>>
>>>>>> --------------------- WARNING ---------------------
>>>>>> MSG: Removing score value(s)
>>>>>> ---------------------------------------------------
>>>>>> Can't locate object method "remove_tags" via package
>>>>>> "Bio::SeqFeature::Similarity" at
>>>>>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/
>>>>>> Generic.pm line
>>>>>> 393, <GEN1> line 5.
>>>>>
>>>>>
>>>>> I'm just learning the ropes...
>>>>>
>>>>> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11
>>>>> 15:25:55.000000000 +0100
>>>>> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12
>>>>> 11:52:41.000000000 +0100
>>>>> @@ -390,7 +390,7 @@
>>>>> }
>>>>> if ($self->has_tag('score')) {
>>>>> $self->warn("Removing score value(s)");
>>>>> - $self->remove_tags('score');
>>>>> + $self->remove_tag('score');
>>>>> }
>>>>> $self->add_tag_value('score',$value);
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Anyone seen this before?
>>>>>>
>>>>>> Cheers,
>>>>>> Dan.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>>>>>>
>>>>>>> Thanks for the info guys, I think I was naively hoping that the
>>>>>>> feature would know how to cast itself as a 'SeqFeature' (GFF).
>>>>>>>
>>>>>>> I think I understand the problem better now, so I'll try to
>>>>>>> summarise:
>>>>>>>
>>>>>>> There is no standard way to encode a HSP as a feature (not least
>>>>>>> because there are two choices about which sequence (query or
>>>>>>> the hit)
>>>>>>> it should be attached to). BioPerl will try, but the result
>>>>>>> will not
>>>>>>> be "well structured" SeqFeatures or "well formed" GFF.
>>>>>>>
>>>>>>>
>>>>>>> From what I read I guess it should be possible to standardize
>>>>>>> this
>>>>>>> mapping (based on something in one of the examples or the
>>>>>>> 'search2gff'
>>>>>>> script), assuming you specify weather you want features put on
>>>>>>> the
>>>>>>> query or on the hit.
>>>>>>>
>>>>>>> At some point last year I was trying out the bp_search2gff.pl
>>>>>>> and my
>>>>>>> own code to write a GFF file for loading and viewing by
>>>>>>> Gbrowse. At
>>>>>>> that time I gave up, as nothing seemed to be working. I was
>>>>>>> hoping
>>>>>>> that doing this at a lower level (i.e. never writing any GFF
>>>>>>> myself)
>>>>>>> it would stand a better chance of working.
>>>>>>>
>>>>>>> Also I was thinking that Gbrowse, if given a
>>>>>>> SeqFeature::Store, could
>>>>>>> autoconfigure its interface to some degree. I guess its back
>>>>>>> to the
>>>>>>> docs ;-)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'll keep trying and see if I can get anywhere.
>>>>>>>
>>>>>>> Thanks again,
>>>>>>> Dan.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> References for the above:
>>>>>>>
>>>>>>> 2009/5/11 Jason Stajich <jason at bioperl.org>:
>>>>>>>
>>>>>>>> otherwise you need to be converting the HSPs into
>>>>>>>> seqfeatures with the
>>>>>>>> right associated information (i.e. the tag/value pairs that
>>>>>>>> are in the 9th
>>>>>>>> column) in order to have well structured data in the database.
>>>>>>>
>>>>>>>> You can get the individual features from the feature pair with
>>>>>>>> $hsp->query or $hsp->hit which can also be passed to a GFF
>>>>>>>> writer (or call
>>>>>>>> $hsp->hit->gff_string). Note that since the data storage is
>>>>>>>> not structured
>>>>>>>> in a GFF3 like-way this won't immediately produce well
>>>>>>>> formed GFF3 for the
>>>>>>>> 9th column.
>>>>>>>
>>>>>>>
>>>>>>> 2009/5/11 Chris Fields <cjfields at illinois.edu>:
>>>>>>>
>>>>>>>> The main problem is the mapping is subjective based on what
>>>>>>>> your
>>>>>>>> reference sequence is within the BLAST run (e.g. whether it
>>>>>>>> is the query or
>>>>>>>> the hit), and is something that can't be automatically
>>>>>>>> discerned. I ended
>>>>>>>> up rolling my own with SeqFeature::Store (just mapped the
>>>>>>>> relevant data to
>>>>>>>> Bio::DB::SeqFeatures), but I have long wanted to fix up the
>>>>>>>> relevant scripts
>>>>>>>> to integrate my changes in, just haven't had the time
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Jason Stajich
jason at bioperl.org
More information about the Bioperl-l
mailing list