[Bioperl-l] SearchIO to GFF (was: Getting'features'fromSearchIO?)

Chris Fields cjfields at illinois.edu
Thu May 14 10:14:54 EDT 2009


Go for it (as long as it passes regression tests).

chris

On May 14, 2009, at 9:07 AM, Mark A. Jensen wrote:

> I'll patch in the fix with a cluck that these are undefined, just  
> so's it doesn't slip
> through Bioperl's fingers (if no objections in the next 10min).
> ----- Original Message ----- From: "Jason Stajich" <jason at bioperl.org>
> To: "Mark A. Jensen" <maj at fortinbras.us>; "Dan Bolser" <dan.bolser at gmail.com 
> >
> Cc: "Chris Fields" <cjfields at illinois.edu>; "BioPerl List" <bioperl-l at lists.open-bio.org 
> >
> Sent: Wednesday, May 13, 2009 2:34 AM
> Subject: Re: [Bioperl-l] SearchIO to GFF (was:  
> Getting'features'fromSearchIO?)
>
>
>> Not looking at the code so I can't be sure, but I think since there  
>> is  no sequence in the blasttable format and I think %cons or %id  
>> is  calculated from some of the numbers that are taken from  
>> underlying  sequence data in the alignment.  I don't know why that  
>> would trip up  tile_hsps necessarily but maybe that's not where the  
>> errors are coming  from.
>>
>> I've generally solved these conversions with simpler perl scripts   
>> rather than searchio since I usually want speed for this kind of  
>> stuff  -- so I convert all the data into blast m9 format   
>> (FASTA,SSEARCH,WUBLAST, BLAST) and then have a simple parser that   
>> converts the m9 (without searchio) into the appropriate GFF with  
>> or  without additional bells and whistles.
>>
>> Dan ---
>> I am slowly cleaning up and migrating more of the pipelines that  
>> we  use to generate polished GFF for loading into Gbrowse.  You  
>> should  just set a goal of making good GFF3 and many of the  
>> visualization and  query issues go away.
>>
>> For genomes I tend to have a separate scaffold file which is just  
>> the  1 line per chromosome/scaffold/contig listing.
>> Then a file for each analysis ie. Protein to genome BLAST, EST to   
>> genome, remapped Pfam domains to genome coordinates, gene  
>> predictions,  etc. Keep it simple at first you know what you've  
>> updated when  something stops working the way you expect.
>>
>> -jason
>> On May 12, 2009, at 7:00 PM, Mark A. Jensen wrote:
>>
>>> I dislike my patch, because it doesn't get to the bottom of why
>>> data members associated with numbers of conserved sites return
>>> from eval's undefined; it seems clear from the code that this is
>>> unexpected. Please excuse my naivete--why would this happen
>>> only in the blasttable format, and why hasn't this thing clucked   
>>> before?
>>> ----- Original Message -----
>>> From: Jason Stajich
>>> To: Mark A. Jensen
>>> Cc: Chris Fields ; BioPerl List ; Dan Bolser
>>> Sent: Tuesday, May 12, 2009 7:07 PM
>>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting  
>>> 'features'fromSearchIO?)
>>>
>>>
>>> I really don't think tile_hsps should be used on BLAST data  
>>> folks,  it is a pretty blind approach.
>>> If you really want the right answer you need to do -links with WU-  
>>> BLAST  or FASTA.
>>> Been discussed a few times on the mailing list.
>>>
>>>
>>> Good to fix the code bug I guess to avoid the warnings, but  
>>> unless  you are going to walk through all the HSPs and extract  
>>> the  consistent paths wrt query I think you'll have loops, etc in  
>>> there  which will make Hit->percent_id non-accurate.
>>>
>>>
>>> -jason
>>>
>>>
>>> On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote:
>>>
>>>
>>>   Patch below (to SearchUtils.pm) fixes the non-numeric warnings  
>>> on  Dan's data, but something deeper may be going on.
>>>   Also get many of the following warnings, haven't looked at it   
>>> closely:
>>>
>>>   --------------------- WARNING ---------------------
>>>   MSG: Removing score value(s)
>>>   ---------------------------------------------------
>>>
>>>   PATCH:
>>>
>>>   Index: SearchUtils.pm
>>>    
>>> ===================================================================
>>>   --- SearchUtils.pm (revision 15674)
>>>   +++ SearchUtils.pm (working copy)
>>>   @@ -252,8 +252,8 @@
>>>          }
>>>
>>>          $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_-
>>> >{'stop'} - $_->{'start'} + 1;
>>>   -        $qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_-
>>> >{'iden'};
>>>   -        $qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_-
>>> >{'cons'};
>>>   +        $qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_-
>>> >{'iden'} || 0;
>>>   +        $qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_-
>>> >{'cons'} || 0;
>>>          $qctg_dat{ "$frame$strand" }->{'qstrand'}   = $strand;
>>>      }
>>>
>>>   @@ -407,9 +407,12 @@
>>>              };
>>>              if($@) { warn "\a\n$@\n"; }
>>>              else {
>>>   +  # make sure numerical
>>>   +  $_->{'iden'} ||= 0;
>>>   +  $_->{'cons'} ||= 0;
>>>                  $_->{'start'} = $start; # Assign a new start   
>>> coordinate to the contig
>>>   -                $_->{'iden'} += $numID; # and add new data to  
>>> #identical, #conserved.
>>>   -                $_->{'cons'} += $numCons;
>>>   +                $_->{'iden'} += ($numID||0); # and add new  
>>> data  to #identical, #conserved.
>>>   +                $_->{'cons'} += ($numCons||0);
>>>                  push(@{$_->{hsps}}, $hsp);
>>>                  $overlap     = 1;
>>>              }
>>>   @@ -424,9 +427,13 @@
>>>              };
>>>              if($@) { warn "\a\n$@\n"; }
>>>              else {
>>>   +  # make sure numerical
>>>   +  $_->{'iden'} ||= 0;
>>>   +  $_->{'cons'} ||= 0;
>>>   +
>>>                  $_->{'stop'}  = $stop; # Assign a new stop   
>>> coordinate to the contig
>>>   -                $_->{'iden'} += $numID; # and add new data to  
>>> #identical, #conserved.
>>>   -                $_->{'cons'} += $numCons;
>>>   +                $_->{'iden'} += ($numID||0); # and add new  
>>> data  to #identical, #conserved.
>>>   +                $_->{'cons'} += ($numCons||0);
>>>                  push(@{$_->{hsps}}, $hsp);
>>>                  $overlap    = 1;
>>>              }
>>>   @@ -461,8 +468,8 @@
>>>                          };
>>>                          if($@) { warn "\a\n$@\n"; }
>>>                          else {
>>>   -                            $ids  += $these_ids;
>>>   -                            $cons += $these_cons;
>>>   +                            $ids  += ($these_ids||0);
>>>   +                            $cons += ($these_cons||0);
>>>                          }
>>>
>>>                          last if $hsp_start == $u_start;
>>>   @@ -490,8 +497,8 @@
>>>                          };
>>>                          if($@) { warn "\a\n$@\n"; }
>>>                          else {
>>>   -                            $ids  += $these_ids;
>>>   -                            $cons += $these_cons;
>>>   +                            $ids  += ($these_ids||0);
>>>   +                            $cons += ($these_cons||0);
>>>                          }
>>>
>>>                          last if $hsp_end == $u_stop;
>>>
>>>   ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
>>> >
>>>   To: "Mark A. Jensen" <maj at fortinbras.us>
>>>   Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Dan Bolser" <dan.bolser at gmail.com
>>> >
>>>   Sent: Tuesday, May 12, 2009 10:04 AM
>>>   Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting  
>>> 'features'fromSearchIO?)
>>>
>>>
>>>
>>>     More complicated than that, I'm afraid.  We should try to fix   
>>> that at the source of the problem.
>>>
>>>
>>>
>>>     This appears to stem from SearchUtils HSP tiling, which in   
>>> turn utilizes HSPI::matches(), which in turn checks  
>>> num_identical/  num_conserved. My guess is, since this is  
>>> blasttable format, one  of  these isn't set and thus is returning  
>>> the wrong value.  I'll  attempt  to track it down today, but it  
>>> may take some time.
>>>
>>>
>>>
>>>     chris
>>>
>>>
>>>
>>>     On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote:
>>>
>>>
>>>
>>>       This sounds like a
>>>
>>>
>>>
>>>       $sum = eval join( '+', @a);
>>>
>>>
>>>
>>>       problem, which can be fixed with
>>>
>>>
>>>
>>>       $sum = eval join('+', map { $_ || () } @a) ;
>>>
>>>
>>>
>>>       MAJ
>>>
>>>       ----- Original Message ----- From: "Dan Bolser" <dan.bolser at gmail.com
>>> >
>>>
>>>       To: "Chris Fields" <cjfields at illinois.edu>
>>>
>>>       Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
>>>
>>>       Sent: Tuesday, May 12, 2009 9:17 AM
>>>
>>>       Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting   
>>> 'features' fromSearchIO?)
>>>
>>>
>>>
>>>
>>>
>>>       2009/5/12 Chris Fields <cjfields at illinois.edu>:
>>>
>>>         Fixed that in svn. We're all still learning the ropes...
>>>
>>>
>>>
>>>       In that case, I'm seeing multiple instances of...
>>>
>>>
>>>
>>>       Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 256
>>>
>>>       Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 412
>>>
>>>       Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 429
>>>
>>>       Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 465
>>>
>>>       Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 473
>>>
>>>       Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 494
>>>
>>>       Argument "" isn't numeric in addition (+) at Bio/Search/  
>>> SearchUtils.pm line 502
>>>
>>>
>>>
>>>
>>>
>>>       Hmm... I was about to go on to complain about the weird GFF   
>>> that I was
>>>
>>>       seeing, but suddenly it looks OK. My bioperl install must   
>>> think your
>>>
>>>       standing over my shoulder and is therefore behaving itself!
>>>
>>>
>>>
>>>
>>>
>>>       Thanks again for all the help,
>>>
>>>       Dan.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>         chris
>>>
>>>
>>>
>>>         On May 12, 2009, at 5:55 AM, Dan Bolser wrote:
>>>
>>>
>>>
>>>           2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>>
>>>
>>>
>>>             Unfortunately bp_search2gff.pl is giving me errors:
>>>
>>>
>>>
>>>             bp_search2gff.pl --version 3 -i BlastResults/  
>>> blast_table_filtered  -f
>>>
>>>             blasttable -o BlastResults/blast_table_filtered.gff - 
>>> t  hit
>>>
>>>             --match --target --component
>>>
>>>
>>>
>>>             --------------------- WARNING ---------------------
>>>
>>>             MSG: Removing score value(s)
>>>
>>>             ---------------------------------------------------
>>>
>>>             Can't locate object method "remove_tags" via package
>>>
>>>             "Bio::SeqFeature::Similarity" at
>>>
>>>             /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/  
>>> Generic.pm  line
>>>
>>>             393, <GEN1> line 5.
>>>
>>>
>>>
>>>
>>>
>>>           I'm just learning the ropes...
>>>
>>>
>>>
>>>           --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~  
>>> 2009-05-11
>>>
>>>           15:25:55.000000000 +0100
>>>
>>>           +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12
>>>
>>>           11:52:41.000000000 +0100
>>>
>>>           @@ -390,7 +390,7 @@
>>>
>>>           }
>>>
>>>           if ($self->has_tag('score')) {
>>>
>>>           $self->warn("Removing score value(s)");
>>>
>>>           - $self->remove_tags('score');
>>>
>>>           + $self->remove_tag('score');
>>>
>>>           }
>>>
>>>           $self->add_tag_value('score',$value);
>>>
>>>           }
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>             Anyone seen this before?
>>>
>>>
>>>
>>>             Cheers,
>>>
>>>             Dan.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>             2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>>
>>>
>>>
>>>               Thanks for the info guys, I think I was naively   
>>> hoping that the
>>>
>>>               feature would know how to cast itself as a   
>>> 'SeqFeature' (GFF).
>>>
>>>
>>>
>>>               I think I understand the problem better now, so  
>>> I'll  try to summarise:
>>>
>>>
>>>
>>>               There is no standard way to encode a HSP as a  
>>> feature  (not least
>>>
>>>               because there are two choices about which sequence   
>>> (query or the  hit)
>>>
>>>               it should be attached to). BioPerl will try, but  
>>> the  result will  not
>>>
>>>               be "well structured" SeqFeatures or "well formed" GFF.
>>>
>>>
>>>
>>>
>>>
>>>               From what I read I guess it should be possible to   
>>> standardize this
>>>
>>>               mapping (based on something in one of the examples  
>>> or  the 'search2gff'
>>>
>>>               script), assuming you specify weather you want   
>>> features put on the
>>>
>>>               query or on the hit.
>>>
>>>
>>>
>>>               At some point last year I was trying out the   
>>> bp_search2gff.pl and my
>>>
>>>               own code to write a GFF file for loading and  
>>> viewing  by Gbrowse.  At
>>>
>>>               that time I gave up, as nothing seemed to be  
>>> working.  I was hoping
>>>
>>>               that doing this at a lower level (i.e. never  
>>> writing  any GFF myself)
>>>
>>>               it would stand a better chance of working.
>>>
>>>
>>>
>>>               Also I was thinking that Gbrowse, if given a  
>>> SeqFeature::Store,  could
>>>
>>>               autoconfigure its interface to some degree. I guess   
>>> its back to  the
>>>
>>>               docs ;-)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>               I'll keep trying and see if I can get anywhere.
>>>
>>>
>>>
>>>               Thanks again,
>>>
>>>               Dan.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>               References for the above:
>>>
>>>
>>>
>>>               2009/5/11 Jason Stajich <jason at bioperl.org>:
>>>
>>>
>>>
>>>                 otherwise you need to be converting the HSPs into  
>>> seqfeatures  with the
>>>
>>>                 right associated information (i.e. the tag/value   
>>> pairs that are  in the 9th
>>>
>>>                 column) in order to have well structured data in   
>>> the database.
>>>
>>>
>>>
>>>                 You can get the individual features from the   
>>> feature pair with
>>>
>>>                 $hsp->query or $hsp->hit which can also be passed   
>>> to a GFF writer (or call
>>>
>>>                 $hsp->hit->gff_string). Note that since the data   
>>> storage is not structured
>>>
>>>                 in a GFF3 like-way this won't immediately produce   
>>> well formed  GFF3 for the
>>>
>>>                 9th column.
>>>
>>>
>>>
>>>
>>>
>>>               2009/5/11 Chris Fields <cjfields at illinois.edu>:
>>>
>>>
>>>
>>>                 The main problem is the mapping is subjective  
>>> based  on what your
>>>
>>>                 reference sequence is within the BLAST run (e.g.   
>>> whether it is  the query or
>>>
>>>                 the hit), and is something that can't be   
>>> automatically discerned. I ended
>>>
>>>                 up rolling my own with SeqFeature::Store (just   
>>> mapped the relevant data to
>>>
>>>                 Bio::DB::SeqFeatures), but I have long wanted to   
>>> fix up the relevant scripts
>>>
>>>                 to integrate my changes in, just haven't had the  
>>> time
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>           _______________________________________________
>>>
>>>           Bioperl-l mailing list
>>>
>>>           Bioperl-l at lists.open-bio.org
>>>
>>>           http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>       _______________________________________________
>>>
>>>       Bioperl-l mailing list
>>>
>>>       Bioperl-l at lists.open-bio.org
>>>
>>>       http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>     _______________________________________________
>>>
>>>     Bioperl-l mailing list
>>>
>>>     Bioperl-l at lists.open-bio.org
>>>
>>>     http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>>
>>>   _______________________________________________
>>>   Bioperl-l mailing list
>>>   Bioperl-l at lists.open-bio.org
>>>   http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>> Jason Stajich
>>> jason at bioperl.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> Jason Stajich
>> jason at bioperl.org
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list