[Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?)

Mark A. Jensen maj at fortinbras.us
Tue May 12 22:00:41 EDT 2009


I dislike my patch, because it doesn't get to the bottom of why 
data members associated with numbers of conserved sites return
from eval's undefined; it seems clear from the code that this is 
unexpected. Please excuse my naivete--why would this happen 
only in the blasttable format, and why hasn't this thing clucked before?
  ----- Original Message ----- 
  From: Jason Stajich 
  To: Mark A. Jensen 
  Cc: Chris Fields ; BioPerl List ; Dan Bolser 
  Sent: Tuesday, May 12, 2009 7:07 PM
  Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?)


  I really don't think tile_hsps should be used on BLAST data folks, it is a pretty blind approach. 
  If you really want the right answer you need to do -links with WU-BLAST  or FASTA.
  Been discussed a few times on the mailing list.


  Good to fix the code bug I guess to avoid the warnings, but unless you are going to walk through all the HSPs and extract the consistent paths wrt query I think you'll have loops, etc in there which will make Hit->percent_id non-accurate. 


  -jason


  On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote:


    Patch below (to SearchUtils.pm) fixes the non-numeric warnings on Dan's data, but something deeper may be going on.
    Also get many of the following warnings, haven't looked at it closely:

    --------------------- WARNING ---------------------
    MSG: Removing score value(s)
    ---------------------------------------------------

    PATCH:

    Index: SearchUtils.pm
    ===================================================================
    --- SearchUtils.pm (revision 15674)
    +++ SearchUtils.pm (working copy)
    @@ -252,8 +252,8 @@
           }

           $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1;
    -        $qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_->{'iden'};
    -        $qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_->{'cons'};
    +        $qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_->{'iden'} || 0;
    +        $qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_->{'cons'} || 0;
           $qctg_dat{ "$frame$strand" }->{'qstrand'}   = $strand;
       }

    @@ -407,9 +407,12 @@
               };
               if($@) { warn "\a\n$@\n"; }
               else {
    +  # make sure numerical
    +  $_->{'iden'} ||= 0;
    +  $_->{'cons'} ||= 0;
                   $_->{'start'} = $start; # Assign a new start coordinate to the contig
    -                $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
    -                $_->{'cons'} += $numCons;
    +                $_->{'iden'} += ($numID||0); # and add new data to #identical, #conserved.
    +                $_->{'cons'} += ($numCons||0);
                   push(@{$_->{hsps}}, $hsp);
                   $overlap     = 1;
               }
    @@ -424,9 +427,13 @@
               };
               if($@) { warn "\a\n$@\n"; }
               else {
    +  # make sure numerical
    +  $_->{'iden'} ||= 0;
    +  $_->{'cons'} ||= 0;
    +
                   $_->{'stop'}  = $stop; # Assign a new stop coordinate to the contig
    -                $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
    -                $_->{'cons'} += $numCons;
    +                $_->{'iden'} += ($numID||0); # and add new data to #identical, #conserved.
    +                $_->{'cons'} += ($numCons||0);
                   push(@{$_->{hsps}}, $hsp);
                   $overlap    = 1;
               }
    @@ -461,8 +468,8 @@
                           };
                           if($@) { warn "\a\n$@\n"; }
                           else {
    -                            $ids  += $these_ids;
    -                            $cons += $these_cons;
    +                            $ids  += ($these_ids||0);
    +                            $cons += ($these_cons||0);
                           }

                           last if $hsp_start == $u_start;
    @@ -490,8 +497,8 @@
                           };
                           if($@) { warn "\a\n$@\n"; }
                           else {
    -                            $ids  += $these_ids;
    -                            $cons += $these_cons;
    +                            $ids  += ($these_ids||0);
    +                            $cons += ($these_cons||0);
                           }

                           last if $hsp_end == $u_stop;

    ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu>
    To: "Mark A. Jensen" <maj at fortinbras.us>
    Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Dan Bolser" <dan.bolser at gmail.com>
    Sent: Tuesday, May 12, 2009 10:04 AM
    Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?)



      More complicated than that, I'm afraid.  We should try to fix that at  the source of the problem.



      This appears to stem from SearchUtils HSP tiling, which in turn  utilizes HSPI::matches(), which in turn checks num_identical/ num_conserved.  My guess is, since this is blasttable format, one of  these isn't set and thus is returning the wrong value.  I'll attempt  to track it down today, but it may take some time.



      chris



      On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote:



        This sounds like a



        $sum = eval join( '+', @a);



        problem, which can be fixed with



        $sum = eval join('+', map { $_ || () } @a) ;



        MAJ

        ----- Original Message ----- From: "Dan Bolser" <dan.bolser at gmail.com>

        To: "Chris Fields" <cjfields at illinois.edu>

        Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>

        Sent: Tuesday, May 12, 2009 9:17 AM

        Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features' fromSearchIO?)





        2009/5/12 Chris Fields <cjfields at illinois.edu>:

          Fixed that in svn. We're all still learning the ropes...



        In that case, I'm seeing multiple instances of...



        Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 256

        Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 412

        Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 429

        Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 465

        Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 473

        Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 494

        Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 502





        Hmm... I was about to go on to complain about the weird GFF that I was

        seeing, but suddenly it looks OK. My bioperl install must think your

        standing over my shoulder and is therefore behaving itself!





        Thanks again for all the help,

        Dan.









          chris



          On May 12, 2009, at 5:55 AM, Dan Bolser wrote:



            2009/5/12 Dan Bolser <dan.bolser at gmail.com>:



              Unfortunately bp_search2gff.pl is giving me errors:



              bp_search2gff.pl --version 3 -i BlastResults/blast_table_filtered  -f

              blasttable -o BlastResults/blast_table_filtered.gff -t hit

              --match --target --component



              --------------------- WARNING ---------------------

              MSG: Removing score value(s)

              ---------------------------------------------------

              Can't locate object method "remove_tags" via package

              "Bio::SeqFeature::Similarity" at

              /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/Generic.pm  line

              393, <GEN1> line 5.





            I'm just learning the ropes...



            --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11

            15:25:55.000000000 +0100

            +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12

            11:52:41.000000000 +0100

            @@ -390,7 +390,7 @@

            }

            if ($self->has_tag('score')) {

            $self->warn("Removing score value(s)");

            - $self->remove_tags('score');

            + $self->remove_tag('score');

            }

            $self->add_tag_value('score',$value);

            }











              Anyone seen this before?



              Cheers,

              Dan.







              2009/5/12 Dan Bolser <dan.bolser at gmail.com>:



                Thanks for the info guys, I think I was naively hoping that the

                feature would know how to cast itself as a 'SeqFeature' (GFF).



                I think I understand the problem better now, so I'll try to  summarise:



                There is no standard way to encode a HSP as a feature (not least

                because there are two choices about which sequence (query or the  hit)

                it should be attached to). BioPerl will try, but the result will  not

                be "well structured" SeqFeatures or "well formed" GFF.





                From what I read I guess it should be possible to standardize this

                mapping (based on something in one of the examples or the  'search2gff'

                script), assuming you specify weather you want features put on the

                query or on the hit.



                At some point last year I was trying out the bp_search2gff.pl  and my

                own code to write a GFF file for loading and viewing by Gbrowse.  At

                that time I gave up, as nothing seemed to be working. I was hoping

                that doing this at a lower level (i.e. never writing any GFF  myself)

                it would stand a better chance of working.



                Also I was thinking that Gbrowse, if given a SeqFeature::Store,  could

                autoconfigure its interface to some degree. I guess its back to  the

                docs ;-)







                I'll keep trying and see if I can get anywhere.



                Thanks again,

                Dan.







                References for the above:



                2009/5/11 Jason Stajich <jason at bioperl.org>:



                  otherwise you need to be converting the HSPs into seqfeatures  with the

                  right associated information (i.e. the tag/value pairs that are  in the 9th

                  column) in order to have well structured data in the database.



                  You can get the individual features from the feature pair with

                  $hsp->query or $hsp->hit which can also be passed to a GFF  writer (or call

                  $hsp->hit->gff_string). Note that since the data storage is not structured

                  in a GFF3 like-way this won't immediately produce well formed  GFF3 for the

                  9th column.





                2009/5/11 Chris Fields <cjfields at illinois.edu>:



                  The main problem is the mapping is subjective based on what your

                  reference sequence is within the BLAST run (e.g. whether it is  the query or

                  the hit), and is something that can't be automatically  discerned. I ended

                  up rolling my own with SeqFeature::Store (just mapped the  relevant data to

                  Bio::DB::SeqFeatures), but I have long wanted to fix up the  relevant scripts

                  to integrate my changes in, just haven't had the time







            _______________________________________________

            Bioperl-l mailing list

            Bioperl-l at lists.open-bio.org

            http://lists.open-bio.org/mailman/listinfo/bioperl-l







        _______________________________________________

        Bioperl-l mailing list

        Bioperl-l at lists.open-bio.org

        http://lists.open-bio.org/mailman/listinfo/bioperl-l







      _______________________________________________

      Bioperl-l mailing list

      Bioperl-l at lists.open-bio.org

      http://lists.open-bio.org/mailman/listinfo/bioperl-l




    _______________________________________________
    Bioperl-l mailing list
    Bioperl-l at lists.open-bio.org
    http://lists.open-bio.org/mailman/listinfo/bioperl-l



  Jason Stajich
  jason at bioperl.org











More information about the Bioperl-l mailing list