[Bioperl-l] SearchIO to GFF (was: Getting'features'fromSearchIO?)

Mark A. Jensen maj at fortinbras.us
Thu May 14 10:07:45 EDT 2009


I'll patch in the fix with a cluck that these are undefined, just so's it 
doesn't slip
through Bioperl's fingers (if no objections in the next 10min).
----- Original Message ----- 
From: "Jason Stajich" <jason at bioperl.org>
To: "Mark A. Jensen" <maj at fortinbras.us>; "Dan Bolser" <dan.bolser at gmail.com>
Cc: "Chris Fields" <cjfields at illinois.edu>; "BioPerl List" 
<bioperl-l at lists.open-bio.org>
Sent: Wednesday, May 13, 2009 2:34 AM
Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting'features'fromSearchIO?)


> Not looking at the code so I can't be sure, but I think since there is  no 
> sequence in the blasttable format and I think %cons or %id is  calculated from 
> some of the numbers that are taken from underlying  sequence data in the 
> alignment.  I don't know why that would trip up  tile_hsps necessarily but 
> maybe that's not where the errors are coming  from.
>
> I've generally solved these conversions with simpler perl scripts  rather than 
> searchio since I usually want speed for this kind of stuff  -- so I convert 
> all the data into blast m9 format  (FASTA,SSEARCH,WUBLAST, BLAST) and then 
> have a simple parser that  converts the m9 (without searchio) into the 
> appropriate GFF with or  without additional bells and whistles.
>
> Dan ---
> I am slowly cleaning up and migrating more of the pipelines that we  use to 
> generate polished GFF for loading into Gbrowse.  You should  just set a goal 
> of making good GFF3 and many of the visualization and  query issues go away.
>
> For genomes I tend to have a separate scaffold file which is just the  1 line 
> per chromosome/scaffold/contig listing.
> Then a file for each analysis ie. Protein to genome BLAST, EST to  genome, 
> remapped Pfam domains to genome coordinates, gene predictions,  etc. Keep it 
> simple at first you know what you've updated when  something stops working the 
> way you expect.
>
> -jason
> On May 12, 2009, at 7:00 PM, Mark A. Jensen wrote:
>
>> I dislike my patch, because it doesn't get to the bottom of why
>> data members associated with numbers of conserved sites return
>> from eval's undefined; it seems clear from the code that this is
>> unexpected. Please excuse my naivete--why would this happen
>> only in the blasttable format, and why hasn't this thing clucked  before?
>>  ----- Original Message -----
>>  From: Jason Stajich
>>  To: Mark A. Jensen
>>  Cc: Chris Fields ; BioPerl List ; Dan Bolser
>>  Sent: Tuesday, May 12, 2009 7:07 PM
>>  Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 
>> 'features'fromSearchIO?)
>>
>>
>>  I really don't think tile_hsps should be used on BLAST data folks,  it is a 
>> pretty blind approach.
>>  If you really want the right answer you need to do -links with WU- BLAST  or 
>> FASTA.
>>  Been discussed a few times on the mailing list.
>>
>>
>>  Good to fix the code bug I guess to avoid the warnings, but unless  you are 
>> going to walk through all the HSPs and extract the  consistent paths wrt 
>> query I think you'll have loops, etc in there  which will make 
>> Hit->percent_id non-accurate.
>>
>>
>>  -jason
>>
>>
>>  On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote:
>>
>>
>>    Patch below (to SearchUtils.pm) fixes the non-numeric warnings on  Dan's 
>> data, but something deeper may be going on.
>>    Also get many of the following warnings, haven't looked at it  closely:
>>
>>    --------------------- WARNING ---------------------
>>    MSG: Removing score value(s)
>>    ---------------------------------------------------
>>
>>    PATCH:
>>
>>    Index: SearchUtils.pm
>>    ===================================================================
>>    --- SearchUtils.pm (revision 15674)
>>    +++ SearchUtils.pm (working copy)
>>    @@ -252,8 +252,8 @@
>>           }
>>
>>           $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_-
>> >{'stop'} - $_->{'start'} + 1;
>>    -        $qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_-
>> >{'iden'};
>>    -        $qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_-
>> >{'cons'};
>>    +        $qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_-
>> >{'iden'} || 0;
>>    +        $qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_-
>> >{'cons'} || 0;
>>           $qctg_dat{ "$frame$strand" }->{'qstrand'}   = $strand;
>>       }
>>
>>    @@ -407,9 +407,12 @@
>>               };
>>               if($@) { warn "\a\n$@\n"; }
>>               else {
>>    +  # make sure numerical
>>    +  $_->{'iden'} ||= 0;
>>    +  $_->{'cons'} ||= 0;
>>                   $_->{'start'} = $start; # Assign a new start  coordinate to 
>> the contig
>>    -                $_->{'iden'} += $numID; # and add new data to 
>> #identical, #conserved.
>>    -                $_->{'cons'} += $numCons;
>>    +                $_->{'iden'} += ($numID||0); # and add new data  to 
>> #identical, #conserved.
>>    +                $_->{'cons'} += ($numCons||0);
>>                   push(@{$_->{hsps}}, $hsp);
>>                   $overlap     = 1;
>>               }
>>    @@ -424,9 +427,13 @@
>>               };
>>               if($@) { warn "\a\n$@\n"; }
>>               else {
>>    +  # make sure numerical
>>    +  $_->{'iden'} ||= 0;
>>    +  $_->{'cons'} ||= 0;
>>    +
>>                   $_->{'stop'}  = $stop; # Assign a new stop  coordinate to 
>> the contig
>>    -                $_->{'iden'} += $numID; # and add new data to 
>> #identical, #conserved.
>>    -                $_->{'cons'} += $numCons;
>>    +                $_->{'iden'} += ($numID||0); # and add new data  to 
>> #identical, #conserved.
>>    +                $_->{'cons'} += ($numCons||0);
>>                   push(@{$_->{hsps}}, $hsp);
>>                   $overlap    = 1;
>>               }
>>    @@ -461,8 +468,8 @@
>>                           };
>>                           if($@) { warn "\a\n$@\n"; }
>>                           else {
>>    -                            $ids  += $these_ids;
>>    -                            $cons += $these_cons;
>>    +                            $ids  += ($these_ids||0);
>>    +                            $cons += ($these_cons||0);
>>                           }
>>
>>                           last if $hsp_start == $u_start;
>>    @@ -490,8 +497,8 @@
>>                           };
>>                           if($@) { warn "\a\n$@\n"; }
>>                           else {
>>    -                            $ids  += $these_ids;
>>    -                            $cons += $these_cons;
>>    +                            $ids  += ($these_ids||0);
>>    +                            $cons += ($these_cons||0);
>>                           }
>>
>>                           last if $hsp_end == $u_stop;
>>
>>    ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
>> >
>>    To: "Mark A. Jensen" <maj at fortinbras.us>
>>    Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Dan Bolser" 
>> <dan.bolser at gmail.com
>> >
>>    Sent: Tuesday, May 12, 2009 10:04 AM
>>    Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 
>> 'features'fromSearchIO?)
>>
>>
>>
>>      More complicated than that, I'm afraid.  We should try to fix  that at 
>> the source of the problem.
>>
>>
>>
>>      This appears to stem from SearchUtils HSP tiling, which in  turn 
>> utilizes HSPI::matches(), which in turn checks num_identical/  num_conserved. 
>> My guess is, since this is blasttable format, one  of  these isn't set and 
>> thus is returning the wrong value.  I'll  attempt  to track it down today, 
>> but it may take some time.
>>
>>
>>
>>      chris
>>
>>
>>
>>      On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote:
>>
>>
>>
>>        This sounds like a
>>
>>
>>
>>        $sum = eval join( '+', @a);
>>
>>
>>
>>        problem, which can be fixed with
>>
>>
>>
>>        $sum = eval join('+', map { $_ || () } @a) ;
>>
>>
>>
>>        MAJ
>>
>>        ----- Original Message ----- From: "Dan Bolser" <dan.bolser at gmail.com
>> >
>>
>>        To: "Chris Fields" <cjfields at illinois.edu>
>>
>>        Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
>>
>>        Sent: Tuesday, May 12, 2009 9:17 AM
>>
>>        Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting  'features' 
>> fromSearchIO?)
>>
>>
>>
>>
>>
>>        2009/5/12 Chris Fields <cjfields at illinois.edu>:
>>
>>          Fixed that in svn. We're all still learning the ropes...
>>
>>
>>
>>        In that case, I'm seeing multiple instances of...
>>
>>
>>
>>        Argument "" isn't numeric in addition (+) at Bio/Search/ 
>> SearchUtils.pm line 256
>>
>>        Argument "" isn't numeric in addition (+) at Bio/Search/ 
>> SearchUtils.pm line 412
>>
>>        Argument "" isn't numeric in addition (+) at Bio/Search/ 
>> SearchUtils.pm line 429
>>
>>        Argument "" isn't numeric in addition (+) at Bio/Search/ 
>> SearchUtils.pm line 465
>>
>>        Argument "" isn't numeric in addition (+) at Bio/Search/ 
>> SearchUtils.pm line 473
>>
>>        Argument "" isn't numeric in addition (+) at Bio/Search/ 
>> SearchUtils.pm line 494
>>
>>        Argument "" isn't numeric in addition (+) at Bio/Search/ 
>> SearchUtils.pm line 502
>>
>>
>>
>>
>>
>>        Hmm... I was about to go on to complain about the weird GFF  that I 
>> was
>>
>>        seeing, but suddenly it looks OK. My bioperl install must  think your
>>
>>        standing over my shoulder and is therefore behaving itself!
>>
>>
>>
>>
>>
>>        Thanks again for all the help,
>>
>>        Dan.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>          chris
>>
>>
>>
>>          On May 12, 2009, at 5:55 AM, Dan Bolser wrote:
>>
>>
>>
>>            2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>
>>
>>
>>              Unfortunately bp_search2gff.pl is giving me errors:
>>
>>
>>
>>              bp_search2gff.pl --version 3 -i BlastResults/ 
>> blast_table_filtered  -f
>>
>>              blasttable -o BlastResults/blast_table_filtered.gff -t  hit
>>
>>              --match --target --component
>>
>>
>>
>>              --------------------- WARNING ---------------------
>>
>>              MSG: Removing score value(s)
>>
>>              ---------------------------------------------------
>>
>>              Can't locate object method "remove_tags" via package
>>
>>              "Bio::SeqFeature::Similarity" at
>>
>>              /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/ 
>> Generic.pm  line
>>
>>              393, <GEN1> line 5.
>>
>>
>>
>>
>>
>>            I'm just learning the ropes...
>>
>>
>>
>>            --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11
>>
>>            15:25:55.000000000 +0100
>>
>>            +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12
>>
>>            11:52:41.000000000 +0100
>>
>>            @@ -390,7 +390,7 @@
>>
>>            }
>>
>>            if ($self->has_tag('score')) {
>>
>>            $self->warn("Removing score value(s)");
>>
>>            - $self->remove_tags('score');
>>
>>            + $self->remove_tag('score');
>>
>>            }
>>
>>            $self->add_tag_value('score',$value);
>>
>>            }
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>              Anyone seen this before?
>>
>>
>>
>>              Cheers,
>>
>>              Dan.
>>
>>
>>
>>
>>
>>
>>
>>              2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>
>>
>>
>>                Thanks for the info guys, I think I was naively  hoping that 
>> the
>>
>>                feature would know how to cast itself as a  'SeqFeature' 
>> (GFF).
>>
>>
>>
>>                I think I understand the problem better now, so I'll  try to 
>> summarise:
>>
>>
>>
>>                There is no standard way to encode a HSP as a feature  (not 
>> least
>>
>>                because there are two choices about which sequence  (query or 
>> the  hit)
>>
>>                it should be attached to). BioPerl will try, but the  result 
>> will  not
>>
>>                be "well structured" SeqFeatures or "well formed" GFF.
>>
>>
>>
>>
>>
>>                From what I read I guess it should be possible to  standardize 
>> this
>>
>>                mapping (based on something in one of the examples or  the 
>> 'search2gff'
>>
>>                script), assuming you specify weather you want  features put 
>> on the
>>
>>                query or on the hit.
>>
>>
>>
>>                At some point last year I was trying out the  bp_search2gff.pl 
>> and my
>>
>>                own code to write a GFF file for loading and viewing  by 
>> Gbrowse.  At
>>
>>                that time I gave up, as nothing seemed to be working.  I was 
>> hoping
>>
>>                that doing this at a lower level (i.e. never writing  any GFF 
>> myself)
>>
>>                it would stand a better chance of working.
>>
>>
>>
>>                Also I was thinking that Gbrowse, if given a 
>> SeqFeature::Store,  could
>>
>>                autoconfigure its interface to some degree. I guess  its back 
>> to  the
>>
>>                docs ;-)
>>
>>
>>
>>
>>
>>
>>
>>                I'll keep trying and see if I can get anywhere.
>>
>>
>>
>>                Thanks again,
>>
>>                Dan.
>>
>>
>>
>>
>>
>>
>>
>>                References for the above:
>>
>>
>>
>>                2009/5/11 Jason Stajich <jason at bioperl.org>:
>>
>>
>>
>>                  otherwise you need to be converting the HSPs into 
>> seqfeatures  with the
>>
>>                  right associated information (i.e. the tag/value  pairs that 
>> are  in the 9th
>>
>>                  column) in order to have well structured data in  the 
>> database.
>>
>>
>>
>>                  You can get the individual features from the  feature pair 
>> with
>>
>>                  $hsp->query or $hsp->hit which can also be passed  to a GFF 
>> writer (or call
>>
>>                  $hsp->hit->gff_string). Note that since the data  storage is 
>> not structured
>>
>>                  in a GFF3 like-way this won't immediately produce  well 
>> formed  GFF3 for the
>>
>>                  9th column.
>>
>>
>>
>>
>>
>>                2009/5/11 Chris Fields <cjfields at illinois.edu>:
>>
>>
>>
>>                  The main problem is the mapping is subjective based  on what 
>> your
>>
>>                  reference sequence is within the BLAST run (e.g.  whether it 
>> is  the query or
>>
>>                  the hit), and is something that can't be  automatically 
>> discerned. I ended
>>
>>                  up rolling my own with SeqFeature::Store (just  mapped the 
>> relevant data to
>>
>>                  Bio::DB::SeqFeatures), but I have long wanted to  fix up the 
>> relevant scripts
>>
>>                  to integrate my changes in, just haven't had the time
>>
>>
>>
>>
>>
>>
>>
>>            _______________________________________________
>>
>>            Bioperl-l mailing list
>>
>>            Bioperl-l at lists.open-bio.org
>>
>>            http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>>
>>
>>
>>
>>        _______________________________________________
>>
>>        Bioperl-l mailing list
>>
>>        Bioperl-l at lists.open-bio.org
>>
>>        http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>>
>>
>>
>>
>>      _______________________________________________
>>
>>      Bioperl-l mailing list
>>
>>      Bioperl-l at lists.open-bio.org
>>
>>      http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>>
>>    _______________________________________________
>>    Bioperl-l mailing list
>>    Bioperl-l at lists.open-bio.org
>>    http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>>  Jason Stajich
>>  jason at bioperl.org
>>
>>
>>
>>
>>
>>
>>
>
> Jason Stajich
> jason at bioperl.org
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> 




More information about the Bioperl-l mailing list