[Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?)
Jason Stajich
jason at bioperl.org
Wed May 13 02:34:56 EDT 2009
Not looking at the code so I can't be sure, but I think since there is
no sequence in the blasttable format and I think %cons or %id is
calculated from some of the numbers that are taken from underlying
sequence data in the alignment. I don't know why that would trip up
tile_hsps necessarily but maybe that's not where the errors are coming
from.
I've generally solved these conversions with simpler perl scripts
rather than searchio since I usually want speed for this kind of stuff
-- so I convert all the data into blast m9 format
(FASTA,SSEARCH,WUBLAST, BLAST) and then have a simple parser that
converts the m9 (without searchio) into the appropriate GFF with or
without additional bells and whistles.
Dan ---
I am slowly cleaning up and migrating more of the pipelines that we
use to generate polished GFF for loading into Gbrowse. You should
just set a goal of making good GFF3 and many of the visualization and
query issues go away.
For genomes I tend to have a separate scaffold file which is just the
1 line per chromosome/scaffold/contig listing.
Then a file for each analysis ie. Protein to genome BLAST, EST to
genome, remapped Pfam domains to genome coordinates, gene predictions,
etc. Keep it simple at first you know what you've updated when
something stops working the way you expect.
-jason
On May 12, 2009, at 7:00 PM, Mark A. Jensen wrote:
> I dislike my patch, because it doesn't get to the bottom of why
> data members associated with numbers of conserved sites return
> from eval's undefined; it seems clear from the code that this is
> unexpected. Please excuse my naivete--why would this happen
> only in the blasttable format, and why hasn't this thing clucked
> before?
> ----- Original Message -----
> From: Jason Stajich
> To: Mark A. Jensen
> Cc: Chris Fields ; BioPerl List ; Dan Bolser
> Sent: Tuesday, May 12, 2009 7:07 PM
> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting
> 'features'fromSearchIO?)
>
>
> I really don't think tile_hsps should be used on BLAST data folks,
> it is a pretty blind approach.
> If you really want the right answer you need to do -links with WU-
> BLAST or FASTA.
> Been discussed a few times on the mailing list.
>
>
> Good to fix the code bug I guess to avoid the warnings, but unless
> you are going to walk through all the HSPs and extract the
> consistent paths wrt query I think you'll have loops, etc in there
> which will make Hit->percent_id non-accurate.
>
>
> -jason
>
>
> On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote:
>
>
> Patch below (to SearchUtils.pm) fixes the non-numeric warnings on
> Dan's data, but something deeper may be going on.
> Also get many of the following warnings, haven't looked at it
> closely:
>
> --------------------- WARNING ---------------------
> MSG: Removing score value(s)
> ---------------------------------------------------
>
> PATCH:
>
> Index: SearchUtils.pm
> ===================================================================
> --- SearchUtils.pm (revision 15674)
> +++ SearchUtils.pm (working copy)
> @@ -252,8 +252,8 @@
> }
>
> $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_-
> >{'stop'} - $_->{'start'} + 1;
> - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_-
> >{'iden'};
> - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_-
> >{'cons'};
> + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_-
> >{'iden'} || 0;
> + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_-
> >{'cons'} || 0;
> $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand;
> }
>
> @@ -407,9 +407,12 @@
> };
> if($@) { warn "\a\n$@\n"; }
> else {
> + # make sure numerical
> + $_->{'iden'} ||= 0;
> + $_->{'cons'} ||= 0;
> $_->{'start'} = $start; # Assign a new start
> coordinate to the contig
> - $_->{'iden'} += $numID; # and add new data to
> #identical, #conserved.
> - $_->{'cons'} += $numCons;
> + $_->{'iden'} += ($numID||0); # and add new data
> to #identical, #conserved.
> + $_->{'cons'} += ($numCons||0);
> push(@{$_->{hsps}}, $hsp);
> $overlap = 1;
> }
> @@ -424,9 +427,13 @@
> };
> if($@) { warn "\a\n$@\n"; }
> else {
> + # make sure numerical
> + $_->{'iden'} ||= 0;
> + $_->{'cons'} ||= 0;
> +
> $_->{'stop'} = $stop; # Assign a new stop
> coordinate to the contig
> - $_->{'iden'} += $numID; # and add new data to
> #identical, #conserved.
> - $_->{'cons'} += $numCons;
> + $_->{'iden'} += ($numID||0); # and add new data
> to #identical, #conserved.
> + $_->{'cons'} += ($numCons||0);
> push(@{$_->{hsps}}, $hsp);
> $overlap = 1;
> }
> @@ -461,8 +468,8 @@
> };
> if($@) { warn "\a\n$@\n"; }
> else {
> - $ids += $these_ids;
> - $cons += $these_cons;
> + $ids += ($these_ids||0);
> + $cons += ($these_cons||0);
> }
>
> last if $hsp_start == $u_start;
> @@ -490,8 +497,8 @@
> };
> if($@) { warn "\a\n$@\n"; }
> else {
> - $ids += $these_ids;
> - $cons += $these_cons;
> + $ids += ($these_ids||0);
> + $cons += ($these_cons||0);
> }
>
> last if $hsp_end == $u_stop;
>
> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
> >
> To: "Mark A. Jensen" <maj at fortinbras.us>
> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Dan Bolser" <dan.bolser at gmail.com
> >
> Sent: Tuesday, May 12, 2009 10:04 AM
> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting
> 'features'fromSearchIO?)
>
>
>
> More complicated than that, I'm afraid. We should try to fix
> that at the source of the problem.
>
>
>
> This appears to stem from SearchUtils HSP tiling, which in
> turn utilizes HSPI::matches(), which in turn checks num_identical/
> num_conserved. My guess is, since this is blasttable format, one
> of these isn't set and thus is returning the wrong value. I'll
> attempt to track it down today, but it may take some time.
>
>
>
> chris
>
>
>
> On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote:
>
>
>
> This sounds like a
>
>
>
> $sum = eval join( '+', @a);
>
>
>
> problem, which can be fixed with
>
>
>
> $sum = eval join('+', map { $_ || () } @a) ;
>
>
>
> MAJ
>
> ----- Original Message ----- From: "Dan Bolser" <dan.bolser at gmail.com
> >
>
> To: "Chris Fields" <cjfields at illinois.edu>
>
> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
>
> Sent: Tuesday, May 12, 2009 9:17 AM
>
> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting
> 'features' fromSearchIO?)
>
>
>
>
>
> 2009/5/12 Chris Fields <cjfields at illinois.edu>:
>
> Fixed that in svn. We're all still learning the ropes...
>
>
>
> In that case, I'm seeing multiple instances of...
>
>
>
> Argument "" isn't numeric in addition (+) at Bio/Search/
> SearchUtils.pm line 256
>
> Argument "" isn't numeric in addition (+) at Bio/Search/
> SearchUtils.pm line 412
>
> Argument "" isn't numeric in addition (+) at Bio/Search/
> SearchUtils.pm line 429
>
> Argument "" isn't numeric in addition (+) at Bio/Search/
> SearchUtils.pm line 465
>
> Argument "" isn't numeric in addition (+) at Bio/Search/
> SearchUtils.pm line 473
>
> Argument "" isn't numeric in addition (+) at Bio/Search/
> SearchUtils.pm line 494
>
> Argument "" isn't numeric in addition (+) at Bio/Search/
> SearchUtils.pm line 502
>
>
>
>
>
> Hmm... I was about to go on to complain about the weird GFF
> that I was
>
> seeing, but suddenly it looks OK. My bioperl install must
> think your
>
> standing over my shoulder and is therefore behaving itself!
>
>
>
>
>
> Thanks again for all the help,
>
> Dan.
>
>
>
>
>
>
>
>
>
> chris
>
>
>
> On May 12, 2009, at 5:55 AM, Dan Bolser wrote:
>
>
>
> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>
>
>
> Unfortunately bp_search2gff.pl is giving me errors:
>
>
>
> bp_search2gff.pl --version 3 -i BlastResults/
> blast_table_filtered -f
>
> blasttable -o BlastResults/blast_table_filtered.gff -t
> hit
>
> --match --target --component
>
>
>
> --------------------- WARNING ---------------------
>
> MSG: Removing score value(s)
>
> ---------------------------------------------------
>
> Can't locate object method "remove_tags" via package
>
> "Bio::SeqFeature::Similarity" at
>
> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/
> Generic.pm line
>
> 393, <GEN1> line 5.
>
>
>
>
>
> I'm just learning the ropes...
>
>
>
> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11
>
> 15:25:55.000000000 +0100
>
> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12
>
> 11:52:41.000000000 +0100
>
> @@ -390,7 +390,7 @@
>
> }
>
> if ($self->has_tag('score')) {
>
> $self->warn("Removing score value(s)");
>
> - $self->remove_tags('score');
>
> + $self->remove_tag('score');
>
> }
>
> $self->add_tag_value('score',$value);
>
> }
>
>
>
>
>
>
>
>
>
>
>
> Anyone seen this before?
>
>
>
> Cheers,
>
> Dan.
>
>
>
>
>
>
>
> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>
>
>
> Thanks for the info guys, I think I was naively
> hoping that the
>
> feature would know how to cast itself as a
> 'SeqFeature' (GFF).
>
>
>
> I think I understand the problem better now, so I'll
> try to summarise:
>
>
>
> There is no standard way to encode a HSP as a feature
> (not least
>
> because there are two choices about which sequence
> (query or the hit)
>
> it should be attached to). BioPerl will try, but the
> result will not
>
> be "well structured" SeqFeatures or "well formed" GFF.
>
>
>
>
>
> From what I read I guess it should be possible to
> standardize this
>
> mapping (based on something in one of the examples or
> the 'search2gff'
>
> script), assuming you specify weather you want
> features put on the
>
> query or on the hit.
>
>
>
> At some point last year I was trying out the
> bp_search2gff.pl and my
>
> own code to write a GFF file for loading and viewing
> by Gbrowse. At
>
> that time I gave up, as nothing seemed to be working.
> I was hoping
>
> that doing this at a lower level (i.e. never writing
> any GFF myself)
>
> it would stand a better chance of working.
>
>
>
> Also I was thinking that Gbrowse, if given a
> SeqFeature::Store, could
>
> autoconfigure its interface to some degree. I guess
> its back to the
>
> docs ;-)
>
>
>
>
>
>
>
> I'll keep trying and see if I can get anywhere.
>
>
>
> Thanks again,
>
> Dan.
>
>
>
>
>
>
>
> References for the above:
>
>
>
> 2009/5/11 Jason Stajich <jason at bioperl.org>:
>
>
>
> otherwise you need to be converting the HSPs into
> seqfeatures with the
>
> right associated information (i.e. the tag/value
> pairs that are in the 9th
>
> column) in order to have well structured data in
> the database.
>
>
>
> You can get the individual features from the
> feature pair with
>
> $hsp->query or $hsp->hit which can also be passed
> to a GFF writer (or call
>
> $hsp->hit->gff_string). Note that since the data
> storage is not structured
>
> in a GFF3 like-way this won't immediately produce
> well formed GFF3 for the
>
> 9th column.
>
>
>
>
>
> 2009/5/11 Chris Fields <cjfields at illinois.edu>:
>
>
>
> The main problem is the mapping is subjective based
> on what your
>
> reference sequence is within the BLAST run (e.g.
> whether it is the query or
>
> the hit), and is something that can't be
> automatically discerned. I ended
>
> up rolling my own with SeqFeature::Store (just
> mapped the relevant data to
>
> Bio::DB::SeqFeatures), but I have long wanted to
> fix up the relevant scripts
>
> to integrate my changes in, just haven't had the time
>
>
>
>
>
>
>
> _______________________________________________
>
> Bioperl-l mailing list
>
> Bioperl-l at lists.open-bio.org
>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
>
>
>
>
> _______________________________________________
>
> Bioperl-l mailing list
>
> Bioperl-l at lists.open-bio.org
>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
>
>
>
>
> _______________________________________________
>
> Bioperl-l mailing list
>
> Bioperl-l at lists.open-bio.org
>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
> Jason Stajich
> jason at bioperl.org
>
>
>
>
>
>
>
Jason Stajich
jason at bioperl.org
More information about the Bioperl-l
mailing list