[Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?)
Mark A. Jensen
maj at fortinbras.us
Tue May 12 22:00:41 EDT 2009
I dislike my patch, because it doesn't get to the bottom of why
data members associated with numbers of conserved sites return
from eval's undefined; it seems clear from the code that this is
unexpected. Please excuse my naivete--why would this happen
only in the blasttable format, and why hasn't this thing clucked before?
----- Original Message -----
From: Jason Stajich
To: Mark A. Jensen
Cc: Chris Fields ; BioPerl List ; Dan Bolser
Sent: Tuesday, May 12, 2009 7:07 PM
Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?)
I really don't think tile_hsps should be used on BLAST data folks, it is a pretty blind approach.
If you really want the right answer you need to do -links with WU-BLAST or FASTA.
Been discussed a few times on the mailing list.
Good to fix the code bug I guess to avoid the warnings, but unless you are going to walk through all the HSPs and extract the consistent paths wrt query I think you'll have loops, etc in there which will make Hit->percent_id non-accurate.
-jason
On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote:
Patch below (to SearchUtils.pm) fixes the non-numeric warnings on Dan's data, but something deeper may be going on.
Also get many of the following warnings, haven't looked at it closely:
--------------------- WARNING ---------------------
MSG: Removing score value(s)
---------------------------------------------------
PATCH:
Index: SearchUtils.pm
===================================================================
--- SearchUtils.pm (revision 15674)
+++ SearchUtils.pm (working copy)
@@ -252,8 +252,8 @@
}
$qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1;
- $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'};
- $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'};
+ $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'} || 0;
+ $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'} || 0;
$qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand;
}
@@ -407,9 +407,12 @@
};
if($@) { warn "\a\n$@\n"; }
else {
+ # make sure numerical
+ $_->{'iden'} ||= 0;
+ $_->{'cons'} ||= 0;
$_->{'start'} = $start; # Assign a new start coordinate to the contig
- $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
- $_->{'cons'} += $numCons;
+ $_->{'iden'} += ($numID||0); # and add new data to #identical, #conserved.
+ $_->{'cons'} += ($numCons||0);
push(@{$_->{hsps}}, $hsp);
$overlap = 1;
}
@@ -424,9 +427,13 @@
};
if($@) { warn "\a\n$@\n"; }
else {
+ # make sure numerical
+ $_->{'iden'} ||= 0;
+ $_->{'cons'} ||= 0;
+
$_->{'stop'} = $stop; # Assign a new stop coordinate to the contig
- $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
- $_->{'cons'} += $numCons;
+ $_->{'iden'} += ($numID||0); # and add new data to #identical, #conserved.
+ $_->{'cons'} += ($numCons||0);
push(@{$_->{hsps}}, $hsp);
$overlap = 1;
}
@@ -461,8 +468,8 @@
};
if($@) { warn "\a\n$@\n"; }
else {
- $ids += $these_ids;
- $cons += $these_cons;
+ $ids += ($these_ids||0);
+ $cons += ($these_cons||0);
}
last if $hsp_start == $u_start;
@@ -490,8 +497,8 @@
};
if($@) { warn "\a\n$@\n"; }
else {
- $ids += $these_ids;
- $cons += $these_cons;
+ $ids += ($these_ids||0);
+ $cons += ($these_cons||0);
}
last if $hsp_end == $u_stop;
----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu>
To: "Mark A. Jensen" <maj at fortinbras.us>
Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Dan Bolser" <dan.bolser at gmail.com>
Sent: Tuesday, May 12, 2009 10:04 AM
Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?)
More complicated than that, I'm afraid. We should try to fix that at the source of the problem.
This appears to stem from SearchUtils HSP tiling, which in turn utilizes HSPI::matches(), which in turn checks num_identical/ num_conserved. My guess is, since this is blasttable format, one of these isn't set and thus is returning the wrong value. I'll attempt to track it down today, but it may take some time.
chris
On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote:
This sounds like a
$sum = eval join( '+', @a);
problem, which can be fixed with
$sum = eval join('+', map { $_ || () } @a) ;
MAJ
----- Original Message ----- From: "Dan Bolser" <dan.bolser at gmail.com>
To: "Chris Fields" <cjfields at illinois.edu>
Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
Sent: Tuesday, May 12, 2009 9:17 AM
Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features' fromSearchIO?)
2009/5/12 Chris Fields <cjfields at illinois.edu>:
Fixed that in svn. We're all still learning the ropes...
In that case, I'm seeing multiple instances of...
Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 256
Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 412
Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 429
Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 465
Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 473
Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 494
Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 502
Hmm... I was about to go on to complain about the weird GFF that I was
seeing, but suddenly it looks OK. My bioperl install must think your
standing over my shoulder and is therefore behaving itself!
Thanks again for all the help,
Dan.
chris
On May 12, 2009, at 5:55 AM, Dan Bolser wrote:
2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
Unfortunately bp_search2gff.pl is giving me errors:
bp_search2gff.pl --version 3 -i BlastResults/blast_table_filtered -f
blasttable -o BlastResults/blast_table_filtered.gff -t hit
--match --target --component
--------------------- WARNING ---------------------
MSG: Removing score value(s)
---------------------------------------------------
Can't locate object method "remove_tags" via package
"Bio::SeqFeature::Similarity" at
/local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/Generic.pm line
393, <GEN1> line 5.
I'm just learning the ropes...
--- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11
15:25:55.000000000 +0100
+++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12
11:52:41.000000000 +0100
@@ -390,7 +390,7 @@
}
if ($self->has_tag('score')) {
$self->warn("Removing score value(s)");
- $self->remove_tags('score');
+ $self->remove_tag('score');
}
$self->add_tag_value('score',$value);
}
Anyone seen this before?
Cheers,
Dan.
2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
Thanks for the info guys, I think I was naively hoping that the
feature would know how to cast itself as a 'SeqFeature' (GFF).
I think I understand the problem better now, so I'll try to summarise:
There is no standard way to encode a HSP as a feature (not least
because there are two choices about which sequence (query or the hit)
it should be attached to). BioPerl will try, but the result will not
be "well structured" SeqFeatures or "well formed" GFF.
From what I read I guess it should be possible to standardize this
mapping (based on something in one of the examples or the 'search2gff'
script), assuming you specify weather you want features put on the
query or on the hit.
At some point last year I was trying out the bp_search2gff.pl and my
own code to write a GFF file for loading and viewing by Gbrowse. At
that time I gave up, as nothing seemed to be working. I was hoping
that doing this at a lower level (i.e. never writing any GFF myself)
it would stand a better chance of working.
Also I was thinking that Gbrowse, if given a SeqFeature::Store, could
autoconfigure its interface to some degree. I guess its back to the
docs ;-)
I'll keep trying and see if I can get anywhere.
Thanks again,
Dan.
References for the above:
2009/5/11 Jason Stajich <jason at bioperl.org>:
otherwise you need to be converting the HSPs into seqfeatures with the
right associated information (i.e. the tag/value pairs that are in the 9th
column) in order to have well structured data in the database.
You can get the individual features from the feature pair with
$hsp->query or $hsp->hit which can also be passed to a GFF writer (or call
$hsp->hit->gff_string). Note that since the data storage is not structured
in a GFF3 like-way this won't immediately produce well formed GFF3 for the
9th column.
2009/5/11 Chris Fields <cjfields at illinois.edu>:
The main problem is the mapping is subjective based on what your
reference sequence is within the BLAST run (e.g. whether it is the query or
the hit), and is something that can't be automatically discerned. I ended
up rolling my own with SeqFeature::Store (just mapped the relevant data to
Bio::DB::SeqFeatures), but I have long wanted to fix up the relevant scripts
to integrate my changes in, just haven't had the time
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Jason Stajich
jason at bioperl.org
More information about the Bioperl-l
mailing list