[Bioperl-l] SearchIO to GFF (was: Getting'features'fromSearchIO?)
Mark A. Jensen
maj at fortinbras.us
Thu May 14 10:07:45 EDT 2009
I'll patch in the fix with a cluck that these are undefined, just so's it
doesn't slip
through Bioperl's fingers (if no objections in the next 10min).
----- Original Message -----
From: "Jason Stajich" <jason at bioperl.org>
To: "Mark A. Jensen" <maj at fortinbras.us>; "Dan Bolser" <dan.bolser at gmail.com>
Cc: "Chris Fields" <cjfields at illinois.edu>; "BioPerl List"
<bioperl-l at lists.open-bio.org>
Sent: Wednesday, May 13, 2009 2:34 AM
Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting'features'fromSearchIO?)
> Not looking at the code so I can't be sure, but I think since there is no
> sequence in the blasttable format and I think %cons or %id is calculated from
> some of the numbers that are taken from underlying sequence data in the
> alignment. I don't know why that would trip up tile_hsps necessarily but
> maybe that's not where the errors are coming from.
>
> I've generally solved these conversions with simpler perl scripts rather than
> searchio since I usually want speed for this kind of stuff -- so I convert
> all the data into blast m9 format (FASTA,SSEARCH,WUBLAST, BLAST) and then
> have a simple parser that converts the m9 (without searchio) into the
> appropriate GFF with or without additional bells and whistles.
>
> Dan ---
> I am slowly cleaning up and migrating more of the pipelines that we use to
> generate polished GFF for loading into Gbrowse. You should just set a goal
> of making good GFF3 and many of the visualization and query issues go away.
>
> For genomes I tend to have a separate scaffold file which is just the 1 line
> per chromosome/scaffold/contig listing.
> Then a file for each analysis ie. Protein to genome BLAST, EST to genome,
> remapped Pfam domains to genome coordinates, gene predictions, etc. Keep it
> simple at first you know what you've updated when something stops working the
> way you expect.
>
> -jason
> On May 12, 2009, at 7:00 PM, Mark A. Jensen wrote:
>
>> I dislike my patch, because it doesn't get to the bottom of why
>> data members associated with numbers of conserved sites return
>> from eval's undefined; it seems clear from the code that this is
>> unexpected. Please excuse my naivete--why would this happen
>> only in the blasttable format, and why hasn't this thing clucked before?
>> ----- Original Message -----
>> From: Jason Stajich
>> To: Mark A. Jensen
>> Cc: Chris Fields ; BioPerl List ; Dan Bolser
>> Sent: Tuesday, May 12, 2009 7:07 PM
>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting
>> 'features'fromSearchIO?)
>>
>>
>> I really don't think tile_hsps should be used on BLAST data folks, it is a
>> pretty blind approach.
>> If you really want the right answer you need to do -links with WU- BLAST or
>> FASTA.
>> Been discussed a few times on the mailing list.
>>
>>
>> Good to fix the code bug I guess to avoid the warnings, but unless you are
>> going to walk through all the HSPs and extract the consistent paths wrt
>> query I think you'll have loops, etc in there which will make
>> Hit->percent_id non-accurate.
>>
>>
>> -jason
>>
>>
>> On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote:
>>
>>
>> Patch below (to SearchUtils.pm) fixes the non-numeric warnings on Dan's
>> data, but something deeper may be going on.
>> Also get many of the following warnings, haven't looked at it closely:
>>
>> --------------------- WARNING ---------------------
>> MSG: Removing score value(s)
>> ---------------------------------------------------
>>
>> PATCH:
>>
>> Index: SearchUtils.pm
>> ===================================================================
>> --- SearchUtils.pm (revision 15674)
>> +++ SearchUtils.pm (working copy)
>> @@ -252,8 +252,8 @@
>> }
>>
>> $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_-
>> >{'stop'} - $_->{'start'} + 1;
>> - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_-
>> >{'iden'};
>> - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_-
>> >{'cons'};
>> + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_-
>> >{'iden'} || 0;
>> + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_-
>> >{'cons'} || 0;
>> $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand;
>> }
>>
>> @@ -407,9 +407,12 @@
>> };
>> if($@) { warn "\a\n$@\n"; }
>> else {
>> + # make sure numerical
>> + $_->{'iden'} ||= 0;
>> + $_->{'cons'} ||= 0;
>> $_->{'start'} = $start; # Assign a new start coordinate to
>> the contig
>> - $_->{'iden'} += $numID; # and add new data to
>> #identical, #conserved.
>> - $_->{'cons'} += $numCons;
>> + $_->{'iden'} += ($numID||0); # and add new data to
>> #identical, #conserved.
>> + $_->{'cons'} += ($numCons||0);
>> push(@{$_->{hsps}}, $hsp);
>> $overlap = 1;
>> }
>> @@ -424,9 +427,13 @@
>> };
>> if($@) { warn "\a\n$@\n"; }
>> else {
>> + # make sure numerical
>> + $_->{'iden'} ||= 0;
>> + $_->{'cons'} ||= 0;
>> +
>> $_->{'stop'} = $stop; # Assign a new stop coordinate to
>> the contig
>> - $_->{'iden'} += $numID; # and add new data to
>> #identical, #conserved.
>> - $_->{'cons'} += $numCons;
>> + $_->{'iden'} += ($numID||0); # and add new data to
>> #identical, #conserved.
>> + $_->{'cons'} += ($numCons||0);
>> push(@{$_->{hsps}}, $hsp);
>> $overlap = 1;
>> }
>> @@ -461,8 +468,8 @@
>> };
>> if($@) { warn "\a\n$@\n"; }
>> else {
>> - $ids += $these_ids;
>> - $cons += $these_cons;
>> + $ids += ($these_ids||0);
>> + $cons += ($these_cons||0);
>> }
>>
>> last if $hsp_start == $u_start;
>> @@ -490,8 +497,8 @@
>> };
>> if($@) { warn "\a\n$@\n"; }
>> else {
>> - $ids += $these_ids;
>> - $cons += $these_cons;
>> + $ids += ($these_ids||0);
>> + $cons += ($these_cons||0);
>> }
>>
>> last if $hsp_end == $u_stop;
>>
>> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
>> >
>> To: "Mark A. Jensen" <maj at fortinbras.us>
>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Dan Bolser"
>> <dan.bolser at gmail.com
>> >
>> Sent: Tuesday, May 12, 2009 10:04 AM
>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting
>> 'features'fromSearchIO?)
>>
>>
>>
>> More complicated than that, I'm afraid. We should try to fix that at
>> the source of the problem.
>>
>>
>>
>> This appears to stem from SearchUtils HSP tiling, which in turn
>> utilizes HSPI::matches(), which in turn checks num_identical/ num_conserved.
>> My guess is, since this is blasttable format, one of these isn't set and
>> thus is returning the wrong value. I'll attempt to track it down today,
>> but it may take some time.
>>
>>
>>
>> chris
>>
>>
>>
>> On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote:
>>
>>
>>
>> This sounds like a
>>
>>
>>
>> $sum = eval join( '+', @a);
>>
>>
>>
>> problem, which can be fixed with
>>
>>
>>
>> $sum = eval join('+', map { $_ || () } @a) ;
>>
>>
>>
>> MAJ
>>
>> ----- Original Message ----- From: "Dan Bolser" <dan.bolser at gmail.com
>> >
>>
>> To: "Chris Fields" <cjfields at illinois.edu>
>>
>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
>>
>> Sent: Tuesday, May 12, 2009 9:17 AM
>>
>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'
>> fromSearchIO?)
>>
>>
>>
>>
>>
>> 2009/5/12 Chris Fields <cjfields at illinois.edu>:
>>
>> Fixed that in svn. We're all still learning the ropes...
>>
>>
>>
>> In that case, I'm seeing multiple instances of...
>>
>>
>>
>> Argument "" isn't numeric in addition (+) at Bio/Search/
>> SearchUtils.pm line 256
>>
>> Argument "" isn't numeric in addition (+) at Bio/Search/
>> SearchUtils.pm line 412
>>
>> Argument "" isn't numeric in addition (+) at Bio/Search/
>> SearchUtils.pm line 429
>>
>> Argument "" isn't numeric in addition (+) at Bio/Search/
>> SearchUtils.pm line 465
>>
>> Argument "" isn't numeric in addition (+) at Bio/Search/
>> SearchUtils.pm line 473
>>
>> Argument "" isn't numeric in addition (+) at Bio/Search/
>> SearchUtils.pm line 494
>>
>> Argument "" isn't numeric in addition (+) at Bio/Search/
>> SearchUtils.pm line 502
>>
>>
>>
>>
>>
>> Hmm... I was about to go on to complain about the weird GFF that I
>> was
>>
>> seeing, but suddenly it looks OK. My bioperl install must think your
>>
>> standing over my shoulder and is therefore behaving itself!
>>
>>
>>
>>
>>
>> Thanks again for all the help,
>>
>> Dan.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> chris
>>
>>
>>
>> On May 12, 2009, at 5:55 AM, Dan Bolser wrote:
>>
>>
>>
>> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>
>>
>>
>> Unfortunately bp_search2gff.pl is giving me errors:
>>
>>
>>
>> bp_search2gff.pl --version 3 -i BlastResults/
>> blast_table_filtered -f
>>
>> blasttable -o BlastResults/blast_table_filtered.gff -t hit
>>
>> --match --target --component
>>
>>
>>
>> --------------------- WARNING ---------------------
>>
>> MSG: Removing score value(s)
>>
>> ---------------------------------------------------
>>
>> Can't locate object method "remove_tags" via package
>>
>> "Bio::SeqFeature::Similarity" at
>>
>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/
>> Generic.pm line
>>
>> 393, <GEN1> line 5.
>>
>>
>>
>>
>>
>> I'm just learning the ropes...
>>
>>
>>
>> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11
>>
>> 15:25:55.000000000 +0100
>>
>> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12
>>
>> 11:52:41.000000000 +0100
>>
>> @@ -390,7 +390,7 @@
>>
>> }
>>
>> if ($self->has_tag('score')) {
>>
>> $self->warn("Removing score value(s)");
>>
>> - $self->remove_tags('score');
>>
>> + $self->remove_tag('score');
>>
>> }
>>
>> $self->add_tag_value('score',$value);
>>
>> }
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Anyone seen this before?
>>
>>
>>
>> Cheers,
>>
>> Dan.
>>
>>
>>
>>
>>
>>
>>
>> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>
>>
>>
>> Thanks for the info guys, I think I was naively hoping that
>> the
>>
>> feature would know how to cast itself as a 'SeqFeature'
>> (GFF).
>>
>>
>>
>> I think I understand the problem better now, so I'll try to
>> summarise:
>>
>>
>>
>> There is no standard way to encode a HSP as a feature (not
>> least
>>
>> because there are two choices about which sequence (query or
>> the hit)
>>
>> it should be attached to). BioPerl will try, but the result
>> will not
>>
>> be "well structured" SeqFeatures or "well formed" GFF.
>>
>>
>>
>>
>>
>> From what I read I guess it should be possible to standardize
>> this
>>
>> mapping (based on something in one of the examples or the
>> 'search2gff'
>>
>> script), assuming you specify weather you want features put
>> on the
>>
>> query or on the hit.
>>
>>
>>
>> At some point last year I was trying out the bp_search2gff.pl
>> and my
>>
>> own code to write a GFF file for loading and viewing by
>> Gbrowse. At
>>
>> that time I gave up, as nothing seemed to be working. I was
>> hoping
>>
>> that doing this at a lower level (i.e. never writing any GFF
>> myself)
>>
>> it would stand a better chance of working.
>>
>>
>>
>> Also I was thinking that Gbrowse, if given a
>> SeqFeature::Store, could
>>
>> autoconfigure its interface to some degree. I guess its back
>> to the
>>
>> docs ;-)
>>
>>
>>
>>
>>
>>
>>
>> I'll keep trying and see if I can get anywhere.
>>
>>
>>
>> Thanks again,
>>
>> Dan.
>>
>>
>>
>>
>>
>>
>>
>> References for the above:
>>
>>
>>
>> 2009/5/11 Jason Stajich <jason at bioperl.org>:
>>
>>
>>
>> otherwise you need to be converting the HSPs into
>> seqfeatures with the
>>
>> right associated information (i.e. the tag/value pairs that
>> are in the 9th
>>
>> column) in order to have well structured data in the
>> database.
>>
>>
>>
>> You can get the individual features from the feature pair
>> with
>>
>> $hsp->query or $hsp->hit which can also be passed to a GFF
>> writer (or call
>>
>> $hsp->hit->gff_string). Note that since the data storage is
>> not structured
>>
>> in a GFF3 like-way this won't immediately produce well
>> formed GFF3 for the
>>
>> 9th column.
>>
>>
>>
>>
>>
>> 2009/5/11 Chris Fields <cjfields at illinois.edu>:
>>
>>
>>
>> The main problem is the mapping is subjective based on what
>> your
>>
>> reference sequence is within the BLAST run (e.g. whether it
>> is the query or
>>
>> the hit), and is something that can't be automatically
>> discerned. I ended
>>
>> up rolling my own with SeqFeature::Store (just mapped the
>> relevant data to
>>
>> Bio::DB::SeqFeatures), but I have long wanted to fix up the
>> relevant scripts
>>
>> to integrate my changes in, just haven't had the time
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> Bioperl-l mailing list
>>
>> Bioperl-l at lists.open-bio.org
>>
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> Bioperl-l mailing list
>>
>> Bioperl-l at lists.open-bio.org
>>
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> Bioperl-l mailing list
>>
>> Bioperl-l at lists.open-bio.org
>>
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>>
>> Jason Stajich
>> jason at bioperl.org
>>
>>
>>
>>
>>
>>
>>
>
> Jason Stajich
> jason at bioperl.org
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list