[Bioperl-l] SearchIO to GFF (was: Getting'features'fromSearchIO?)
Chris Fields
cjfields at illinois.edu
Thu May 14 10:14:54 EDT 2009
Go for it (as long as it passes regression tests).
chris
On May 14, 2009, at 9:07 AM, Mark A. Jensen wrote:
> I'll patch in the fix with a cluck that these are undefined, just
> so's it doesn't slip
> through Bioperl's fingers (if no objections in the next 10min).
> ----- Original Message ----- From: "Jason Stajich" <jason at bioperl.org>
> To: "Mark A. Jensen" <maj at fortinbras.us>; "Dan Bolser" <dan.bolser at gmail.com
> >
> Cc: "Chris Fields" <cjfields at illinois.edu>; "BioPerl List" <bioperl-l at lists.open-bio.org
> >
> Sent: Wednesday, May 13, 2009 2:34 AM
> Subject: Re: [Bioperl-l] SearchIO to GFF (was:
> Getting'features'fromSearchIO?)
>
>
>> Not looking at the code so I can't be sure, but I think since there
>> is no sequence in the blasttable format and I think %cons or %id
>> is calculated from some of the numbers that are taken from
>> underlying sequence data in the alignment. I don't know why that
>> would trip up tile_hsps necessarily but maybe that's not where the
>> errors are coming from.
>>
>> I've generally solved these conversions with simpler perl scripts
>> rather than searchio since I usually want speed for this kind of
>> stuff -- so I convert all the data into blast m9 format
>> (FASTA,SSEARCH,WUBLAST, BLAST) and then have a simple parser that
>> converts the m9 (without searchio) into the appropriate GFF with
>> or without additional bells and whistles.
>>
>> Dan ---
>> I am slowly cleaning up and migrating more of the pipelines that
>> we use to generate polished GFF for loading into Gbrowse. You
>> should just set a goal of making good GFF3 and many of the
>> visualization and query issues go away.
>>
>> For genomes I tend to have a separate scaffold file which is just
>> the 1 line per chromosome/scaffold/contig listing.
>> Then a file for each analysis ie. Protein to genome BLAST, EST to
>> genome, remapped Pfam domains to genome coordinates, gene
>> predictions, etc. Keep it simple at first you know what you've
>> updated when something stops working the way you expect.
>>
>> -jason
>> On May 12, 2009, at 7:00 PM, Mark A. Jensen wrote:
>>
>>> I dislike my patch, because it doesn't get to the bottom of why
>>> data members associated with numbers of conserved sites return
>>> from eval's undefined; it seems clear from the code that this is
>>> unexpected. Please excuse my naivete--why would this happen
>>> only in the blasttable format, and why hasn't this thing clucked
>>> before?
>>> ----- Original Message -----
>>> From: Jason Stajich
>>> To: Mark A. Jensen
>>> Cc: Chris Fields ; BioPerl List ; Dan Bolser
>>> Sent: Tuesday, May 12, 2009 7:07 PM
>>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting
>>> 'features'fromSearchIO?)
>>>
>>>
>>> I really don't think tile_hsps should be used on BLAST data
>>> folks, it is a pretty blind approach.
>>> If you really want the right answer you need to do -links with WU-
>>> BLAST or FASTA.
>>> Been discussed a few times on the mailing list.
>>>
>>>
>>> Good to fix the code bug I guess to avoid the warnings, but
>>> unless you are going to walk through all the HSPs and extract
>>> the consistent paths wrt query I think you'll have loops, etc in
>>> there which will make Hit->percent_id non-accurate.
>>>
>>>
>>> -jason
>>>
>>>
>>> On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote:
>>>
>>>
>>> Patch below (to SearchUtils.pm) fixes the non-numeric warnings
>>> on Dan's data, but something deeper may be going on.
>>> Also get many of the following warnings, haven't looked at it
>>> closely:
>>>
>>> --------------------- WARNING ---------------------
>>> MSG: Removing score value(s)
>>> ---------------------------------------------------
>>>
>>> PATCH:
>>>
>>> Index: SearchUtils.pm
>>>
>>> ===================================================================
>>> --- SearchUtils.pm (revision 15674)
>>> +++ SearchUtils.pm (working copy)
>>> @@ -252,8 +252,8 @@
>>> }
>>>
>>> $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_-
>>> >{'stop'} - $_->{'start'} + 1;
>>> - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_-
>>> >{'iden'};
>>> - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_-
>>> >{'cons'};
>>> + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_-
>>> >{'iden'} || 0;
>>> + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_-
>>> >{'cons'} || 0;
>>> $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand;
>>> }
>>>
>>> @@ -407,9 +407,12 @@
>>> };
>>> if($@) { warn "\a\n$@\n"; }
>>> else {
>>> + # make sure numerical
>>> + $_->{'iden'} ||= 0;
>>> + $_->{'cons'} ||= 0;
>>> $_->{'start'} = $start; # Assign a new start
>>> coordinate to the contig
>>> - $_->{'iden'} += $numID; # and add new data to
>>> #identical, #conserved.
>>> - $_->{'cons'} += $numCons;
>>> + $_->{'iden'} += ($numID||0); # and add new
>>> data to #identical, #conserved.
>>> + $_->{'cons'} += ($numCons||0);
>>> push(@{$_->{hsps}}, $hsp);
>>> $overlap = 1;
>>> }
>>> @@ -424,9 +427,13 @@
>>> };
>>> if($@) { warn "\a\n$@\n"; }
>>> else {
>>> + # make sure numerical
>>> + $_->{'iden'} ||= 0;
>>> + $_->{'cons'} ||= 0;
>>> +
>>> $_->{'stop'} = $stop; # Assign a new stop
>>> coordinate to the contig
>>> - $_->{'iden'} += $numID; # and add new data to
>>> #identical, #conserved.
>>> - $_->{'cons'} += $numCons;
>>> + $_->{'iden'} += ($numID||0); # and add new
>>> data to #identical, #conserved.
>>> + $_->{'cons'} += ($numCons||0);
>>> push(@{$_->{hsps}}, $hsp);
>>> $overlap = 1;
>>> }
>>> @@ -461,8 +468,8 @@
>>> };
>>> if($@) { warn "\a\n$@\n"; }
>>> else {
>>> - $ids += $these_ids;
>>> - $cons += $these_cons;
>>> + $ids += ($these_ids||0);
>>> + $cons += ($these_cons||0);
>>> }
>>>
>>> last if $hsp_start == $u_start;
>>> @@ -490,8 +497,8 @@
>>> };
>>> if($@) { warn "\a\n$@\n"; }
>>> else {
>>> - $ids += $these_ids;
>>> - $cons += $these_cons;
>>> + $ids += ($these_ids||0);
>>> + $cons += ($these_cons||0);
>>> }
>>>
>>> last if $hsp_end == $u_stop;
>>>
>>> ----- Original Message ----- From: "Chris Fields" <cjfields at illinois.edu
>>> >
>>> To: "Mark A. Jensen" <maj at fortinbras.us>
>>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>; "Dan Bolser" <dan.bolser at gmail.com
>>> >
>>> Sent: Tuesday, May 12, 2009 10:04 AM
>>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting
>>> 'features'fromSearchIO?)
>>>
>>>
>>>
>>> More complicated than that, I'm afraid. We should try to fix
>>> that at the source of the problem.
>>>
>>>
>>>
>>> This appears to stem from SearchUtils HSP tiling, which in
>>> turn utilizes HSPI::matches(), which in turn checks
>>> num_identical/ num_conserved. My guess is, since this is
>>> blasttable format, one of these isn't set and thus is returning
>>> the wrong value. I'll attempt to track it down today, but it
>>> may take some time.
>>>
>>>
>>>
>>> chris
>>>
>>>
>>>
>>> On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote:
>>>
>>>
>>>
>>> This sounds like a
>>>
>>>
>>>
>>> $sum = eval join( '+', @a);
>>>
>>>
>>>
>>> problem, which can be fixed with
>>>
>>>
>>>
>>> $sum = eval join('+', map { $_ || () } @a) ;
>>>
>>>
>>>
>>> MAJ
>>>
>>> ----- Original Message ----- From: "Dan Bolser" <dan.bolser at gmail.com
>>> >
>>>
>>> To: "Chris Fields" <cjfields at illinois.edu>
>>>
>>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
>>>
>>> Sent: Tuesday, May 12, 2009 9:17 AM
>>>
>>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting
>>> 'features' fromSearchIO?)
>>>
>>>
>>>
>>>
>>>
>>> 2009/5/12 Chris Fields <cjfields at illinois.edu>:
>>>
>>> Fixed that in svn. We're all still learning the ropes...
>>>
>>>
>>>
>>> In that case, I'm seeing multiple instances of...
>>>
>>>
>>>
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 256
>>>
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 412
>>>
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 429
>>>
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 465
>>>
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 473
>>>
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 494
>>>
>>> Argument "" isn't numeric in addition (+) at Bio/Search/
>>> SearchUtils.pm line 502
>>>
>>>
>>>
>>>
>>>
>>> Hmm... I was about to go on to complain about the weird GFF
>>> that I was
>>>
>>> seeing, but suddenly it looks OK. My bioperl install must
>>> think your
>>>
>>> standing over my shoulder and is therefore behaving itself!
>>>
>>>
>>>
>>>
>>>
>>> Thanks again for all the help,
>>>
>>> Dan.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> chris
>>>
>>>
>>>
>>> On May 12, 2009, at 5:55 AM, Dan Bolser wrote:
>>>
>>>
>>>
>>> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>>
>>>
>>>
>>> Unfortunately bp_search2gff.pl is giving me errors:
>>>
>>>
>>>
>>> bp_search2gff.pl --version 3 -i BlastResults/
>>> blast_table_filtered -f
>>>
>>> blasttable -o BlastResults/blast_table_filtered.gff -
>>> t hit
>>>
>>> --match --target --component
>>>
>>>
>>>
>>> --------------------- WARNING ---------------------
>>>
>>> MSG: Removing score value(s)
>>>
>>> ---------------------------------------------------
>>>
>>> Can't locate object method "remove_tags" via package
>>>
>>> "Bio::SeqFeature::Similarity" at
>>>
>>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/
>>> Generic.pm line
>>>
>>> 393, <GEN1> line 5.
>>>
>>>
>>>
>>>
>>>
>>> I'm just learning the ropes...
>>>
>>>
>>>
>>> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~
>>> 2009-05-11
>>>
>>> 15:25:55.000000000 +0100
>>>
>>> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12
>>>
>>> 11:52:41.000000000 +0100
>>>
>>> @@ -390,7 +390,7 @@
>>>
>>> }
>>>
>>> if ($self->has_tag('score')) {
>>>
>>> $self->warn("Removing score value(s)");
>>>
>>> - $self->remove_tags('score');
>>>
>>> + $self->remove_tag('score');
>>>
>>> }
>>>
>>> $self->add_tag_value('score',$value);
>>>
>>> }
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Anyone seen this before?
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Dan.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2009/5/12 Dan Bolser <dan.bolser at gmail.com>:
>>>
>>>
>>>
>>> Thanks for the info guys, I think I was naively
>>> hoping that the
>>>
>>> feature would know how to cast itself as a
>>> 'SeqFeature' (GFF).
>>>
>>>
>>>
>>> I think I understand the problem better now, so
>>> I'll try to summarise:
>>>
>>>
>>>
>>> There is no standard way to encode a HSP as a
>>> feature (not least
>>>
>>> because there are two choices about which sequence
>>> (query or the hit)
>>>
>>> it should be attached to). BioPerl will try, but
>>> the result will not
>>>
>>> be "well structured" SeqFeatures or "well formed" GFF.
>>>
>>>
>>>
>>>
>>>
>>> From what I read I guess it should be possible to
>>> standardize this
>>>
>>> mapping (based on something in one of the examples
>>> or the 'search2gff'
>>>
>>> script), assuming you specify weather you want
>>> features put on the
>>>
>>> query or on the hit.
>>>
>>>
>>>
>>> At some point last year I was trying out the
>>> bp_search2gff.pl and my
>>>
>>> own code to write a GFF file for loading and
>>> viewing by Gbrowse. At
>>>
>>> that time I gave up, as nothing seemed to be
>>> working. I was hoping
>>>
>>> that doing this at a lower level (i.e. never
>>> writing any GFF myself)
>>>
>>> it would stand a better chance of working.
>>>
>>>
>>>
>>> Also I was thinking that Gbrowse, if given a
>>> SeqFeature::Store, could
>>>
>>> autoconfigure its interface to some degree. I guess
>>> its back to the
>>>
>>> docs ;-)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> I'll keep trying and see if I can get anywhere.
>>>
>>>
>>>
>>> Thanks again,
>>>
>>> Dan.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> References for the above:
>>>
>>>
>>>
>>> 2009/5/11 Jason Stajich <jason at bioperl.org>:
>>>
>>>
>>>
>>> otherwise you need to be converting the HSPs into
>>> seqfeatures with the
>>>
>>> right associated information (i.e. the tag/value
>>> pairs that are in the 9th
>>>
>>> column) in order to have well structured data in
>>> the database.
>>>
>>>
>>>
>>> You can get the individual features from the
>>> feature pair with
>>>
>>> $hsp->query or $hsp->hit which can also be passed
>>> to a GFF writer (or call
>>>
>>> $hsp->hit->gff_string). Note that since the data
>>> storage is not structured
>>>
>>> in a GFF3 like-way this won't immediately produce
>>> well formed GFF3 for the
>>>
>>> 9th column.
>>>
>>>
>>>
>>>
>>>
>>> 2009/5/11 Chris Fields <cjfields at illinois.edu>:
>>>
>>>
>>>
>>> The main problem is the mapping is subjective
>>> based on what your
>>>
>>> reference sequence is within the BLAST run (e.g.
>>> whether it is the query or
>>>
>>> the hit), and is something that can't be
>>> automatically discerned. I ended
>>>
>>> up rolling my own with SeqFeature::Store (just
>>> mapped the relevant data to
>>>
>>> Bio::DB::SeqFeatures), but I have long wanted to
>>> fix up the relevant scripts
>>>
>>> to integrate my changes in, just haven't had the
>>> time
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>>
>>> Bioperl-l mailing list
>>>
>>> Bioperl-l at lists.open-bio.org
>>>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>>
>>> Bioperl-l mailing list
>>>
>>> Bioperl-l at lists.open-bio.org
>>>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>>
>>> Bioperl-l mailing list
>>>
>>> Bioperl-l at lists.open-bio.org
>>>
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>>> Jason Stajich
>>> jason at bioperl.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> Jason Stajich
>> jason at bioperl.org
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list