[Bioperl-l] Problem Parsing BLAST output to annotate FASTA file

Fields, Christopher J cjfields at illinois.edu
Wed Feb 20 18:24:58 UTC 2013


If this is meant to be something done using the same FASTA files for a bunch of BLAST reports, might be worth setting up a flat file index and using that to look up and grab the sequences; it should be a LOT faster, just the first pass (generation of the initial index) would take a little time.  Look at Bio::DB::Fasta for an example.

chris

On Feb 20, 2013, at 11:00 AM, Andreas Leimbach <andreas.leimbach at uni-wuerzburg.de>
 wrote:

> Hey Ann,
> 
> damn, it 's not my best day ... Anyways, I wouldn't work with List::MoreUtils's each_array function, as this assumes that the blast hits and the nucleotide queries are in the same order (as Adam pointed out). Rather use a hash which associates a key to a certain value. Also, the hash can be used to skip sequences that have no hits.
> Here's my new version:
> 
> my %hits;
> my $hit_desc;
> my $search_in = Bio::SearchIO->new(-format => 'blastxml', -file =>
> "$ARGV[0]");
> while (my $result = $search_in->next_result) {
> while (my $hit = $result->next_hit) {
> while (my $hsp = $hit->next_hsp) {
> $hits{$result->query_description} = $hit->description; # hash: associate query_desc (key) with hit_desc (value)
> last; # jump out of the while loop; this should resolve getting only the first hit
> }
> last; # see above
> }
> }
> 
> 
> my $seqio = Bio::SeqIO->new(-format => 'fasta', -file => "$ARGV[1]");
> while (my $seqobj = $seqio->next_seq) {
> if ($hits{$seqobj->display_id}) { # only true if display_id associated with hit_desc and should skip seqs without hits
> print ">$hits{$seqobj->display_id}\n";
> my $nuc = $seqobj->seq();
> print $nuc, "\n";
> }
> }
> 
> Cheers,
> Andreas
> 
> P.S.: I redirected your mail to the BioPerl mailing list, others might profit from my mistakes ;-) ...
> 
> --
> Andreas Leimbach
> Universität Münster
> Institut für Hygiene
> Mendelstr. 7
> D-48149 Münster
> Germany
> 
> Tel.: +49 (0)551 39 3843
> E-Mail: andreas.leimbach at uni-wuerzburg.de
> 
> On 20.2.13 17:35, Ann Gregory wrote:
>> Hi Andreas,
>> 
>> Thanks for you help! I don't understand how this gets the first blast hit:
>> 
>> if ($hit->description eq $hit_desc) { # Only want the first blast hit
>> next;
>> }
>> 
>> I tried this and seems to be working...but I can't get the 1st blast hit
>> or skip the sequences that had no hits. Do you know any quick fixes?
>> 
>> *
>> use warnings;
>> use strict;
>> use Bio::SearchIO;
>> use Bio::SeqIO;
>> use List::MoreUtils qw(each_array);
>> 
>> my $search_in = Bio::SearchIO->new(-format => 'blastxml', -file =>
>> "$ARGV[0]");
>> my @ids;
>> while (my $result = $search_in->next_result) {
>> while (my $hit = $result->next_hit) {
>> while (my $hsp = $hit->next_hsp) {
>> my $match = $result->num_hits;
>> push(@ids, $qd);
>> }
>> }
>> }
>> }
>> 
>> my $seqio = Bio::SeqIO->new(-format => 'fasta', -file => "$ARGV[1]");
>> my @seqs;
>> while (my $seqobj = $seqio->next_seq) {
>> my $nuc = $seqobj->seq();
>> push(@seqs, $nuc);
>> }
>> 
>> my $it = each_array(@ids, at seqs);
>> while(my($ids,$seqs)=$it->()){
>> print $ids, "\n", $seqs, "\n";
>> }
>> *
>> 
>> Thanks again!
>> ~Ann
>> 
>> On Wed, Feb 20, 2013 at 9:24 AM, Andreas Leimbach
>> <andreas.leimbach at uni-wuerzburg.de
>> <mailto:andreas.leimbach at uni-wuerzburg.de>> wrote:
>> 
>>    oops, I just realized I had one loop to much in there. Adam is
>>    correct. Sorry.
>> 
>>    The last part of the code I send you should look like this:
>> 
>> 
>>    my $seqio = Bio::SeqIO->new(-format => 'fasta', -file => "$ARGV[1]");
>>    while (my $seqobj = $seqio->next_seq) {
>>    print ">$hits{$seqobj->display_id}\__n";
>> 
>>    my $nuc = $seqobj->seq();
>>    print $nuc, "\n";
>>    }
>> 
>> 
>>    Cheers,
>>    Andreas
>> 
>> 
>>    --
>>    Andreas Leimbach
>>    Universität Münster
>>    Institut für Hygiene
>>    Mendelstr. 7
>>    D-48149 Münster
>>    Germany
>> 
>>    Tel.: +49 (0)551 39 3843 <tel:%2B49%20%280%29551%2039%203843>
>>    E-Mail: andreas.leimbach at uni-__wuerzburg.de
>>    <mailto:andreas.leimbach at uni-wuerzburg.de>
>> 
>>    On 20.2.13 06:20, Ann Gregory wrote:
>> 
>>        Hi BioPerl,
>> 
>>        I am having issues with a BioPerl script. I have a blastxml file
>>        from a
>>        blastx blast and the original multifasta file containing the
>>        original
>>        nucleotides sequences.
>> 
>>        I want to take the blast result (ie. the blast description) and
>>        annotate my
>>        multifasta file.
>> 
>>        I have written 2 while loops that extract the blast descriptions
>>        as well as
>>        the nucleotide sequence from the multifasta file.
>> 
>>        My problem is that I cannot incorporate one of the while loops
>>        into the
>>        other without loosing the loop property of one of the loops. I
>>        would like
>>        to take the 1st blast description, then the 1st nucleotide
>>        sequence, then
>>        the 2nd blast description, then the 2nd nucleotide sequence and so
>>        on...just can figure out how to alternate the results.
>> 
>>        See script below:
>> 
>> 
>>        use warnings;
>>        use strict;
>>        use Bio::SearchIO;
>>        use Bio::SeqIO;
>> 
>> 
>>        my $search_in = Bio::SearchIO->new(-format => 'blastxml', -file =>
>>        "$ARGV[0]");
>>        while (my $result = $search_in->next_result) {
>>        while (my $hit = $result->next_hit) {
>>        while (my $hsp = $hit->next_hsp) {
>>        my $qd = $hit->description;
>>        print $qd, "\n";
>>        }
>>        }
>>        }
>> 
>>        my $seqio = Bio::SeqIO->new(-format => 'fasta', -file =>
>>        "$ARGV[1]");
>>        while (my $seqobj = $seqio->next_seq) {
>>        my $nuc = $seqobj->seq();
>>        print $nuc, "\n";
>>        }--
>>        Ann (Nina) Gregory
>>        Graduate Student
>>        Rich Lab / Sullivan Lab
>>        Soil, Water, Environmental Science Department
>>        University of Arizona
>>        _________________________________________________
>>        Bioperl-l mailing list
>>        Bioperl-l at lists.open-bio.org <mailto:Bioperl-l at lists.open-bio.org>
>>        http://lists.open-bio.org/__mailman/listinfo/bioperl-l
>>        <http://lists.open-bio.org/mailman/listinfo/bioperl-l>
>> 
>> 
>> 
>> 
>> --
>> Ann (Nina) Gregory
>> Graduate Student
>> Rich Lab / Sullivan Lab
>> Soil, Water, Environmental Science Department
>> University of Arizona
>> 
>> 
>> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list