[Bioperl-l] How to parse BLAST output - all hits of each query in new file

Smithies, Russell Russell.Smithies at agresearch.co.nz
Wed Nov 25 19:07:26 UTC 2009


Hi Tim,
Here's some code for a job I'm working on at the moment that contains all the bits you'll probably need.
It's extracting 2 species-specific databases from nr (based on tax ids), doing a blast, then parsing the results and creating a substitution matrix. I was initially using Bio::DB::Eutilities to query and retrieve sequences but I kept getting errors and time-outs from NCBI when pulling back large numbers of sequences.
It should give you a rough idea of how to run Bio::Tools::Run::StandAloneBlast, Bio::DB::Fasta and Bio::SearchIO.

Email me direct if you want further explaination as it's not well commented ;-)

Russell Smithies 

Bioinformatics Applications Developer 
T +64 3 489 9085 
E  russell.smithies at agresearch.co.nz 

Invermay  Research Centre 
Puddle Alley, 
Mosgiel, 
New Zealand 
T  +64 3 489 3809   
F  +64 3 489 9174  
www.agresearch.co.nz

=======================================

#!/usr/local/bin/perl

use strict;
use Bio::Tools::Run::StandAloneBlast;
use Bio::SeqIO;
use Bio::SearchIO;
use Bio::DB::Fasta;

use Storable;

# Parameters: <query> <subject> <number or percentage of searches>
# Percentage can be specified as either 20p, 20P or 20%
# So for 20% of rice sequences blasted against oil palm:
#    4530 51953 20p   (4530=rice,51953=oil_palm, 20p=20%)
# Or for 20 searches:
#      4530 51953 20
#
my ( $q, $s, $c ) = @ARGV;

my $nr = "/data/databases/flatfile/illuminati_blastdata/nr";
my $tax_file = "/data/anonftp/pub/mirror/taxonomy/gi_taxid_prot.dmp.gz";
my $tmp = "/tmp/tax";


my %stats      = ();
my $total_subs = 0;

my $min_hsp_len      = 0;
my $min_hsp_identity = 0;
my $num_searches     = $c || 10;
my $blast_e          = '1e-6';
my $count            = 0;

# check if all the fasta and blast files exist
# if not, extract new fasta and re-formatdb the database
foreach my $t ( $q, $s ) {
  foreach ( map { "$tmp/$t.$_" } qw(faa list phr pin psq) ) {
    unless ( -e $_ ) {
      print "Creating database for $t\n";
      &create_database($t);
      last;
    }
  }
}

my @params = (
               -database => "$tmp/$q",
               -program  => 'blastp',
               -e        => $blast_e,
               -outfile  => "$tmp/blast.out",
               -v        => '1',
               -b        => '1'
);
my $factory = Bio::Tools::Run::StandAloneBlast->new(@params) or die $!;

# load the query sequences into a db
# makes it easier to randomly access them
my $db = Bio::DB::Fasta->new( "$tmp", -glob => "$s.faa", -reindex => 1 );

my @ids      = $db->ids;
my $id_count = $#ids;
exit "No sequences\n" unless $id_count;

# if a percentage is requested, calculate
# the required number of searches
if ( $num_searches =~ m/(\d+)[pP%]/ ) {
  $num_searches = int( ( $1 / 100 ) * $id_count );
  warn
"Searching random $1 percent ($num_searches) of $id_count sequences from taxid $q\n";
}

my $summary_file = "$tmp/".$$."_summary.txt";
open( OUT, ">", $summary_file ) or die $!;
print OUT
"#Summary of $num_searches random blast searches from taxid $q against taxid $s.\n";
print OUT "#Parameters used were:\n";
print OUT "#blast_e: $blast_e\n";
print OUT "#min_hsp_len: $min_hsp_len\n";
print OUT "#min_hsp_identity: $min_hsp_identity\n";
print OUT "\n";

while ( my $seq = $db->get_Seq_by_id( $ids[ rand($#ids) ] ) ) {
  next unless $seq;

  warn "Processing ", $seq->id, "\n";
  eval {
    my $blast_report = $factory->blastall($seq);
    sleep 5;
  };

  my $blast_in = new Bio::SearchIO( -format => "blast", -file => "$tmp/blast.out" );

  while ( my $result = $blast_in->next_result ) {
    if ( $result->num_hits <= 0 ) {
      warn "No hits for ", $result->query_accession, "\n";
      print OUT "No hits for ", $result->query_accession, "\n";
      next;
    }
    $count++;
    while ( my $hit = $result->next_hit ) {
      while ( my $hsp = $hit->next_hsp ) {
        warn sprintf( "%s had %s hsp%s\n",
                      $result->query_accession, $hit->num_hsps,
                      $hit->num_hsps > 1 ? "s" : "" );
        print OUT sprintf( "%s had %s hsp%s\n",
                      $result->query_accession, $hit->num_hsps,
                      $hit->num_hsps > 1 ? "s" : "" );

        # http://www.bioperl.org/wiki/HOWTO:SearchIO#Table_of_Methods
        if ( $hsp->length('total') > $min_hsp_len ) {
          if ( $hsp->percent_identity >= $min_hsp_identity ) {
            my @query_string = split '', $hsp->query_string;
            my @homol_string = split '', $hsp->homology_string;
            my @hit_string   = split '', $hsp->hit_string;
            for ( my $i = 0; $i < $#query_string; $i++ ) {
              next unless $homol_string[$i] =~ /\+/;
              $stats{ $query_string[$i] }{ $hit_string[$i] }++;
              $total_subs++;
            }
          }
        }
      }
    }
  }
  unlink '$tmp/blast.out' if -e '$tmp/blast.out';
  last if $count >= $num_searches;
}



# create summary frequency list
my %summary = ();
for my $query ( keys %stats ) {
  for my $hit ( keys %{ $stats{$query} } ) {
    $summary{"$query->$hit"} =
      sprintf( "%6f", $stats{$query}{$hit} / $total_subs );
  }
}

print OUT "\n";

# sort by decending frequencies and print to summary file
foreach my $k ( sort { $summary{$b} <=> $summary{$a} } keys %summary ) {
  print OUT "$k\t", $summary{$k}, "\n" unless $k =~ /TOTAL/;
}

print OUT "\n\n";

# print substitution matrix
my $i     = 0;
my @prots = qw(A R N D C Q E G H I L K M F P S T W Y V);
my $sep   = "\t";

print OUT sprintf( "%7s %s", $_, $sep ) foreach ( "       ", @prots );
print OUT "\n";

foreach my $x (@prots) {
  print OUT sprintf( "%7s|%s", $prots[ $i++ ], $sep );
  foreach my $y (@prots) {
    my $val =
      defined( $stats{$x}{$y} )
      ? sprintf( "%0.6f", $stats{$x}{$y} / $total_subs )
      : "--------";
    print OUT sprintf( "%s%s", $val, $sep );
  }
  print OUT "\n";
}
close OUT;


open(IN, $summary_file) or die $!;
print $_ while(<IN>);
close IN;


# extract sequences from nr database based on taxid.
sub create_database {
  my $txid      = shift;
  my %hash      = ();
  my $gi_stored = "/tmp/gi.dat";

  if ( -e $gi_stored ) {
    %hash = %{ retrieve($gi_stored) };
  }
  else {
    open( TXID, "zcat $tax_file | " ) or die $!;
    while (<TXID>) {
      chomp;
      my ( $gi, $tx ) = split( "\t", $_ );
      push( @{ $hash{$tx} }, $gi );
    }
    close TXID;

    store( \%hash, $gi_stored );
  }

  my $txlist = "$tmp/$txid.list";
  my $txseq  = "$tmp/$txid.faa";
	
	die "No sequences found for taxid $txid\n" unless defined( @{ $hash{$txid} });
	my $num_seqs =  scalar( @{ $hash{$txid} });
	warn "Found $num_seqs sequences for taxid $txid in $tax_file\n";

  open OUT, ">", $txlist or die $!;
  print OUT "$_\n" foreach ( @{ $hash{$txid} } );
  close OUT;

  my $cmd = "fastacmd -d $nr -i $txlist -t T -o $txseq 2>/dev/null";
  system $cmd;

  my $count = `grep -c '>' $txseq`;
  $count =~ s/\n//;
	warn "Could only extract $count sequences from $nr\n";

  $cmd = "formatdb -p T -i $tmp/$txid.faa -n $tmp/$txid -l $tmp/formatdb.log";
  system $cmd;

  $cmd = "fastacmd -d $tmp/$txid -I";
  system $cmd;

  warn "Check the formatdb.log for any errors\n";
}


=======================================






> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Tim
> Sent: Thursday, 26 November 2009 6:41 a.m.
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] How to parse BLAST output - all hits of each query in
> new file
> 
> Dear bioperl users,
> 
> I am a real newbie and have - maybe a very trivial - question.
> 
> I searched the mailing list archive and many howtos but I have not found
> a concrete answer to my problem. So hopefully you can help me :)
> 
> Background: I use the latest Bioperl version (installed it two weeks
> before).
> When I use Bio::Tools::Run::StandAloneBlast to BLAST one fasta file
> including different sequences, I get a BLAST output with many queries
> each having several hits / sbjcts.
> 
> My problem is how to parse *all* hits of *one* query into a single new
> file. And this for all the queries I have in my BLAST output file.
> 
> Or is it better the other way round; first to make fasta files with only
> single sequences inside and BLAST each file? But how can I automize that
> using Bioperl?
> 
> I tried Bio::SearchIO but can only parse all queries and their
> respective hits in only one file...
> I think iteration is also necessary here, but I do not really know how
> to include that into Bio::SearchIO.
> Or do I have to use Module:Bio::Index::Blast?
> 
> I can index a file (see below), but I have no idea what comes next...
> 
> ###How I index a file...
> 
> #!/usr/bin/perl -w
> 
> $ENV{BIOPERL_INDEX_TYPE} = "SDBM_File";
> 
> use Bio::Index::Fasta;
> 
> 
> $file_name = "8_to_BLAST_two_seq_index.fasta";
> $id = "48882";
> $inx = Bio::Index::Fasta->new (-filename => $file_name . ".idx",
> -write_flag => 1);
> $inx->make_index($file_name);
> 
> 
> Hopefully, you can give me at least hints what to look for.
> 
> A big THANKS in advance!
> 
> Cheers,
> 
> Tim
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================




More information about the Bioperl-l mailing list