[Bioperl-l] blast output -> blast -m8 output

Amir Karger akarger at CGR.Harvard.edu
Fri Dec 23 11:24:17 EST 2005


I'm writing a script that will take regular blast output and translate it to
blast -m8 tabular form. (The reverse transform won't work without re-doing
the alignments.)
I've attached the blast output for running 3 sequences against month.aa.
Below are the script, the script output, and the blast -m8 output. (Output
is the same for bioperl-1.4 and 1.5-RC1.)

So there are two questions:

1) To get the number of openings, I count /-+/g in $hsp->query_string .
$hsp->hit_string. Is this really the best way to do it? (A $hsp->ngaps
method would be nicer :) 
2) Why does blast -m8 use longer IDs than regular blast? (See second column)
I guess this is an NCBI question, really. Other than the difference in IDs
and a couple space characters, the two outputs are the same.
3) Do you think this will work on all blast outputs, or will something
break?

Thanks,
-Amir Karger
---------
Here's the script:
#!/usr/local/bin/perl -w

use strict;

use Bio::SearchIO;

my $in = new Bio::SearchIO(-format => 'blast', -file => $ARGV[0]);
while (my $result = $in->next_result ) {
    while (my $hit = $result->next_hit ) {
        while (my $hsp = $hit->next_hsp ) {
            my $pcid = $hsp->percent_identity;
            my $len = $hsp->length('total');
            my $mismatches = $len - $hsp->num_identical -
$hsp->gaps('total');
            # Find runs of '-' in query, hit strings
            my @gap_list = (($hsp->query_string . $hsp->hit_string) =~
/-+/g);
            my $ngaps = @gap_list;
            my @fields = (
                $result->query_name, $hit->name, sprintf("%.2f",$pcid),
                $len, $mismatches, $ngaps,
                $hsp->start('query'), $hsp->end('query'),
                $hsp->start('hit'), $hsp->end('hit'),
                $hsp->evalue, $hsp->bits
            );
            print join("\t", @fields), "\n";
        }
    }
}

----------
Here's the output from running perl change_blast_to_tab.pl seqs.blp

Bacteriophage_1[M19348] ref|NP_037061.1|        40.32   62      33      1
28      89      1050    1107    6e-05   46.6
Bacteriophage_1[M19348] ref|XP_193814.5|        48.89   45      17      1
57      95      320     364     0.001   42.7
Bacteriophage_1[M19348] ref|XP_912463.1|        48.89   45      17      1
57      95      866     910     0.001   42.7
Bacteriophage_1[M19348] ref|XP_619329.2|        48.89   45      17      1
57      95      676     720     0.001   42.7
C.elegans_1_[Z49071]    ref|XP_917828.1|        29.61   412     242     9
40      410     52      456     6e-43   173
C.elegans_1_[Z49071]    gb|AAI10184.1|  31.99   347     213     11      40
373     53      389     6e-42   169

-----------
Here's the -m8 output.

Bacteriophage_1[M19348] gi|6978677|ref|NP_037061.1|     40.32   62      33
1       28      89      1050    1107    6e-05   46.6
Bacteriophage_1[M19348] gi|82958039|ref|XP_193814.5|    48.89   45      17
1       57      95      320     364     0.001   42.7
Bacteriophage_1[M19348] gi|82958037|ref|XP_912463.1|    48.89   45      17
1       57      95      866     910     0.001   42.7
Bacteriophage_1[M19348] gi|82957449|ref|XP_619329.2|    48.89   45      17
1       57      95      676     720     0.001   42.7
C.elegans_1_[Z49071]    gi|82802536|ref|XP_917828.1|    29.61   412     242
9       40      410     52      456     6e-43    173
C.elegans_1_[Z49071]    gi|82571607|gb|AAI10184.1|      31.99   347     213
11      40      373     53      389     6e-42    169

-------------- next part --------------
A non-text attachment was scrubbed...
Name: seqs.blp
Type: application/octet-stream
Size: 10400 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20051223/a95c381f/seqs-0001.obj


More information about the Bioperl-l mailing list