[Bioperl-l] Experiences from a newbie

Jason Stajich jason at cgt.duhs.duke.edu
Sat Dec 13 10:14:28 EST 2003


On Fri, 12 Dec 2003, Andreas Bernauer wrote:

> Brian Osborne wrote:
> > Andreas,
> >
> > >As I din't want to look at the actual sequences and only needed the
> > >hits themselves, I told blast to omit the detailed listing of the
> > >alignments.  This turned out to be a mistake, as for some reason that
> > >I can never think of, I can only get HITS in $result, when I include

The reason would be the parser expects a certain subset of the many
possible blast formats.  I have improved the ability of the
SearchIO::blast to read some of the variants of BLAST but I have not tried
every possible parameter combination.  As with all of the modules, we have
developed the tools that work for us, if there are things you would like
to see supported, feature requests with and example report at
bugzilla.open-bio.org is the best way to document what you would like.


If you want just query,hit,evalue,query start, query end..., etc I would
suggest you just use the -m 8 output from NCBI blast and use the
SearchIO::blasttable - it is simpler - BLAST even runs a bit faster, and
will use up less disk space.

If you are using WU-BLAST I typically integrate a script with
SearchIO as a wrapper around a BLAST process to grab this simple stuff
as well.

> > >the alignments (which are accessed via HSP which I never use).  Again
> > >something that I cannot access with my logic.


> >
> > I couldn't understand this. Could you restate this? or tell us exactly what
> > information you wanted to get from the Hit object?
>
> 0Sure:
>
> Let's have a look at this excerpt from a BLAST search result:
>
>
> BLASTP 2.2.1 [Apr-13-2001]
>
> ## header omitted
>
> ## This is the HIT LIST:
>                                                                    Score     E
> Sequences producing significant alignments:                        (bits)  Value
>
> gi|33236978|ref|AAP99047.1| DNA polymerase III beta subunit [ Pr...    63  2e-10
> gi|33860561|ref|NP_892122.1| DNA polymerase III, beta chain [Pro...    63  2e-10
> ## ... etc. etc.   ...
> gi|33237342|ref|AAP99410.1| Septum formation inhibitor [ Prochlo...    31  0.60
>
>
> ## this is the DETAIL LIST:
>
> >gi|33236978|ref|AAP99047.1| DNA polymerase III beta subunit [
>            Prochlorococcus marinus subsp. marinus str. CCMP1375 ]
>           Length = 385
>
>  Score = 62.8 bits (151), Expect = 2e-10
>  Identities = 71/348 (20%), Positives = 153/348 (43%), Gaps = 45/348 (12%)
>
> Query: 38  VKCNLNKNIDILEQGSLIVKGKIFNDLINGIKEE---IITIQEKDQTLLVKTKKTSINLN 94
>            ++ +L+ +I+    G++ V  K+F ++I+ +  E    ++  +  + + +K+K  +  +
> Sbjct: 55  IQTSLSASIE--SSGAITVPSKLFGEIISKLSSESSITLSTDDSSEQVNLKSKSGNYQVR 112
>
> ## etc. etc.
>
> Query: 315 NISFNPSSLLDHIESFESNEINFDFQGNSKYFLITSKSEPELKQILVP 362
>             I+FN   LL+ ++  E+N I   F   +   + T   E     +++P
> Sbjct: 333 QIAFNSRYLLEGLKIIETNTILLKFNAPTTPAIFTPNDETNFVYLVMP 380
>
>
> >gi|33860561|ref|NP_892122.1| DNA polymerase III, beta chain
>            [Prochlorococcus marinus subsp. pastoris str. CCMP1378]
>           Length = 385
>
>  Score = 62.8 bits (151), Expect = 2e-10
>  Identities = 74/323 (22%), Positives = 150/323 (45%), Gaps = 38/323 (11%)
>
> Query: 36  FSVKCNLNKNID--ILEQGSLIVKGKIFNDLINGIKEEI---ITIQEKDQTLLVKTKKTS 90
>            F +   +  + D  + + G++ +  K+ ++++N +  E    + + E    +L+K+ + S
> Sbjct: 49  FDLNLGIQTSFDATVNKSGAITIPSKLLSEIVNKLPSETPVSLDVDESSDNILIKSDRGS 108
>
> ## etc. etc.
>
>
> I only need the gi number of the protein that was hit, the score and
> the evalue.  I don't need the alignment or anything else from the
> DETAIL list.  So I told blast to omit the detail list.  (Idea: I am
> searching through several whole genomes, so this would also save me a
> significant amount of drive space.)  But when I read the blast file
> with the hit list omitted, bioperl didn't report any hits at all,
> although the file still contained the hit list.
>
> I found this counterintuitive, as I expected the HIT object to give me
> the members of the HIT list, and the HSP object to give me more
> information about the alignment, i.e. informations about the members
> of the DETAIL list.
>
> When I told blast, no to omit the DETAIL list, everything worked fine.
>
Because we make connections between this summary list and the details the
parser initially expected to find the details below to 'finish off
building' the Hit objects.  I have since added some code on the main trunk
which I think handles this case okay - I believe - I haven't been
testing that aspect of late.   If you write a test for t/SearchIO.t and
a test report I can make sure it supported in future versions.


But remember - in all of these cases you have LESS information so HSP
objects will not be constructed - you can only call the methods

$hit->name, $hit->description, $hit->significance, $hit->raw_score.

There will be no hit->length as that is not listed in the summary part of
the report.

> As I've written previously, I don't use the HSP object at all
(get_gi > extracts the gi number, $blastId refers to the blast $file):
>
[for the more advanced folks]
You can also use a speedup if you only want
to get hits - where you can attach a different kind of event listener than
the default ResultBuilder - see SearchIO::FastHitEventBuilder which won't
even build HSP objects in the first place for you.


As for what you actually want from the report below as below, I would
strongly suggest you just use -m 8 output from BLAST as you'll just get
exactly what you want in column format and you can either use
SearchIO::blasttable to parse it or write your own that looks something
like this:

 while(<>) {
   my ($qname,$hname, ... ) = split(/\t/,$_);
 }


> use Bio::SearchIO;
> # ...omitted for breviety...
> my $report = new Bio::SearchIO (-format => 'blast',
>       			  -file   => $file);
>
> while (my $result = $report->next_result ) {
>   print "Next ($blastId) result (" . $result->num_hits . " hits)\n";
>   my $hitNo = 0;
>   while (my $hit = $result->next_hit) {
>     $hitNo++;
>     printf "%s\t%s\t%s\t%d\t%d\n",
>       get_gi($result->query_name), get_gi($hit->name),
>           $hit->significance, $blastId, $hitNo;
>   }
> }
>
>
>
>
> Andreas.
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list