[Biojava-l] Blast parser?
Simon Brocklehurst
simon.brocklehurst@CambridgeAntibody.com
Wed, 16 Feb 2000 18:41:21 +0000
This is a multi-part message in MIME format.
--------------610F56FDD687C2F8BC826810
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Forwarded message from Terry (my fault 'cos I replied to him, and not to
the list as well).
--
Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com
--------------610F56FDD687C2F8BC826810
Content-Type: message/rfc822
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Return-Path: <jchang@SMI.Stanford.EDU>
Received: from [192.168.1.2] (HELO camb-antibody)
by camb-antibody.co.uk (CommuniGate Pro SMTP 3.1)
with SMTP id 1265783 for simon.brocklehurst@cambridgeantibody.com; Wed, 16 Feb 2000 17:16:05 +0000
Received: from crg-gw.Stanford.EDU ([171.65.32.201]) by camb-antibody.camb-antibody.co.uk; Wed, 16 Feb 2000 17:36:00 +0000 (GMT)
Received: from taiyang.Stanford.EDU (jchang@taiyang.Stanford.EDU [171.65.32.101])
by crg-gw.Stanford.EDU (8.9.1a/8.9.1) with ESMTP id JAA03932;
Wed, 16 Feb 2000 09:37:03 -0800 (PST)
Received: (from jchang@localhost)
by taiyang.Stanford.EDU (8.9.0.Beta5/8.8.8) id JAA09840;
Wed, 16 Feb 2000 09:37:03 -0800 (PST)
Date: Wed, 16 Feb 2000 09:37:03 -0800 (PST)
From: Jeffrey Chang <jchang@SMI.Stanford.EDU>
Sender: jchang@crg-gw.Stanford.EDU
To: Simon Brocklehurst <simon.brocklehurst@cambridgeantibody.com>
cc: biojava-l@biojava.org
Subject: Re: [Biojava-l] status? Blast parser?
In-Reply-To: <38AACA96.D1AA4C1E@CambridgeAntibody.com>
Message-ID: <Pine.GSO.4.05.10002160925060.9839-100000@taiyang>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Mozilla-Status2: 00000000
Hello all,
On Wed, 16 Feb 2000, Simon Brocklehurst wrote:
>
> It's not hard to add SAX functionality to our systems, and if the
> consensus view of people on the list is that we should go ahead and
> use CAT's code as the basis for the biojava BLAST parser, we will
> definitely implement that fairly quickly. If we get to that point,
> we'll need to agree on some standard meta data conventions and I'll
> post a proposal (e.g. we are keen for this software to work with
> software that has generic BLAST-like output (e.g. HMMER) with the
> minimum of effort, so our proposal would probably reflect that).
The biopython parsers are built around a SAX-like event model that you've
described. The discussions are documented in the newsgroup threads in
November and December:
http://www.biopython.org/pipermail/biopython/1999-November/thread.html
http://www.biopython.org/pipermail/biopython/1999-December/thread.html
The final design is documented within the CVS tree, but it's relatively
long, so I won't post it here. Basically, it's build around a
Scanner/Consumer model where a Scanner object goes through a stream,
recognizes content, and passes it into a Consumer object that does the
final processing. Then, a Parser object contains both a Scanner and
Consumer, and thus has the ability to take an input stream and processes
it into some final data structure.
It's relatively flexible, as you can substitute different consumers
depending on what kind of data you want.
We've already decided upon a meta-data convention for BLAST content. My
feeling is that you'll run into trouble if you try to have 1 standard for
all similarity algorithms, and you'll be better off creating a standard
specifically for each algorithm, and then a more general one that they can
map into.
Jeff
BLAST Scanners produce the following events:
SECTION NAME COMMENTS
EVENT NAME
header
version
reference
query_info
database_info
descriptions
round psi blast
model_sequences psi blast
nonmodel_sequences psi blast
converged psi blast
description
no_hits
alignment
multalign master-slave
title pairwise
length pairwise
hsp
score pairwise
identities pairwise
strand pairwise, blastn
frame pairwise, blastx, tblastn, tblastx
query pairwise
align pairwise
sbjct pairwise
database_report
database
posted_date
num_letters_in_database
num_sequences_in_database
num_letters_searched RESERVED. Currently unused. I've never
num_sequences_searched RESERVED. seen it, but it's in blastool.c..
ka_params
gapped not blastp
ka_params_gap gapped mode (not tblastx)
parameters
matrix
gap_penalties gapped mode (not tblastx)
num_hits
num_sequences
num_extends
num_good_extends
num_seqs_better_e
hsps_no_gap gapped (not tblastx) and not blastn
hsps_prelim_gapped gapped (not tblastx) and not blastn
hsps_prelim_gap_attempted gapped (not tblastx) and not blastn
hsps_gapped gapped (not tblastx) and not blastn
query_length
database_length
effective_hsp_length
effective_query_length
effective_database_length
effective_search_space
effective_search_space_used
frameshift blastx or tblastn or tblastx
threshold
window_size
dropoff_1st_pass
gap_x_dropoff
gap_x_dropoff_final gapped (not tblastx) and not blastn
gap_trigger
blast_cutoff
--------------610F56FDD687C2F8BC826810--