[Biojava-l] BLAST Parser for extracting all BLAST data?

Richard HOLLAND hollandr at gis.a-star.edu.sg
Sun Jun 26 11:33:14 EDT 2005


BioJava's BLAST framework parses files and fires events for every piece of information it finds. The SeqSimilarityAdapter class is an example of how to catch these events and construct basic BLAST result objects (SimpleSeqSimilarityHit), however they are not comprehensive and do not record full details of every hit.

If you want the kind of detail you mention below you will have to write your own content handler for BLAST parsing and parse it to the BLASTLikeSAXParser when parsing a file. This event handler should implement the ContentHandler interface. Look at the source of SeqSimilarityAdapter for guidance. You will then receive events for every part of the file, from which you can construct your own custom BLAST result objects to describe them.

If you're not sure what tag names to listen for in your ContentHandler the easiest thing to do is just run it once and dump them all out to see what you get.

cheers,
Richard


-----Original Message-----
From:	biojava-l-bounces at portal.open-bio.org on behalf of Y D Sun
Sent:	Sun 6/26/2005 5:42 PM
To:	biojava-l at biojava.org
Cc:	
Subject:	[Biojava-l] BLAST Parser for extracting all BLAST data?

Hi,

I want to extract all data from BLASTP results. In the following hit,
for example, I need to get the lengths of query and subject proteins,
the identities (including all data 54, 124 and 43%), the positives (all
data 79, 124 and 63%), and the gaps (3, 124 and 2%). Can the
BLASTLikeSAXParser filter all these information? I can't find the
methods in SeqSimilaritySearchHit and SeqSimilaritySearchSubHit APIs to
retrieve these data. Does Biojava provide any methods for this purpose?

Thanks,

George


BLASTP 2.2.5 [Nov-16-2002]

Query= Prot0001
         (138 letters)

Database: /work/nys1/fasta/protein/AE000782.pro.fasta
           2407 sequences; 662,866 total letters

Searching.....done

                                                                 Score
E
Sequences producing significant alignments:                      (bits)
Value

Prot0002                                                           100
1e-23
Prot0003                                                            74
2e-15
Prot0004                                                            43
3e-06

>Prot0002
          Length = 138

 Score =  100 bits (250), Expect = 1e-23
 Identities = 54/124 (43%), Positives = 79/124 (63%), Gaps = 3/124 (2%)

Query: 18  NARTKFTDIAKTLNLTEAAIRKRIKKLEENQIIKRYSIDIDYKKLGYNMAIIGLDIDMDY
77
           NAR   T IAK LN+TEAA+RKRI  LE  + I  Y   I+YKK+G + ++ G+D+D D
Sbjct: 15  NARIPKTRIAKELNVTEAAVRKRIANLERREEILGYKAIINYKKVGLSASLTGVDVDPDK
74

Query: 78  FPKIIKELEKRKEFLHIYSSAGDHDIMVIAIYK---DLEEIYNYLKNLKGVKRVCPAIII
134
             K+++EL+  +    ++ + GDH IM   I K   +L EI+  +  ++GVKRVCP+II
Sbjct: 75  LWKVVEELKDLESVKSLWLTTGDHTIMAEIIAKSVQELSEIHQKIAEMEGVKRVCPSIIT
134

Query: 135 DQIK 138
           D +K
Sbjct: 135 DIVK 138

_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l






More information about the Biojava-l mailing list