[Biojava-l] BLAST Parser for extracting all BLAST data?
Richard HOLLAND
hollandr at gis.a-star.edu.sg
Sun Jun 26 11:33:14 EDT 2005
BioJava's BLAST framework parses files and fires events for every piece of information it finds. The SeqSimilarityAdapter class is an example of how to catch these events and construct basic BLAST result objects (SimpleSeqSimilarityHit), however they are not comprehensive and do not record full details of every hit.
If you want the kind of detail you mention below you will have to write your own content handler for BLAST parsing and parse it to the BLASTLikeSAXParser when parsing a file. This event handler should implement the ContentHandler interface. Look at the source of SeqSimilarityAdapter for guidance. You will then receive events for every part of the file, from which you can construct your own custom BLAST result objects to describe them.
If you're not sure what tag names to listen for in your ContentHandler the easiest thing to do is just run it once and dump them all out to see what you get.
cheers,
Richard
-----Original Message-----
From: biojava-l-bounces at portal.open-bio.org on behalf of Y D Sun
Sent: Sun 6/26/2005 5:42 PM
To: biojava-l at biojava.org
Cc:
Subject: [Biojava-l] BLAST Parser for extracting all BLAST data?
Hi,
I want to extract all data from BLASTP results. In the following hit,
for example, I need to get the lengths of query and subject proteins,
the identities (including all data 54, 124 and 43%), the positives (all
data 79, 124 and 63%), and the gaps (3, 124 and 2%). Can the
BLASTLikeSAXParser filter all these information? I can't find the
methods in SeqSimilaritySearchHit and SeqSimilaritySearchSubHit APIs to
retrieve these data. Does Biojava provide any methods for this purpose?
Thanks,
George
BLASTP 2.2.5 [Nov-16-2002]
Query= Prot0001
(138 letters)
Database: /work/nys1/fasta/protein/AE000782.pro.fasta
2407 sequences; 662,866 total letters
Searching.....done
Score
E
Sequences producing significant alignments: (bits)
Value
Prot0002 100
1e-23
Prot0003 74
2e-15
Prot0004 43
3e-06
>Prot0002
Length = 138
Score = 100 bits (250), Expect = 1e-23
Identities = 54/124 (43%), Positives = 79/124 (63%), Gaps = 3/124 (2%)
Query: 18 NARTKFTDIAKTLNLTEAAIRKRIKKLEENQIIKRYSIDIDYKKLGYNMAIIGLDIDMDY
77
NAR T IAK LN+TEAA+RKRI LE + I Y I+YKK+G + ++ G+D+D D
Sbjct: 15 NARIPKTRIAKELNVTEAAVRKRIANLERREEILGYKAIINYKKVGLSASLTGVDVDPDK
74
Query: 78 FPKIIKELEKRKEFLHIYSSAGDHDIMVIAIYK---DLEEIYNYLKNLKGVKRVCPAIII
134
K+++EL+ + ++ + GDH IM I K +L EI+ + ++GVKRVCP+II
Sbjct: 75 LWKVVEELKDLESVKSLWLTTGDHTIMAEIIAKSVQELSEIHQKIAEMEGVKRVCPSIIT
134
Query: 135 DQIK 138
D +K
Sbjct: 135 DIVK 138
_______________________________________________
Biojava-l mailing list - Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l
More information about the Biojava-l
mailing list