[Biojava-dev] blast parsing continued
Doug Rusch
drusch@tcag.org
Fri, 15 Nov 2002 14:28:21 -0500
Yes an XML parser would be best if I didnt find that the NCBI blast XML output option tends to core dump on me. In any case, here is my modification of the BlastLikeDataSetCollection DTD which I call BlastLikeResultSetCollection. See if this fits with your expectations David, if not we can thrash out what should be changed.
<!-- BlastLikeResultSetCollection DTD - this is a heavily modified version
of the BlastListDataSetCollection collection. It is currently under
development but hopefully will serve as a unified DTD for
a variety of analysis tools including :
o BLAST (NCBI)
o WU-BLAST (Washington University)
o HMMER (Washington University)
o DBA (Sanger Center)
o Genewise (Sanger Center)
o Sim4 (Pennsylvania State University)
NB This DTD covers output from the above software, when run
in modes such that the detailed output is based around
pairwise alignments.
This is as opposed to other output formats such as ASN.1
The root element is a BlastLikeResultSetCollection. This is
described towards the end of the DTD.
================================================================
The BlastLikeDataSetCollection DTD is Copyright 1999, 2000, 2001 Cambridge
Antibody Technology Group plc (CAT). All Rights Reserved.
The BlastLikeResultSetCollection DTD is Copyright 2002 The Center for the
Advancement of Genomics (TCAG). All Rights Reserved.
Version 0.5
Author List for BlastLikeResultSetCollection:
Primary Author: Douglas Rusch (TCAG)
Author List for BlastLikeDataSetCollection:
Primary Author: Simon Brocklehurst (CAT)
Other Authors: Colin H. Hardman (CAT)
Stuart Johnson (CAT)
Tim Dilks (CAT)
Keith James (Sanger Center)
================================================================ -->
<!-- PARAMETER ENTITY DECLARATIONS
============================= -->
<!-- ELEMENT DECLARATIONS
==================== -->
<!-- The RawOutput element is used to represent sections of the
output from programs "as is". This enables information from
software to be represented, without being parsed in detail.
-->
<!ELEMENT biojava:RawOutput (#PCDATA)>
<!ATTLIST biojava:RawOutput
xml:space (default|preserve) #IMPLIED >
<!-- ================================================================ -->
<!-- Elements for Query, Subject, and Database information -->
<!-- Changes include the addition of the description or definition -->
<!-- line and the length (in letters) of the subject and query -->
<!-- sequences. For the database, length in letters and number of -->
<!-- sequences has been added. -->
<!-- Why is there a metadata field? How is this supposed to be used?? -->
<!-- Parsers seem to ignore this attribute. -->
<!ELEMENT biojava:QueryInfo EMPTY>
<!ATTLIST biojava:QueryInfo
id CDATA #REQUIRED
desc PCDATA #IMPLIED
length CDATA #IMPLIED
metadata CDATA #REQUIRED >
<!ELEMENT biojava:SbjctInfo EMPTY>
<!ATTLIST biojava:SbjctInfo
id CDATA #REQUIRED
desc PCDATA #IMPLIED
length CDATA #IMPLIED
metaData CDATA #REQUIRED >
<!ELEMENT biojava:DatabaseInfo EMPTY>
<!ATTLIST biojava:DatabaseInfo
name CDATA #REQUIRED
letters CDATA #IMPLIED
entries CDATA #IMPLIED
metadata CDATA #REQUIRED >
<!-- ================================================================ -->
<!-- Mainly HSPSummary related information derived from HitSummary. -->
<!-- Neither of these names seems correct, perhaps MatchSummary -->
<!-- would be best. Changes include removing a count of HSPs and -->
<!-- reading frame. Reading frame is easily derived from the -->
<!-- coordinates of the alignment. Also removed sumProbability value -->
<!-- though this should probably be kept. Added similarity count. -->
<!ELEMENT biojava:HSPSummary >
<!ATTLIST biojava:HSPSummary
score CDATA #REQUIRED
bitScore CDATA #IMPLIED
expectValue CDATA #IMPLIED
identitical CDATA #IMPLIED
alignmentLength CDATA #IMPLIED
similar CDATA #IMPLIED
pValue CDATA #IMPLIED
sumPValues CDATA #IMPLIED >
<!-- ================================================================ -->
<!-- Elements for Query, Subject, and Match alignment information -->
<!ELEMENT biojava:QuerySequence (#PCDATA)>
<!ATTLIST biojava:QuerySequence
begin CDATA #REQUIRED
end CDATA #REQUIRED
strand CDATA #REQUIRED
type CDATA #IMPLIED
gaps CDATA #IMPLIED >
<!-- A MatchConsensus element represents the consensus information
present in a pairwise alignment produced by Blast-like programs
(i.e. the middle line of the alignment). -->
<!ELEMENT biojava:MatchConsensus (#PCDATA)>
<!ATTLIST biojava:MatchConsensus
xml:space (default|preserve) #IMPLIED >
<!ELEMENT biojava:SbjctSequence (#PCDATA)>
<!ATTLIST biojava:SbjctSequence
begin CDATA #REQUIRED
end CDATA #REQUIRED
strand CDATA #REQUIRED
type CDATA #IMPLIED
gaps CDATA #IMPLIED >
<!-- The BlastLikeAlignment elements represents information from the
pairwise alignments produced by Blast-like programs. Rather than
representing the alignment simply as preformatted raw text, it
separates out the information into a QuerySequence, a HitSequence
and a MatchConsensus. -->
<!ELEMENT biojava:BlastLikeAlignment (biojava:QuerySequence,
biojava:MatchConsensus,
biojava:SbjctSequence) >
<!ELEMENT biojava:HSP (biojava:HSPSummary, biojava:BlastLikeAlignment?)>
<!-- HSPCollections model related groups of HSPs. For example, this
allows all plus strand HSPs to be grouped separated from all
minus strand HSPs -->
<!ELEMENT biojava:HSPCollection (biojava:HSP+)>
<!-- A hit, besides containing the subject and alignment information
should also hold things like frameshifts where it is assumed that
a frameshift terminates a given match or HSP -->
<!ELEMENT biojava:Hit (biojava:SbjctInfo, biojava:HSPCollection+)>
<!ATTLIST biojava:Hit >
<!ELEMENT biojava:Detail (biojava:Hit*)>
<!-- ================================================================ -->
<!-- Statistics found at end of blast -->
<!ELEMENT biojava:KAStats EMPTY>
<!ELEMENT biojava:KAStats
K CDATA #REQUIRED
H CDATA #REQUIRED
lambda CDATA #REQUIRED >
<!ELEMENT biojava:GappedKAStats EMPTY>
<!ELEMENT biojava:GappedKAStats
K CDATA #REQUIRED
H CDATA #REQUIRED
lambda CDATA #REQUIRED >
<!ELEMENT biojava:SearchMatrix
name CDATA #REQUIRED
matchScore CDATA #IMPLIED
mismatchScore CDATA #IMPLIED >
<!ELEMENT biojava:GapPenalties
gapOpen CDATA #REQUIRED
gapExtend CDATA #REQUIRED >
<!ELEMENT biojava:SearchSpaceStats
effectiveSpace CDATA #REQUIRED
usedSpace CDATA #REQUIRED >
<!ELEMENT biojava:Statistics (biojava:KAStats,
biojava:GappedKAStats,
biojava:SearchMatrix,
biojava:GapPenalties,
biojava:SearchSpaceStats)>
<!-- ================================================================ -->
<!-- Relating to overall results of searches -->
<!ELEMENT biojava:Header (biojava:RawOutput?, QueryInfo?, DatabaseInfo? )>
<!ELEMENT biojava:BlastLikeResultSet (biojava:Header,
biojava:Summary?,
biojava:Detail?,
biojava:Statistics?)>
<!ATTLIST biojava:BlastLikeResultSet
program CDATA #REQUIRED
version CDATA #REQUIRED>
<!-- A BlastLikeResultSetCollection contains data from groups of results
obtained from bioinformatics software that produces Blast-like
output. For example, it can model the output from Blast run on
multiple sequences. Or it could be used to group together analyses
on a single sequence obtained from multiple programs. -->
<!ELEMENT biojava:BlastLikeResultSetCollection (biojava:BlastLikeResultSet+) >
<!ATTLIST biojava:BlastLikeResultSetCollection
xmlns CDATA #FIXED ""
xmlns:biojava CDATA #FIXED "http://www.biojava.org" >
-----Original Message-----
From: David Huen [mailto:smh1008@cus.cam.ac.uk]
Sent: Fri 11/15/02 11:49 AM
To: Doug Rusch; Keith James
Cc: biojava-dev@biojava.org
Subject: Re: [Biojava-dev] blast parsing continued
Could I have a copy of whatever DTD you might settle upon please?
I have a NCBI Blast XML parser that I use that I'd like to check in and an
adaptor to implement to make the events match those expected by downstream
builders.
Regards,
David Huen