[Biojava-dev] blast parsing continued
Matthew Pocock
matthew_pocock@yahoo.co.uk
Sat, 16 Nov 2002 14:09:06 +0000
Hi Doug,
You've perswaded me ;-)
Just a handfull of comments. Firstly, you said that your code uses 1.4
regex stuff. If it does, then please locate it under src-1.4 rather than
src in the dev tree. Seccondly, whatever happens, we need to make sure
that there is a functional blast parsing solution for Java 1.2 and 1.3
users. Thirdly, I'm a little bit worried about us ending up with two
incompattible blast DTDs. If we modify it for one parser, then if
humanly possible, could we make sure the other one spits out the same
sax? I know this may not be trivial, but it would make me happier.
Lastly, I can't remember if you have a cvs account, but I guess it's
time to sort you out with one so that you can work on this without
mailing source arround the list all the time. Welcome aboard :-)
Matthew
Doug Rusch wrote:
> Yes an XML parser would be best if I didnt find that the NCBI blast XML output option tends to core dump on me. In any case, here is my modification of the BlastLikeDataSetCollection DTD which I call BlastLikeResultSetCollection. See if this fits with your expectations David, if not we can thrash out what should be changed.
>
> <!-- BlastLikeResultSetCollection DTD - this is a heavily modified version
> of the BlastListDataSetCollection collection. It is currently under
> development but hopefully will serve as a unified DTD for
> a variety of analysis tools including :
>
> o BLAST (NCBI)
> o WU-BLAST (Washington University)
> o HMMER (Washington University)
> o DBA (Sanger Center)
> o Genewise (Sanger Center)
> o Sim4 (Pennsylvania State University)
>
> NB This DTD covers output from the above software, when run
> in modes such that the detailed output is based around
> pairwise alignments.
>
> This is as opposed to other output formats such as ASN.1
>
> The root element is a BlastLikeResultSetCollection. This is
> described towards the end of the DTD.
> ================================================================
> The BlastLikeDataSetCollection DTD is Copyright 1999, 2000, 2001 Cambridge
> Antibody Technology Group plc (CAT). All Rights Reserved.
>
> The BlastLikeResultSetCollection DTD is Copyright 2002 The Center for the
> Advancement of Genomics (TCAG). All Rights Reserved.
>
> Version 0.5
>
> Author List for BlastLikeResultSetCollection:
> Primary Author: Douglas Rusch (TCAG)
>
> Author List for BlastLikeDataSetCollection:
> Primary Author: Simon Brocklehurst (CAT)
> Other Authors: Colin H. Hardman (CAT)
> Stuart Johnson (CAT)
> Tim Dilks (CAT)
> Keith James (Sanger Center)
> ================================================================ -->
>
> <!-- PARAMETER ENTITY DECLARATIONS
> ============================= -->
>
> <!-- ELEMENT DECLARATIONS
> ==================== -->
>
> <!-- The RawOutput element is used to represent sections of the
> output from programs "as is". This enables information from
> software to be represented, without being parsed in detail.
> -->
> <!ELEMENT biojava:RawOutput (#PCDATA)>
> <!ATTLIST biojava:RawOutput
> xml:space (default|preserve) #IMPLIED >
>
> <!-- ================================================================ -->
> <!-- Elements for Query, Subject, and Database information -->
> <!-- Changes include the addition of the description or definition -->
> <!-- line and the length (in letters) of the subject and query -->
> <!-- sequences. For the database, length in letters and number of -->
> <!-- sequences has been added. -->
> <!-- Why is there a metadata field? How is this supposed to be used?? -->
> <!-- Parsers seem to ignore this attribute. -->
>
> <!ELEMENT biojava:QueryInfo EMPTY>
> <!ATTLIST biojava:QueryInfo
> id CDATA #REQUIRED
> desc PCDATA #IMPLIED
> length CDATA #IMPLIED
> metadata CDATA #REQUIRED >
>
> <!ELEMENT biojava:SbjctInfo EMPTY>
> <!ATTLIST biojava:SbjctInfo
> id CDATA #REQUIRED
> desc PCDATA #IMPLIED
> length CDATA #IMPLIED
> metaData CDATA #REQUIRED >
>
> <!ELEMENT biojava:DatabaseInfo EMPTY>
> <!ATTLIST biojava:DatabaseInfo
> name CDATA #REQUIRED
> letters CDATA #IMPLIED
> entries CDATA #IMPLIED
> metadata CDATA #REQUIRED >
>
> <!-- ================================================================ -->
> <!-- Mainly HSPSummary related information derived from HitSummary. -->
> <!-- Neither of these names seems correct, perhaps MatchSummary -->
> <!-- would be best. Changes include removing a count of HSPs and -->
> <!-- reading frame. Reading frame is easily derived from the -->
> <!-- coordinates of the alignment. Also removed sumProbability value -->
> <!-- though this should probably be kept. Added similarity count. -->
>
> <!ELEMENT biojava:HSPSummary >
> <!ATTLIST biojava:HSPSummary
> score CDATA #REQUIRED
> bitScore CDATA #IMPLIED
> expectValue CDATA #IMPLIED
> identitical CDATA #IMPLIED
> alignmentLength CDATA #IMPLIED
> similar CDATA #IMPLIED
> pValue CDATA #IMPLIED
> sumPValues CDATA #IMPLIED >
>
> <!-- ================================================================ -->
> <!-- Elements for Query, Subject, and Match alignment information -->
>
> <!ELEMENT biojava:QuerySequence (#PCDATA)>
> <!ATTLIST biojava:QuerySequence
> begin CDATA #REQUIRED
> end CDATA #REQUIRED
> strand CDATA #REQUIRED
> type CDATA #IMPLIED
> gaps CDATA #IMPLIED >
>
> <!-- A MatchConsensus element represents the consensus information
> present in a pairwise alignment produced by Blast-like programs
> (i.e. the middle line of the alignment). -->
>
> <!ELEMENT biojava:MatchConsensus (#PCDATA)>
> <!ATTLIST biojava:MatchConsensus
> xml:space (default|preserve) #IMPLIED >
>
>
> <!ELEMENT biojava:SbjctSequence (#PCDATA)>
> <!ATTLIST biojava:SbjctSequence
> begin CDATA #REQUIRED
> end CDATA #REQUIRED
> strand CDATA #REQUIRED
> type CDATA #IMPLIED
> gaps CDATA #IMPLIED >
>
> <!-- The BlastLikeAlignment elements represents information from the
> pairwise alignments produced by Blast-like programs. Rather than
> representing the alignment simply as preformatted raw text, it
> separates out the information into a QuerySequence, a HitSequence
> and a MatchConsensus. -->
>
> <!ELEMENT biojava:BlastLikeAlignment (biojava:QuerySequence,
> biojava:MatchConsensus,
> biojava:SbjctSequence) >
>
> <!ELEMENT biojava:HSP (biojava:HSPSummary, biojava:BlastLikeAlignment?)>
>
> <!-- HSPCollections model related groups of HSPs. For example, this
> allows all plus strand HSPs to be grouped separated from all
> minus strand HSPs -->
>
> <!ELEMENT biojava:HSPCollection (biojava:HSP+)>
>
> <!-- A hit, besides containing the subject and alignment information
> should also hold things like frameshifts where it is assumed that
> a frameshift terminates a given match or HSP -->
>
> <!ELEMENT biojava:Hit (biojava:SbjctInfo, biojava:HSPCollection+)>
> <!ATTLIST biojava:Hit >
>
> <!ELEMENT biojava:Detail (biojava:Hit*)>
>
> <!-- ================================================================ -->
> <!-- Statistics found at end of blast -->
>
> <!ELEMENT biojava:KAStats EMPTY>
> <!ELEMENT biojava:KAStats
> K CDATA #REQUIRED
> H CDATA #REQUIRED
> lambda CDATA #REQUIRED >
>
> <!ELEMENT biojava:GappedKAStats EMPTY>
> <!ELEMENT biojava:GappedKAStats
> K CDATA #REQUIRED
> H CDATA #REQUIRED
> lambda CDATA #REQUIRED >
>
> <!ELEMENT biojava:SearchMatrix
> name CDATA #REQUIRED
> matchScore CDATA #IMPLIED
> mismatchScore CDATA #IMPLIED >
>
> <!ELEMENT biojava:GapPenalties
> gapOpen CDATA #REQUIRED
> gapExtend CDATA #REQUIRED >
>
> <!ELEMENT biojava:SearchSpaceStats
> effectiveSpace CDATA #REQUIRED
> usedSpace CDATA #REQUIRED >
>
> <!ELEMENT biojava:Statistics (biojava:KAStats,
> biojava:GappedKAStats,
> biojava:SearchMatrix,
> biojava:GapPenalties,
> biojava:SearchSpaceStats)>
>
> <!-- ================================================================ -->
> <!-- Relating to overall results of searches -->
>
> <!ELEMENT biojava:Header (biojava:RawOutput?, QueryInfo?, DatabaseInfo? )>
>
> <!ELEMENT biojava:BlastLikeResultSet (biojava:Header,
> biojava:Summary?,
> biojava:Detail?,
> biojava:Statistics?)>
> <!ATTLIST biojava:BlastLikeResultSet
> program CDATA #REQUIRED
> version CDATA #REQUIRED>
>
> <!-- A BlastLikeResultSetCollection contains data from groups of results
> obtained from bioinformatics software that produces Blast-like
> output. For example, it can model the output from Blast run on
> multiple sequences. Or it could be used to group together analyses
> on a single sequence obtained from multiple programs. -->
>
> <!ELEMENT biojava:BlastLikeResultSetCollection (biojava:BlastLikeResultSet+) >
> <!ATTLIST biojava:BlastLikeResultSetCollection
> xmlns CDATA #FIXED ""
> xmlns:biojava CDATA #FIXED "http://www.biojava.org" >
>
>
> -----Original Message-----
> From: David Huen [mailto:smh1008@cus.cam.ac.uk]
> Sent: Fri 11/15/02 11:49 AM
> To: Doug Rusch; Keith James
> Cc: biojava-dev@biojava.org
> Subject: Re: [Biojava-dev] blast parsing continued
> Could I have a copy of whatever DTD you might settle upon please?
>
> I have a NCBI Blast XML parser that I use that I'd like to check in and an
> adaptor to implement to make the events match those expected by downstream
> builders.
>
> Regards,
> David Huen
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev@biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
>
--
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk
__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com