[Biojava-l] Biojava XML Binding (BJXB)

Emig, Robin Robin.Emig@maxygen.com
Mon, 6 May 2002 09:03:52 -0700


Although we have had similiar problems, I'd like to know what information you need is lost exporting to the mentioned file formats? For the most part you can recover what you need. Esp if you REALLY mean you don't want to be held to biojava/java on either end of the process. I'd just hate to create YAF (Yet another format), instead of modifying/using one that already exists, and creating extra work trying to make it "not biojava bound" yet "containing biojava info"
-Robin

	-----Original Message----- 
	From: Schreiber, Mark [mailto:mark.schreiber@agresearch.co.nz] 
	Sent: Sun 5/5/2002 9:05 PM 
	To: biojava-l@biojava.org 
	Cc: 
	Subject: [Biojava-l] Biojava XML Binding (BJXB)
	
	

	Hi -
	
	I would like to propose/ formalise a schema for binding biojava objects
	esp sequence objects to XML. The current binding of Biojava objects to
	other formats such as GFF, GenBank, EMBL, Game, Agave is inadequate as
	details are lost in the reading and writing of these objects. While it
	is useful for biojava to read and write these objects the only way to
	currently capture everything about a biojava is to serialize it as a
	binary stream. The advantage of serializing to an XML document is that
	the XML can be constructed and edited using a text editor or programatic
	processes on a machine (possibly a legacy system) with no Biojava
	installation and no requirement for a JVM. Also the XML can be ported
	via HTTP/ Soap. The DTD could also be used as a base for anyone who
	needs a richer schema that maps well to Biojava.
	
	Why not use JAXB? Two reasons, JAXB requires java at both ends of the
	serialization / deserialization proceedure. JAXB doesn't play well with
	many biojava objetcs due to their use of factory methods, private and
	protected constructors and singleton Alphabets. Actually this was all
	inspired by my inability to get JAXB to work with biojava.
	
	I have included a demo xml file and a simple dtd. Obviously there is a
	lot of room for expansion of the DTD to include more biojava concepts
	however I thought I would start with a typical use with a rather nasty
	feature structure. Currently there is no read or write ability but StAX
	looks like an obvious choice, I suspect there might be a need for a lot
	of reflection code in the handlers! I am no StAX expert so if someone
	feels particularly inspired in the next 24hours to knock out a quick
	handler that would be cool.
	
	Comments and Flames welcome.
	
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE seq_db SYSTEM "bjxb.dtd">
	
	<seq_db class="org.biojava.bio.seq.db.HashSequenceDB">
	  <sequence class="org.biojava.bio.seq.impl.SimpleSequence">
	    <id name="fooase_est" urn="embl:UA000933"/>
	    <symbol_list class="org.biojava.bio.seq.SimpleSymbolList"
	alphabet="DNA">
	
	accggtatgaccagaggacccatatagggacaaaccaaaaaaaaagcccacagcgcgttgagacagg
	      gggacacacccatatttaagaggacaccaaccccccccaaagagagagatnaaaaanaaana
	    </symbol_list>
	    <annotation class="org.biojava.bio.SimpleAnnotation">
	      <entry key="organism" value="Homo Sapiens"/>
	      <entry key="seq_type" value="EST"/>
	      <entry key="date" value="19/11/2001"/>
	    </annotation>
	    <feature_holder>
	      <feature class="org.biojava.bio.seq.genomic.TranslatedRegion"
	               source="auto translation"
	               type="predicted peptide">
	        <annotation class="org.biojava.bio.Annotation.EmptyAnnotation"/>
	        <location value="[7..28]"/>
	        <sequence class="org.biojava.bio.seq.impl.SimpleSequence">
	          <id name="fooase"/>
	            <symbol_list class="org.biojava.bio.seq.SimpleSymbolList"
	alphabet="PROTEIN">
	              MTRGPI*
	            </symbol_list>
	            <annotation
	class="org.biojava.bio.Annotation.EmptyAnnotation"/>
	        </sequence>
	        <feature class="org.biojava.bio.seq.impl.SimpleFeature"
	                 source="experimental evidence"
	                 type="SNP">
	          <annotation class="org.biojava.bio.SimpleAnnotation">
	            <entry key="SNP_type" value="g:c"/>
	          </annotation>
	          <location value="14"/>
	        </feature>
	      </feature>
	      <feature class="org.biojava.bio.seq.SimpleFeature"
	               source="experimental"
	               type="PolyA tail">
	         <annotation
	class="org.biojava.bio.Annotation.EmptyAnnotation"/>
	         <location value="[119..131]"/>
	      </feature>
	    </feature_holder>
	  </sequence>
	</seq_db>
	
	<?xml version="1.0" encoding="UTF-8" ?>
	<!ELEMENT id EMPTY >
	<!ATTLIST id urn NMTOKEN #IMPLIED >
	<!ATTLIST id name NMTOKEN #REQUIRED >
	
	<!ELEMENT feature_holder ( feature* ) >
	
	<!ELEMENT annotation ( entry* ) >
	<!ATTLIST annotation class NMTOKEN #REQUIRED >
	
	<!ELEMENT sequence ( id, symbol_list, annotation, feature_holder? ) >
	<!ATTLIST sequence class NMTOKEN #REQUIRED >
	
	<!ELEMENT seq_db ( sequence+ ) >
	<!ATTLIST seq_db class NMTOKEN #REQUIRED >
	
	<!ELEMENT symbol_list ( #PCDATA ) >
	<!ATTLIST symbol_list class NMTOKEN #REQUIRED >
	<!ATTLIST symbol_list alphabet NMTOKEN #REQUIRED >
	
	<!ELEMENT location EMPTY >
	<!ATTLIST location value CDATA #REQUIRED >
	
	<!ELEMENT entry EMPTY >
	<!ATTLIST entry key NMTOKEN #REQUIRED >
	<!ATTLIST entry value CDATA #REQUIRED >
	
	<!ELEMENT feature ( annotation, location, sequence?, feature? ) >
	<!ATTLIST feature type CDATA #REQUIRED >
	<!ATTLIST feature source CDATA #REQUIRED >
	<!ATTLIST feature class NMTOKEN #REQUIRED >
	
	
	Mark Schreiber
	Bioinformatics
	AgResearch Invermay
	PO Box 50034
	Mosgiel
	New Zealand
	
	PH:   +64 3 489 9175
	FAX:  +64 3 489 3739
	
	
	=======================================================================
	Attention: The information contained in this message and/or attachments
	from AgResearch Limited is intended only for the persons or entities
	to which it is addressed and may contain confidential and/or privileged
	material. Any review, retransmission, dissemination or other use of, or
	taking of any action in reliance upon, this information by persons or
	entities other than the intended recipients is prohibited by AgResearch
	Limited. If you have received this message in error, please notify the
	sender immediately.
	=======================================================================
	_______________________________________________
	Biojava-l mailing list  -  Biojava-l@biojava.org
	http://biojava.org/mailman/listinfo/biojava-l