[Biojava-l] SAX, DOM, XPath and Flat files

Andy Yates ayates at ebi.ac.uk
Fri Nov 30 09:26:15 UTC 2007


I think I've seen XPath hanging around in other people's code in a 1.5 
code-base (in fact one of the guys I work with). I've used Java's DOM 
before & it really isn't very nice & quite verbose. I'd prefer if there 
was a better alternative/wrapper around the XML parsers just to cut down 
on code chatter.

Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these 
Java tools & I think I've gone cross-eyed with the sheer number of 
acronyms! You've gotta love something which seems to add a letter to ER 
& that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the 
list know of a ASN.1 parser for Java that's good and should we support 
it (considering NCBI generate their DTD & XML from the ASN.1 
representation).

Andy

Mark Schreiber wrote:
> Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
> not XQuery although XPath is probably more important for this use.
> 
> The DOM model is a direct implementation of the W3C standard which
> makes it a little awkward from a java point of view but it is usable.
> 
> Java 6 has StAX (the other one).
> 
> There are a few java API's for parsing ASN.1 mostly developed for the
> telco industry, I've never really looked into which is best (anyone
> experienced with this?) but we could probably use one to work directly
> off NCBI ASN.1
> 
> - Mark
> 
> On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Mark,
>>
>> Okay that sounds like a perfectly sensible way to deal with this. Is
>> this kind of parsing model supported in Java5? I only ask as I've not
>> done a lot of XML parsing with Java5; more with things like XOM (which I
>> think offers a DOM only representation but I'm probably wrong).
>>
>> That's good. There's not a huge point to have a format & a DTD/XSD and
>> then have your files not conform to it.
>>
>> I was thinking the exact same thing about ASN.1 (well that & it looks
>> bleeding horrible to parse but that is an un-educated look at the format
>> which I'm sure is a parsable as JSON & the alike).
>>
>> When it comes to flat file parsers I would be happier to provide
>> implementations of the more common formats where a viable alternative is
>> not available e.g. UniProt, EMBL, Genbank etc. Then groups which provide
>> similar output to the above have a chance to write their own
>> parsers/formatters. This is very similar to the current situation but we
>> just need to remove dependencies on statically located data structures
>> (don't get rid of them completely just give users an option to not use
>> them).
>>
>> I'm not sure how much automatically generated parsers would help us. I
>> guess it depends on the data model(s) we use if they are auto-parser
>> friendly (which normally means POJO/JavaBean conventions including the
>> no-args constructor).
>>
>> Cool I don't want to exclude flat file parsers completely (if only
>> because my group has an interest in BioJava being able to read & write
>> flat files) :)
>>
>> They decided to have HUPO-PSI Format instead :)
>>
>> Andy
>>
>>
>> Mark Schreiber wrote:
>>> Hi -
>>>
>>> I think in most cases huge XML files in bioinformatics result from a
>>> single XML containing multiple repetitive elements. Eg a BLAST XML
>>> output with several hits or a GenBankXML with many Sequences.  A nice
>>> approach I have seen for dealing with these is to use SAX to read over
>>> the file and every time it comes to an element it delegates to a DOM
>>> object.  You then parse the bits of the DOM you want with XPath or
>>> convert to objects or something and then when you are finished with
>>> that entry everything gets garbage collected and the SAX parser moves
>>> to the next element and repeats the whole process.  This is a hybrid
>>> of event based parsing and object-model based parsing which could let
>>> you efficiently deal with huge files.
>>>
>>> I think the BLAST XML has improved substantially, at least in terms of
>>> validating against it's own DTD.  The DTD itself may not be the best
>>> design but that is always a matter of taste and if you are using XPath
>>> to get the relevant bits you don't need to make a SAX parser jump
>>> through hoops to get them.
>>>
>>> I agree we will have to keep flat file parsers but we should strongly
>>> encourage the use of XML where possible. It is simply easier to deal
>>> with. Most biological flat-files were designed for Fortran and mainly
>>> for human consumption. There is no obvious validation mechanism.
>>> Notably everything in NCBI is derived from ASN.1, what you see in the
>>> flatfile is produced from there. I tend to think this means that the
>>> ASN.1 is the holy gospel and what you get in the flat file is some
>>> translation.  Ideally NCBI files should be parsed from the ASN.1 where
>>> you can guarantee validation, the more practical alternative is to use
>>> the XML which you can at least validate against a DTD.
>>>
>>> With XML we (Biojava) can say if it validates we will parse it and if
>>> it doesn't we may not.  With flat files there are so many dodgey
>>> variants we cannot say anything.  Because XML dtds (or xsd's) have
>>> versions it also makes it much easier to have parsers for different
>>> versions and the parsing machinery can figure out which is needed.
>>> With flat files it is anyones guess what version you are dealing with.
>>>
>>> Finally parsers can be auto-generated for XML if you have the DTD or
>>> XSD. This often doesn't give you an ideal parser but it can be a
>>> useful starting point for rapid development.
>>>
>>> For Biojava v 3 I think we should concentrate on XML parsers first and
>>> flat files second. <sigh>if only Fasta had an XML format</sigh>
>>>
>>> - Mark
>>>
>>> On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> I was always under the impression that blast's XML output was nearly as
>>>> hard to parse as the flat file format but I do agree that if we can use
>>>> XML whenever we can it would make writing parsers a lot easier
>>>> (especially if there are SAX based XPath libraries available). Actually
>>>> this brings up a good question about development of this type of parser.
>>>> The majority of XPath supporting libraries are DOM based which will mean
>>>> large memory usage in some situations but overall providing an easier
>>>> coding experience (and hopefully reduce our chances of creating bugs).
>>>> Or should we code to the edge cases of someone trying to parse a 1GB
>>>> XML? Personally I'd favour the former.
>>>>
>>>> Going back to the original topic there are going to be situations where
>>>> people want the flat file parsers/writers & I think it's a valid point
>>>> to say this is where BioJava is meant to come in & help a developer.
>>>> Afterall XML is a computer science problem where as parsing an EMBL flat
>>>> file or blast output is a bioinformatics problem.
>>>>
>>>> Andy
>>>>
>>>>
>>>> Mark Schreiber wrote:
>>>>> For a long time now my feeling has been that we should *only* support
>>>>> the XML version of blast output.  The other formats are too brittle to
>>>>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc that
>>>>> may be an extreme view but the power of generic XML parsers and things
>>>>> like XPath etc really make these formats look very attractive.
>>>>>
>>>>> - Mark
>>>>>
>>>>>
>>>>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>> I think Groovy have adopted a similar system recently & have guidelines
>>>>>> for how each module should behave (dependencies, build system etc). This
>>>>>> enforces the idea that a module whilst not part of the core project must
>>>>>> behave in the same manner the core does. I do like the idea that we can
>>>>>> have a core biojava & things get added around it & it might encourage
>>>>>> other users to start developing their own modules for any
>>>>>> formats/purpose they want.
>>>>>>
>>>>>> Richard Holland wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA1
>>>>>>>
>>>>>>>> What format options are there from blast? Just thinking if it supports
>>>>>>>> CIGAR or something like that are we better providing a parser for that
>>>>>>>> format & saying that we do not support the traditional blast output?
>>>>>>>> That said it doesn't help is when that format changes so maybe what is
>>>>>>>> needed is a way to push out parser changes without requiring a full
>>>>>>>> biojava release (v3 discussion) ...
>>>>>>> Exactly! So the modular idea would work nicely here - we could have a
>>>>>>> blast module and only update that single module (which would be its own
>>>>>>> JAR) whenever the format changes. In a way, BioJava releases as such
>>>>>>> would no longer happen, except maybe for some kind of core BioJava
>>>>>>> module. Everything would be done in terms of individual module+JAR
>>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS, one
>>>>>>> for Phylogenetic tools, one for translation/transcription, etc. etc.
>>>>>>>
>>>>>>> cheers,
>>>>>>> Richard
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>



More information about the Biojava-l mailing list