[Biojava-l] SAX, DOM, XPath and Flat files

Fri Nov 30 18:30:50 UTC 2007

There's a potential gotcha involved with XPath parsing.  If you use the
current implementation that ships with the Java 5 & 6 JDKs, it performs a
DOM parse on the whole document, even if you pass it a specific starting
node in the document.  I stumbled across this one the hard way when using
the hybrid approach that you mention.  This may be solved with another XPath
implementation such as Saxon.

One other problem I've noticed is that the NCBI XML doesn't always parse.
I've reported this to them, and they've promised to address this. It usually
occurs when submitters put non-escaped characters into text fields such as
author lists in PubMed. NCBI doesn't always use CDATA blocks around text and
as soon as the parser hits one of these characters it throws an exception.

I've also noticed a tendency (in other code bases) for developers to use
several different parsers; usually, whatever parser they're most familiar
with.  The problem with this is that they often introduce parser-specific
code into the code base, so you end up with numerous dependencies for
different parsers, and a potential configuration problem if you're passing
the XML parser as a run-time configuration parameter.  The most frequent
external parsers I've seen used are JDOM and DOM4J.  The usual way to get
around this is to write to an interface, but that will require some
additional vigilance.

Just a few things to watch out for as we move forward.

Mark (the other one) :-)

On Nov 30, 2007 1:26 AM, Andy Yates <ayates at ebi.ac.uk> wrote:

> I think I've seen XPath hanging around in other people's code in a 1.5
> code-base (in fact one of the guys I work with). I've used Java's DOM
> before & it really isn't very nice & quite verbose. I'd prefer if there
> was a better alternative/wrapper around the XML parsers just to cut down
> on code chatter.
>
> Wow I've just visited http://asn1.elibel.tm.fr/links/ looking for these
> Java tools & I think I've gone cross-eyed with the sheer number of
> acronyms! You've gotta love something which seems to add a letter to ER
> & that's a new acronym (e.g. BER, DER, PER and XER). Does anyone on the
> list know of a ASN.1 parser for Java that's good and should we support
> it (considering NCBI generate their DTD & XML from the ASN.1
> representation).
>
> Andy
>
> Mark Schreiber wrote:
> > Java 5 SDK has both SAX and DOM as standard. I think it has XPath but
> > not XQuery although XPath is probably more important for this use.
> >
> > The DOM model is a direct implementation of the W3C standard which
> > makes it a little awkward from a java point of view but it is usable.
> >
> > Java 6 has StAX (the other one).
> >
> > There are a few java API's for parsing ASN.1 mostly developed for the
> > telco industry, I've never really looked into which is best (anyone
> > experienced with this?) but we could probably use one to work directly
> > off NCBI ASN.1
> >
> > - Mark
> >
> > On Nov 28, 2007 10:29 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> Hi Mark,
> >>
> >> Okay that sounds like a perfectly sensible way to deal with this. Is
> >> this kind of parsing model supported in Java5? I only ask as I've not
> >> done a lot of XML parsing with Java5; more with things like XOM (which
> I
> >> think offers a DOM only representation but I'm probably wrong).
> >>
> >> That's good. There's not a huge point to have a format & a DTD/XSD and
> >> then have your files not conform to it.
> >>
> >> I was thinking the exact same thing about ASN.1 (well that & it looks
> >> bleeding horrible to parse but that is an un-educated look at the
> format
> >> which I'm sure is a parsable as JSON & the alike).
> >>
> >> When it comes to flat file parsers I would be happier to provide
> >> implementations of the more common formats where a viable alternative
> is
> >> not available e.g. UniProt, EMBL, Genbank etc. Then groups which
> provide
> >> similar output to the above have a chance to write their own
> >> parsers/formatters. This is very similar to the current situation but
> we
> >> just need to remove dependencies on statically located data structures
> >> (don't get rid of them completely just give users an option to not use
> >> them).
> >>
> >> I'm not sure how much automatically generated parsers would help us. I
> >> guess it depends on the data model(s) we use if they are auto-parser
> >> friendly (which normally means POJO/JavaBean conventions including the
> >> no-args constructor).
> >>
> >> Cool I don't want to exclude flat file parsers completely (if only
> >> because my group has an interest in BioJava being able to read & write
> >> flat files) :)
> >>
> >> They decided to have HUPO-PSI Format instead :)
> >>
> >> Andy
> >>
> >>
> >> Mark Schreiber wrote:
> >>> Hi -
> >>>
> >>> I think in most cases huge XML files in bioinformatics result from a
> >>> single XML containing multiple repetitive elements. Eg a BLAST XML
> >>> output with several hits or a GenBankXML with many Sequences.  A nice
> >>> approach I have seen for dealing with these is to use SAX to read over
> >>> the file and every time it comes to an element it delegates to a DOM
> >>> object.  You then parse the bits of the DOM you want with XPath or
> >>> convert to objects or something and then when you are finished with
> >>> that entry everything gets garbage collected and the SAX parser moves
> >>> to the next element and repeats the whole process.  This is a hybrid
> >>> of event based parsing and object-model based parsing which could let
> >>> you efficiently deal with huge files.
> >>>
> >>> I think the BLAST XML has improved substantially, at least in terms of
> >>> validating against it's own DTD.  The DTD itself may not be the best
> >>> design but that is always a matter of taste and if you are using XPath
> >>> to get the relevant bits you don't need to make a SAX parser jump
> >>> through hoops to get them.
> >>>
> >>> I agree we will have to keep flat file parsers but we should strongly
> >>> encourage the use of XML where possible. It is simply easier to deal
> >>> with. Most biological flat-files were designed for Fortran and mainly
> >>> for human consumption. There is no obvious validation mechanism.
> >>> Notably everything in NCBI is derived from ASN.1, what you see in the
> >>> flatfile is produced from there. I tend to think this means that the
> >>> ASN.1 is the holy gospel and what you get in the flat file is some
> >>> translation.  Ideally NCBI files should be parsed from the ASN.1 where
> >>> you can guarantee validation, the more practical alternative is to use
> >>> the XML which you can at least validate against a DTD.
> >>>
> >>> With XML we (Biojava) can say if it validates we will parse it and if
> >>> it doesn't we may not.  With flat files there are so many dodgey
> >>> variants we cannot say anything.  Because XML dtds (or xsd's) have
> >>> versions it also makes it much easier to have parsers for different
> >>> versions and the parsing machinery can figure out which is needed.
> >>> With flat files it is anyones guess what version you are dealing with.
> >>>
> >>> Finally parsers can be auto-generated for XML if you have the DTD or
> >>> XSD. This often doesn't give you an ideal parser but it can be a
> >>> useful starting point for rapid development.
> >>>
> >>> For Biojava v 3 I think we should concentrate on XML parsers first and
> >>> flat files second. <sigh>if only Fasta had an XML format</sigh>
> >>>
> >>> - Mark
> >>>
> >>> On Nov 27, 2007 11:16 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> I was always under the impression that blast's XML output was nearly
> as
> >>>> hard to parse as the flat file format but I do agree that if we can
> use
> >>>> XML whenever we can it would make writing parsers a lot easier
> >>>> (especially if there are SAX based XPath libraries available).
> Actually
> >>>> this brings up a good question about development of this type of
> parser.
> >>>> The majority of XPath supporting libraries are DOM based which will
> mean
> >>>> large memory usage in some situations but overall providing an easier
> >>>> coding experience (and hopefully reduce our chances of creating
> bugs).
> >>>> Or should we code to the edge cases of someone trying to parse a 1GB
> >>>> XML? Personally I'd favour the former.
> >>>>
> >>>> Going back to the original topic there are going to be situations
> where
> >>>> people want the flat file parsers/writers & I think it's a valid
> point
> >>>> to say this is where BioJava is meant to come in & help a developer.
> >>>> Afterall XML is a computer science problem where as parsing an EMBL
> flat
> >>>> file or blast output is a bioinformatics problem.
> >>>>
> >>>> Andy
> >>>>
> >>>>
> >>>> Mark Schreiber wrote:
> >>>>> For a long time now my feeling has been that we should *only*
> support
> >>>>> the XML version of blast output.  The other formats are too brittle
> to
> >>>>> be easy to parse.  I also feel similarly about Genbank, EMBL, etc
> that
> >>>>> may be an extreme view but the power of generic XML parsers and
> things
> >>>>> like XPath etc really make these formats look very attractive.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>>
> >>>>> On Nov 27, 2007 7:47 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>>>> I think Groovy have adopted a similar system recently & have
> guidelines
> >>>>>> for how each module should behave (dependencies, build system etc).
> This
> >>>>>> enforces the idea that a module whilst not part of the core project
> must
> >>>>>> behave in the same manner the core does. I do like the idea that we
> can
> >>>>>> have a core biojava & things get added around it & it might
> encourage
> >>>>>> other users to start developing their own modules for any
> >>>>>> formats/purpose they want.
> >>>>>>
> >>>>>> Richard Holland wrote:
> >>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>> Hash: SHA1
> >>>>>>>
> >>>>>>>> What format options are there from blast? Just thinking if it
> supports
> >>>>>>>> CIGAR or something like that are we better providing a parser for
> that
> >>>>>>>> format & saying that we do not support the traditional blast
> output?
> >>>>>>>> That said it doesn't help is when that format changes so maybe
> what is
> >>>>>>>> needed is a way to push out parser changes without requiring a
> full
> >>>>>>>> biojava release (v3 discussion) ...
> >>>>>>> Exactly! So the modular idea would work nicely here - we could
> have a
> >>>>>>> blast module and only update that single module (which would be
> its own
> >>>>>>> JAR) whenever the format changes. In a way, BioJava releases as
> such
> >>>>>>> would no longer happen, except maybe for some kind of core BioJava
> >>>>>>> module. Everything would be done in terms of individual module+JAR
> >>>>>>> releases instead - one for Genbank, one for BioSQL, one for NEXUS,
> one
> >>>>>>> for Phylogenetic tools, one for translation/transcription, etc.
> etc.
> >>>>>>>
> >>>>>>> cheers,
> >>>>>>> Richard
> >>>>>> _______________________________________________
> >>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>