[Biojava-l] Possible Submission
Simon Brocklehurst
simon.brocklehurst@CambridgeAntibody.com
Mon, 15 Oct 2001 19:01:47 +0100
Robert Hubley wrote:
> I have developed a parsing framework called LSAX that I would
> like to submit to BioJava. It was inspired by the work of Cambridge
> Antibody Technology (Simon Brocklehurst et al.) on the BioJava
> BlastLikeSaxParser. The idea is the same -- create a bridge
> between XML applications and Non-XML data. The difference
> between the CAT parser and LSAX is in the design of the raw
> file parser. I use LEX (actually JFLEX) to tokenize the raw
> data files and generate Start, Data, and End SAX events. I have
> developed two parsers using this framework an NCBI Blast and
> a Fasta parser. The advantage to using LEX is that you can specify
> the rules of your parser at a high level with regular expressions. The
> actual parser is then auto-generated using JFLEX and is often times
> faster than a parser you would write by hand.
>
> Let me know if you would like to include this in BioJava,
Robert,
Sounds good to me - this sounds is a similar idea to the Andrew Dalke's
Martel package in biopython. This would be a valuable edition to biojava I
think. I don't know if your parsing framework does this already, but it
would be really cool it was SAX2 compliant (as opposed to SAX1).
I also have a question. Can it cope specifying formats that span multiple
lines? Or is it limited to treating non-XML files as being essentially
record-based i.e. dealing with single lines at a time? Sometimes, it's
useful (and necessary) to be able to read ahead several lines of a file,
before actually parsing i.e. emitting SAX events.
Simon
--
Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com