[Biojava-dev] blast parsing continued

Doug Rusch drusch@tcag.org
Thu, 14 Nov 2002 14:18:43 -0500


I have no problem adding my solution to the repository however I dont have the permissions to do that. Also, I should point out that the changes have rather large consequences.

I like the design for the StAX based parsing but to maintain that design requires adding the defline and length of the query and subject sequence requires changing the DTD. Such changes ripple outwards till most of the parsing code has to be modified. I am still not satisfied with the DTD and need to add components that would be optional (such as the stats collected at the end of the blast report).

I also re-wrote the parser for two reasons. One, having a parser that tries to do it all (parse HMMER, and blast) is a maintainence nightmare. The current parser was overly complicated and difficult to modify. In the end it was easier to write a new parser. My working version only parses NCBI Blast (it may work for wu-blast but havent tested it) and uses the regex classes available in Java 1.4. While perhaps not the prettiest code I have ever written, it is much clearer what the parser is doing and should be easy to modify should NCBI Blast output change in some significant way.

Ideally, I would also like to enhance the documentation, put together some example programs, and some unit tests before I checked anything in. I would be happy to share what I have now but I would like to do it in a way that does not cause confusion.

Doug


To: biojava-dev@biojava.org
Subject: Re: [Biojava-dev] blast parsing continued
From: Keith James <kdj@sanger.ac.uk>
Date: 13 Nov 2002 21:42:27 +0000

>>>>> "Doug" == Doug Rusch <drusch@tcag.org> writes:

[...]

    Doug> I think the use of sequenceDBs was a better approach than
    Doug> using just queryID, databaseID, and subjectID. Minimally, if
    Doug> you look at blast output, there are 3 valuable attributes of
    Doug> a sequence. The id, the definition line, and the length of
    Doug> the sequence. The problem comes that there is no such thing
    Doug> as a sequence-less Sequence object. I tested and implemented
    Doug> an approach that makes a VirtualSequence object that is
    Doug> built with an SymbolList.EMPTY and has an overridden
    Doug> getLength method that. This allows the parser to keep all
    Doug> the valuable information you might have about a sequence you
    Doug> see in a blast output while allowing you to use all the
    Doug> functionality of the sequenceDB classes.

    Doug> What are everyone elses opinions on this?

This sounds like a fine idea. It's certainly better than the dummy
objects which I was using.

Almost all the feedback I've had previously was from people finding
that the original API was hard to use. If we can roll your solution
into the dist then we can have the best of both worlds (by the magic
of CVS).

Keith

--

- Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -