[Biojava-dev] blast parsing slowness

Mon, 2 Dec 2002 15:47:59 -0500

This is a good topic for consideration with BioJava2.

The circumstances are this: I have my blast parser working in my personal experimental biojava package. The blast data I am parsing was generated by blasting 1 mb human genomic chunks against small sequences (basically ests), so 1 query many different subjects. Anyways, I did comparisons of the java code against a home brewed perl blast parser. The biojava was much slower (at least an order of magnatitude slower) than the perl code. Now this isnt quite a fair test because the design of the two parsers is completely different but if anything I would still expect Java to be faster than perl.

I profiled the code and found that the vast majority of the processing time was being spent in org.biojava.utils.ChangeSupport.growIfNecessary. Everytime it creates an alignment (org.biojava.bio.program.ssbind.BlastLikeSearchBuilder.makeSubHit) it is adding a changeListener to the generic alphabet (org.biojava.bio.symbol.SimpleSymbolList.addListener) it is using for alignments. Obviously it is adding many thousands of change listeners to the alphabet, but to add insult to injury, the listeners are all ALWAYS_VETO. So this poor alphabet has thousands of listeners telling it not to change.

Is this really what was intended? I get the impression that the ALWAYS_VETO changeListener is a special case. Perhaps ALWAYS_VETO listeners should just be kept track of by a counter? Should alphabets be changable at all? I do not know what use cases prompted this design but is there any concensus on a fix?

Doug Rusch
drusch@tcag.org