[Biojava-dev] Initial impressions...

Len Trigg len at reeltwo.com
Wed Jul 2 18:58:41 EDT 2003


Hi all,

We've just been evaluating BioJava for some bioinformatics work and
have been going through a few simple examples. I thought I'd share our
first impressions, in case they're a useful datapoint.

It's been hard going initially, as the BioJava apis are quite huge,
and it sometimes feels like a case of "chase the javadocs" from class
to class in order to find out how to get things done. In some case the
javadocs are pretty sparse and we've had to look through source code
for examples (DNATools is quite instructive). One case where we were
initially confused, is that we thought that there should be an easy
way to get from a Symbol to it's one-character name (something like
aSymbol.getAsChar()). We've now found out that you have to go via
aSymbol.getAlphabet().getTokenization("token").tokenizeSymbol(aSymbol);

I have been impressed with how easy it is to parse FASTA files, and
have used both the method to load all the sequences into a SequenceDB,
and the low-memory method that returns a SequenceIterator (great for
large sequence files).

Another thing we tried out was to show the suffix trees for a
sequence. One confusing thing here is that there seem to be a couple
of different independent implementations of suffix trees in
BioJava. The SuffixTree documentation doesn't explain how you are
supposed to navigate the tree (in particular that child nodes are
indexed by symbol, rather than as a list of children, so you have to
get an AlphabetIndex to find out where you are).

The UkkonenSuffixTree has a different API to that of the regular
SuffixTree, and the printTree() method outputs characters that don't
correspond to the regular symbol representations. Maybe the author of
this class wasn't aware of how to get the representations of Symbols
either :-). I have a patch to contribute that addresses this
(attached).

Parsing a BLAST output file was also easy, however, I had to use
"lazy" mode to work with our files (from NCBI BLAST 2.2.1), and I have
not yet figured out how to extract a) the length of the query
sequence, and b) the frame of the hits. Any suggestions here?

That's about it at the moment. Soon I intend to look into GFF file
handling and BioJava/BioSQL integration. Overall I think there is a
tonne of useful functionality in BioJava -- I look forward to working
with the BioJava project and hope to be able to make some useful
contributions.


Cheers,
Len Trigg.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: UkkonenSuffixTree.patch
Type: application/octet-stream
Size: 4355 bytes
Desc: not available
Url : http://pw600a.bioperl.org/pipermail/biojava-dev/attachments/20030702/0bbcadb4/UkkonenSuffixTree.lha


More information about the biojava-dev mailing list