[Biojava-dev] Initial impressions...

Matthew Pocock matthew_pocock at yahoo.co.uk
Thu Jul 3 14:03:18 EDT 2003


Hi,

Len Trigg wrote:
> Hi all,
> 
> We've just been evaluating BioJava for some bioinformatics work and
> have been going through a few simple examples. I thought I'd share our
> first impressions, in case they're a useful datapoint.

Great. Always nice to hear war-stories.

> 
> It's been hard going initially, as the BioJava apis are quite huge,
> and it sometimes feels like a case of "chase the javadocs" from class
> to class in order to find out how to get things done. In some case the
> javadocs are pretty sparse and we've had to look through source code
> for examples (DNATools is quite instructive). One case where we were

Yes. This is a general criticism of BioJava. I think we need to put in 
flashing lights that you should only need to read the interface docs and 
the *Tools or *Utils classes to do a lot of things.

> initially confused, is that we thought that there should be an easy
> way to get from a Symbol to it's one-character name (something like
> aSymbol.getAsChar()). We've now found out that you have to go via
> aSymbol.getAlphabet().getTokenization("token").tokenizeSymbol(aSymbol);

We need to make this process much easier. Unfortunately, getAsChar() 
doesn't realy work for us because we can have symbols for things that 
don't have a single char representation, such as codons. However, you 
shouldn't have to end up going through 20 function calls either.

Is there a biojava in anger example of geting letters from symbols?

> 
> I have been impressed with how easy it is to parse FASTA files, and
> have used both the method to load all the sequences into a SequenceDB,
> and the low-memory method that returns a SequenceIterator (great for
> large sequence files).

Thanks. This sort of things works reasonably efficiently for the richer 
formats as well, such as embl.

> 
> Another thing we tried out was to show the suffix trees for a
> sequence. One confusing thing here is that there seem to be a couple
> of different independent implementations of suffix trees in
> BioJava. The SuffixTree documentation doesn't explain how you are
> supposed to navigate the tree (in particular that child nodes are
> indexed by symbol, rather than as a list of children, so you have to
> get an AlphabetIndex to find out where you are).

I'll take a look at the docs. To be honest, this is very old code and 
hasn't recently been bashed very hard by the core team.

> 
> The UkkonenSuffixTree has a different API to that of the regular
> SuffixTree, and the printTree() method outputs characters that don't
> correspond to the regular symbol representations. Maybe the author of
> this class wasn't aware of how to get the representations of Symbols
> either :-). I have a patch to contribute that addresses this
> (attached).

Francois, would you mind looking at this patch?

> 
> Parsing a BLAST output file was also easy, however, I had to use
> "lazy" mode to work with our files (from NCBI BLAST 2.2.1), and I have
> not yet figured out how to extract a) the length of the query
> sequence, and b) the frame of the hits. Any suggestions here?

Is that information in the annotation attached to the 
SeqSimilaritySearchSubHit or the SeqSimilritySearchResult?

> 
> That's about it at the moment. Soon I intend to look into GFF file
> handling and BioJava/BioSQL integration. Overall I think there is a
> tonne of useful functionality in BioJava -- I look forward to working
> with the BioJava project and hope to be able to make some useful
> contributions.

Good luck with BioSQL and GFF. These are parts of the library that I use 
daily. Oh, and for the GFF, start off by using GFFTools.

> 
> 
> Cheers,
> Len Trigg.
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev


-- 
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk



More information about the biojava-dev mailing list