[Biojava-dev] Newbie design question

Schreiber, Mark mark.schreiber at agresearch.co.nz
Fri Jul 18 11:48:35 EDT 2003


Hi Chris -
 
> I recently ported sputnik (http://abajian.net/sputnik) to 
> Java, mostly because I got tired of people emailing me with 
> problems compiling and running it on different platforms.  
> Sputnik is a utility to search DNA sequence for 
> microsatellite repeats (which are basically repeated patterns 
> of 2-5 ntides with the occasional error,e.g.
> "CACACACACCACACACAACCACACACA")
> 

First up, thanks for making Sputnik, it's great. I have used it a lot
and also did a version in Java for very much the same reason.

> I got two suprises:
> 
> - On a large test file, Sun 1.4 outperformed the C (gcc) 
> version by almost exactly 2x (cpu time).  This is with a 
> direct port of the algorithm, no attempt at optimization, 
> same unloaded CPU/OS.  Java rocks.
>

Probably because sputnik is very loop intensive the HotSpot compiler
will be compiling much of the program into native code. Its still a
pretty surprising result though!
 
> - The biojava Fasta file parsing was really easy to use and 

Glad you found it easy to use. 

> SymbolList was measurably faster than my own trivial 
> Stringbuffer implementation.
> 

Probably cause Java is pants at String manipulation (even with
Stringbuffer). Probably cause of all that copying and object creation
that goes on with Strings. BJ uses flyweight objects as Symbols so there
is not much object creation going on with SymbolLists.

> What's the best way to return the results of the search?  
> Adding Features?  The original utility produced an ugly & 
> difficult to parse report to stdout.  I can imagine at least 
> three output "formats" of the results, including other 
> biojava apps (i.e. a java class), a display utility and 
> possibly an XML format file to be imported into other systems.
> 

One possibility would be to add features to the Sequences you are
testing. If the features followed one of the EMBL/ GenBank feature
labels that would be cool, otherwise just put your own values in for
feature type etc. Another useful output format is GFF or XFF, both are
supported by BioJava and can be made easily from Sequences with Features
using GFFTools or XFFTools.

> Where does it belong in the class heirarchy (if it does)?  I 
> realize that these are questions I could answer myself, but 
> the docs are a bit daunting and even then it takes a while to 
> learn the local "style" of a package.  What would be a good 
> interface for the SatelliteFinder class? 
> I'm hoping to get some design advice at this stage to make it 
> more intuitive and consistent with the rest of biojava.
> 

My advice would be to put the object(s) that do the prediction/
annotation into something like org.biojava.bio.program.sputnik and put a
demo class with a main method into the demos directory. As for an
interface, there is a useful class called
org.biojava.bio.seq.SequenceAnnotator that is not used as widely as it
should be. It's pretty easy to implement.

- Mark

Mark Schreiber PhD
AgResearch Joint Bioinformatics Institute
School of Biological Sciences
Universtity of Auckland
3a Symonds St
Private Bag 92-019 Auckland New Zealand
 
PH:   +64 9 3737599 ext 84290
FAX:  +64 9 3737414

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================



More information about the biojava-dev mailing list