[Biojava-l] seqString/subStr bottleneck in SymbolList
Keith James
kdj@sanger.ac.uk
02 Jul 2001 11:28:54 +0100
>>>>> "Thomas" == Thomas Down <td2@sanger.ac.uk> writes:
[...]
Thomas> This worries me... The subStr method ought to run in
Thomas> linear time w.r.t the size of the window you're
Thomas> requesting, and constant w.r.t. the overall length of the
Thomas> sequence.
Thomas> My first thought is to wonder if there might be a
Thomas> SymbolList.seqString().substring(x, y) somewhere where you
Thomas> really want a SymbolList.subStr(x, y)...
Nope, afraid not. The Artemis sequence object simply wraps a
SymbolList and the subsequence method is
public String getSubSequence(int index1, int index2)
{
String subSeq = "";
try
{
// symbols is a BioJava SymbolList (a SimpleSequence)
subSeq = symbols.subStr(index1, index2);
// Using JDK method
// subSeq = sequence.substring(index1 - 1, index2);
}
catch (IndexOutOfBoundsException ioe)
{
ioe.printStackTrace();
}
// print static counter
callCount++;
System.err.println("count " + callCount);
return subSeq;
}
>> The cause turned out to be subStr() in AbstractSymbolList which
>> makes 4 method calls for each base in the subsequence
>> (symbolAt(), length(), getToken(), append()) when creating a
>> readable (i.e. string) representation of the sequence. Is this
>> something that is worth looking at in the BioJava core?
Thomas> When I've looked at this in the past, this process hasn't
Thomas> been too slow (actually, I did once try optimizing
Thomas> stringification for one particular special case, and ended
Thomas> up slowing things down!). Possibly if your java VM isn't
Thomas> handling method inlining well (you are using the Fast VM
Thomas> on Compaqs, aren't you?) it'll slow down.
Yes, fast VM on the Compaqs, but I get the same symptom with the Sun
and Blackdown ports on Linux. I added printing a static counter -
calling the BioJava subStr() I can count 3-4 seconds between
increments, while using the java.lang.String.substring they whizz past
too fast to see.
It's not really a big deal though, as I can just cache the String.
--
-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA