[Biojava-l] Evolutionary distances

Wed Oct 24 12:28:21 UTC 2007

Yes a very good point & one I was going to make before hand but forgot :)

Also not to mention that micro-benchmarks/profiling in Java are 
notorious for giving false results due to VM warmup & JIT compilation 
optimisations. There is a framework hosted on Java.net somewhere which 
can perform VM warmups and code iterations to produce more accurate 
benchmarking results; but the name escapes me at the moment.

However looking at this particular code I get the feeling that this is 
about as fast as its going to get without someone doing bitwise XOR 
operations or some C code ... that's not an open invitation for people 
to start recoding this in C :). At the end of the day the key to 
optimisation is to ask the question "is it fast enough already?". If it 
is then there's no point :)

Andy

Mark Schreiber wrote:
> Hi -
> 
>>From experience the best way to optimize java code is to run a
> profiler. The one in Netbeans is quite good.
> 
> The reason is that the hotspot or JIT compilers might natively compile
> the part of the code that you think is slow and actually make it
> faster than something else which becomes the bottle neck. Using a good
> profiler you can detect how much time is spent in each method and pin
> point some candidate methods for optimization. You can also see if
> there is a burden due to creation of lots of objects.
> 
> - Mark
> 
> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Our code is very similar but not identical. The original programmer
>> shortcutted a lot of else if conditions by considering if the two bases
>> were equal or not. It can then calculate the transitional changes &
>> assume the rest are transversional.
>>
>> In terms of speed of both pieces of code I can't see an obvious way to
>> speed it up. Probably in our code removing the 10 or so calls to
>> String.charAt() with a two calls & referencing those chars might help
>> but in all honesty I cannot say.
>>
>> Andy
>>
>> Richard Holland wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Thanks.
>>>
>>> Your code is similar to the code we have in
>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>>> see if it is identical, but it probably is.
>>>
>>> You can call our code like this:
>>>
>>>  // import statement for biojava phylo stuff
>>>  import org.biojavax.bio.phylo.*;
>>>
>>>  // ...rest of code goes here
>>>
>>>  // call Kimura2P
>>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>>  String seq2 = ...;
>>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>>
>>> Note that our implementation expects sequence strings to be in upper
>>> case, so you'll need to make sure your data is upper case or has been
>>> converted to upper case before calling our method.
>>>
>>> cheers,
>>> Richard
>>>
>>> vineith kaul wrote:
>>>> This is what I have .....Thanks a lot  fr the help.
>>>>
>>>>
>>>> //Method to calculate the Kimura 2 parameter distance
>>>> public static double K2P(String sequence1,String sequence2){
>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>
>>>>
>>>>         char[] seq1array=sequence1.toCharArray();
>>>>         char[] seq2array=sequence2.toCharArray();
>>>>
>>>>         for(int i=0;i<seq1array.length;i++){
>>>>                                 // Number of aligned sites
>>>>                 if(((seq1array[i]=='a') ||
>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>
>>>>                         numberOfAlignedSites++;
>>>>                 }
>>>>
>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>                                 q++;
>>>>                         }
>>>>                 else
>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>                                 q++;
>>>>                         }
>>>>                 else
>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>                                         q++;
>>>>                                 }
>>>>
>>>>
>>>>
>>>>
>>>>         }
>>>>
>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>> (((double)q)/numberOfAlignedSites);
>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>          return dist;
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>
>>>>     You should take a look at the latest 1.5 release, in the
>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>     Kimura2P
>>>>     is already implemented here, in
>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>
>>>>     If you can't find code that will do what you want, but have written some
>>>>     before, then please do feel free to contribute it. Even if it is
>>>>     slow, I'm
>>>>     sure someone out there will be able to help optimise it!
>>>>
>>>>     cheers,
>>>>     Richard
>>>>
>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>     > Hi,
>>>>     >
>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>     > slow and I am not even sure if thats right.
>>>>     > --
>>>>     > Vineith Kaul
>>>>     > Masters Student Bioinformatics
>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>     > Georgia Tech, Atlanta
>>>>     > _______________________________________________
>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>     >
>>>>
>>>>
>>>>     --
>>>>     Richard Holland
>>>>     BioMart ( http://www.biomart.org/)
>>>>     EMBL-EBI
>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Vineith Kaul
>>>> Masters Student Bioinformatics
>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>> Georgia Tech, Atlanta
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>>
>>> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
>>> 4iKvsyBj2uznhhjTF9EYDFE=
>>> =LALE
>>> -----END PGP SIGNATURE-----
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>