From matthew.pocock at ncl.ac.uk  Tue Oct  2 18:14:01 2007
From: matthew.pocock at ncl.ac.uk (Matthew Pocock)
Date: Tue, 2 Oct 2007 23:14:01 +0100
Subject: [Biojava-l] Biojava Question.
In-Reply-To: <1653.130.207.66.142.1191340893.squirrel@webmail.cc.gatech.edu>
References: <1653.130.207.66.142.1191340893.squirrel@webmail.cc.gatech.edu>
Message-ID: <200710022314.02018.matthew.pocock@ncl.ac.uk>

This is very strange. This sort of error nearly always happens because of a 
miss-configured classpath. Could you send me:

The html of the page that causes the problem

The URL of the jars the page should be referencing

A URL that I can point my browser at that causes the problem

It is difficult to debug something like this without the program actually 
infront of me.

Matthew

On Tuesday 02 October 2007, abhi232 at cc.gatech.edu wrote:
> Respected Sir,
>
> I am sorry if I sent you a direct mail but this is a kind of emergency and
> I am not getting any substantial response from the biojava mailing
> community.
> I a graduate student at Georgia Institute of technology.We are working on
> creating a Teaceviewer applet for viewing the Sequence using biojava
> library.
> I am able to create the applet using netbeans and run it there.
> The error comes when I upload it on net. I am getting this particular
> error.
>
> java.lang.NoClassDefFoundError:
> org/biojava/bio/gui/sequence/SequenceRenderer at
> java.lang.Class.getDeclaredConstructors0(Native Method)
> 	at java.lang.Class.privateGetDeclaredConstructors(Unknown Source)
> 	at java.lang.Class.getConstructor0(Unknown Source)
> 	at java.lang.Class.newInstance0(Unknown Source)
> 	at java.lang.Class.newInstance(Unknown Source)
> 	at sun.applet.AppletPanel.createApplet(Unknown Source)
> 	at sun.plugin.AppletViewer.createApplet(Unknown Source)
> 	at sun.applet.AppletPanel.runLoader(Unknown Source)
> 	at sun.applet.AppletPanel.run(Unknown Source)
> 	at java.lang.Thread.run(Unknown Source)
>
> I am getting an error only for SequenceRenderer class.Even If I comment
> that out still it is giving me error.
>
> I have set the classpath as well as the path variables and also I am
> giving the archive field in the applet code so as the biojava library will
> be available.
>
> Is there any particular thing required which I probably am missing?
> Please guide me on this topic.
> I would really appreciate your gesture.
> Thanks a lot in advance.


From elmh06 at yahoo.ca  Wed Oct  3 14:27:36 2007
From: elmh06 at yahoo.ca (El Mabrouk M)
Date: Wed, 3 Oct 2007 14:27:36 -0400 (EDT)
Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method
Message-ID: <975012.12435.qm@web37310.mail.mud.yahoo.com>

Hi!  
 
I have just started to learn biojava. I have written a small    
program that write a sequence in fasta file with the help of the biojavax method
 
RichSequence.IOTools.writeFasta(seqOut, s1, ns);  
I have got the error "cannot find symbol".
I'm using biojava 1.5, jdk 1.6 and netbeans.
What can be done to fix this problem?

This is what I tried:

import org.biojava.bio.seq.*;
import java.io.*;
import org.biojava.bio.symbol.SymbolList;
import org.biojavax.RichObjectFactory;
import javax.xml.stream.events.Namespace;
import org.biojavax.bio.seq.RichSequence;

public class SeqFastaF {
    public static void main(String[] args) {
        SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1");
        Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0");
        try {
            OutputStream seqOut = System.out;
            Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace();
            RichSequence.IOTools.writeFasta(seqOut,s1,ns); 
        } catch (IOException ex) {
            //io error
            ex.printStackTrace();
        }
    }
}

Error:
cannot find symbol
symbol  : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace)
location: class org.biojavax.bio.seq.RichSequence.IOTools


---------------------------------
Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail  

From markjschreiber at gmail.com  Wed Oct  3 19:20:31 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Thu, 4 Oct 2007 07:20:31 +0800
Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method
In-Reply-To: <975012.12435.qm@web37310.mail.mud.yahoo.com>
References: <975012.12435.qm@web37310.mail.mud.yahoo.com>
Message-ID: <93b45ca50710031620m35495bfey8ec111177c6201f@mail.gmail.com>

Hi -

This is a compilation error. It is caused because the biojava write
method is expecting a Namespace object from the biojavax package but
netbeans has guessed that you wanted a Namespace object from the
javax.xml.stream.events package and has imported this for you.

If you remove that import ( javax.xml.stream.events.Namespace) and
then import the biojavax Namespace object it should compile.

- Mark

On 10/4/07, El Mabrouk M <elmh06 at yahoo.ca> wrote:
> Hi!
>
> I have just started to learn biojava. I have written a small
> program that write a sequence in fasta file with the help of the biojavax method
>
> RichSequence.IOTools.writeFasta(seqOut, s1, ns);
> I have got the error "cannot find symbol".
> I'm using biojava 1.5, jdk 1.6 and netbeans.
> What can be done to fix this problem?
>
> This is what I tried:
>
> import org.biojava.bio.seq.*;
> import java.io.*;
> import org.biojava.bio.symbol.SymbolList;
> import org.biojavax.RichObjectFactory;
> import javax.xml.stream.events.Namespace;
> import org.biojavax.bio.seq.RichSequence;
>
> public class SeqFastaF {
>     public static void main(String[] args) {
>         SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1");
>         Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0");
>         try {
>             OutputStream seqOut = System.out;
>             Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace();
>             RichSequence.IOTools.writeFasta(seqOut,s1,ns);
>         } catch (IOException ex) {
>             //io error
>             ex.printStackTrace();
>         }
>     }
> }
>
> Error:
> cannot find symbol
> symbol  : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace)
> location: class org.biojavax.bio.seq.RichSequence.IOTools
>
>
>
> ---------------------------------
> Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From md5 at sanger.ac.uk  Wed Oct  3 19:05:43 2007
From: md5 at sanger.ac.uk (Mutlu Dogruel)
Date: Thu, 4 Oct 2007 00:05:43 +0100 (BST)
Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method
In-Reply-To: <975012.12435.qm@web37310.mail.mud.yahoo.com>
References: <975012.12435.qm@web37310.mail.mud.yahoo.com>
Message-ID: <Pine.LNX.4.64.0710040001150.22143@cbi4c.internal.sanger.ac.uk>


Hi, try using import org.biojavax.Namespace instead of 
javax.xml.stream.events.Namespace;

Also, you should handle the illegal symbol 
exception that DNATools.createDNASequence may throw.

Cheers,
mutlu

On Wed, 3 Oct 2007, El Mabrouk M wrote:

> Hi!
>
> I have just started to learn biojava. I have written a small
> program that write a sequence in fasta file with the help of the biojavax method
>
> RichSequence.IOTools.writeFasta(seqOut, s1, ns);
> I have got the error "cannot find symbol".
> I'm using biojava 1.5, jdk 1.6 and netbeans.
> What can be done to fix this problem?
>
> This is what I tried:
>
> import org.biojava.bio.seq.*;
> import java.io.*;
> import org.biojava.bio.symbol.SymbolList;
> import org.biojavax.RichObjectFactory;
> import javax.xml.stream.events.Namespace;
> import org.biojavax.bio.seq.RichSequence;
>
> public class SeqFastaF {
>    public static void main(String[] args) {
>        SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1");
>        Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0");
>        try {
>            OutputStream seqOut = System.out;
>            Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace();
>            RichSequence.IOTools.writeFasta(seqOut,s1,ns);
>        } catch (IOException ex) {
>            //io error
>            ex.printStackTrace();
>        }
>    }
> }
>
> Error:
> cannot find symbol
> symbol  : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace)
> location: class org.biojavax.bio.seq.RichSequence.IOTools
>
>
>
> ---------------------------------
> Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

From su24 at st-andrews.ac.uk  Thu Oct  4 10:43:23 2007
From: su24 at st-andrews.ac.uk (Saif Ur-Rehman)
Date: Thu,  4 Oct 2007 15:43:23 +0100
Subject: [Biojava-l] WriteFasta
Message-ID: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>


Dear All,

I was writing to ask about the SeqIOTools.writeFasta() Method. I am currently
trying to break up Fasta Files of whole organisms into one file per gene for
further analysis. However the writeFasta method appears to append the
characters
"??

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


From holland at ebi.ac.uk  Thu Oct  4 11:23:10 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 04 Oct 2007 16:23:10 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
Message-ID: <4705055E.5070401@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

SeqIOTools is deprecated.

Try RichSequence.IOTools.writeFasta() instead to see if that helps.

e.g.:

RichSequence.IOTools.writeFasta(
	System.out,
	seq,
	RichObjectFactory.getDefaultNamespace()
	);

where seq is either a Sequence or a SequenceIterator.

cheers,
Richard

Saif Ur-Rehman wrote:
> Dear All,
> 
> I was writing to ask about the SeqIOTools.writeFasta() Method. I am currently
> trying to break up Fasta Files of whole organisms into one file per gene for
> further analysis. However the writeFasta method appears to append the
> characters
> "??
> 
> ------------------------------------------------------------------
> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp
C4xPs/2ywAMfIPDmUKPCrqg=
=TwwH
-----END PGP SIGNATURE-----

From su24 at st-andrews.ac.uk  Thu Oct  4 11:23:52 2007
From: su24 at st-andrews.ac.uk (Saif Ur-Rehman)
Date: Thu,  4 Oct 2007 16:23:52 +0100
Subject: [Biojava-l] (no subject)
Message-ID: <1191511432.4705058825b79@webmail.st-andrews.ac.uk>


Dear All,

I'm sorry the use of the characters seems to have truncated the previous email I
sent. To complete my question I was just wondering as to possible causes for
this addition of random charcters and if there was a way to stop it from
occuring.

Thanking you again

Saif
-------------------------------------------------------------------------------
Saif Ur-Rehman
Research Student
The Centre for Evolution, Genes & Genomics (CEGG)
Dyers Brae
School of Biology
The University of St Andrews
St Andrews,
Fife
Scotland,UK

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


From su24 at st-andrews.ac.uk  Fri Oct  5 06:06:25 2007
From: su24 at st-andrews.ac.uk (Saif Ur-Rehman)
Date: Fri,  5 Oct 2007 11:06:25 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <4705055E.5070401@ebi.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
	<4705055E.5070401@ebi.ac.uk>
Message-ID: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>

Dear Richard,

I have tried the RichSEquence.IOTools.writeFasta method and this method is still
appending the characters "??" to the front of each write. I am using a 
FileOutputStream and a Sequence object as inputs to the method. like so.


 Sequence seq; // read in from File
 FileOutputStream f =new FileOutputStream (fileName);


			   try{

			    	RichSequence.IOTools.writeFasta(f,
			    	        seq,
			    	        RichObjectFactory.getDefaultNamespace()
			    	        );


			    }


Thanks a lot for your time

Sincerely,

Saif

Quoting Richard Holland <holland at ebi.ac.uk>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> SeqIOTools is deprecated.
>
> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
>
> e.g.:
>
> RichSequence.IOTools.writeFasta(
> 	System.out,
> 	seq,
> 	RichObjectFactory.getDefaultNamespace()
> 	);
>
> where seq is either a Sequence or a SequenceIterator.
>
> cheers,
> Richard
>
> Saif Ur-Rehman wrote:
> > Dear All,
> >
> > I was writing to ask about the SeqIOTools.writeFasta() Method. I am
> currently
> > trying to break up Fasta Files of whole organisms into one file per gene
> for
> > further analysis. However the writeFasta method appears to append the
> > characters
> > "??
> >
> > ------------------------------------------------------------------
> > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp
> C4xPs/2ywAMfIPDmUKPCrqg=
> =TwwH
> -----END PGP SIGNATURE-----
>


-------------------------------------------------------------------------------
Saif Ur-Rehman
Research Student
The Centre for Evolution, Genes & Genomics (CEGG)
Dyers Brae
School of Biology
The University of St Andrews
St Andrews,
Fife
Scotland,UK

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


From holland at ebi.ac.uk  Fri Oct  5 06:13:36 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 05 Oct 2007 11:13:36 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
	<4705055E.5070401@ebi.ac.uk>
	<1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
Message-ID: <47060E50.2070405@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Where are the input sequences coming from? i.e. what method are you
using to construct them or read them from a file.

Also, what do you mean by the 'front' of each write? Could you send me
an example of an entire FASTA file containing the problem? (It'd be best
to attach the file to an email to me personally as this list will not
accept attachments, and copying-and-pasting from a text editor to an
email client may obscure the underlying problem).

It'd be good also to see your entire code from the point the sequences
are read or created to the point where they are written out. Or, a
sample program which exhibits the same behaviour would suffice.

I suspect that the sequences themselves contain the incorrect data,
although technically this should be impossible as the sequence alphabet
should prevent it.

We recently had an issue reported here regarding BioJava not being able
to do certain sequence tasks on platforms using non-Western-European
character mappings. If your machine is running such a mapping, try it
again on a machine with an English or other Western European language
set up by default. If it works there but not on your machine, then
this'll be the same problem. (There is no solution yet, but at least
you'll know what's wrong).

cheers,
Richard

Saif Ur-Rehman wrote:
> Dear Richard,
> 
> I have tried the RichSEquence.IOTools.writeFasta method and this method is still
> appending the characters "??" to the front of each write. I am using a 
> FileOutputStream and a Sequence object as inputs to the method. like so.
> 
> 
>  Sequence seq; // read in from File
>  FileOutputStream f =new FileOutputStream (fileName);
> 
> 
> 			   try{
> 
> 			    	RichSequence.IOTools.writeFasta(f,
> 			    	        seq,
> 			    	        RichObjectFactory.getDefaultNamespace()
> 			    	        );
> 
> 
> 			    }
> 
> 
> Thanks a lot for your time
> 
> Sincerely,
> 
> Saif
> 
> Quoting Richard Holland <holland at ebi.ac.uk>:
> 
> SeqIOTools is deprecated.
> 
> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
> 
> e.g.:
> 
> RichSequence.IOTools.writeFasta(
> 	System.out,
> 	seq,
> 	RichObjectFactory.getDefaultNamespace()
> 	);
> 
> where seq is either a Sequence or a SequenceIterator.
> 
> cheers,
> Richard
> 
> Saif Ur-Rehman wrote:
>>>> Dear All,
>>>>
>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am
> currently
>>>> trying to break up Fasta Files of whole organisms into one file per gene
> for
>>>> further analysis. However the writeFasta method appears to append the
>>>> characters
>>>> "??
>>>>
>>>> ------------------------------------------------------------------
>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>>>
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>

> -------------------------------------------------------------------------------
> Saif Ur-Rehman
> Research Student
> The Centre for Evolution, Genes & Genomics (CEGG)
> Dyers Brae
> School of Biology
> The University of St Andrews
> St Andrews,
> Fife
> Scotland,UK

> ------------------------------------------------------------------
> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBg5Q4C5LeMEKA/QRAlKlAKCKXrMfJI2W4Ir7Us5P9bj3KmEY1ACgo89L
WgUPFCLGUNSUZxO8h3Ltqlw=
=Jq7X
-----END PGP SIGNATURE-----

From ayates at ebi.ac.uk  Fri Oct  5 06:16:02 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 05 Oct 2007 11:16:02 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>	<4705055E.5070401@ebi.ac.uk>
	<1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
Message-ID: <47060EE2.2000909@ebi.ac.uk>

Is it possible for you to send us the code which you're trying to run & 
the sequence you are trying to write out. If it is sent to us in a 
manner we can drop it into an IDE & run that would help us a lot.

Thanks,

Andy Yates

Saif Ur-Rehman wrote:
> Dear Richard,
> 
> I have tried the RichSEquence.IOTools.writeFasta method and this method is still
> appending the characters "??" to the front of each write. I am using a 
> FileOutputStream and a Sequence object as inputs to the method. like so.
> 
> 
>  Sequence seq; // read in from File
>  FileOutputStream f =new FileOutputStream (fileName);
> 
> 
> 			   try{
> 
> 			    	RichSequence.IOTools.writeFasta(f,
> 			    	        seq,
> 			    	        RichObjectFactory.getDefaultNamespace()
> 			    	        );
> 
> 
> 			    }
> 
> 
> Thanks a lot for your time
> 
> Sincerely,
> 
> Saif
> 
> Quoting Richard Holland <holland at ebi.ac.uk>:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> SeqIOTools is deprecated.
>>
>> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
>>
>> e.g.:
>>
>> RichSequence.IOTools.writeFasta(
>> 	System.out,
>> 	seq,
>> 	RichObjectFactory.getDefaultNamespace()
>> 	);
>>
>> where seq is either a Sequence or a SequenceIterator.
>>
>> cheers,
>> Richard
>>
>> Saif Ur-Rehman wrote:
>>> Dear All,
>>>
>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am
>> currently
>>> trying to break up Fasta Files of whole organisms into one file per gene
>> for
>>> further analysis. However the writeFasta method appears to append the
>>> characters
>>> "??
>>>
>>> ------------------------------------------------------------------
>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp
>> C4xPs/2ywAMfIPDmUKPCrqg=
>> =TwwH
>> -----END PGP SIGNATURE-----
>>
> 
> 
> -------------------------------------------------------------------------------
> Saif Ur-Rehman
> Research Student
> The Centre for Evolution, Genes & Genomics (CEGG)
> Dyers Brae
> School of Biology
> The University of St Andrews
> St Andrews,
> Fife
> Scotland,UK
> 
> ------------------------------------------------------------------
> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

From holland at ebi.ac.uk  Fri Oct  5 08:10:58 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 05 Oct 2007 13:10:58 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <1191584372.4706227437594@webmail.st-andrews.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
	<4705055E.5070401@ebi.ac.uk>
	<1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
	<47060E50.2070405@ebi.ac.uk>
	<1191582472.47061b0836c9f@webmail.st-andrews.ac.uk>
	<47061FDD.1070806@ebi.ac.uk>
	<1191584372.4706227437594@webmail.st-andrews.ac.uk>
Message-ID: <470629D2.6020709@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Great, thanks.

The initial analysis shows that the text file generated contains four
extra characters at the beginning of the file, and is using '\n' as the
line separator.

This is a hex dump of the file:

00000000  ac ed 00 05 3e 67 69 7c  31 38 33 39 38 33 39 30
|....>gi|18398390|
00000010  7c 6c 63 6c 7c 4e 50 5f  35 36 35 34 31 33 2e 31
||lcl|NP_565413.1|
00000020  7c 4e 50 5f 35 36 35 34  31 33 20 75 6e 6b 6e 6f  ||NP_565413
unkno|
00000030  77 6e 20 70 72 6f 74 65  69 6e 20 5b 41 72 61 62  |wn protein
[Arab|
00000040  69 64 6f 70 73 69 73 20  74 68 61 6c 69 61 6e 61  |idopsis
thaliana|
00000050  5d 0a 4d 53 4c 52 49 4b  4c 56 56 44 4b 46 56 45
|].MSLRIKLVVDKFVE|
00000060  45 4c 4b 51 41 4c 44 41  44 49 51 44 52 49 4d 4b
|ELKQALDADIQDRIMK|
00000070  45 52 45 4d 51 53 59 49  58 58 58 58 58 58 58 58
|EREMQSYIXXXXXXXX|
00000080  58 58 58 58 58 57 4b 41  45 4c 53 52 52 45 54 45
|XXXXXWKAELSRRETE|
00000090  49 41 52 51 45 41 52 4c  4b 4d 45 52 45 4e 4c 45
|IARQEARLKMERENLE|
000000a0  4b 45 0a 4b 53 56 4c 4d  47 54 41 53 4e 51 44 4e
|KE.KSVLMGTASNQDN|
000000b0  51 44 47 41 4c 45 49 54  56 53 47 45 4b 59 52 43
|QDGALEITVSGEKYRC|
000000c0  4c 52 46 53 4b 41 4b 4b  0a                       |LRFSKAKK.|


The four extra characters are hex #ac #ed #00 #05 and these are showing
as question marks in your text editor because that's how text editors
handle unprintable characters.

Does anyone recognise these characters? There is no code in BioJava
which writes anything like this, in fact there is no output code at all
before the initial write of the first > symbol in the file. Something
tells me that these symbols are being inserted by the VM or the OS
somewhere under the hood, possibly due to internationalisation?

I strongly suspect this is an internationalisation problem. It seems
probable that Java has been set up on your system to use a language or
character encoding that causes Java by default to write these extra
characters at the start of files to indicate the encoding. Check the
output of:

System.getProperty("file.encode");

to see if it is using something other than UTF-8. If it is, then chances
are that this is the problem.

We've had internationalisation problems before with BioJava. Hopefully
these will be addressed in future development, but there is no current
activity in that area due to lack of resources. In the meantime the best
workaround is to set every setting you can find to a Western European
character set/character mapping and UTF-8 file encoding, in the hope
that it will all match up nicely and work.

cheers,
Richard

Saif Ur-Rehman wrote:
> Dear Richard,
> 
> The input file is just the entire set of RefSeq proteins for Arabdopsis thaliana
> and is too large for me to send as an attachment. But it can be downloaded from
> NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]".
> 
> Cheers,
> 
> Saif
> 
> 
> 
> Quoting Richard Holland <holland at ebi.ac.uk>:
> 
> Interesting. Could you send your input file as well?
> 
> cheers,
> Richard
> 
> Saif Ur-Rehman wrote:
>>>> Dear Richard,
>>>>
>>>> The sequences are being read by SeqIO.readFasta. The code from read to
> write is
>>>> as follows. Essentially the program wants to read in a fasta file
> containing
>>>> all the protein sequences in a given organism and split them up into one
> file
>>>> per protein.
>>>>
>>>>
>>>> BufferedReader br=null;
>>>> try
>>>> {
>>>> br = new BufferedReader(new FileReader(filename));
>>>> }
>>>> catch (FileNotFoundException e1)
>>>> {
>>>>
>>>> e1.printStackTrace();
>>>> }
>>>>
>>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br);
>>>> 	while (stream.hasNext())
>>>>     {
>>>> 	    try
>>>>         {
>>>> 			Sequence seq = stream.nextSequence();
>>>>            File scriptFile1= new
> File("///Users/Saif/Organisms/RunTemp/"+name
>>>> +"/"+seq.getName());
>>>>
>>>> 			try
>>>>            {
>>>> 				scriptFile1.createNewFile();
>>>> 			 }
>>>>          catch (IOException e1)
>>>>          {
>>>>
>>>> 				e1.printStackTrace();
>>>> 			}
>>>>
>>>> 			try
>>>>           {
>>>>            FileWriter fstream = new
> FileWriter(scriptFile1.getAbsolutePath());
>>>> 			    BufferedWriter out = new BufferedWriter(fstream);
>>>>
>>>> 			    FileOutputStream f =new FileOutputStream (scriptFile1);
>>>>
>>>> 			    RichSequence rs=RichSequence.Tools.enrich(seq);
>>>>
>>>>
>>>> 			    try{
>>>>
>>>>
>>>> 			    	RichSequence.IOTools.writeFasta(
>>>> 			    	        f,
>>>> 			    	        rs,
>>>> 			    	        RichObjectFactory.getDefaultNamespace()
>>>> 			    	        );
>>>>
>>>>
>>>> 			    }
>>>>
>>>> 			    catch (IOException ioe){}
>>>>
>>>> An example of an outputted fasta file from this code is attached.
>>>>
>>>>
>>>>
>>>> Thanks a lot for your time.
>>>>
>>>> Saif
>>>>
>>>>
>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
>>>>
>>>> Where are the input sequences coming from? i.e. what method are you
>>>> using to construct them or read them from a file.
>>>>
>>>> Also, what do you mean by the 'front' of each write? Could you send me
>>>> an example of an entire FASTA file containing the problem? (It'd be best
>>>> to attach the file to an email to me personally as this list will not
>>>> accept attachments, and copying-and-pasting from a text editor to an
>>>> email client may obscure the underlying problem).
>>>>
>>>> It'd be good also to see your entire code from the point the sequences
>>>> are read or created to the point where they are written out. Or, a
>>>> sample program which exhibits the same behaviour would suffice.
>>>>
>>>> I suspect that the sequences themselves contain the incorrect data,
>>>> although technically this should be impossible as the sequence alphabet
>>>> should prevent it.
>>>>
>>>> We recently had an issue reported here regarding BioJava not being able
>>>> to do certain sequence tasks on platforms using non-Western-European
>>>> character mappings. If your machine is running such a mapping, try it
>>>> again on a machine with an English or other Western European language
>>>> set up by default. If it works there but not on your machine, then
>>>> this'll be the same problem. (There is no solution yet, but at least
>>>> you'll know what's wrong).
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> Saif Ur-Rehman wrote:
>>>>>>> Dear Richard,
>>>>>>>
>>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this method
> is
>>>> still
>>>>>>> appending the characters "??" to the front of each write. I am using a
>>>>>>> FileOutputStream and a Sequence object as inputs to the method. like so.
>>>>>>>
>>>>>>>
>>>>>>>  Sequence seq; // read in from File
>>>>>>>  FileOutputStream f =new FileOutputStream (fileName);
>>>>>>>
>>>>>>>
>>>>>>> 			   try{
>>>>>>>
>>>>>>> 			    	RichSequence.IOTools.writeFasta(f,
>>>>>>> 			    	        seq,
>>>>>>> 			    	        RichObjectFactory.getDefaultNamespace()
>>>>>>> 			    	        );
>>>>>>>
>>>>>>>
>>>>>>> 			    }
>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot for your time
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> Saif
>>>>>>>
>>>>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
>>>>>>>
>>>>>>> SeqIOTools is deprecated.
>>>>>>>
>>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
>>>>>>>
>>>>>>> e.g.:
>>>>>>>
>>>>>>> RichSequence.IOTools.writeFasta(
>>>>>>> 	System.out,
>>>>>>> 	seq,
>>>>>>> 	RichObjectFactory.getDefaultNamespace()
>>>>>>> 	);
>>>>>>>
>>>>>>> where seq is either a Sequence or a SequenceIterator.
>>>>>>>
>>>>>>> cheers,
>>>>>>> Richard
>>>>>>>
>>>>>>> Saif Ur-Rehman wrote:
>>>>>>>>>> Dear All,
>>>>>>>>>>
>>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am
>>>>>>> currently
>>>>>>>>>> trying to break up Fasta Files of whole organisms into one file per
> gene
>>>>>>> for
>>>>>>>>>> further analysis. However the writeFasta method appears to append the
>>>>>>>>>> characters
>>>>>>>>>> "??
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------
>>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>> -------------------------------------------------------------------------------
>>>>>>> Saif Ur-Rehman
>>>>>>> Research Student
>>>>>>> The Centre for Evolution, Genes & Genomics (CEGG)
>>>>>>> Dyers Brae
>>>>>>> School of Biology
>>>>>>> The University of St Andrews
>>>>>>> St Andrews,
>>>>>>> Fife
>>>>>>> Scotland,UK
>>>>>>> ------------------------------------------------------------------
>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>> -------------------------------------------------------------------------------
>>>> Saif Ur-Rehman
>>>> Research Student
>>>> The Centre for Evolution, Genes & Genomics (CEGG)
>>>> Dyers Brae
>>>> School of Biology
>>>> The University of St Andrews
>>>> St Andrews,
>>>> Fife
>>>> Scotland,UK
>>>> ------------------------------------------------------------------
>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>

> -------------------------------------------------------------------------------
> Saif Ur-Rehman
> Research Student
> The Centre for Evolution, Genes & Genomics (CEGG)
> Dyers Brae
> School of Biology
> The University of St Andrews
> St Andrews,
> Fife
> Scotland,UK

> ------------------------------------------------------------------
> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb
pxAPAybISoRQgbvQ1wyzqVg=
=MS7P
-----END PGP SIGNATURE-----

From ayates at ebi.ac.uk  Fri Oct  5 08:28:43 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 05 Oct 2007 13:28:43 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <470629D2.6020709@ebi.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>	<4705055E.5070401@ebi.ac.uk>	<1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>	<47060E50.2070405@ebi.ac.uk>	<1191582472.47061b0836c9f@webmail.st-andrews.ac.uk>	<47061FDD.1070806@ebi.ac.uk>	<1191584372.4706227437594@webmail.st-andrews.ac.uk>
	<470629D2.6020709@ebi.ac.uk>
Message-ID: <47062DFB.6040201@ebi.ac.uk>

I've done a quick search & it seems as if U+ACED is a Chinese character 
& the other is just a blank. Something is getting confused quite badly here

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Great, thanks.
> 
> The initial analysis shows that the text file generated contains four
> extra characters at the beginning of the file, and is using '\n' as the
> line separator.
> 
> This is a hex dump of the file:
> 
> 00000000  ac ed 00 05 3e 67 69 7c  31 38 33 39 38 33 39 30
> |....>gi|18398390|
> 00000010  7c 6c 63 6c 7c 4e 50 5f  35 36 35 34 31 33 2e 31
> ||lcl|NP_565413.1|
> 00000020  7c 4e 50 5f 35 36 35 34  31 33 20 75 6e 6b 6e 6f  ||NP_565413
> unkno|
> 00000030  77 6e 20 70 72 6f 74 65  69 6e 20 5b 41 72 61 62  |wn protein
> [Arab|
> 00000040  69 64 6f 70 73 69 73 20  74 68 61 6c 69 61 6e 61  |idopsis
> thaliana|
> 00000050  5d 0a 4d 53 4c 52 49 4b  4c 56 56 44 4b 46 56 45
> |].MSLRIKLVVDKFVE|
> 00000060  45 4c 4b 51 41 4c 44 41  44 49 51 44 52 49 4d 4b
> |ELKQALDADIQDRIMK|
> 00000070  45 52 45 4d 51 53 59 49  58 58 58 58 58 58 58 58
> |EREMQSYIXXXXXXXX|
> 00000080  58 58 58 58 58 57 4b 41  45 4c 53 52 52 45 54 45
> |XXXXXWKAELSRRETE|
> 00000090  49 41 52 51 45 41 52 4c  4b 4d 45 52 45 4e 4c 45
> |IARQEARLKMERENLE|
> 000000a0  4b 45 0a 4b 53 56 4c 4d  47 54 41 53 4e 51 44 4e
> |KE.KSVLMGTASNQDN|
> 000000b0  51 44 47 41 4c 45 49 54  56 53 47 45 4b 59 52 43
> |QDGALEITVSGEKYRC|
> 000000c0  4c 52 46 53 4b 41 4b 4b  0a                       |LRFSKAKK.|
> 
> 
> The four extra characters are hex #ac #ed #00 #05 and these are showing
> as question marks in your text editor because that's how text editors
> handle unprintable characters.
> 
> Does anyone recognise these characters? There is no code in BioJava
> which writes anything like this, in fact there is no output code at all
> before the initial write of the first > symbol in the file. Something
> tells me that these symbols are being inserted by the VM or the OS
> somewhere under the hood, possibly due to internationalisation?
> 
> I strongly suspect this is an internationalisation problem. It seems
> probable that Java has been set up on your system to use a language or
> character encoding that causes Java by default to write these extra
> characters at the start of files to indicate the encoding. Check the
> output of:
> 
> System.getProperty("file.encode");
> 
> to see if it is using something other than UTF-8. If it is, then chances
> are that this is the problem.
> 
> We've had internationalisation problems before with BioJava. Hopefully
> these will be addressed in future development, but there is no current
> activity in that area due to lack of resources. In the meantime the best
> workaround is to set every setting you can find to a Western European
> character set/character mapping and UTF-8 file encoding, in the hope
> that it will all match up nicely and work.
> 
> cheers,
> Richard
> 
>

From su24 at st-andrews.ac.uk  Fri Oct  5 09:44:29 2007
From: su24 at st-andrews.ac.uk (Saif Ur-Rehman)
Date: Fri,  5 Oct 2007 14:44:29 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <470629D2.6020709@ebi.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
	<4705055E.5070401@ebi.ac.uk>
	<1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
	<47060E50.2070405@ebi.ac.uk>
	<1191582472.47061b0836c9f@webmail.st-andrews.ac.uk>
	<47061FDD.1070806@ebi.ac.uk>
	<1191584372.4706227437594@webmail.st-andrews.ac.uk>
	<470629D2.6020709@ebi.ac.uk>
Message-ID: <1191591869.47063fbd22461@webmail.st-andrews.ac.uk>

Setting the System properties solved the problem.

Thanks a lot,

Saif

Quoting Richard Holland <holland at ebi.ac.uk>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Great, thanks.
>
> The initial analysis shows that the text file generated contains four
> extra characters at the beginning of the file, and is using '\n' as the
> line separator.
>
> This is a hex dump of the file:
>
> 00000000  ac ed 00 05 3e 67 69 7c  31 38 33 39 38 33 39 30
> |....>gi|18398390|
> 00000010  7c 6c 63 6c 7c 4e 50 5f  35 36 35 34 31 33 2e 31
> ||lcl|NP_565413.1|
> 00000020  7c 4e 50 5f 35 36 35 34  31 33 20 75 6e 6b 6e 6f  ||NP_565413
> unkno|
> 00000030  77 6e 20 70 72 6f 74 65  69 6e 20 5b 41 72 61 62  |wn protein
> [Arab|
> 00000040  69 64 6f 70 73 69 73 20  74 68 61 6c 69 61 6e 61  |idopsis
> thaliana|
> 00000050  5d 0a 4d 53 4c 52 49 4b  4c 56 56 44 4b 46 56 45
> |].MSLRIKLVVDKFVE|
> 00000060  45 4c 4b 51 41 4c 44 41  44 49 51 44 52 49 4d 4b
> |ELKQALDADIQDRIMK|
> 00000070  45 52 45 4d 51 53 59 49  58 58 58 58 58 58 58 58
> |EREMQSYIXXXXXXXX|
> 00000080  58 58 58 58 58 57 4b 41  45 4c 53 52 52 45 54 45
> |XXXXXWKAELSRRETE|
> 00000090  49 41 52 51 45 41 52 4c  4b 4d 45 52 45 4e 4c 45
> |IARQEARLKMERENLE|
> 000000a0  4b 45 0a 4b 53 56 4c 4d  47 54 41 53 4e 51 44 4e
> |KE.KSVLMGTASNQDN|
> 000000b0  51 44 47 41 4c 45 49 54  56 53 47 45 4b 59 52 43
> |QDGALEITVSGEKYRC|
> 000000c0  4c 52 46 53 4b 41 4b 4b  0a                       |LRFSKAKK.|
>
>
> The four extra characters are hex #ac #ed #00 #05 and these are showing
> as question marks in your text editor because that's how text editors
> handle unprintable characters.
>
> Does anyone recognise these characters? There is no code in BioJava
> which writes anything like this, in fact there is no output code at all
> before the initial write of the first > symbol in the file. Something
> tells me that these symbols are being inserted by the VM or the OS
> somewhere under the hood, possibly due to internationalisation?
>
> I strongly suspect this is an internationalisation problem. It seems
> probable that Java has been set up on your system to use a language or
> character encoding that causes Java by default to write these extra
> characters at the start of files to indicate the encoding. Check the
> output of:
>
> System.getProperty("file.encode");
>
> to see if it is using something other than UTF-8. If it is, then chances
> are that this is the problem.
>
> We've had internationalisation problems before with BioJava. Hopefully
> these will be addressed in future development, but there is no current
> activity in that area due to lack of resources. In the meantime the best
> workaround is to set every setting you can find to a Western European
> character set/character mapping and UTF-8 file encoding, in the hope
> that it will all match up nicely and work.
>
> cheers,
> Richard
>
> Saif Ur-Rehman wrote:
> > Dear Richard,
> >
> > The input file is just the entire set of RefSeq proteins for Arabdopsis
> thaliana
> > and is too large for me to send as an attachment. But it can be downloaded
> from
> > NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]".
> >
> > Cheers,
> >
> > Saif
> >
> >
> >
> > Quoting Richard Holland <holland at ebi.ac.uk>:
> >
> > Interesting. Could you send your input file as well?
> >
> > cheers,
> > Richard
> >
> > Saif Ur-Rehman wrote:
> >>>> Dear Richard,
> >>>>
> >>>> The sequences are being read by SeqIO.readFasta. The code from read to
> > write is
> >>>> as follows. Essentially the program wants to read in a fasta file
> > containing
> >>>> all the protein sequences in a given organism and split them up into one
> > file
> >>>> per protein.
> >>>>
> >>>>
> >>>> BufferedReader br=null;
> >>>> try
> >>>> {
> >>>> br = new BufferedReader(new FileReader(filename));
> >>>> }
> >>>> catch (FileNotFoundException e1)
> >>>> {
> >>>>
> >>>> e1.printStackTrace();
> >>>> }
> >>>>
> >>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br);
> >>>> 	while (stream.hasNext())
> >>>>     {
> >>>> 	    try
> >>>>         {
> >>>> 			Sequence seq = stream.nextSequence();
> >>>>            File scriptFile1= new
> > File("///Users/Saif/Organisms/RunTemp/"+name
> >>>> +"/"+seq.getName());
> >>>>
> >>>> 			try
> >>>>            {
> >>>> 				scriptFile1.createNewFile();
> >>>> 			 }
> >>>>          catch (IOException e1)
> >>>>          {
> >>>>
> >>>> 				e1.printStackTrace();
> >>>> 			}
> >>>>
> >>>> 			try
> >>>>           {
> >>>>            FileWriter fstream = new
> > FileWriter(scriptFile1.getAbsolutePath());
> >>>> 			    BufferedWriter out = new BufferedWriter(fstream);
> >>>>
> >>>> 			    FileOutputStream f =new FileOutputStream (scriptFile1);
> >>>>
> >>>> 			    RichSequence rs=RichSequence.Tools.enrich(seq);
> >>>>
> >>>>
> >>>> 			    try{
> >>>>
> >>>>
> >>>> 			    	RichSequence.IOTools.writeFasta(
> >>>> 			    	        f,
> >>>> 			    	        rs,
> >>>> 			    	        RichObjectFactory.getDefaultNamespace()
> >>>> 			    	        );
> >>>>
> >>>>
> >>>> 			    }
> >>>>
> >>>> 			    catch (IOException ioe){}
> >>>>
> >>>> An example of an outputted fasta file from this code is attached.
> >>>>
> >>>>
> >>>>
> >>>> Thanks a lot for your time.
> >>>>
> >>>> Saif
> >>>>
> >>>>
> >>>> Quoting Richard Holland <holland at ebi.ac.uk>:
> >>>>
> >>>> Where are the input sequences coming from? i.e. what method are you
> >>>> using to construct them or read them from a file.
> >>>>
> >>>> Also, what do you mean by the 'front' of each write? Could you send me
> >>>> an example of an entire FASTA file containing the problem? (It'd be best
> >>>> to attach the file to an email to me personally as this list will not
> >>>> accept attachments, and copying-and-pasting from a text editor to an
> >>>> email client may obscure the underlying problem).
> >>>>
> >>>> It'd be good also to see your entire code from the point the sequences
> >>>> are read or created to the point where they are written out. Or, a
> >>>> sample program which exhibits the same behaviour would suffice.
> >>>>
> >>>> I suspect that the sequences themselves contain the incorrect data,
> >>>> although technically this should be impossible as the sequence alphabet
> >>>> should prevent it.
> >>>>
> >>>> We recently had an issue reported here regarding BioJava not being able
> >>>> to do certain sequence tasks on platforms using non-Western-European
> >>>> character mappings. If your machine is running such a mapping, try it
> >>>> again on a machine with an English or other Western European language
> >>>> set up by default. If it works there but not on your machine, then
> >>>> this'll be the same problem. (There is no solution yet, but at least
> >>>> you'll know what's wrong).
> >>>>
> >>>> cheers,
> >>>> Richard
> >>>>
> >>>> Saif Ur-Rehman wrote:
> >>>>>>> Dear Richard,
> >>>>>>>
> >>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this
> method
> > is
> >>>> still
> >>>>>>> appending the characters "??" to the front of each write. I am using
> a
> >>>>>>> FileOutputStream and a Sequence object as inputs to the method. like
> so.
> >>>>>>>
> >>>>>>>
> >>>>>>>  Sequence seq; // read in from File
> >>>>>>>  FileOutputStream f =new FileOutputStream (fileName);
> >>>>>>>
> >>>>>>>
> >>>>>>> 			   try{
> >>>>>>>
> >>>>>>> 			    	RichSequence.IOTools.writeFasta(f,
> >>>>>>> 			    	        seq,
> >>>>>>> 			    	        RichObjectFactory.getDefaultNamespace()
> >>>>>>> 			    	        );
> >>>>>>>
> >>>>>>>
> >>>>>>> 			    }
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks a lot for your time
> >>>>>>>
> >>>>>>> Sincerely,
> >>>>>>>
> >>>>>>> Saif
> >>>>>>>
> >>>>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
> >>>>>>>
> >>>>>>> SeqIOTools is deprecated.
> >>>>>>>
> >>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
> >>>>>>>
> >>>>>>> e.g.:
> >>>>>>>
> >>>>>>> RichSequence.IOTools.writeFasta(
> >>>>>>> 	System.out,
> >>>>>>> 	seq,
> >>>>>>> 	RichObjectFactory.getDefaultNamespace()
> >>>>>>> 	);
> >>>>>>>
> >>>>>>> where seq is either a Sequence or a SequenceIterator.
> >>>>>>>
> >>>>>>> cheers,
> >>>>>>> Richard
> >>>>>>>
> >>>>>>> Saif Ur-Rehman wrote:
> >>>>>>>>>> Dear All,
> >>>>>>>>>>
> >>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I
> am
> >>>>>>> currently
> >>>>>>>>>> trying to break up Fasta Files of whole organisms into one file
> per
> > gene
> >>>>>>> for
> >>>>>>>>>> further analysis. However the writeFasta method appears to append
> the
> >>>>>>>>>> characters
> >>>>>>>>>> "??
> >>>>>>>>>>
> >>>>>>>>>> ------------------------------------------------------------------
> >>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>>>>>
> >>
>
-------------------------------------------------------------------------------
> >>>>>>> Saif Ur-Rehman
> >>>>>>> Research Student
> >>>>>>> The Centre for Evolution, Genes & Genomics (CEGG)
> >>>>>>> Dyers Brae
> >>>>>>> School of Biology
> >>>>>>> The University of St Andrews
> >>>>>>> St Andrews,
> >>>>>>> Fife
> >>>>>>> Scotland,UK
> >>>>>>> ------------------------------------------------------------------
> >>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>
>
-------------------------------------------------------------------------------
> >>>> Saif Ur-Rehman
> >>>> Research Student
> >>>> The Centre for Evolution, Genes & Genomics (CEGG)
> >>>> Dyers Brae
> >>>> School of Biology
> >>>> The University of St Andrews
> >>>> St Andrews,
> >>>> Fife
> >>>> Scotland,UK
> >>>> ------------------------------------------------------------------
> >>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>
>
> >
>
-------------------------------------------------------------------------------
> > Saif Ur-Rehman
> > Research Student
> > The Centre for Evolution, Genes & Genomics (CEGG)
> > Dyers Brae
> > School of Biology
> > The University of St Andrews
> > St Andrews,
> > Fife
> > Scotland,UK
>
> > ------------------------------------------------------------------
> > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb
> pxAPAybISoRQgbvQ1wyzqVg=
> =MS7P
> -----END PGP SIGNATURE-----
>


-------------------------------------------------------------------------------
Saif Ur-Rehman
Research Student
The Centre for Evolution, Genes & Genomics (CEGG)
Dyers Brae
School of Biology
The University of St Andrews
St Andrews,
Fife
Scotland,UK

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


From sanbiogene at yahoo.co.in  Sat Oct  6 05:23:11 2007
From: sanbiogene at yahoo.co.in (sandeep telkar)
Date: Sat, 6 Oct 2007 10:23:11 +0100 (BST)
Subject: [Biojava-l] BIOJAVA INSTALLATION FOR WINDOWS PLATFORM
Message-ID: <121992.19693.qm@web94408.mail.in2.yahoo.com>

Dear friends,
   Sandeep here...
       I wanna learn biojava n now i am beginner.but
from where to download its exe installation file as
like that of JDK6 fron sun website....

please suggest me any thing other than the following
url:
       http://biojava.org/wiki/BioJava:Download

N plese tell in which directory i have to save the
program.....
            I am not getting any clear idea ..

        please help me..
                                       - Sandeep

Sandeep Telkar,
  M.Sc Bioinformatics.


      Meet people who discuss and share your passions. Go to http://in.promos.yahoo.com/groups

From su24 at st-andrews.ac.uk  Sat Oct  6 14:04:28 2007
From: su24 at st-andrews.ac.uk (Saif Ur-Rehman)
Date: Sat,  6 Oct 2007 19:04:28 +0100
Subject: [Biojava-l] BIOJAVA INSTALLATION FOR WINDOWS PLATFORM
In-Reply-To: <121992.19693.qm@web94408.mail.in2.yahoo.com>
References: <121992.19693.qm@web94408.mail.in2.yahoo.com>
Message-ID: <1191693868.4707ce2caae97@webmail.st-andrews.ac.uk>

Hi,

You need to download the Jar files from
http://biojava.org/wiki/BioJava:Download. You can then use the File
biojava-1.5.jar. Just include it in the buildpath as an external JAR if you're
using an IDE like Netbeans or Eclipse or your class path if working from the
command line. You can then import the BioJava classes and use them. Hope that
helps

Cheers,

Saif


Quoting sandeep telkar <sanbiogene at yahoo.co.in>:

> Dear friends,
>    Sandeep here...
>        I wanna learn biojava n now i am beginner.but
> from where to download its exe installation file as
> like that of JDK6 fron sun website....
>
> please suggest me any thing other than the following
> url:
>        http://biojava.org/wiki/BioJava:Download
>
> N plese tell in which directory i have to save the
> program.....
>             I am not getting any clear idea ..
>
>         please help me..
>                                        - Sandeep
>
> Sandeep Telkar,
>   M.Sc Bioinformatics.
>
>
>
>       Meet people who discuss and share your passions. Go to
> http://in.promos.yahoo.com/groups
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-------------------------------------------------------------------------------
Saif Ur-Rehman
Research Student
The Centre for Evolution, Genes & Genomics (CEGG)
Dyers Brae
School of Biology
The University of St Andrews
St Andrews,
Fife
Scotland,UK

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


From vineith at gmail.com  Wed Oct 10 00:44:22 2007
From: vineith at gmail.com (vineith kaul)
Date: Wed, 10 Oct 2007 00:44:22 -0400
Subject: [Biojava-l] case-sensitive sequences
Message-ID: <f2446ee40710092144q5b18b599yb42cf4a5ecba3c2c@mail.gmail.com>

Hi,

I want to read in a sequence which has case sensitive
alphabets(nucleotides).Basically I want to replace only small
'a,g,t,c' with blanks .Although I saw a similar post earlier but
couldn't understand much.Can someone help me with this ?

-- 
Vineith Kaul
Masters Student Bioinformatics
The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
Georgia Tech, Atlanta

From holland at ebi.ac.uk  Wed Oct 10 04:06:16 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 10 Oct 2007 09:06:16 +0100
Subject: [Biojava-l] case-sensitive sequences
In-Reply-To: <f2446ee40710092144q5b18b599yb42cf4a5ecba3c2c@mail.gmail.com>
References: <f2446ee40710092144q5b18b599yb42cf4a5ecba3c2c@mail.gmail.com>
Message-ID: <470C87F8.8020502@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

You can use SoftMaskedAlphabet with the BioJavaX parsers to get the
desired effect.

By default, a soft masked character is one in lower case. The code below
will detect these. If you have other search criteria you can modify the
soft masked detection criteria to match this instead. To do that, add a
second parameter to the call to SoftMaskedAlphabet.getInstance() and use
it to pass in an instance of SoftMaskedAlphabet.MaskingDetector (see the
JavaDocs to see how this should work).

Hope this helps! :


// Set up a soft-masked alphabet.
SoftMaskedAlphabet sma =
	SoftMaskedAlphabet.getInstance(DNATools.getDNA());
SymbolTokenization stok = sma.getTokenization("token");

// Set up sequence parsing.
BufferedReader input = ....;
	// Get your sequences from somewhere
RichSequenceFormat format = new FastaFormat();
	// Or Genbank etc.
RichSequenceBuilderFactory factory = RichSequenceBuilderFactory.FACTORY;
	// See Javadocs for alternative factories.
Namespace ns = RichObjectFactory.getDefaultNamespace();
	// See Javadocs for alternative namespaces.

// Parse the sequences.
RichStreamReader seqsIn =
	new RichStreamReader(input, format,  stok, factory, ns);


// Find the soft-masked symbols in the sequences.
while (seqsIn.hasNext()) {
  RichSequence seq = seqsIn.nextRichSequence();

  // Iterate over symbols in sequence.
  for (Iterator i = seq.iterator(); i.hasNext(); ) {

     Symbol sym = (Symbol)i.next();

     // Is this symbol masked?
     if (sma.isMasked(sym)) {
        // Yes it is so deal with it.
        .......
     } else {
        // No it isn't, so deal with that instead.
        .......
     }
  }
}


cheers,
Richard

vineith kaul wrote:
> Hi,
> 
> I want to read in a sequence which has case sensitive
> alphabets(nucleotides).Basically I want to replace only small
> 'a,g,t,c' with blanks .Although I saw a similar post earlier but
> couldn't understand much.Can someone help me with this ?
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHDIf44C5LeMEKA/QRAmuNAJ426M/UgInqDG5rG6w+F+qoMdVzPQCfZo1S
nAS5v8jSFBX5WCuB5UmzczQ=
=Sicc
-----END PGP SIGNATURE-----

From vineith at gmail.com  Sun Oct 14 13:21:45 2007
From: vineith at gmail.com (vineith kaul)
Date: Sun, 14 Oct 2007 13:21:45 -0400
Subject: [Biojava-l] Java to Perl
Message-ID: <f2446ee40710141021y434112a2m466fbe0a99e40486@mail.gmail.com>

Is there some tool by which we can convert a complete Java Code to  a
Perl code ?

-- 
Vineith Kaul
Masters Student Bioinformatics
The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
Georgia Tech, Atlanta

From davidfeitosa at gmail.com  Sun Oct 14 13:57:47 2007
From: davidfeitosa at gmail.com (David Barbosa Feitosa)
Date: Sun, 14 Oct 2007 14:57:47 -0300
Subject: [Biojava-l] Java to Perl
In-Reply-To: <f2446ee40710141021y434112a2m466fbe0a99e40486@mail.gmail.com>
References: <f2446ee40710141021y434112a2m466fbe0a99e40486@mail.gmail.com>
Message-ID: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com>

Vineith

I do not know, but if you need to execute Pearl code inside Java code, in
Java 6, codename Mustang, is possible to execute script code inside the Java
Virtual Machine.

The default scripting engine is Rhino, for JavaScript, but as it is a
specification, if exists an Pearl engine, you can plug it into the JVM and
execute your Pearl code.

Mode infoa bout the available engines and how to install one:

https://scripting.dev.java.net/

Maybe it can help you,

David.

2007/10/14, vineith kaul <vineith at gmail.com>:
>
> Is there some tool by which we can convert a complete Java Code to  a
> Perl code ?
>
> --
> Vineith Kaul
> Masters Student Bioinformatics
> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From ayates at ebi.ac.uk  Mon Oct 15 04:15:33 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Mon, 15 Oct 2007 09:15:33 +0100
Subject: [Biojava-l] Java to Perl
In-Reply-To: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com>
References: <f2446ee40710141021y434112a2m466fbe0a99e40486@mail.gmail.com>
	<93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com>
Message-ID: <471321A5.5090600@ebi.ac.uk>

Unfortunately to my knowledge there is no Perl/Java scripting interface. 
Apparently for some reason Perl is not trendy enough to warrant a port 
(which is a pity).

In response to Vineith's original question such a tool really wouldn't 
work. Good Perl code is very different to good Java code. If you did get 
something that would work you'd probably end up with quite verbose & 
in-efficent Perl code (not to mention the problems that would arise with 
Perl objects having no access modifiers, using inside-out objects, 
converting 3rd party libraries etc).

Two options do spring to mind if you need code available in both languages:

* Make one of the pieces of code a "black box" where you read results 
from STDOUT (works well enough calling a Java program from Perl).

* Write the commmon code in C

Out of these two options if you want the code replicated in a 1-1 
fashion then C is your only option. Otherwise the first idea is the 
easiest to work with.

As David did mention there are other scripting engines available 
(Jython, Groovy, JRuby & Rhino all spring to mind) which might satisfy 
your scripting needs whilst remaining in a Java environment (Groovy hits 
that nice sweet spot for a Java inspired scripting language).

Andy

P.S. This really isn't a Biojava question ...

David Barbosa Feitosa wrote:
> Vineith
> 
> I do not know, but if you need to execute Pearl code inside Java code, in
> Java 6, codename Mustang, is possible to execute script code inside the Java
> Virtual Machine.
> 
> The default scripting engine is Rhino, for JavaScript, but as it is a
> specification, if exists an Pearl engine, you can plug it into the JVM and
> execute your Pearl code.
> 
> Mode infoa bout the available engines and how to install one:
> 
> https://scripting.dev.java.net/
> 
> Maybe it can help you,
> 
> David.
> 
> 2007/10/14, vineith kaul <vineith at gmail.com>:
>> Is there some tool by which we can convert a complete Java Code to  a
>> Perl code ?
>>
>> --
>> Vineith Kaul
>> Masters Student Bioinformatics
>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>> Georgia Tech, Atlanta
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

From phidias51 at gmail.com  Mon Oct 15 10:57:06 2007
From: phidias51 at gmail.com (Mark Fortner)
Date: Mon, 15 Oct 2007 07:57:06 -0700
Subject: [Biojava-l] Java to Perl
In-Reply-To: <471321A5.5090600@ebi.ac.uk>
References: <f2446ee40710141021y434112a2m466fbe0a99e40486@mail.gmail.com>
	<93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com>
	<471321A5.5090600@ebi.ac.uk>
Message-ID: <6e1d61f50710150757p6ba25c1ck9466baa5f8273bc2@mail.gmail.com>

The original post indicated that they wanted to go from java to perl.  Doing
a quick Google search yielded a lot of hits for tools going from perl to
java.  Just out curiosity, was there some reason you wanted to create perl
code from Java code?

There are a couple of projects which supposedly provide PERL-scripting
support inside Java to one extent or another.  The first is called Sleep (
http://sleep.hick.org/) which is described as being a PERL-like plugin for
the Java 6 scripting engine.  There's also a BSF plugin called BSF Perl (
http://bsfperl.sf.net) and another BSF plugin called PerlScript which is
part of ActiveState's ActivePerl distribution.

I don't have any first-hand experience with any of these, so please don't
construe anything I say as an endorsement of these technologies.  Although
none of these solutions will convert PERL code into Java or vice-versa, they
may allow you to run Perl inside a VM.

Hope this helps,

Mark

On 10/15/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>
> Unfortunately to my knowledge there is no Perl/Java scripting interface.
> Apparently for some reason Perl is not trendy enough to warrant a port
> (which is a pity).
>
> In response to Vineith's original question such a tool really wouldn't
> work. Good Perl code is very different to good Java code. If you did get
> something that would work you'd probably end up with quite verbose &
> in-efficent Perl code (not to mention the problems that would arise with
> Perl objects having no access modifiers, using inside-out objects,
> converting 3rd party libraries etc).
>
> Two options do spring to mind if you need code available in both
> languages:
>
> * Make one of the pieces of code a "black box" where you read results
> from STDOUT (works well enough calling a Java program from Perl).
>
> * Write the commmon code in C
>
> Out of these two options if you want the code replicated in a 1-1
> fashion then C is your only option. Otherwise the first idea is the
> easiest to work with.
>
> As David did mention there are other scripting engines available
> (Jython, Groovy, JRuby & Rhino all spring to mind) which might satisfy
> your scripting needs whilst remaining in a Java environment (Groovy hits
> that nice sweet spot for a Java inspired scripting language).
>
> Andy
>
> P.S. This really isn't a Biojava question ...
>
> David Barbosa Feitosa wrote:
> > Vineith
> >
> > I do not know, but if you need to execute Pearl code inside Java code,
> in
> > Java 6, codename Mustang, is possible to execute script code inside the
> Java
> > Virtual Machine.
> >
> > The default scripting engine is Rhino, for JavaScript, but as it is a
> > specification, if exists an Pearl engine, you can plug it into the JVM
> and
> > execute your Pearl code.
> >
> > Mode infoa bout the available engines and how to install one:
> >
> > https://scripting.dev.java.net/
> >
> > Maybe it can help you,
> >
> > David.
> >
> > 2007/10/14, vineith kaul <vineith at gmail.com>:
> >> Is there some tool by which we can convert a complete Java Code to  a
> >> Perl code ?
> >>
> >> --
> >> Vineith Kaul
> >> Masters Student Bioinformatics
> >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >> Georgia Tech, Atlanta
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From vineith at gmail.com  Sun Oct 21 12:30:48 2007
From: vineith at gmail.com (vineith kaul)
Date: Sun, 21 Oct 2007 12:30:48 -0400
Subject: [Biojava-l] Evolutionary distances
Message-ID: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>

Hi,

Are there functions to calculate evolutionary pairwise distances like
Kimura2P,Finkelstein etc in Biojava
I did write smthng on my own but on large sequences it runs terribly
slow and I am not even sure if thats right.
-- 
Vineith Kaul
Masters Student Bioinformatics
The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
Georgia Tech, Atlanta

From holland at ebi.ac.uk  Mon Oct 22 08:06:57 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Mon, 22 Oct 2007 13:06:57 +0100 (BST)
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
Message-ID: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>

You should take a look at the latest 1.5 release, in the
org.biojavax.bio.phylo packages. This code is the beginnings of some
phylogenetics code that will perform tasks as you describe. The future
plan is to extend this code to cover a wider range of use cases. Kimura2P
is already implemented here, in
org.biojavax.bio.phylo.MultipleHitCorrection.

If you can't find code that will do what you want, but have written some
before, then please do feel free to contribute it. Even if it is slow, I'm
sure someone out there will be able to help optimise it!

cheers,
Richard

On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> Hi,
>
> Are there functions to calculate evolutionary pairwise distances like
> Kimura2P,Finkelstein etc in Biojava
> I did write smthng on my own but on large sequences it runs terribly
> slow and I am not even sure if thats right.
> --
> Vineith Kaul
> Masters Student Bioinformatics
> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Richard Holland
BioMart (http://www.biomart.org/)
EMBL-EBI
Hinxton, Cambridgeshire CB10 1SD, UK


From vineith at gmail.com  Tue Oct 23 02:59:29 2007
From: vineith at gmail.com (vineith kaul)
Date: Tue, 23 Oct 2007 02:59:29 -0400
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
Message-ID: <f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>

This is what I have .....Thanks a lot  fr the help.


//Method to calculate the Kimura 2 parameter distance
public static double K2P(String sequence1,String sequence2){
        long p=0,q=0,numberOfAlignedSites=0; // P= transitional differences
(A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)


        char[] seq1array=sequence1.toCharArray();
        char[] seq2array=sequence2.toCharArray();

        for(int i=0;i<seq1array.length;i++){
                                // Number of aligned sites
                if(((seq1array[i]=='a') ||
(seq1array[i]=='A')||(seq1array[i]=='g') ||
(seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
(seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
(seq2array[i]=='A')||(seq2array[i]=='c') ||
(seq2array[i]=='C')||(seq2array[i]=='t') ||
(seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {

                        numberOfAlignedSites++;
                }

                if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
((seq2array[i]=='g') || (seq2array[i]=='G'))) {
                        p++;
                }
                else
                if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
((seq2array[i]=='a') || (seq2array[i]=='A'))) {
                        p++;
                }
                else
                if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
((seq2array[i]=='c') || (seq2array[i]=='C'))) {
                        p++;
                }
                else
                if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
((seq2array[i]=='t') || (seq2array[i]=='T'))) {
                        p++;
                }
                else
                if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
((seq2array[i]=='c') || (seq2array[i]=='C'))) {
                                q++;
                        }
                else
                if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
((seq2array[i]=='t') || (seq2array[i]=='T'))) {
                                q++;
                        }
                else
                if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
((seq2array[i]=='c') || (seq2array[i]=='C'))) {
                                        q++;
                                }
                else
                if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
((seq2array[i]=='t') || (seq2array[i]=='T'))) {
                                        q++;
                                }
                else
                if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
((seq2array[i]=='a') || (seq2array[i]=='A'))) {
                                        q++;
                                }
                else
                if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
((seq2array[i]=='g') || (seq2array[i]=='G'))) {
                                        q++;
                                }
                else
                if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
((seq2array[i]=='a') || (seq2array[i]=='A'))) {
                                        q++;
                                }
                else
                if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
((seq2array[i]=='g') || (seq2array[i]=='G'))) {
                                        q++;
                                }


        }

         double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
(((double)q)/numberOfAlignedSites);
         double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
         System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
         double dist = (-0.5 * Math.log(P)) - (0.25 * Math.log(Q));
         return dist;
}


On 10/22/07, Richard Holland <holland at ebi.ac.uk> wrote:
>
> You should take a look at the latest 1.5 release, in the
> org.biojavax.bio.phylo packages. This code is the beginnings of some
> phylogenetics code that will perform tasks as you describe. The future
> plan is to extend this code to cover a wider range of use cases. Kimura2P
> is already implemented here, in
> org.biojavax.bio.phylo.MultipleHitCorrection.
>
> If you can't find code that will do what you want, but have written some
> before, then please do feel free to contribute it. Even if it is slow, I'm
> sure someone out there will be able to help optimise it!
>
> cheers,
> Richard
>
> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> > Hi,
> >
> > Are there functions to calculate evolutionary pairwise distances like
> > Kimura2P,Finkelstein etc in Biojava
> > I did write smthng on my own but on large sequences it runs terribly
> > slow and I am not even sure if thats right.
> > --
> > Vineith Kaul
> > Masters Student Bioinformatics
> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> > Georgia Tech, Atlanta
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
>
> --
> Richard Holland
> BioMart (http://www.biomart.org/)
> EMBL-EBI
> Hinxton, Cambridgeshire CB10 1SD, UK
>
>


-- 
Vineith Kaul
Masters Student Bioinformatics
The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
Georgia Tech, Atlanta

From ozgur7 at gmail.com  Tue Oct 23 14:17:29 2007
From: ozgur7 at gmail.com (Ozgur Ozturk)
Date: Tue, 23 Oct 2007 11:17:29 -0700
Subject: [Biojava-l] problem with CookBook:Blast:Parser
Message-ID: <a662a84f0710231117y548fc2f0q471f5f1868a91777@mail.gmail.com>

Hi,
   I am receiving the following error when I use BlastParser code from the
cookbook <http://www.biojava.org/wiki/BioJava:CookBook:Blast:Parser> :

org.xml.sax.SAXException: Could not recognise the format of this file as one
supported by the framework.
    at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(
BlastLikeSAXParser.java:182)
    at org.arabidopsis.test.BlastParser.main(BlastParser.java:44)

I have generated the xml file using this command:
blast-2.2.17/bin/blastall -p blastp -d brAll -i tester -b 300 -m 7 >
tempresult.xml

Then pass it to BlastParser:
BlastParser tempresult.xml

Thanks for your help in advance,
-- 
Best regards,
Ozgur (Oscar) Ozturk,
http://www.cse.ohio-state.edu/~ozturk/
Mobile Phone: (614) 805-4370

From ozgur7 at gmail.com  Tue Oct 23 16:24:49 2007
From: ozgur7 at gmail.com (Ozgur Ozturk)
Date: Tue, 23 Oct 2007 13:24:49 -0700
Subject: [Biojava-l] Problem Solved Re: problem with CookBook:Blast:Parser
Message-ID: <a662a84f0710231324i63eb5bb0xfefc551f507f81ab@mail.gmail.com>

Hi,
      Another code in demos ( BioJava/biojava-1.5/demos/blastxml ) could
handle my xml file.
I guess the problem is solved. Thanks.
(But if the BlastParser code from the
cookbook<http://www.biojava.org/wiki/BioJava:CookBook:Blast:Parser>is
deprecated, you may want to update it.)

Best regards,
Ozgur (Oscar) Ozturk,
http://www.cse.ohio-state.edu/~ozturk/
Mobile Phone: (614) 805-4370

On 10/23/07, Ozgur Ozturk <ozgur7 at gmail.com> wrote:
>
> Hi,
>    I am receiving the following error when I use BlastParser code from the
> cookbook <http://www.biojava.org/wiki/BioJava:CookBook:Blast:Parser> :
>
> org.xml.sax.SAXException: Could not recognise the format of this file as
> one supported by the framework.
>     at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(
> BlastLikeSAXParser.java:182)
>     at org.arabidopsis.test.BlastParser.main(BlastParser.java:44)
>
> I have generated the xml file using this command:
> blast-2.2.17/bin/blastall -p blastp -d brAll -i tester -b 300 -m 7 >
> tempresult.xml
>
> Then pass it to BlastParser:
> BlastParser tempresult.xml
>
> Thanks for your help in advance,
> --
> Best regards,
> Ozgur (Oscar) Ozturk,
> http://www.cse.ohio-state.edu/~ozturk/<http://www.cse.ohio-state.edu/%7Eozturk/>
> Mobile Phone: (614) 805-4370

From holland at ebi.ac.uk  Wed Oct 24 03:52:24 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 24 Oct 2007 08:52:24 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
Message-ID: <471EF9B8.7020609@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks.

Your code is similar to the code we have in
org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
see if it is identical, but it probably is.

You can call our code like this:

 // import statement for biojava phylo stuff
 import org.biojavax.bio.phylo.*;

 // ...rest of code goes here

 // call Kimura2P
 String seq1 = ...; // Get seq1 and seq2 from somewhere
 String seq2 = ...;
 double result = MultipleHitCorrection.Kimura2P(seq1, seq2);

Note that our implementation expects sequence strings to be in upper
case, so you'll need to make sure your data is upper case or has been
converted to upper case before calling our method.

cheers,
Richard

vineith kaul wrote:
> This is what I have .....Thanks a lot  fr the help.
> 
> 
> //Method to calculate the Kimura 2 parameter distance
> public static double K2P(String sequence1,String sequence2){
>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
> 
> 
>         char[] seq1array=sequence1.toCharArray();
>         char[] seq2array=sequence2.toCharArray();
> 
>         for(int i=0;i<seq1array.length;i++){
>                                 // Number of aligned sites
>                 if(((seq1array[i]=='a') ||
> (seq1array[i]=='A')||(seq1array[i]=='g') ||
> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
> (seq2array[i]=='A')||(seq2array[i]=='c') ||
> (seq2array[i]=='C')||(seq2array[i]=='t') ||
> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
> 
>                         numberOfAlignedSites++;
>                 }
> 
>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>                         p++;
>                 }
>                 else
>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>                         p++;
>                 }
>                 else
>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>                         p++;
>                 }
>                 else
>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>                         p++;
>                 }
>                 else
>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>                                 q++;
>                         }
>                 else
>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>                                 q++;
>                         }
>                 else
>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>                                         q++;
>                                 }
>                 else
>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>                                         q++;
>                                 }
>                 else
>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>                                         q++;
>                                 }
>                 else
>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>                                         q++;
>                                 }
>                 else
>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>                                         q++;
>                                 }
>                 else
>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>                                         q++;
>                                 }
> 
> 
> 
> 
>         }
> 
>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
> (((double)q)/numberOfAlignedSites);
>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>          return dist;
> }
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
> <mailto:holland at ebi.ac.uk>> wrote:
> 
>     You should take a look at the latest 1.5 release, in the
>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>     phylogenetics code that will perform tasks as you describe. The future
>     plan is to extend this code to cover a wider range of use cases.
>     Kimura2P
>     is already implemented here, in
>     org.biojavax.bio.phylo.MultipleHitCorrection.
> 
>     If you can't find code that will do what you want, but have written some
>     before, then please do feel free to contribute it. Even if it is
>     slow, I'm
>     sure someone out there will be able to help optimise it!
> 
>     cheers,
>     Richard
> 
>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>     > Hi,
>     >
>     > Are there functions to calculate evolutionary pairwise distances like
>     > Kimura2P,Finkelstein etc in Biojava
>     > I did write smthng on my own but on large sequences it runs terribly
>     > slow and I am not even sure if thats right.
>     > --
>     > Vineith Kaul
>     > Masters Student Bioinformatics
>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>     > Georgia Tech, Atlanta
>     > _______________________________________________
>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>     <mailto:Biojava-l at lists.open-bio.org>
>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>     >
> 
> 
>     --
>     Richard Holland
>     BioMart ( http://www.biomart.org/)
>     EMBL-EBI
>     Hinxton, Cambridgeshire CB10 1SD, UK
> 
> 
> 
> 
> -- 
> Vineith Kaul
> Masters Student Bioinformatics
> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> Georgia Tech, Atlanta
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
4iKvsyBj2uznhhjTF9EYDFE=
=LALE
-----END PGP SIGNATURE-----

From ayates at ebi.ac.uk  Wed Oct 24 04:09:13 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 24 Oct 2007 09:09:13 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471EF9B8.7020609@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>		<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
	<471EF9B8.7020609@ebi.ac.uk>
Message-ID: <471EFDA9.1090706@ebi.ac.uk>

Our code is very similar but not identical. The original programmer 
shortcutted a lot of else if conditions by considering if the two bases 
were equal or not. It can then calculate the transitional changes & 
assume the rest are transversional.

In terms of speed of both pieces of code I can't see an obvious way to 
speed it up. Probably in our code removing the 10 or so calls to 
String.charAt() with a two calls & referencing those chars might help 
but in all honesty I cannot say.

Andy

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Thanks.
> 
> Your code is similar to the code we have in
> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> see if it is identical, but it probably is.
> 
> You can call our code like this:
> 
>  // import statement for biojava phylo stuff
>  import org.biojavax.bio.phylo.*;
> 
>  // ...rest of code goes here
> 
>  // call Kimura2P
>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>  String seq2 = ...;
>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> 
> Note that our implementation expects sequence strings to be in upper
> case, so you'll need to make sure your data is upper case or has been
> converted to upper case before calling our method.
> 
> cheers,
> Richard
> 
> vineith kaul wrote:
>> This is what I have .....Thanks a lot  fr the help.
>>
>>
>> //Method to calculate the Kimura 2 parameter distance
>> public static double K2P(String sequence1,String sequence2){
>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>
>>
>>         char[] seq1array=sequence1.toCharArray();
>>         char[] seq2array=sequence2.toCharArray();
>>
>>         for(int i=0;i<seq1array.length;i++){
>>                                 // Number of aligned sites
>>                 if(((seq1array[i]=='a') ||
>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>
>>                         numberOfAlignedSites++;
>>                 }
>>
>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>                         p++;
>>                 }
>>                 else
>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>                         p++;
>>                 }
>>                 else
>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>                         p++;
>>                 }
>>                 else
>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>                         p++;
>>                 }
>>                 else
>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>                                 q++;
>>                         }
>>                 else
>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>                                 q++;
>>                         }
>>                 else
>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>                                         q++;
>>                                 }
>>                 else
>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>                                         q++;
>>                                 }
>>                 else
>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>                                         q++;
>>                                 }
>>                 else
>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>                                         q++;
>>                                 }
>>                 else
>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>                                         q++;
>>                                 }
>>                 else
>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>                                         q++;
>>                                 }
>>
>>
>>
>>
>>         }
>>
>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>> (((double)q)/numberOfAlignedSites);
>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>          return dist;
>> }
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>> <mailto:holland at ebi.ac.uk>> wrote:
>>
>>     You should take a look at the latest 1.5 release, in the
>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>     phylogenetics code that will perform tasks as you describe. The future
>>     plan is to extend this code to cover a wider range of use cases.
>>     Kimura2P
>>     is already implemented here, in
>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>
>>     If you can't find code that will do what you want, but have written some
>>     before, then please do feel free to contribute it. Even if it is
>>     slow, I'm
>>     sure someone out there will be able to help optimise it!
>>
>>     cheers,
>>     Richard
>>
>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>     > Hi,
>>     >
>>     > Are there functions to calculate evolutionary pairwise distances like
>>     > Kimura2P,Finkelstein etc in Biojava
>>     > I did write smthng on my own but on large sequences it runs terribly
>>     > slow and I am not even sure if thats right.
>>     > --
>>     > Vineith Kaul
>>     > Masters Student Bioinformatics
>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>     > Georgia Tech, Atlanta
>>     > _______________________________________________
>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>     <mailto:Biojava-l at lists.open-bio.org>
>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>     >
>>
>>
>>     --
>>     Richard Holland
>>     BioMart ( http://www.biomart.org/)
>>     EMBL-EBI
>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>
>>
>>
>>
>> -- 
>> Vineith Kaul
>> Masters Student Bioinformatics
>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>> Georgia Tech, Atlanta
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
> 4iKvsyBj2uznhhjTF9EYDFE=
> =LALE
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

From markjschreiber at gmail.com  Wed Oct 24 07:59:04 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 24 Oct 2007 19:59:04 +0800
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471EFDA9.1090706@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>
Message-ID: <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>

Hi -

>From experience the best way to optimize java code is to run a
profiler. The one in Netbeans is quite good.

The reason is that the hotspot or JIT compilers might natively compile
the part of the code that you think is slow and actually make it
faster than something else which becomes the bottle neck. Using a good
profiler you can detect how much time is spent in each method and pin
point some candidate methods for optimization. You can also see if
there is a burden due to creation of lots of objects.

- Mark

On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> Our code is very similar but not identical. The original programmer
> shortcutted a lot of else if conditions by considering if the two bases
> were equal or not. It can then calculate the transitional changes &
> assume the rest are transversional.
>
> In terms of speed of both pieces of code I can't see an obvious way to
> speed it up. Probably in our code removing the 10 or so calls to
> String.charAt() with a two calls & referencing those chars might help
> but in all honesty I cannot say.
>
> Andy
>
> Richard Holland wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Thanks.
> >
> > Your code is similar to the code we have in
> > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> > see if it is identical, but it probably is.
> >
> > You can call our code like this:
> >
> >  // import statement for biojava phylo stuff
> >  import org.biojavax.bio.phylo.*;
> >
> >  // ...rest of code goes here
> >
> >  // call Kimura2P
> >  String seq1 = ...; // Get seq1 and seq2 from somewhere
> >  String seq2 = ...;
> >  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> >
> > Note that our implementation expects sequence strings to be in upper
> > case, so you'll need to make sure your data is upper case or has been
> > converted to upper case before calling our method.
> >
> > cheers,
> > Richard
> >
> > vineith kaul wrote:
> >> This is what I have .....Thanks a lot  fr the help.
> >>
> >>
> >> //Method to calculate the Kimura 2 parameter distance
> >> public static double K2P(String sequence1,String sequence2){
> >>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
> >> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
> >>
> >>
> >>         char[] seq1array=sequence1.toCharArray();
> >>         char[] seq2array=sequence2.toCharArray();
> >>
> >>         for(int i=0;i<seq1array.length;i++){
> >>                                 // Number of aligned sites
> >>                 if(((seq1array[i]=='a') ||
> >> (seq1array[i]=='A')||(seq1array[i]=='g') ||
> >> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
> >> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
> >> (seq2array[i]=='A')||(seq2array[i]=='c') ||
> >> (seq2array[i]=='C')||(seq2array[i]=='t') ||
> >> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>
> >>                         numberOfAlignedSites++;
> >>                 }
> >>
> >>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>                         p++;
> >>                 }
> >>                 else
> >>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>                         p++;
> >>                 }
> >>                 else
> >>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>                         p++;
> >>                 }
> >>                 else
> >>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>                         p++;
> >>                 }
> >>                 else
> >>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>                                 q++;
> >>                         }
> >>                 else
> >>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>                                 q++;
> >>                         }
> >>                 else
> >>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>                                         q++;
> >>                                 }
> >>                 else
> >>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>                                         q++;
> >>                                 }
> >>                 else
> >>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>                                         q++;
> >>                                 }
> >>                 else
> >>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>                                         q++;
> >>                                 }
> >>                 else
> >>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>                                         q++;
> >>                                 }
> >>                 else
> >>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>                                         q++;
> >>                                 }
> >>
> >>
> >>
> >>
> >>         }
> >>
> >>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
> >> (((double)q)/numberOfAlignedSites);
> >>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
> >>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
> >>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
> >>          return dist;
> >> }
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
> >> <mailto:holland at ebi.ac.uk>> wrote:
> >>
> >>     You should take a look at the latest 1.5 release, in the
> >>     org.biojavax.bio.phylo packages. This code is the beginnings of some
> >>     phylogenetics code that will perform tasks as you describe. The future
> >>     plan is to extend this code to cover a wider range of use cases.
> >>     Kimura2P
> >>     is already implemented here, in
> >>     org.biojavax.bio.phylo.MultipleHitCorrection.
> >>
> >>     If you can't find code that will do what you want, but have written some
> >>     before, then please do feel free to contribute it. Even if it is
> >>     slow, I'm
> >>     sure someone out there will be able to help optimise it!
> >>
> >>     cheers,
> >>     Richard
> >>
> >>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> >>     > Hi,
> >>     >
> >>     > Are there functions to calculate evolutionary pairwise distances like
> >>     > Kimura2P,Finkelstein etc in Biojava
> >>     > I did write smthng on my own but on large sequences it runs terribly
> >>     > slow and I am not even sure if thats right.
> >>     > --
> >>     > Vineith Kaul
> >>     > Masters Student Bioinformatics
> >>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >>     > Georgia Tech, Atlanta
> >>     > _______________________________________________
> >>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> >>     <mailto:Biojava-l at lists.open-bio.org>
> >>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>     >
> >>
> >>
> >>     --
> >>     Richard Holland
> >>     BioMart ( http://www.biomart.org/)
> >>     EMBL-EBI
> >>     Hinxton, Cambridgeshire CB10 1SD, UK
> >>
> >>
> >>
> >>
> >> --
> >> Vineith Kaul
> >> Masters Student Bioinformatics
> >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >> Georgia Tech, Atlanta
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.2.2 (GNU/Linux)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >
> > iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
> > 4iKvsyBj2uznhhjTF9EYDFE=
> > =LALE
> > -----END PGP SIGNATURE-----
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

From ayates at ebi.ac.uk  Wed Oct 24 08:28:21 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 24 Oct 2007 13:28:21 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>
Message-ID: <471F3A65.50202@ebi.ac.uk>

Yes a very good point & one I was going to make before hand but forgot :)

Also not to mention that micro-benchmarks/profiling in Java are 
notorious for giving false results due to VM warmup & JIT compilation 
optimisations. There is a framework hosted on Java.net somewhere which 
can perform VM warmups and code iterations to produce more accurate 
benchmarking results; but the name escapes me at the moment.

However looking at this particular code I get the feeling that this is 
about as fast as its going to get without someone doing bitwise XOR 
operations or some C code ... that's not an open invitation for people 
to start recoding this in C :). At the end of the day the key to 
optimisation is to ask the question "is it fast enough already?". If it 
is then there's no point :)

Andy

Mark Schreiber wrote:
> Hi -
> 
>>From experience the best way to optimize java code is to run a
> profiler. The one in Netbeans is quite good.
> 
> The reason is that the hotspot or JIT compilers might natively compile
> the part of the code that you think is slow and actually make it
> faster than something else which becomes the bottle neck. Using a good
> profiler you can detect how much time is spent in each method and pin
> point some candidate methods for optimization. You can also see if
> there is a burden due to creation of lots of objects.
> 
> - Mark
> 
> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Our code is very similar but not identical. The original programmer
>> shortcutted a lot of else if conditions by considering if the two bases
>> were equal or not. It can then calculate the transitional changes &
>> assume the rest are transversional.
>>
>> In terms of speed of both pieces of code I can't see an obvious way to
>> speed it up. Probably in our code removing the 10 or so calls to
>> String.charAt() with a two calls & referencing those chars might help
>> but in all honesty I cannot say.
>>
>> Andy
>>
>> Richard Holland wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Thanks.
>>>
>>> Your code is similar to the code we have in
>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>>> see if it is identical, but it probably is.
>>>
>>> You can call our code like this:
>>>
>>>  // import statement for biojava phylo stuff
>>>  import org.biojavax.bio.phylo.*;
>>>
>>>  // ...rest of code goes here
>>>
>>>  // call Kimura2P
>>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>>  String seq2 = ...;
>>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>>
>>> Note that our implementation expects sequence strings to be in upper
>>> case, so you'll need to make sure your data is upper case or has been
>>> converted to upper case before calling our method.
>>>
>>> cheers,
>>> Richard
>>>
>>> vineith kaul wrote:
>>>> This is what I have .....Thanks a lot  fr the help.
>>>>
>>>>
>>>> //Method to calculate the Kimura 2 parameter distance
>>>> public static double K2P(String sequence1,String sequence2){
>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>
>>>>
>>>>         char[] seq1array=sequence1.toCharArray();
>>>>         char[] seq2array=sequence2.toCharArray();
>>>>
>>>>         for(int i=0;i<seq1array.length;i++){
>>>>                                 // Number of aligned sites
>>>>                 if(((seq1array[i]=='a') ||
>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>
>>>>                         numberOfAlignedSites++;
>>>>                 }
>>>>
>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>                                 q++;
>>>>                         }
>>>>                 else
>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>                                 q++;
>>>>                         }
>>>>                 else
>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>                                         q++;
>>>>                                 }
>>>>
>>>>
>>>>
>>>>
>>>>         }
>>>>
>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>> (((double)q)/numberOfAlignedSites);
>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>          return dist;
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>
>>>>     You should take a look at the latest 1.5 release, in the
>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>     Kimura2P
>>>>     is already implemented here, in
>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>
>>>>     If you can't find code that will do what you want, but have written some
>>>>     before, then please do feel free to contribute it. Even if it is
>>>>     slow, I'm
>>>>     sure someone out there will be able to help optimise it!
>>>>
>>>>     cheers,
>>>>     Richard
>>>>
>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>     > Hi,
>>>>     >
>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>     > slow and I am not even sure if thats right.
>>>>     > --
>>>>     > Vineith Kaul
>>>>     > Masters Student Bioinformatics
>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>     > Georgia Tech, Atlanta
>>>>     > _______________________________________________
>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>     >
>>>>
>>>>
>>>>     --
>>>>     Richard Holland
>>>>     BioMart ( http://www.biomart.org/)
>>>>     EMBL-EBI
>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Vineith Kaul
>>>> Masters Student Bioinformatics
>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>> Georgia Tech, Atlanta
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>>
>>> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
>>> 4iKvsyBj2uznhhjTF9EYDFE=
>>> =LALE
>>> -----END PGP SIGNATURE-----
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>

From markjschreiber at gmail.com  Wed Oct 24 09:19:25 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 24 Oct 2007 21:19:25 +0800
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471F3A65.50202@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>
	<471F3A65.50202@ebi.ac.uk>
Message-ID: <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>

Another important consideration after optimization is can the task be
multithreaded?  Almost all modern computers have at least 2 cores. So
if the algorithm can be parallelized you will get some performance
bonus on most machines.

Modern JVM's will automagically try to use idle CPU's to execute new
threads spawned by the programmer.

- Mark

On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> Yes a very good point & one I was going to make before hand but forgot :)
>
> Also not to mention that micro-benchmarks/profiling in Java are
> notorious for giving false results due to VM warmup & JIT compilation
> optimisations. There is a framework hosted on Java.net somewhere which
> can perform VM warmups and code iterations to produce more accurate
> benchmarking results; but the name escapes me at the moment.
>
> However looking at this particular code I get the feeling that this is
> about as fast as its going to get without someone doing bitwise XOR
> operations or some C code ... that's not an open invitation for people
> to start recoding this in C :). At the end of the day the key to
> optimisation is to ask the question "is it fast enough already?". If it
> is then there's no point :)
>
> Andy
>
> Mark Schreiber wrote:
> > Hi -
> >
> >>From experience the best way to optimize java code is to run a
> > profiler. The one in Netbeans is quite good.
> >
> > The reason is that the hotspot or JIT compilers might natively compile
> > the part of the code that you think is slow and actually make it
> > faster than something else which becomes the bottle neck. Using a good
> > profiler you can detect how much time is spent in each method and pin
> > point some candidate methods for optimization. You can also see if
> > there is a burden due to creation of lots of objects.
> >
> > - Mark
> >
> > On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> Our code is very similar but not identical. The original programmer
> >> shortcutted a lot of else if conditions by considering if the two bases
> >> were equal or not. It can then calculate the transitional changes &
> >> assume the rest are transversional.
> >>
> >> In terms of speed of both pieces of code I can't see an obvious way to
> >> speed it up. Probably in our code removing the 10 or so calls to
> >> String.charAt() with a two calls & referencing those chars might help
> >> but in all honesty I cannot say.
> >>
> >> Andy
> >>
> >> Richard Holland wrote:
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA1
> >>>
> >>> Thanks.
> >>>
> >>> Your code is similar to the code we have in
> >>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> >>> see if it is identical, but it probably is.
> >>>
> >>> You can call our code like this:
> >>>
> >>>  // import statement for biojava phylo stuff
> >>>  import org.biojavax.bio.phylo.*;
> >>>
> >>>  // ...rest of code goes here
> >>>
> >>>  // call Kimura2P
> >>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
> >>>  String seq2 = ...;
> >>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> >>>
> >>> Note that our implementation expects sequence strings to be in upper
> >>> case, so you'll need to make sure your data is upper case or has been
> >>> converted to upper case before calling our method.
> >>>
> >>> cheers,
> >>> Richard
> >>>
> >>> vineith kaul wrote:
> >>>> This is what I have .....Thanks a lot  fr the help.
> >>>>
> >>>>
> >>>> //Method to calculate the Kimura 2 parameter distance
> >>>> public static double K2P(String sequence1,String sequence2){
> >>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
> >>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
> >>>>
> >>>>
> >>>>         char[] seq1array=sequence1.toCharArray();
> >>>>         char[] seq2array=sequence2.toCharArray();
> >>>>
> >>>>         for(int i=0;i<seq1array.length;i++){
> >>>>                                 // Number of aligned sites
> >>>>                 if(((seq1array[i]=='a') ||
> >>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
> >>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
> >>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
> >>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
> >>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
> >>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>
> >>>>                         numberOfAlignedSites++;
> >>>>                 }
> >>>>
> >>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>                         p++;
> >>>>                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>                         p++;
> >>>>                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>                         p++;
> >>>>                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>                         p++;
> >>>>                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>                                 q++;
> >>>>                         }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>                                 q++;
> >>>>                         }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>         }
> >>>>
> >>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
> >>>> (((double)q)/numberOfAlignedSites);
> >>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
> >>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
> >>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
> >>>>          return dist;
> >>>> }
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
> >>>> <mailto:holland at ebi.ac.uk>> wrote:
> >>>>
> >>>>     You should take a look at the latest 1.5 release, in the
> >>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
> >>>>     phylogenetics code that will perform tasks as you describe. The future
> >>>>     plan is to extend this code to cover a wider range of use cases.
> >>>>     Kimura2P
> >>>>     is already implemented here, in
> >>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
> >>>>
> >>>>     If you can't find code that will do what you want, but have written some
> >>>>     before, then please do feel free to contribute it. Even if it is
> >>>>     slow, I'm
> >>>>     sure someone out there will be able to help optimise it!
> >>>>
> >>>>     cheers,
> >>>>     Richard
> >>>>
> >>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> >>>>     > Hi,
> >>>>     >
> >>>>     > Are there functions to calculate evolutionary pairwise distances like
> >>>>     > Kimura2P,Finkelstein etc in Biojava
> >>>>     > I did write smthng on my own but on large sequences it runs terribly
> >>>>     > slow and I am not even sure if thats right.
> >>>>     > --
> >>>>     > Vineith Kaul
> >>>>     > Masters Student Bioinformatics
> >>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >>>>     > Georgia Tech, Atlanta
> >>>>     > _______________________________________________
> >>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> >>>>     <mailto:Biojava-l at lists.open-bio.org>
> >>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>     >
> >>>>
> >>>>
> >>>>     --
> >>>>     Richard Holland
> >>>>     BioMart ( http://www.biomart.org/)
> >>>>     EMBL-EBI
> >>>>     Hinxton, Cambridgeshire CB10 1SD, UK
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Vineith Kaul
> >>>> Masters Student Bioinformatics
> >>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >>>> Georgia Tech, Atlanta
> >>> -----BEGIN PGP SIGNATURE-----
> >>> Version: GnuPG v1.4.2.2 (GNU/Linux)
> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >>>
> >>> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
> >>> 4iKvsyBj2uznhhjTF9EYDFE=
> >>> =LALE
> >>> -----END PGP SIGNATURE-----
> >>> _______________________________________________
> >>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>
>

From holland at ebi.ac.uk  Wed Oct 24 09:33:53 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 24 Oct 2007 14:33:53 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>	
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>	
	<471F3A65.50202@ebi.ac.uk>
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>
Message-ID: <471F49C1.9070901@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This particular code could easily be parallelised - given N threads, you
can simply divide the input into N chunks and get each thread to process
1/Nth of the input. You then combine the output of each thread to do the
final calculation.

But, it'd be bad practice to always fork a predetermined N threads for a
given task. It'd be much better to somehow be able to ask 'how parallel
can I make this?' at runtime by checking system resources, or maybe get
the parallel-savvy user to set an optional BioJava-wide parallelisation
hint. N could then be determined and the task divided appropriately.

cheers,
Richard

Mark Schreiber wrote:
> Another important consideration after optimization is can the task be
> multithreaded?  Almost all modern computers have at least 2 cores. So
> if the algorithm can be parallelized you will get some performance
> bonus on most machines.
> 
> Modern JVM's will automagically try to use idle CPU's to execute new
> threads spawned by the programmer.
> 
> - Mark
> 
> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Yes a very good point & one I was going to make before hand but forgot :)
>>
>> Also not to mention that micro-benchmarks/profiling in Java are
>> notorious for giving false results due to VM warmup & JIT compilation
>> optimisations. There is a framework hosted on Java.net somewhere which
>> can perform VM warmups and code iterations to produce more accurate
>> benchmarking results; but the name escapes me at the moment.
>>
>> However looking at this particular code I get the feeling that this is
>> about as fast as its going to get without someone doing bitwise XOR
>> operations or some C code ... that's not an open invitation for people
>> to start recoding this in C :). At the end of the day the key to
>> optimisation is to ask the question "is it fast enough already?". If it
>> is then there's no point :)
>>
>> Andy
>>
>> Mark Schreiber wrote:
>>> Hi -
>>>
>>> >From experience the best way to optimize java code is to run a
>>> profiler. The one in Netbeans is quite good.
>>>
>>> The reason is that the hotspot or JIT compilers might natively compile
>>> the part of the code that you think is slow and actually make it
>>> faster than something else which becomes the bottle neck. Using a good
>>> profiler you can detect how much time is spent in each method and pin
>>> point some candidate methods for optimization. You can also see if
>>> there is a burden due to creation of lots of objects.
>>>
>>> - Mark
>>>
>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> Our code is very similar but not identical. The original programmer
>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>> were equal or not. It can then calculate the transitional changes &
>>>> assume the rest are transversional.
>>>>
>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>> String.charAt() with a two calls & referencing those chars might help
>>>> but in all honesty I cannot say.
>>>>
>>>> Andy
>>>>
>>>> Richard Holland wrote:
> Thanks.
> 
> Your code is similar to the code we have in
> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> see if it is identical, but it probably is.
> 
> You can call our code like this:
> 
>  // import statement for biojava phylo stuff
>  import org.biojavax.bio.phylo.*;
> 
>  // ...rest of code goes here
> 
>  // call Kimura2P
>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>  String seq2 = ...;
>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> 
> Note that our implementation expects sequence strings to be in upper
> case, so you'll need to make sure your data is upper case or has been
> converted to upper case before calling our method.
> 
> cheers,
> Richard
> 
> vineith kaul wrote:
>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>
>>>>>>>
>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>
>>>>>>>
>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>
>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>                                 // Number of aligned sites
>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>
>>>>>>>                         numberOfAlignedSites++;
>>>>>>>                 }
>>>>>>>
>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>                                 q++;
>>>>>>>                         }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>                                 q++;
>>>>>>>                         }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         }
>>>>>>>
>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>          return dist;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>
>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>     Kimura2P
>>>>>>>     is already implemented here, in
>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>
>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>     slow, I'm
>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>
>>>>>>>     cheers,
>>>>>>>     Richard
>>>>>>>
>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>     > Hi,
>>>>>>>     >
>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>     > --
>>>>>>>     > Vineith Kaul
>>>>>>>     > Masters Student Bioinformatics
>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>     > _______________________________________________
>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>     >
>>>>>>>
>>>>>>>
>>>>>>>     --
>>>>>>>     Richard Holland
>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>     EMBL-EBI
>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Vineith Kaul
>>>>>>> Masters Student Bioinformatics
>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>> Georgia Tech, Atlanta
_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
IEyRleSs1+AziCvfhcES8wI=
=uLDm
-----END PGP SIGNATURE-----

From markjschreiber at gmail.com  Wed Oct 24 09:41:16 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 24 Oct 2007 21:41:16 +0800
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471F49C1.9070901@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>
	<471F3A65.50202@ebi.ac.uk>
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>
	<471F49C1.9070901@ebi.ac.uk>
Message-ID: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>

I'm not aware of a way to determine the number of CPU's within a
program although possibly it is one the the environment variables
available from System.

Even if it can't be determined there could be a method argument to
specify the number of threads to spawn.

- Mark

On 10/24/07, Richard Holland <holland at ebi.ac.uk> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> This particular code could easily be parallelised - given N threads, you
> can simply divide the input into N chunks and get each thread to process
> 1/Nth of the input. You then combine the output of each thread to do the
> final calculation.
>
> But, it'd be bad practice to always fork a predetermined N threads for a
> given task. It'd be much better to somehow be able to ask 'how parallel
> can I make this?' at runtime by checking system resources, or maybe get
> the parallel-savvy user to set an optional BioJava-wide parallelisation
> hint. N could then be determined and the task divided appropriately.
>
> cheers,
> Richard
>
> Mark Schreiber wrote:
> > Another important consideration after optimization is can the task be
> > multithreaded?  Almost all modern computers have at least 2 cores. So
> > if the algorithm can be parallelized you will get some performance
> > bonus on most machines.
> >
> > Modern JVM's will automagically try to use idle CPU's to execute new
> > threads spawned by the programmer.
> >
> > - Mark
> >
> > On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> Yes a very good point & one I was going to make before hand but forgot :)
> >>
> >> Also not to mention that micro-benchmarks/profiling in Java are
> >> notorious for giving false results due to VM warmup & JIT compilation
> >> optimisations. There is a framework hosted on Java.net somewhere which
> >> can perform VM warmups and code iterations to produce more accurate
> >> benchmarking results; but the name escapes me at the moment.
> >>
> >> However looking at this particular code I get the feeling that this is
> >> about as fast as its going to get without someone doing bitwise XOR
> >> operations or some C code ... that's not an open invitation for people
> >> to start recoding this in C :). At the end of the day the key to
> >> optimisation is to ask the question "is it fast enough already?". If it
> >> is then there's no point :)
> >>
> >> Andy
> >>
> >> Mark Schreiber wrote:
> >>> Hi -
> >>>
> >>> >From experience the best way to optimize java code is to run a
> >>> profiler. The one in Netbeans is quite good.
> >>>
> >>> The reason is that the hotspot or JIT compilers might natively compile
> >>> the part of the code that you think is slow and actually make it
> >>> faster than something else which becomes the bottle neck. Using a good
> >>> profiler you can detect how much time is spent in each method and pin
> >>> point some candidate methods for optimization. You can also see if
> >>> there is a burden due to creation of lots of objects.
> >>>
> >>> - Mark
> >>>
> >>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> Our code is very similar but not identical. The original programmer
> >>>> shortcutted a lot of else if conditions by considering if the two bases
> >>>> were equal or not. It can then calculate the transitional changes &
> >>>> assume the rest are transversional.
> >>>>
> >>>> In terms of speed of both pieces of code I can't see an obvious way to
> >>>> speed it up. Probably in our code removing the 10 or so calls to
> >>>> String.charAt() with a two calls & referencing those chars might help
> >>>> but in all honesty I cannot say.
> >>>>
> >>>> Andy
> >>>>
> >>>> Richard Holland wrote:
> > Thanks.
> >
> > Your code is similar to the code we have in
> > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> > see if it is identical, but it probably is.
> >
> > You can call our code like this:
> >
> >  // import statement for biojava phylo stuff
> >  import org.biojavax.bio.phylo.*;
> >
> >  // ...rest of code goes here
> >
> >  // call Kimura2P
> >  String seq1 = ...; // Get seq1 and seq2 from somewhere
> >  String seq2 = ...;
> >  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> >
> > Note that our implementation expects sequence strings to be in upper
> > case, so you'll need to make sure your data is upper case or has been
> > converted to upper case before calling our method.
> >
> > cheers,
> > Richard
> >
> > vineith kaul wrote:
> >>>>>>> This is what I have .....Thanks a lot  fr the help.
> >>>>>>>
> >>>>>>>
> >>>>>>> //Method to calculate the Kimura 2 parameter distance
> >>>>>>> public static double K2P(String sequence1,String sequence2){
> >>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
> >>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
> >>>>>>>
> >>>>>>>
> >>>>>>>         char[] seq1array=sequence1.toCharArray();
> >>>>>>>         char[] seq2array=sequence2.toCharArray();
> >>>>>>>
> >>>>>>>         for(int i=0;i<seq1array.length;i++){
> >>>>>>>                                 // Number of aligned sites
> >>>>>>>                 if(((seq1array[i]=='a') ||
> >>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
> >>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
> >>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
> >>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
> >>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
> >>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>>>>
> >>>>>>>                         numberOfAlignedSites++;
> >>>>>>>                 }
> >>>>>>>
> >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>>>>                         p++;
> >>>>>>>                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>>>>                         p++;
> >>>>>>>                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>>>>                         p++;
> >>>>>>>                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>>>>                         p++;
> >>>>>>>                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>>>>                                 q++;
> >>>>>>>                         }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>>>>                                 q++;
> >>>>>>>                         }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>         }
> >>>>>>>
> >>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
> >>>>>>> (((double)q)/numberOfAlignedSites);
> >>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
> >>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
> >>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
> >>>>>>>          return dist;
> >>>>>>> }
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
> >>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
> >>>>>>>
> >>>>>>>     You should take a look at the latest 1.5 release, in the
> >>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
> >>>>>>>     phylogenetics code that will perform tasks as you describe. The future
> >>>>>>>     plan is to extend this code to cover a wider range of use cases.
> >>>>>>>     Kimura2P
> >>>>>>>     is already implemented here, in
> >>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
> >>>>>>>
> >>>>>>>     If you can't find code that will do what you want, but have written some
> >>>>>>>     before, then please do feel free to contribute it. Even if it is
> >>>>>>>     slow, I'm
> >>>>>>>     sure someone out there will be able to help optimise it!
> >>>>>>>
> >>>>>>>     cheers,
> >>>>>>>     Richard
> >>>>>>>
> >>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> >>>>>>>     > Hi,
> >>>>>>>     >
> >>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
> >>>>>>>     > Kimura2P,Finkelstein etc in Biojava
> >>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
> >>>>>>>     > slow and I am not even sure if thats right.
> >>>>>>>     > --
> >>>>>>>     > Vineith Kaul
> >>>>>>>     > Masters Student Bioinformatics
> >>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >>>>>>>     > Georgia Tech, Atlanta
> >>>>>>>     > _______________________________________________
> >>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> >>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
> >>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>>     >
> >>>>>>>
> >>>>>>>
> >>>>>>>     --
> >>>>>>>     Richard Holland
> >>>>>>>     BioMart ( http://www.biomart.org/)
> >>>>>>>     EMBL-EBI
> >>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Vineith Kaul
> >>>>>>> Masters Student Bioinformatics
> >>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >>>>>>> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
> IEyRleSs1+AziCvfhcES8wI=
> =uLDm
> -----END PGP SIGNATURE-----
>

From markjschreiber at gmail.com  Wed Oct 24 09:48:00 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 24 Oct 2007 21:48:00 +0800
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>
	<471F3A65.50202@ebi.ac.uk>
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>
	<471F49C1.9070901@ebi.ac.uk>
	<93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
Message-ID: <93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com>

It appears it is as simple as:

Runtime.getRuntime().availableProcessors();

- Mark

On 10/24/07, Mark Schreiber <markjschreiber at gmail.com> wrote:
> I'm not aware of a way to determine the number of CPU's within a
> program although possibly it is one the the environment variables
> available from System.
>
> Even if it can't be determined there could be a method argument to
> specify the number of threads to spawn.
>
> - Mark
>
> On 10/24/07, Richard Holland <holland at ebi.ac.uk> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > This particular code could easily be parallelised - given N threads, you
> > can simply divide the input into N chunks and get each thread to process
> > 1/Nth of the input. You then combine the output of each thread to do the
> > final calculation.
> >
> > But, it'd be bad practice to always fork a predetermined N threads for a
> > given task. It'd be much better to somehow be able to ask 'how parallel
> > can I make this?' at runtime by checking system resources, or maybe get
> > the parallel-savvy user to set an optional BioJava-wide parallelisation
> > hint. N could then be determined and the task divided appropriately.
> >
> > cheers,
> > Richard
> >
> > Mark Schreiber wrote:
> > > Another important consideration after optimization is can the task be
> > > multithreaded?  Almost all modern computers have at least 2 cores. So
> > > if the algorithm can be parallelized you will get some performance
> > > bonus on most machines.
> > >
> > > Modern JVM's will automagically try to use idle CPU's to execute new
> > > threads spawned by the programmer.
> > >
> > > - Mark
> > >
> > > On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> > >> Yes a very good point & one I was going to make before hand but forgot :)
> > >>
> > >> Also not to mention that micro-benchmarks/profiling in Java are
> > >> notorious for giving false results due to VM warmup & JIT compilation
> > >> optimisations. There is a framework hosted on Java.net somewhere which
> > >> can perform VM warmups and code iterations to produce more accurate
> > >> benchmarking results; but the name escapes me at the moment.
> > >>
> > >> However looking at this particular code I get the feeling that this is
> > >> about as fast as its going to get without someone doing bitwise XOR
> > >> operations or some C code ... that's not an open invitation for people
> > >> to start recoding this in C :). At the end of the day the key to
> > >> optimisation is to ask the question "is it fast enough already?". If it
> > >> is then there's no point :)
> > >>
> > >> Andy
> > >>
> > >> Mark Schreiber wrote:
> > >>> Hi -
> > >>>
> > >>> >From experience the best way to optimize java code is to run a
> > >>> profiler. The one in Netbeans is quite good.
> > >>>
> > >>> The reason is that the hotspot or JIT compilers might natively compile
> > >>> the part of the code that you think is slow and actually make it
> > >>> faster than something else which becomes the bottle neck. Using a good
> > >>> profiler you can detect how much time is spent in each method and pin
> > >>> point some candidate methods for optimization. You can also see if
> > >>> there is a burden due to creation of lots of objects.
> > >>>
> > >>> - Mark
> > >>>
> > >>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> > >>>> Our code is very similar but not identical. The original programmer
> > >>>> shortcutted a lot of else if conditions by considering if the two bases
> > >>>> were equal or not. It can then calculate the transitional changes &
> > >>>> assume the rest are transversional.
> > >>>>
> > >>>> In terms of speed of both pieces of code I can't see an obvious way to
> > >>>> speed it up. Probably in our code removing the 10 or so calls to
> > >>>> String.charAt() with a two calls & referencing those chars might help
> > >>>> but in all honesty I cannot say.
> > >>>>
> > >>>> Andy
> > >>>>
> > >>>> Richard Holland wrote:
> > > Thanks.
> > >
> > > Your code is similar to the code we have in
> > > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> > > see if it is identical, but it probably is.
> > >
> > > You can call our code like this:
> > >
> > >  // import statement for biojava phylo stuff
> > >  import org.biojavax.bio.phylo.*;
> > >
> > >  // ...rest of code goes here
> > >
> > >  // call Kimura2P
> > >  String seq1 = ...; // Get seq1 and seq2 from somewhere
> > >  String seq2 = ...;
> > >  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> > >
> > > Note that our implementation expects sequence strings to be in upper
> > > case, so you'll need to make sure your data is upper case or has been
> > > converted to upper case before calling our method.
> > >
> > > cheers,
> > > Richard
> > >
> > > vineith kaul wrote:
> > >>>>>>> This is what I have .....Thanks a lot  fr the help.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> //Method to calculate the Kimura 2 parameter distance
> > >>>>>>> public static double K2P(String sequence1,String sequence2){
> > >>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
> > >>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>         char[] seq1array=sequence1.toCharArray();
> > >>>>>>>         char[] seq2array=sequence2.toCharArray();
> > >>>>>>>
> > >>>>>>>         for(int i=0;i<seq1array.length;i++){
> > >>>>>>>                                 // Number of aligned sites
> > >>>>>>>                 if(((seq1array[i]=='a') ||
> > >>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
> > >>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
> > >>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
> > >>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
> > >>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
> > >>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
> > >>>>>>>
> > >>>>>>>                         numberOfAlignedSites++;
> > >>>>>>>                 }
> > >>>>>>>
> > >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> > >>>>>>>                         p++;
> > >>>>>>>                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> > >>>>>>>                         p++;
> > >>>>>>>                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> > >>>>>>>                         p++;
> > >>>>>>>                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> > >>>>>>>                         p++;
> > >>>>>>>                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> > >>>>>>>                                 q++;
> > >>>>>>>                         }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> > >>>>>>>                                 q++;
> > >>>>>>>                         }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>         }
> > >>>>>>>
> > >>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
> > >>>>>>> (((double)q)/numberOfAlignedSites);
> > >>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
> > >>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
> > >>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
> > >>>>>>>          return dist;
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
> > >>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
> > >>>>>>>
> > >>>>>>>     You should take a look at the latest 1.5 release, in the
> > >>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
> > >>>>>>>     phylogenetics code that will perform tasks as you describe. The future
> > >>>>>>>     plan is to extend this code to cover a wider range of use cases.
> > >>>>>>>     Kimura2P
> > >>>>>>>     is already implemented here, in
> > >>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
> > >>>>>>>
> > >>>>>>>     If you can't find code that will do what you want, but have written some
> > >>>>>>>     before, then please do feel free to contribute it. Even if it is
> > >>>>>>>     slow, I'm
> > >>>>>>>     sure someone out there will be able to help optimise it!
> > >>>>>>>
> > >>>>>>>     cheers,
> > >>>>>>>     Richard
> > >>>>>>>
> > >>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> > >>>>>>>     > Hi,
> > >>>>>>>     >
> > >>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
> > >>>>>>>     > Kimura2P,Finkelstein etc in Biojava
> > >>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
> > >>>>>>>     > slow and I am not even sure if thats right.
> > >>>>>>>     > --
> > >>>>>>>     > Vineith Kaul
> > >>>>>>>     > Masters Student Bioinformatics
> > >>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> > >>>>>>>     > Georgia Tech, Atlanta
> > >>>>>>>     > _______________________________________________
> > >>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> > >>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
> > >>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >>>>>>>     >
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>     --
> > >>>>>>>     Richard Holland
> > >>>>>>>     BioMart ( http://www.biomart.org/)
> > >>>>>>>     EMBL-EBI
> > >>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Vineith Kaul
> > >>>>>>> Masters Student Bioinformatics
> > >>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> > >>>>>>> Georgia Tech, Atlanta
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >>>> _______________________________________________
> > >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >>>>
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.2.2 (GNU/Linux)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >
> > iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
> > IEyRleSs1+AziCvfhcES8wI=
> > =uLDm
> > -----END PGP SIGNATURE-----
> >
>

From ayates at ebi.ac.uk  Wed Oct 24 09:49:22 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 24 Oct 2007 14:49:22 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471F49C1.9070901@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>	
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>	
	<471F3A65.50202@ebi.ac.uk>
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>
	<471F49C1.9070901@ebi.ac.uk>
Message-ID: <471F4D62.3030900@ebi.ac.uk>

Of course parallelisation all depends on the task not being limited by 
something else like memory, IO or database (which of course this 
wouldn't be). There's also the scenario where thread startup takes 
longer than running the code in serial :). Not to mention Java 
concurrency isn't an easy thing to write correctly.

I'd prefer the model promoted in Java5 where you have pools of threads & 
pass in instances of Callable (which are a successor to Runnable but 
return Futures which return objects & exceptions). You then pass in a 
list of these callables & wait for them all to finish & grab the 
results. You can have as many callables as you like & the thread pool 
will process them as & when a thread becomes free. Combine this with 
looking at the reported number of processors/cores on the machine & say 
that's the default size of the pool (assuming you're making it parallel 
because you're flat-lining a processor).

Say:

int processorCount = Runtime.getRuntime().availableProcessors();
ExecutorService.createThreadPool(processorCount);

This code might be wrong (well the creating the thread pool bit) but you 
get the idea :). Of course someone may not want to parallise a job (I 
quite like having dual cores as a runaway process can take out one but I 
can still run top & kill the thing).

Andy

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> This particular code could easily be parallelised - given N threads, you
> can simply divide the input into N chunks and get each thread to process
> 1/Nth of the input. You then combine the output of each thread to do the
> final calculation.
> 
> But, it'd be bad practice to always fork a predetermined N threads for a
> given task. It'd be much better to somehow be able to ask 'how parallel
> can I make this?' at runtime by checking system resources, or maybe get
> the parallel-savvy user to set an optional BioJava-wide parallelisation
> hint. N could then be determined and the task divided appropriately.
> 
> cheers,
> Richard
> 
> Mark Schreiber wrote:
>> Another important consideration after optimization is can the task be
>> multithreaded?  Almost all modern computers have at least 2 cores. So
>> if the algorithm can be parallelized you will get some performance
>> bonus on most machines.
>>
>> Modern JVM's will automagically try to use idle CPU's to execute new
>> threads spawned by the programmer.
>>
>> - Mark
>>
>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> Yes a very good point & one I was going to make before hand but forgot :)
>>>
>>> Also not to mention that micro-benchmarks/profiling in Java are
>>> notorious for giving false results due to VM warmup & JIT compilation
>>> optimisations. There is a framework hosted on Java.net somewhere which
>>> can perform VM warmups and code iterations to produce more accurate
>>> benchmarking results; but the name escapes me at the moment.
>>>
>>> However looking at this particular code I get the feeling that this is
>>> about as fast as its going to get without someone doing bitwise XOR
>>> operations or some C code ... that's not an open invitation for people
>>> to start recoding this in C :). At the end of the day the key to
>>> optimisation is to ask the question "is it fast enough already?". If it
>>> is then there's no point :)
>>>
>>> Andy
>>>
>>> Mark Schreiber wrote:
>>>> Hi -
>>>>
>>>> >From experience the best way to optimize java code is to run a
>>>> profiler. The one in Netbeans is quite good.
>>>>
>>>> The reason is that the hotspot or JIT compilers might natively compile
>>>> the part of the code that you think is slow and actually make it
>>>> faster than something else which becomes the bottle neck. Using a good
>>>> profiler you can detect how much time is spent in each method and pin
>>>> point some candidate methods for optimization. You can also see if
>>>> there is a burden due to creation of lots of objects.
>>>>
>>>> - Mark
>>>>
>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> Our code is very similar but not identical. The original programmer
>>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>>> were equal or not. It can then calculate the transitional changes &
>>>>> assume the rest are transversional.
>>>>>
>>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>>> String.charAt() with a two calls & referencing those chars might help
>>>>> but in all honesty I cannot say.
>>>>>
>>>>> Andy
>>>>>
>>>>> Richard Holland wrote:
>> Thanks.
>>
>> Your code is similar to the code we have in
>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>> see if it is identical, but it probably is.
>>
>> You can call our code like this:
>>
>>  // import statement for biojava phylo stuff
>>  import org.biojavax.bio.phylo.*;
>>
>>  // ...rest of code goes here
>>
>>  // call Kimura2P
>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>  String seq2 = ...;
>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>
>> Note that our implementation expects sequence strings to be in upper
>> case, so you'll need to make sure your data is upper case or has been
>> converted to upper case before calling our method.
>>
>> cheers,
>> Richard
>>
>> vineith kaul wrote:
>>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>>
>>>>>>>>
>>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>>
>>>>>>>>
>>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>>
>>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>>                                 // Number of aligned sites
>>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>
>>>>>>>>                         numberOfAlignedSites++;
>>>>>>>>                 }
>>>>>>>>
>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>                                 q++;
>>>>>>>>                         }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>                                 q++;
>>>>>>>>                         }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>         }
>>>>>>>>
>>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>>          return dist;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>>
>>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>>     Kimura2P
>>>>>>>>     is already implemented here, in
>>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>>
>>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>>     slow, I'm
>>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>>
>>>>>>>>     cheers,
>>>>>>>>     Richard
>>>>>>>>
>>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>>     > Hi,
>>>>>>>>     >
>>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>>     > --
>>>>>>>>     > Vineith Kaul
>>>>>>>>     > Masters Student Bioinformatics
>>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>>     > _______________________________________________
>>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>     >
>>>>>>>>
>>>>>>>>
>>>>>>>>     --
>>>>>>>>     Richard Holland
>>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>>     EMBL-EBI
>>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Vineith Kaul
>>>>>>>> Masters Student Bioinformatics
>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
> IEyRleSs1+AziCvfhcES8wI=
> =uLDm
> -----END PGP SIGNATURE-----

From ayates at ebi.ac.uk  Wed Oct 24 09:49:38 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 24 Oct 2007 14:49:38 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>	
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>	
	<471F3A65.50202@ebi.ac.uk>	
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>	
	<471F49C1.9070901@ebi.ac.uk>	
	<93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
	<93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com>
Message-ID: <471F4D72.80505@ebi.ac.uk>

Beat me to it :)

Andy

Mark Schreiber wrote:
> It appears it is as simple as:
> 
> Runtime.getRuntime().availableProcessors();
> 
> - Mark
> 
> On 10/24/07, Mark Schreiber <markjschreiber at gmail.com> wrote:
>> I'm not aware of a way to determine the number of CPU's within a
>> program although possibly it is one the the environment variables
>> available from System.
>>
>> Even if it can't be determined there could be a method argument to
>> specify the number of threads to spawn.
>>
>> - Mark
>>
>> On 10/24/07, Richard Holland <holland at ebi.ac.uk> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> This particular code could easily be parallelised - given N threads, you
>>> can simply divide the input into N chunks and get each thread to process
>>> 1/Nth of the input. You then combine the output of each thread to do the
>>> final calculation.
>>>
>>> But, it'd be bad practice to always fork a predetermined N threads for a
>>> given task. It'd be much better to somehow be able to ask 'how parallel
>>> can I make this?' at runtime by checking system resources, or maybe get
>>> the parallel-savvy user to set an optional BioJava-wide parallelisation
>>> hint. N could then be determined and the task divided appropriately.
>>>
>>> cheers,
>>> Richard
>>>
>>> Mark Schreiber wrote:
>>>> Another important consideration after optimization is can the task be
>>>> multithreaded?  Almost all modern computers have at least 2 cores. So
>>>> if the algorithm can be parallelized you will get some performance
>>>> bonus on most machines.
>>>>
>>>> Modern JVM's will automagically try to use idle CPU's to execute new
>>>> threads spawned by the programmer.
>>>>
>>>> - Mark
>>>>
>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> Yes a very good point & one I was going to make before hand but forgot :)
>>>>>
>>>>> Also not to mention that micro-benchmarks/profiling in Java are
>>>>> notorious for giving false results due to VM warmup & JIT compilation
>>>>> optimisations. There is a framework hosted on Java.net somewhere which
>>>>> can perform VM warmups and code iterations to produce more accurate
>>>>> benchmarking results; but the name escapes me at the moment.
>>>>>
>>>>> However looking at this particular code I get the feeling that this is
>>>>> about as fast as its going to get without someone doing bitwise XOR
>>>>> operations or some C code ... that's not an open invitation for people
>>>>> to start recoding this in C :). At the end of the day the key to
>>>>> optimisation is to ask the question "is it fast enough already?". If it
>>>>> is then there's no point :)
>>>>>
>>>>> Andy
>>>>>
>>>>> Mark Schreiber wrote:
>>>>>> Hi -
>>>>>>
>>>>>> >From experience the best way to optimize java code is to run a
>>>>>> profiler. The one in Netbeans is quite good.
>>>>>>
>>>>>> The reason is that the hotspot or JIT compilers might natively compile
>>>>>> the part of the code that you think is slow and actually make it
>>>>>> faster than something else which becomes the bottle neck. Using a good
>>>>>> profiler you can detect how much time is spent in each method and pin
>>>>>> point some candidate methods for optimization. You can also see if
>>>>>> there is a burden due to creation of lots of objects.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>> Our code is very similar but not identical. The original programmer
>>>>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>>>>> were equal or not. It can then calculate the transitional changes &
>>>>>>> assume the rest are transversional.
>>>>>>>
>>>>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>>>>> String.charAt() with a two calls & referencing those chars might help
>>>>>>> but in all honesty I cannot say.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> Richard Holland wrote:
>>>> Thanks.
>>>>
>>>> Your code is similar to the code we have in
>>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>>>> see if it is identical, but it probably is.
>>>>
>>>> You can call our code like this:
>>>>
>>>>  // import statement for biojava phylo stuff
>>>>  import org.biojavax.bio.phylo.*;
>>>>
>>>>  // ...rest of code goes here
>>>>
>>>>  // call Kimura2P
>>>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>>>  String seq2 = ...;
>>>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>>>
>>>> Note that our implementation expects sequence strings to be in upper
>>>> case, so you'll need to make sure your data is upper case or has been
>>>> converted to upper case before calling our method.
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> vineith kaul wrote:
>>>>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>>>>
>>>>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>>>>                                 // Number of aligned sites
>>>>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>
>>>>>>>>>>                         numberOfAlignedSites++;
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>>>>          return dist;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>>>>
>>>>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>>>>     Kimura2P
>>>>>>>>>>     is already implemented here, in
>>>>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>>>>
>>>>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>>>>     slow, I'm
>>>>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>>>>
>>>>>>>>>>     cheers,
>>>>>>>>>>     Richard
>>>>>>>>>>
>>>>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>>>>     > Hi,
>>>>>>>>>>     >
>>>>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>>>>     > --
>>>>>>>>>>     > Vineith Kaul
>>>>>>>>>>     > Masters Student Bioinformatics
>>>>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>>>>     > _______________________________________________
>>>>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>     >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>>>>     Richard Holland
>>>>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>>>>     EMBL-EBI
>>>>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Vineith Kaul
>>>>>>>>>> Masters Student Bioinformatics
>>>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>> Georgia Tech, Atlanta
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>>
>>> iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
>>> IEyRleSs1+AziCvfhcES8wI=
>>> =uLDm
>>> -----END PGP SIGNATURE-----
>>>

From holland at ebi.ac.uk  Wed Oct 24 09:53:29 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 24 Oct 2007 14:53:29 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>	
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>	
	<471F3A65.50202@ebi.ac.uk>	
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>	
	<471F49C1.9070901@ebi.ac.uk>
	<93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
Message-ID: <471F4E59.1040703@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Mark Schreiber wrote:
> I'm not aware of a way to determine the number of CPU's within a
> program although possibly it is one the the environment variables
> available from System.

Yup, I'm not aware of one either. Actually, thinking about this, it'd be
a bad thing if BioJava grabbed both CPUs just because they're currently
available - the user might want it to only run on one, with something
else running on the second one. So attempting to guess a good
parallelisation value from the system is probably not good!

> Even if it can't be determined there could be a method argument to
> specify the number of threads to spawn.

I was thinking more along the lines of a global static method in some
kind of toolkit class, so that any part of BJ which is
parallelisation-aware can take advantage of it if it is set. This also
avoids passing parameters that don't have an immediately obvious impact
on the expected output of the method. I'd also like to have this global
variable control the total number of threads, so that if the user forks
a set of threads themselves and runs a parallel-aware method in each of
them, then BJ will not attempt to sub-divide each thread into more
threads than the limit configured by this variable. Likewise if the user
changes the limit whilst threads are currently running, they should stop
(if there are too many) or new ones should start (if there are too few),
but taking care to make sure that every parallelisation request
maintains at least one thread so the job doesn't stop entirely.... there
must be a toolkit for this somewhere surely?

cheers,
Richard

> - Mark
> 
> On 10/24/07, Richard Holland <holland at ebi.ac.uk> wrote:
> This particular code could easily be parallelised - given N threads, you
> can simply divide the input into N chunks and get each thread to process
> 1/Nth of the input. You then combine the output of each thread to do the
> final calculation.
> 
> But, it'd be bad practice to always fork a predetermined N threads for a
> given task. It'd be much better to somehow be able to ask 'how parallel
> can I make this?' at runtime by checking system resources, or maybe get
> the parallel-savvy user to set an optional BioJava-wide parallelisation
> hint. N could then be determined and the task divided appropriately.
> 
> cheers,
> Richard
> 
> Mark Schreiber wrote:
>>>> Another important consideration after optimization is can the task be
>>>> multithreaded?  Almost all modern computers have at least 2 cores. So
>>>> if the algorithm can be parallelized you will get some performance
>>>> bonus on most machines.
>>>>
>>>> Modern JVM's will automagically try to use idle CPU's to execute new
>>>> threads spawned by the programmer.
>>>>
>>>> - Mark
>>>>
>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> Yes a very good point & one I was going to make before hand but forgot :)
>>>>>
>>>>> Also not to mention that micro-benchmarks/profiling in Java are
>>>>> notorious for giving false results due to VM warmup & JIT compilation
>>>>> optimisations. There is a framework hosted on Java.net somewhere which
>>>>> can perform VM warmups and code iterations to produce more accurate
>>>>> benchmarking results; but the name escapes me at the moment.
>>>>>
>>>>> However looking at this particular code I get the feeling that this is
>>>>> about as fast as its going to get without someone doing bitwise XOR
>>>>> operations or some C code ... that's not an open invitation for people
>>>>> to start recoding this in C :). At the end of the day the key to
>>>>> optimisation is to ask the question "is it fast enough already?". If it
>>>>> is then there's no point :)
>>>>>
>>>>> Andy
>>>>>
>>>>> Mark Schreiber wrote:
>>>>>> Hi -
>>>>>>
>>>>>> >From experience the best way to optimize java code is to run a
>>>>>> profiler. The one in Netbeans is quite good.
>>>>>>
>>>>>> The reason is that the hotspot or JIT compilers might natively compile
>>>>>> the part of the code that you think is slow and actually make it
>>>>>> faster than something else which becomes the bottle neck. Using a good
>>>>>> profiler you can detect how much time is spent in each method and pin
>>>>>> point some candidate methods for optimization. You can also see if
>>>>>> there is a burden due to creation of lots of objects.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>> Our code is very similar but not identical. The original programmer
>>>>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>>>>> were equal or not. It can then calculate the transitional changes &
>>>>>>> assume the rest are transversional.
>>>>>>>
>>>>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>>>>> String.charAt() with a two calls & referencing those chars might help
>>>>>>> but in all honesty I cannot say.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> Richard Holland wrote:
>>>> Thanks.
>>>>
>>>> Your code is similar to the code we have in
>>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>>>> see if it is identical, but it probably is.
>>>>
>>>> You can call our code like this:
>>>>
>>>>  // import statement for biojava phylo stuff
>>>>  import org.biojavax.bio.phylo.*;
>>>>
>>>>  // ...rest of code goes here
>>>>
>>>>  // call Kimura2P
>>>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>>>  String seq2 = ...;
>>>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>>>
>>>> Note that our implementation expects sequence strings to be in upper
>>>> case, so you'll need to make sure your data is upper case or has been
>>>> converted to upper case before calling our method.
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> vineith kaul wrote:
>>>>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>>>>
>>>>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>>>>                                 // Number of aligned sites
>>>>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>
>>>>>>>>>>                         numberOfAlignedSites++;
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>>>>          return dist;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>>>>
>>>>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>>>>     Kimura2P
>>>>>>>>>>     is already implemented here, in
>>>>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>>>>
>>>>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>>>>     slow, I'm
>>>>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>>>>
>>>>>>>>>>     cheers,
>>>>>>>>>>     Richard
>>>>>>>>>>
>>>>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>>>>     > Hi,
>>>>>>>>>>     >
>>>>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>>>>     > --
>>>>>>>>>>     > Vineith Kaul
>>>>>>>>>>     > Masters Student Bioinformatics
>>>>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>>>>     > _______________________________________________
>>>>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>     >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>>>>     Richard Holland
>>>>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>>>>     EMBL-EBI
>>>>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Vineith Kaul
>>>>>>>>>> Masters Student Bioinformatics
>>>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHH05Y4C5LeMEKA/QRAouqAJ9TgDACIQLPeenSZcStDhkZQg/UuQCfc7sZ
cocyjnf9/T8H3uQJ+rW5m2U=
=Q6UR
-----END PGP SIGNATURE-----

From ayates at ebi.ac.uk  Wed Oct 24 09:58:01 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 24 Oct 2007 14:58:01 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471F4E59.1040703@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>	
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>	
	<471F3A65.50202@ebi.ac.uk>	
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>	
	<471F49C1.9070901@ebi.ac.uk>
	<93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
	<471F4E59.1040703@ebi.ac.uk>
Message-ID: <471F4F69.3010806@ebi.ac.uk>

The executor thread pool system is the best way to control this. The 
thread pool can be setup once & called out whilst all clients of the 
code will wait for their jobs/futures to complete.

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I was thinking more along the lines of a global static method in some
> kind of toolkit class, so that any part of BJ which is
> parallelisation-aware can take advantage of it if it is set. This also
> avoids passing parameters that don't have an immediately obvious impact
> on the expected output of the method. I'd also like to have this global
> variable control the total number of threads, so that if the user forks
> a set of threads themselves and runs a parallel-aware method in each of
> them, then BJ will not attempt to sub-divide each thread into more
> threads than the limit configured by this variable. Likewise if the user
> changes the limit whilst threads are currently running, they should stop
> (if there are too many) or new ones should start (if there are too few),
> but taking care to make sure that every parallelisation request
> maintains at least one thread so the job doesn't stop entirely.... there
> must be a toolkit for this somewhere surely?
> 

From matthew.pocock at ncl.ac.uk  Tue Oct  2 22:14:01 2007
From: matthew.pocock at ncl.ac.uk (Matthew Pocock)
Date: Tue, 2 Oct 2007 23:14:01 +0100
Subject: [Biojava-l] Biojava Question.
In-Reply-To: <1653.130.207.66.142.1191340893.squirrel@webmail.cc.gatech.edu>
References: <1653.130.207.66.142.1191340893.squirrel@webmail.cc.gatech.edu>
Message-ID: <200710022314.02018.matthew.pocock@ncl.ac.uk>

This is very strange. This sort of error nearly always happens because of a 
miss-configured classpath. Could you send me:

The html of the page that causes the problem

The URL of the jars the page should be referencing

A URL that I can point my browser at that causes the problem

It is difficult to debug something like this without the program actually 
infront of me.

Matthew

On Tuesday 02 October 2007, abhi232 at cc.gatech.edu wrote:
> Respected Sir,
>
> I am sorry if I sent you a direct mail but this is a kind of emergency and
> I am not getting any substantial response from the biojava mailing
> community.
> I a graduate student at Georgia Institute of technology.We are working on
> creating a Teaceviewer applet for viewing the Sequence using biojava
> library.
> I am able to create the applet using netbeans and run it there.
> The error comes when I upload it on net. I am getting this particular
> error.
>
> java.lang.NoClassDefFoundError:
> org/biojava/bio/gui/sequence/SequenceRenderer at
> java.lang.Class.getDeclaredConstructors0(Native Method)
> 	at java.lang.Class.privateGetDeclaredConstructors(Unknown Source)
> 	at java.lang.Class.getConstructor0(Unknown Source)
> 	at java.lang.Class.newInstance0(Unknown Source)
> 	at java.lang.Class.newInstance(Unknown Source)
> 	at sun.applet.AppletPanel.createApplet(Unknown Source)
> 	at sun.plugin.AppletViewer.createApplet(Unknown Source)
> 	at sun.applet.AppletPanel.runLoader(Unknown Source)
> 	at sun.applet.AppletPanel.run(Unknown Source)
> 	at java.lang.Thread.run(Unknown Source)
>
> I am getting an error only for SequenceRenderer class.Even If I comment
> that out still it is giving me error.
>
> I have set the classpath as well as the path variables and also I am
> giving the archive field in the applet code so as the biojava library will
> be available.
>
> Is there any particular thing required which I probably am missing?
> Please guide me on this topic.
> I would really appreciate your gesture.
> Thanks a lot in advance.


From elmh06 at yahoo.ca  Wed Oct  3 18:27:36 2007
From: elmh06 at yahoo.ca (El Mabrouk M)
Date: Wed, 3 Oct 2007 14:27:36 -0400 (EDT)
Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method
Message-ID: <975012.12435.qm@web37310.mail.mud.yahoo.com>

Hi!  
 
I have just started to learn biojava. I have written a small    
program that write a sequence in fasta file with the help of the biojavax method
 
RichSequence.IOTools.writeFasta(seqOut, s1, ns);  
I have got the error "cannot find symbol".
I'm using biojava 1.5, jdk 1.6 and netbeans.
What can be done to fix this problem?

This is what I tried:

import org.biojava.bio.seq.*;
import java.io.*;
import org.biojava.bio.symbol.SymbolList;
import org.biojavax.RichObjectFactory;
import javax.xml.stream.events.Namespace;
import org.biojavax.bio.seq.RichSequence;

public class SeqFastaF {
    public static void main(String[] args) {
        SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1");
        Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0");
        try {
            OutputStream seqOut = System.out;
            Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace();
            RichSequence.IOTools.writeFasta(seqOut,s1,ns); 
        } catch (IOException ex) {
            //io error
            ex.printStackTrace();
        }
    }
}

Error:
cannot find symbol
symbol  : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace)
location: class org.biojavax.bio.seq.RichSequence.IOTools


---------------------------------
Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail  


From markjschreiber at gmail.com  Wed Oct  3 23:20:31 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Thu, 4 Oct 2007 07:20:31 +0800
Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method
In-Reply-To: <975012.12435.qm@web37310.mail.mud.yahoo.com>
References: <975012.12435.qm@web37310.mail.mud.yahoo.com>
Message-ID: <93b45ca50710031620m35495bfey8ec111177c6201f@mail.gmail.com>

Hi -

This is a compilation error. It is caused because the biojava write
method is expecting a Namespace object from the biojavax package but
netbeans has guessed that you wanted a Namespace object from the
javax.xml.stream.events package and has imported this for you.

If you remove that import ( javax.xml.stream.events.Namespace) and
then import the biojavax Namespace object it should compile.

- Mark

On 10/4/07, El Mabrouk M <elmh06 at yahoo.ca> wrote:
> Hi!
>
> I have just started to learn biojava. I have written a small
> program that write a sequence in fasta file with the help of the biojavax method
>
> RichSequence.IOTools.writeFasta(seqOut, s1, ns);
> I have got the error "cannot find symbol".
> I'm using biojava 1.5, jdk 1.6 and netbeans.
> What can be done to fix this problem?
>
> This is what I tried:
>
> import org.biojava.bio.seq.*;
> import java.io.*;
> import org.biojava.bio.symbol.SymbolList;
> import org.biojavax.RichObjectFactory;
> import javax.xml.stream.events.Namespace;
> import org.biojavax.bio.seq.RichSequence;
>
> public class SeqFastaF {
>     public static void main(String[] args) {
>         SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1");
>         Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0");
>         try {
>             OutputStream seqOut = System.out;
>             Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace();
>             RichSequence.IOTools.writeFasta(seqOut,s1,ns);
>         } catch (IOException ex) {
>             //io error
>             ex.printStackTrace();
>         }
>     }
> }
>
> Error:
> cannot find symbol
> symbol  : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace)
> location: class org.biojavax.bio.seq.RichSequence.IOTools
>
>
>
> ---------------------------------
> Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From md5 at sanger.ac.uk  Wed Oct  3 23:05:43 2007
From: md5 at sanger.ac.uk (Mutlu Dogruel)
Date: Thu, 4 Oct 2007 00:05:43 +0100 (BST)
Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method
In-Reply-To: <975012.12435.qm@web37310.mail.mud.yahoo.com>
References: <975012.12435.qm@web37310.mail.mud.yahoo.com>
Message-ID: <Pine.LNX.4.64.0710040001150.22143@cbi4c.internal.sanger.ac.uk>


Hi, try using import org.biojavax.Namespace instead of 
javax.xml.stream.events.Namespace;

Also, you should handle the illegal symbol 
exception that DNATools.createDNASequence may throw.

Cheers,
mutlu

On Wed, 3 Oct 2007, El Mabrouk M wrote:

> Hi!
>
> I have just started to learn biojava. I have written a small
> program that write a sequence in fasta file with the help of the biojavax method
>
> RichSequence.IOTools.writeFasta(seqOut, s1, ns);
> I have got the error "cannot find symbol".
> I'm using biojava 1.5, jdk 1.6 and netbeans.
> What can be done to fix this problem?
>
> This is what I tried:
>
> import org.biojava.bio.seq.*;
> import java.io.*;
> import org.biojava.bio.symbol.SymbolList;
> import org.biojavax.RichObjectFactory;
> import javax.xml.stream.events.Namespace;
> import org.biojavax.bio.seq.RichSequence;
>
> public class SeqFastaF {
>    public static void main(String[] args) {
>        SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1");
>        Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0");
>        try {
>            OutputStream seqOut = System.out;
>            Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace();
>            RichSequence.IOTools.writeFasta(seqOut,s1,ns);
>        } catch (IOException ex) {
>            //io error
>            ex.printStackTrace();
>        }
>    }
> }
>
> Error:
> cannot find symbol
> symbol  : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace)
> location: class org.biojavax.bio.seq.RichSequence.IOTools
>
>
>
> ---------------------------------
> Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From su24 at st-andrews.ac.uk  Thu Oct  4 14:43:23 2007
From: su24 at st-andrews.ac.uk (Saif Ur-Rehman)
Date: Thu,  4 Oct 2007 15:43:23 +0100
Subject: [Biojava-l] WriteFasta
Message-ID: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>


Dear All,

I was writing to ask about the SeqIOTools.writeFasta() Method. I am currently
trying to break up Fasta Files of whole organisms into one file per gene for
further analysis. However the writeFasta method appears to append the
characters
"??

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


From holland at ebi.ac.uk  Thu Oct  4 15:23:10 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Thu, 04 Oct 2007 16:23:10 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
Message-ID: <4705055E.5070401@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

SeqIOTools is deprecated.

Try RichSequence.IOTools.writeFasta() instead to see if that helps.

e.g.:

RichSequence.IOTools.writeFasta(
	System.out,
	seq,
	RichObjectFactory.getDefaultNamespace()
	);

where seq is either a Sequence or a SequenceIterator.

cheers,
Richard

Saif Ur-Rehman wrote:
> Dear All,
> 
> I was writing to ask about the SeqIOTools.writeFasta() Method. I am currently
> trying to break up Fasta Files of whole organisms into one file per gene for
> further analysis. However the writeFasta method appears to append the
> characters
> "??
> 
> ------------------------------------------------------------------
> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp
C4xPs/2ywAMfIPDmUKPCrqg=
=TwwH
-----END PGP SIGNATURE-----


From su24 at st-andrews.ac.uk  Thu Oct  4 15:23:52 2007
From: su24 at st-andrews.ac.uk (Saif Ur-Rehman)
Date: Thu,  4 Oct 2007 16:23:52 +0100
Subject: [Biojava-l] (no subject)
Message-ID: <1191511432.4705058825b79@webmail.st-andrews.ac.uk>


Dear All,

I'm sorry the use of the characters seems to have truncated the previous email I
sent. To complete my question I was just wondering as to possible causes for
this addition of random charcters and if there was a way to stop it from
occuring.

Thanking you again

Saif
-------------------------------------------------------------------------------
Saif Ur-Rehman
Research Student
The Centre for Evolution, Genes & Genomics (CEGG)
Dyers Brae
School of Biology
The University of St Andrews
St Andrews,
Fife
Scotland,UK

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


From su24 at st-andrews.ac.uk  Fri Oct  5 10:06:25 2007
From: su24 at st-andrews.ac.uk (Saif Ur-Rehman)
Date: Fri,  5 Oct 2007 11:06:25 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <4705055E.5070401@ebi.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
	<4705055E.5070401@ebi.ac.uk>
Message-ID: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>

Dear Richard,

I have tried the RichSEquence.IOTools.writeFasta method and this method is still
appending the characters "??" to the front of each write. I am using a 
FileOutputStream and a Sequence object as inputs to the method. like so.


 Sequence seq; // read in from File
 FileOutputStream f =new FileOutputStream (fileName);


			   try{

			    	RichSequence.IOTools.writeFasta(f,
			    	        seq,
			    	        RichObjectFactory.getDefaultNamespace()
			    	        );


			    }


Thanks a lot for your time

Sincerely,

Saif

Quoting Richard Holland <holland at ebi.ac.uk>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> SeqIOTools is deprecated.
>
> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
>
> e.g.:
>
> RichSequence.IOTools.writeFasta(
> 	System.out,
> 	seq,
> 	RichObjectFactory.getDefaultNamespace()
> 	);
>
> where seq is either a Sequence or a SequenceIterator.
>
> cheers,
> Richard
>
> Saif Ur-Rehman wrote:
> > Dear All,
> >
> > I was writing to ask about the SeqIOTools.writeFasta() Method. I am
> currently
> > trying to break up Fasta Files of whole organisms into one file per gene
> for
> > further analysis. However the writeFasta method appears to append the
> > characters
> > "??
> >
> > ------------------------------------------------------------------
> > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp
> C4xPs/2ywAMfIPDmUKPCrqg=
> =TwwH
> -----END PGP SIGNATURE-----
>


-------------------------------------------------------------------------------
Saif Ur-Rehman
Research Student
The Centre for Evolution, Genes & Genomics (CEGG)
Dyers Brae
School of Biology
The University of St Andrews
St Andrews,
Fife
Scotland,UK

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


From holland at ebi.ac.uk  Fri Oct  5 10:13:36 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 05 Oct 2007 11:13:36 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
	<4705055E.5070401@ebi.ac.uk>
	<1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
Message-ID: <47060E50.2070405@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Where are the input sequences coming from? i.e. what method are you
using to construct them or read them from a file.

Also, what do you mean by the 'front' of each write? Could you send me
an example of an entire FASTA file containing the problem? (It'd be best
to attach the file to an email to me personally as this list will not
accept attachments, and copying-and-pasting from a text editor to an
email client may obscure the underlying problem).

It'd be good also to see your entire code from the point the sequences
are read or created to the point where they are written out. Or, a
sample program which exhibits the same behaviour would suffice.

I suspect that the sequences themselves contain the incorrect data,
although technically this should be impossible as the sequence alphabet
should prevent it.

We recently had an issue reported here regarding BioJava not being able
to do certain sequence tasks on platforms using non-Western-European
character mappings. If your machine is running such a mapping, try it
again on a machine with an English or other Western European language
set up by default. If it works there but not on your machine, then
this'll be the same problem. (There is no solution yet, but at least
you'll know what's wrong).

cheers,
Richard

Saif Ur-Rehman wrote:
> Dear Richard,
> 
> I have tried the RichSEquence.IOTools.writeFasta method and this method is still
> appending the characters "??" to the front of each write. I am using a 
> FileOutputStream and a Sequence object as inputs to the method. like so.
> 
> 
>  Sequence seq; // read in from File
>  FileOutputStream f =new FileOutputStream (fileName);
> 
> 
> 			   try{
> 
> 			    	RichSequence.IOTools.writeFasta(f,
> 			    	        seq,
> 			    	        RichObjectFactory.getDefaultNamespace()
> 			    	        );
> 
> 
> 			    }
> 
> 
> Thanks a lot for your time
> 
> Sincerely,
> 
> Saif
> 
> Quoting Richard Holland <holland at ebi.ac.uk>:
> 
> SeqIOTools is deprecated.
> 
> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
> 
> e.g.:
> 
> RichSequence.IOTools.writeFasta(
> 	System.out,
> 	seq,
> 	RichObjectFactory.getDefaultNamespace()
> 	);
> 
> where seq is either a Sequence or a SequenceIterator.
> 
> cheers,
> Richard
> 
> Saif Ur-Rehman wrote:
>>>> Dear All,
>>>>
>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am
> currently
>>>> trying to break up Fasta Files of whole organisms into one file per gene
> for
>>>> further analysis. However the writeFasta method appears to append the
>>>> characters
>>>> "??
>>>>
>>>> ------------------------------------------------------------------
>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>>>
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>

> -------------------------------------------------------------------------------
> Saif Ur-Rehman
> Research Student
> The Centre for Evolution, Genes & Genomics (CEGG)
> Dyers Brae
> School of Biology
> The University of St Andrews
> St Andrews,
> Fife
> Scotland,UK

> ------------------------------------------------------------------
> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBg5Q4C5LeMEKA/QRAlKlAKCKXrMfJI2W4Ir7Us5P9bj3KmEY1ACgo89L
WgUPFCLGUNSUZxO8h3Ltqlw=
=Jq7X
-----END PGP SIGNATURE-----


From ayates at ebi.ac.uk  Fri Oct  5 10:16:02 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 05 Oct 2007 11:16:02 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>	<4705055E.5070401@ebi.ac.uk>
	<1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
Message-ID: <47060EE2.2000909@ebi.ac.uk>

Is it possible for you to send us the code which you're trying to run & 
the sequence you are trying to write out. If it is sent to us in a 
manner we can drop it into an IDE & run that would help us a lot.

Thanks,

Andy Yates

Saif Ur-Rehman wrote:
> Dear Richard,
> 
> I have tried the RichSEquence.IOTools.writeFasta method and this method is still
> appending the characters "??" to the front of each write. I am using a 
> FileOutputStream and a Sequence object as inputs to the method. like so.
> 
> 
>  Sequence seq; // read in from File
>  FileOutputStream f =new FileOutputStream (fileName);
> 
> 
> 			   try{
> 
> 			    	RichSequence.IOTools.writeFasta(f,
> 			    	        seq,
> 			    	        RichObjectFactory.getDefaultNamespace()
> 			    	        );
> 
> 
> 			    }
> 
> 
> Thanks a lot for your time
> 
> Sincerely,
> 
> Saif
> 
> Quoting Richard Holland <holland at ebi.ac.uk>:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> SeqIOTools is deprecated.
>>
>> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
>>
>> e.g.:
>>
>> RichSequence.IOTools.writeFasta(
>> 	System.out,
>> 	seq,
>> 	RichObjectFactory.getDefaultNamespace()
>> 	);
>>
>> where seq is either a Sequence or a SequenceIterator.
>>
>> cheers,
>> Richard
>>
>> Saif Ur-Rehman wrote:
>>> Dear All,
>>>
>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am
>> currently
>>> trying to break up Fasta Files of whole organisms into one file per gene
>> for
>>> further analysis. However the writeFasta method appears to append the
>>> characters
>>> "??
>>>
>>> ------------------------------------------------------------------
>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp
>> C4xPs/2ywAMfIPDmUKPCrqg=
>> =TwwH
>> -----END PGP SIGNATURE-----
>>
> 
> 
> -------------------------------------------------------------------------------
> Saif Ur-Rehman
> Research Student
> The Centre for Evolution, Genes & Genomics (CEGG)
> Dyers Brae
> School of Biology
> The University of St Andrews
> St Andrews,
> Fife
> Scotland,UK
> 
> ------------------------------------------------------------------
> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l


From holland at ebi.ac.uk  Fri Oct  5 12:10:58 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Fri, 05 Oct 2007 13:10:58 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <1191584372.4706227437594@webmail.st-andrews.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
	<4705055E.5070401@ebi.ac.uk>
	<1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
	<47060E50.2070405@ebi.ac.uk>
	<1191582472.47061b0836c9f@webmail.st-andrews.ac.uk>
	<47061FDD.1070806@ebi.ac.uk>
	<1191584372.4706227437594@webmail.st-andrews.ac.uk>
Message-ID: <470629D2.6020709@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Great, thanks.

The initial analysis shows that the text file generated contains four
extra characters at the beginning of the file, and is using '\n' as the
line separator.

This is a hex dump of the file:

00000000  ac ed 00 05 3e 67 69 7c  31 38 33 39 38 33 39 30
|....>gi|18398390|
00000010  7c 6c 63 6c 7c 4e 50 5f  35 36 35 34 31 33 2e 31
||lcl|NP_565413.1|
00000020  7c 4e 50 5f 35 36 35 34  31 33 20 75 6e 6b 6e 6f  ||NP_565413
unkno|
00000030  77 6e 20 70 72 6f 74 65  69 6e 20 5b 41 72 61 62  |wn protein
[Arab|
00000040  69 64 6f 70 73 69 73 20  74 68 61 6c 69 61 6e 61  |idopsis
thaliana|
00000050  5d 0a 4d 53 4c 52 49 4b  4c 56 56 44 4b 46 56 45
|].MSLRIKLVVDKFVE|
00000060  45 4c 4b 51 41 4c 44 41  44 49 51 44 52 49 4d 4b
|ELKQALDADIQDRIMK|
00000070  45 52 45 4d 51 53 59 49  58 58 58 58 58 58 58 58
|EREMQSYIXXXXXXXX|
00000080  58 58 58 58 58 57 4b 41  45 4c 53 52 52 45 54 45
|XXXXXWKAELSRRETE|
00000090  49 41 52 51 45 41 52 4c  4b 4d 45 52 45 4e 4c 45
|IARQEARLKMERENLE|
000000a0  4b 45 0a 4b 53 56 4c 4d  47 54 41 53 4e 51 44 4e
|KE.KSVLMGTASNQDN|
000000b0  51 44 47 41 4c 45 49 54  56 53 47 45 4b 59 52 43
|QDGALEITVSGEKYRC|
000000c0  4c 52 46 53 4b 41 4b 4b  0a                       |LRFSKAKK.|


The four extra characters are hex #ac #ed #00 #05 and these are showing
as question marks in your text editor because that's how text editors
handle unprintable characters.

Does anyone recognise these characters? There is no code in BioJava
which writes anything like this, in fact there is no output code at all
before the initial write of the first > symbol in the file. Something
tells me that these symbols are being inserted by the VM or the OS
somewhere under the hood, possibly due to internationalisation?

I strongly suspect this is an internationalisation problem. It seems
probable that Java has been set up on your system to use a language or
character encoding that causes Java by default to write these extra
characters at the start of files to indicate the encoding. Check the
output of:

System.getProperty("file.encode");

to see if it is using something other than UTF-8. If it is, then chances
are that this is the problem.

We've had internationalisation problems before with BioJava. Hopefully
these will be addressed in future development, but there is no current
activity in that area due to lack of resources. In the meantime the best
workaround is to set every setting you can find to a Western European
character set/character mapping and UTF-8 file encoding, in the hope
that it will all match up nicely and work.

cheers,
Richard

Saif Ur-Rehman wrote:
> Dear Richard,
> 
> The input file is just the entire set of RefSeq proteins for Arabdopsis thaliana
> and is too large for me to send as an attachment. But it can be downloaded from
> NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]".
> 
> Cheers,
> 
> Saif
> 
> 
> 
> Quoting Richard Holland <holland at ebi.ac.uk>:
> 
> Interesting. Could you send your input file as well?
> 
> cheers,
> Richard
> 
> Saif Ur-Rehman wrote:
>>>> Dear Richard,
>>>>
>>>> The sequences are being read by SeqIO.readFasta. The code from read to
> write is
>>>> as follows. Essentially the program wants to read in a fasta file
> containing
>>>> all the protein sequences in a given organism and split them up into one
> file
>>>> per protein.
>>>>
>>>>
>>>> BufferedReader br=null;
>>>> try
>>>> {
>>>> br = new BufferedReader(new FileReader(filename));
>>>> }
>>>> catch (FileNotFoundException e1)
>>>> {
>>>>
>>>> e1.printStackTrace();
>>>> }
>>>>
>>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br);
>>>> 	while (stream.hasNext())
>>>>     {
>>>> 	    try
>>>>         {
>>>> 			Sequence seq = stream.nextSequence();
>>>>            File scriptFile1= new
> File("///Users/Saif/Organisms/RunTemp/"+name
>>>> +"/"+seq.getName());
>>>>
>>>> 			try
>>>>            {
>>>> 				scriptFile1.createNewFile();
>>>> 			 }
>>>>          catch (IOException e1)
>>>>          {
>>>>
>>>> 				e1.printStackTrace();
>>>> 			}
>>>>
>>>> 			try
>>>>           {
>>>>            FileWriter fstream = new
> FileWriter(scriptFile1.getAbsolutePath());
>>>> 			    BufferedWriter out = new BufferedWriter(fstream);
>>>>
>>>> 			    FileOutputStream f =new FileOutputStream (scriptFile1);
>>>>
>>>> 			    RichSequence rs=RichSequence.Tools.enrich(seq);
>>>>
>>>>
>>>> 			    try{
>>>>
>>>>
>>>> 			    	RichSequence.IOTools.writeFasta(
>>>> 			    	        f,
>>>> 			    	        rs,
>>>> 			    	        RichObjectFactory.getDefaultNamespace()
>>>> 			    	        );
>>>>
>>>>
>>>> 			    }
>>>>
>>>> 			    catch (IOException ioe){}
>>>>
>>>> An example of an outputted fasta file from this code is attached.
>>>>
>>>>
>>>>
>>>> Thanks a lot for your time.
>>>>
>>>> Saif
>>>>
>>>>
>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
>>>>
>>>> Where are the input sequences coming from? i.e. what method are you
>>>> using to construct them or read them from a file.
>>>>
>>>> Also, what do you mean by the 'front' of each write? Could you send me
>>>> an example of an entire FASTA file containing the problem? (It'd be best
>>>> to attach the file to an email to me personally as this list will not
>>>> accept attachments, and copying-and-pasting from a text editor to an
>>>> email client may obscure the underlying problem).
>>>>
>>>> It'd be good also to see your entire code from the point the sequences
>>>> are read or created to the point where they are written out. Or, a
>>>> sample program which exhibits the same behaviour would suffice.
>>>>
>>>> I suspect that the sequences themselves contain the incorrect data,
>>>> although technically this should be impossible as the sequence alphabet
>>>> should prevent it.
>>>>
>>>> We recently had an issue reported here regarding BioJava not being able
>>>> to do certain sequence tasks on platforms using non-Western-European
>>>> character mappings. If your machine is running such a mapping, try it
>>>> again on a machine with an English or other Western European language
>>>> set up by default. If it works there but not on your machine, then
>>>> this'll be the same problem. (There is no solution yet, but at least
>>>> you'll know what's wrong).
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> Saif Ur-Rehman wrote:
>>>>>>> Dear Richard,
>>>>>>>
>>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this method
> is
>>>> still
>>>>>>> appending the characters "??" to the front of each write. I am using a
>>>>>>> FileOutputStream and a Sequence object as inputs to the method. like so.
>>>>>>>
>>>>>>>
>>>>>>>  Sequence seq; // read in from File
>>>>>>>  FileOutputStream f =new FileOutputStream (fileName);
>>>>>>>
>>>>>>>
>>>>>>> 			   try{
>>>>>>>
>>>>>>> 			    	RichSequence.IOTools.writeFasta(f,
>>>>>>> 			    	        seq,
>>>>>>> 			    	        RichObjectFactory.getDefaultNamespace()
>>>>>>> 			    	        );
>>>>>>>
>>>>>>>
>>>>>>> 			    }
>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot for your time
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> Saif
>>>>>>>
>>>>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
>>>>>>>
>>>>>>> SeqIOTools is deprecated.
>>>>>>>
>>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
>>>>>>>
>>>>>>> e.g.:
>>>>>>>
>>>>>>> RichSequence.IOTools.writeFasta(
>>>>>>> 	System.out,
>>>>>>> 	seq,
>>>>>>> 	RichObjectFactory.getDefaultNamespace()
>>>>>>> 	);
>>>>>>>
>>>>>>> where seq is either a Sequence or a SequenceIterator.
>>>>>>>
>>>>>>> cheers,
>>>>>>> Richard
>>>>>>>
>>>>>>> Saif Ur-Rehman wrote:
>>>>>>>>>> Dear All,
>>>>>>>>>>
>>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am
>>>>>>> currently
>>>>>>>>>> trying to break up Fasta Files of whole organisms into one file per
> gene
>>>>>>> for
>>>>>>>>>> further analysis. However the writeFasta method appears to append the
>>>>>>>>>> characters
>>>>>>>>>> "??
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------
>>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>
>> -------------------------------------------------------------------------------
>>>>>>> Saif Ur-Rehman
>>>>>>> Research Student
>>>>>>> The Centre for Evolution, Genes & Genomics (CEGG)
>>>>>>> Dyers Brae
>>>>>>> School of Biology
>>>>>>> The University of St Andrews
>>>>>>> St Andrews,
>>>>>>> Fife
>>>>>>> Scotland,UK
>>>>>>> ------------------------------------------------------------------
>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>> -------------------------------------------------------------------------------
>>>> Saif Ur-Rehman
>>>> Research Student
>>>> The Centre for Evolution, Genes & Genomics (CEGG)
>>>> Dyers Brae
>>>> School of Biology
>>>> The University of St Andrews
>>>> St Andrews,
>>>> Fife
>>>> Scotland,UK
>>>> ------------------------------------------------------------------
>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>>

> -------------------------------------------------------------------------------
> Saif Ur-Rehman
> Research Student
> The Centre for Evolution, Genes & Genomics (CEGG)
> Dyers Brae
> School of Biology
> The University of St Andrews
> St Andrews,
> Fife
> Scotland,UK

> ------------------------------------------------------------------
> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb
pxAPAybISoRQgbvQ1wyzqVg=
=MS7P
-----END PGP SIGNATURE-----


From ayates at ebi.ac.uk  Fri Oct  5 12:28:43 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Fri, 05 Oct 2007 13:28:43 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <470629D2.6020709@ebi.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>	<4705055E.5070401@ebi.ac.uk>	<1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>	<47060E50.2070405@ebi.ac.uk>	<1191582472.47061b0836c9f@webmail.st-andrews.ac.uk>	<47061FDD.1070806@ebi.ac.uk>	<1191584372.4706227437594@webmail.st-andrews.ac.uk>
	<470629D2.6020709@ebi.ac.uk>
Message-ID: <47062DFB.6040201@ebi.ac.uk>

I've done a quick search & it seems as if U+ACED is a Chinese character 
& the other is just a blank. Something is getting confused quite badly here

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Great, thanks.
> 
> The initial analysis shows that the text file generated contains four
> extra characters at the beginning of the file, and is using '\n' as the
> line separator.
> 
> This is a hex dump of the file:
> 
> 00000000  ac ed 00 05 3e 67 69 7c  31 38 33 39 38 33 39 30
> |....>gi|18398390|
> 00000010  7c 6c 63 6c 7c 4e 50 5f  35 36 35 34 31 33 2e 31
> ||lcl|NP_565413.1|
> 00000020  7c 4e 50 5f 35 36 35 34  31 33 20 75 6e 6b 6e 6f  ||NP_565413
> unkno|
> 00000030  77 6e 20 70 72 6f 74 65  69 6e 20 5b 41 72 61 62  |wn protein
> [Arab|
> 00000040  69 64 6f 70 73 69 73 20  74 68 61 6c 69 61 6e 61  |idopsis
> thaliana|
> 00000050  5d 0a 4d 53 4c 52 49 4b  4c 56 56 44 4b 46 56 45
> |].MSLRIKLVVDKFVE|
> 00000060  45 4c 4b 51 41 4c 44 41  44 49 51 44 52 49 4d 4b
> |ELKQALDADIQDRIMK|
> 00000070  45 52 45 4d 51 53 59 49  58 58 58 58 58 58 58 58
> |EREMQSYIXXXXXXXX|
> 00000080  58 58 58 58 58 57 4b 41  45 4c 53 52 52 45 54 45
> |XXXXXWKAELSRRETE|
> 00000090  49 41 52 51 45 41 52 4c  4b 4d 45 52 45 4e 4c 45
> |IARQEARLKMERENLE|
> 000000a0  4b 45 0a 4b 53 56 4c 4d  47 54 41 53 4e 51 44 4e
> |KE.KSVLMGTASNQDN|
> 000000b0  51 44 47 41 4c 45 49 54  56 53 47 45 4b 59 52 43
> |QDGALEITVSGEKYRC|
> 000000c0  4c 52 46 53 4b 41 4b 4b  0a                       |LRFSKAKK.|
> 
> 
> The four extra characters are hex #ac #ed #00 #05 and these are showing
> as question marks in your text editor because that's how text editors
> handle unprintable characters.
> 
> Does anyone recognise these characters? There is no code in BioJava
> which writes anything like this, in fact there is no output code at all
> before the initial write of the first > symbol in the file. Something
> tells me that these symbols are being inserted by the VM or the OS
> somewhere under the hood, possibly due to internationalisation?
> 
> I strongly suspect this is an internationalisation problem. It seems
> probable that Java has been set up on your system to use a language or
> character encoding that causes Java by default to write these extra
> characters at the start of files to indicate the encoding. Check the
> output of:
> 
> System.getProperty("file.encode");
> 
> to see if it is using something other than UTF-8. If it is, then chances
> are that this is the problem.
> 
> We've had internationalisation problems before with BioJava. Hopefully
> these will be addressed in future development, but there is no current
> activity in that area due to lack of resources. In the meantime the best
> workaround is to set every setting you can find to a Western European
> character set/character mapping and UTF-8 file encoding, in the hope
> that it will all match up nicely and work.
> 
> cheers,
> Richard
> 
>


From su24 at st-andrews.ac.uk  Fri Oct  5 13:44:29 2007
From: su24 at st-andrews.ac.uk (Saif Ur-Rehman)
Date: Fri,  5 Oct 2007 14:44:29 +0100
Subject: [Biojava-l] WriteFasta
In-Reply-To: <470629D2.6020709@ebi.ac.uk>
References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk>
	<4705055E.5070401@ebi.ac.uk>
	<1191578785.47060ca19d74b@webmail.st-andrews.ac.uk>
	<47060E50.2070405@ebi.ac.uk>
	<1191582472.47061b0836c9f@webmail.st-andrews.ac.uk>
	<47061FDD.1070806@ebi.ac.uk>
	<1191584372.4706227437594@webmail.st-andrews.ac.uk>
	<470629D2.6020709@ebi.ac.uk>
Message-ID: <1191591869.47063fbd22461@webmail.st-andrews.ac.uk>

Setting the System properties solved the problem.

Thanks a lot,

Saif

Quoting Richard Holland <holland at ebi.ac.uk>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Great, thanks.
>
> The initial analysis shows that the text file generated contains four
> extra characters at the beginning of the file, and is using '\n' as the
> line separator.
>
> This is a hex dump of the file:
>
> 00000000  ac ed 00 05 3e 67 69 7c  31 38 33 39 38 33 39 30
> |....>gi|18398390|
> 00000010  7c 6c 63 6c 7c 4e 50 5f  35 36 35 34 31 33 2e 31
> ||lcl|NP_565413.1|
> 00000020  7c 4e 50 5f 35 36 35 34  31 33 20 75 6e 6b 6e 6f  ||NP_565413
> unkno|
> 00000030  77 6e 20 70 72 6f 74 65  69 6e 20 5b 41 72 61 62  |wn protein
> [Arab|
> 00000040  69 64 6f 70 73 69 73 20  74 68 61 6c 69 61 6e 61  |idopsis
> thaliana|
> 00000050  5d 0a 4d 53 4c 52 49 4b  4c 56 56 44 4b 46 56 45
> |].MSLRIKLVVDKFVE|
> 00000060  45 4c 4b 51 41 4c 44 41  44 49 51 44 52 49 4d 4b
> |ELKQALDADIQDRIMK|
> 00000070  45 52 45 4d 51 53 59 49  58 58 58 58 58 58 58 58
> |EREMQSYIXXXXXXXX|
> 00000080  58 58 58 58 58 57 4b 41  45 4c 53 52 52 45 54 45
> |XXXXXWKAELSRRETE|
> 00000090  49 41 52 51 45 41 52 4c  4b 4d 45 52 45 4e 4c 45
> |IARQEARLKMERENLE|
> 000000a0  4b 45 0a 4b 53 56 4c 4d  47 54 41 53 4e 51 44 4e
> |KE.KSVLMGTASNQDN|
> 000000b0  51 44 47 41 4c 45 49 54  56 53 47 45 4b 59 52 43
> |QDGALEITVSGEKYRC|
> 000000c0  4c 52 46 53 4b 41 4b 4b  0a                       |LRFSKAKK.|
>
>
> The four extra characters are hex #ac #ed #00 #05 and these are showing
> as question marks in your text editor because that's how text editors
> handle unprintable characters.
>
> Does anyone recognise these characters? There is no code in BioJava
> which writes anything like this, in fact there is no output code at all
> before the initial write of the first > symbol in the file. Something
> tells me that these symbols are being inserted by the VM or the OS
> somewhere under the hood, possibly due to internationalisation?
>
> I strongly suspect this is an internationalisation problem. It seems
> probable that Java has been set up on your system to use a language or
> character encoding that causes Java by default to write these extra
> characters at the start of files to indicate the encoding. Check the
> output of:
>
> System.getProperty("file.encode");
>
> to see if it is using something other than UTF-8. If it is, then chances
> are that this is the problem.
>
> We've had internationalisation problems before with BioJava. Hopefully
> these will be addressed in future development, but there is no current
> activity in that area due to lack of resources. In the meantime the best
> workaround is to set every setting you can find to a Western European
> character set/character mapping and UTF-8 file encoding, in the hope
> that it will all match up nicely and work.
>
> cheers,
> Richard
>
> Saif Ur-Rehman wrote:
> > Dear Richard,
> >
> > The input file is just the entire set of RefSeq proteins for Arabdopsis
> thaliana
> > and is too large for me to send as an attachment. But it can be downloaded
> from
> > NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]".
> >
> > Cheers,
> >
> > Saif
> >
> >
> >
> > Quoting Richard Holland <holland at ebi.ac.uk>:
> >
> > Interesting. Could you send your input file as well?
> >
> > cheers,
> > Richard
> >
> > Saif Ur-Rehman wrote:
> >>>> Dear Richard,
> >>>>
> >>>> The sequences are being read by SeqIO.readFasta. The code from read to
> > write is
> >>>> as follows. Essentially the program wants to read in a fasta file
> > containing
> >>>> all the protein sequences in a given organism and split them up into one
> > file
> >>>> per protein.
> >>>>
> >>>>
> >>>> BufferedReader br=null;
> >>>> try
> >>>> {
> >>>> br = new BufferedReader(new FileReader(filename));
> >>>> }
> >>>> catch (FileNotFoundException e1)
> >>>> {
> >>>>
> >>>> e1.printStackTrace();
> >>>> }
> >>>>
> >>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br);
> >>>> 	while (stream.hasNext())
> >>>>     {
> >>>> 	    try
> >>>>         {
> >>>> 			Sequence seq = stream.nextSequence();
> >>>>            File scriptFile1= new
> > File("///Users/Saif/Organisms/RunTemp/"+name
> >>>> +"/"+seq.getName());
> >>>>
> >>>> 			try
> >>>>            {
> >>>> 				scriptFile1.createNewFile();
> >>>> 			 }
> >>>>          catch (IOException e1)
> >>>>          {
> >>>>
> >>>> 				e1.printStackTrace();
> >>>> 			}
> >>>>
> >>>> 			try
> >>>>           {
> >>>>            FileWriter fstream = new
> > FileWriter(scriptFile1.getAbsolutePath());
> >>>> 			    BufferedWriter out = new BufferedWriter(fstream);
> >>>>
> >>>> 			    FileOutputStream f =new FileOutputStream (scriptFile1);
> >>>>
> >>>> 			    RichSequence rs=RichSequence.Tools.enrich(seq);
> >>>>
> >>>>
> >>>> 			    try{
> >>>>
> >>>>
> >>>> 			    	RichSequence.IOTools.writeFasta(
> >>>> 			    	        f,
> >>>> 			    	        rs,
> >>>> 			    	        RichObjectFactory.getDefaultNamespace()
> >>>> 			    	        );
> >>>>
> >>>>
> >>>> 			    }
> >>>>
> >>>> 			    catch (IOException ioe){}
> >>>>
> >>>> An example of an outputted fasta file from this code is attached.
> >>>>
> >>>>
> >>>>
> >>>> Thanks a lot for your time.
> >>>>
> >>>> Saif
> >>>>
> >>>>
> >>>> Quoting Richard Holland <holland at ebi.ac.uk>:
> >>>>
> >>>> Where are the input sequences coming from? i.e. what method are you
> >>>> using to construct them or read them from a file.
> >>>>
> >>>> Also, what do you mean by the 'front' of each write? Could you send me
> >>>> an example of an entire FASTA file containing the problem? (It'd be best
> >>>> to attach the file to an email to me personally as this list will not
> >>>> accept attachments, and copying-and-pasting from a text editor to an
> >>>> email client may obscure the underlying problem).
> >>>>
> >>>> It'd be good also to see your entire code from the point the sequences
> >>>> are read or created to the point where they are written out. Or, a
> >>>> sample program which exhibits the same behaviour would suffice.
> >>>>
> >>>> I suspect that the sequences themselves contain the incorrect data,
> >>>> although technically this should be impossible as the sequence alphabet
> >>>> should prevent it.
> >>>>
> >>>> We recently had an issue reported here regarding BioJava not being able
> >>>> to do certain sequence tasks on platforms using non-Western-European
> >>>> character mappings. If your machine is running such a mapping, try it
> >>>> again on a machine with an English or other Western European language
> >>>> set up by default. If it works there but not on your machine, then
> >>>> this'll be the same problem. (There is no solution yet, but at least
> >>>> you'll know what's wrong).
> >>>>
> >>>> cheers,
> >>>> Richard
> >>>>
> >>>> Saif Ur-Rehman wrote:
> >>>>>>> Dear Richard,
> >>>>>>>
> >>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this
> method
> > is
> >>>> still
> >>>>>>> appending the characters "??" to the front of each write. I am using
> a
> >>>>>>> FileOutputStream and a Sequence object as inputs to the method. like
> so.
> >>>>>>>
> >>>>>>>
> >>>>>>>  Sequence seq; // read in from File
> >>>>>>>  FileOutputStream f =new FileOutputStream (fileName);
> >>>>>>>
> >>>>>>>
> >>>>>>> 			   try{
> >>>>>>>
> >>>>>>> 			    	RichSequence.IOTools.writeFasta(f,
> >>>>>>> 			    	        seq,
> >>>>>>> 			    	        RichObjectFactory.getDefaultNamespace()
> >>>>>>> 			    	        );
> >>>>>>>
> >>>>>>>
> >>>>>>> 			    }
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks a lot for your time
> >>>>>>>
> >>>>>>> Sincerely,
> >>>>>>>
> >>>>>>> Saif
> >>>>>>>
> >>>>>>> Quoting Richard Holland <holland at ebi.ac.uk>:
> >>>>>>>
> >>>>>>> SeqIOTools is deprecated.
> >>>>>>>
> >>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps.
> >>>>>>>
> >>>>>>> e.g.:
> >>>>>>>
> >>>>>>> RichSequence.IOTools.writeFasta(
> >>>>>>> 	System.out,
> >>>>>>> 	seq,
> >>>>>>> 	RichObjectFactory.getDefaultNamespace()
> >>>>>>> 	);
> >>>>>>>
> >>>>>>> where seq is either a Sequence or a SequenceIterator.
> >>>>>>>
> >>>>>>> cheers,
> >>>>>>> Richard
> >>>>>>>
> >>>>>>> Saif Ur-Rehman wrote:
> >>>>>>>>>> Dear All,
> >>>>>>>>>>
> >>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I
> am
> >>>>>>> currently
> >>>>>>>>>> trying to break up Fasta Files of whole organisms into one file
> per
> > gene
> >>>>>>> for
> >>>>>>>>>> further analysis. However the writeFasta method appears to append
> the
> >>>>>>>>>> characters
> >>>>>>>>>> "??
> >>>>>>>>>>
> >>>>>>>>>> ------------------------------------------------------------------
> >>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>>>>>
> >>
>
-------------------------------------------------------------------------------
> >>>>>>> Saif Ur-Rehman
> >>>>>>> Research Student
> >>>>>>> The Centre for Evolution, Genes & Genomics (CEGG)
> >>>>>>> Dyers Brae
> >>>>>>> School of Biology
> >>>>>>> The University of St Andrews
> >>>>>>> St Andrews,
> >>>>>>> Fife
> >>>>>>> Scotland,UK
> >>>>>>> ------------------------------------------------------------------
> >>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>
>
-------------------------------------------------------------------------------
> >>>> Saif Ur-Rehman
> >>>> Research Student
> >>>> The Centre for Evolution, Genes & Genomics (CEGG)
> >>>> Dyers Brae
> >>>> School of Biology
> >>>> The University of St Andrews
> >>>> St Andrews,
> >>>> Fife
> >>>> Scotland,UK
> >>>> ------------------------------------------------------------------
> >>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
> >>
>
> >
>
-------------------------------------------------------------------------------
> > Saif Ur-Rehman
> > Research Student
> > The Centre for Evolution, Genes & Genomics (CEGG)
> > Dyers Brae
> > School of Biology
> > The University of St Andrews
> > St Andrews,
> > Fife
> > Scotland,UK
>
> > ------------------------------------------------------------------
> > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb
> pxAPAybISoRQgbvQ1wyzqVg=
> =MS7P
> -----END PGP SIGNATURE-----
>


-------------------------------------------------------------------------------
Saif Ur-Rehman
Research Student
The Centre for Evolution, Genes & Genomics (CEGG)
Dyers Brae
School of Biology
The University of St Andrews
St Andrews,
Fife
Scotland,UK

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


From sanbiogene at yahoo.co.in  Sat Oct  6 09:23:11 2007
From: sanbiogene at yahoo.co.in (sandeep telkar)
Date: Sat, 6 Oct 2007 10:23:11 +0100 (BST)
Subject: [Biojava-l] BIOJAVA INSTALLATION FOR WINDOWS PLATFORM
Message-ID: <121992.19693.qm@web94408.mail.in2.yahoo.com>

Dear friends,
   Sandeep here...
       I wanna learn biojava n now i am beginner.but
from where to download its exe installation file as
like that of JDK6 fron sun website....

please suggest me any thing other than the following
url:
       http://biojava.org/wiki/BioJava:Download

N plese tell in which directory i have to save the
program.....
            I am not getting any clear idea ..

        please help me..
                                       - Sandeep

Sandeep Telkar,
  M.Sc Bioinformatics.


      Meet people who discuss and share your passions. Go to http://in.promos.yahoo.com/groups


From su24 at st-andrews.ac.uk  Sat Oct  6 18:04:28 2007
From: su24 at st-andrews.ac.uk (Saif Ur-Rehman)
Date: Sat,  6 Oct 2007 19:04:28 +0100
Subject: [Biojava-l] BIOJAVA INSTALLATION FOR WINDOWS PLATFORM
In-Reply-To: <121992.19693.qm@web94408.mail.in2.yahoo.com>
References: <121992.19693.qm@web94408.mail.in2.yahoo.com>
Message-ID: <1191693868.4707ce2caae97@webmail.st-andrews.ac.uk>

Hi,

You need to download the Jar files from
http://biojava.org/wiki/BioJava:Download. You can then use the File
biojava-1.5.jar. Just include it in the buildpath as an external JAR if you're
using an IDE like Netbeans or Eclipse or your class path if working from the
command line. You can then import the BioJava classes and use them. Hope that
helps

Cheers,

Saif


Quoting sandeep telkar <sanbiogene at yahoo.co.in>:

> Dear friends,
>    Sandeep here...
>        I wanna learn biojava n now i am beginner.but
> from where to download its exe installation file as
> like that of JDK6 fron sun website....
>
> please suggest me any thing other than the following
> url:
>        http://biojava.org/wiki/BioJava:Download
>
> N plese tell in which directory i have to save the
> program.....
>             I am not getting any clear idea ..
>
>         please help me..
>                                        - Sandeep
>
> Sandeep Telkar,
>   M.Sc Bioinformatics.
>
>
>
>       Meet people who discuss and share your passions. Go to
> http://in.promos.yahoo.com/groups
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-------------------------------------------------------------------------------
Saif Ur-Rehman
Research Student
The Centre for Evolution, Genes & Genomics (CEGG)
Dyers Brae
School of Biology
The University of St Andrews
St Andrews,
Fife
Scotland,UK

------------------------------------------------------------------
University of St Andrews Webmail: https://webmail.st-andrews.ac.uk


From vineith at gmail.com  Wed Oct 10 04:44:22 2007
From: vineith at gmail.com (vineith kaul)
Date: Wed, 10 Oct 2007 00:44:22 -0400
Subject: [Biojava-l] case-sensitive sequences
Message-ID: <f2446ee40710092144q5b18b599yb42cf4a5ecba3c2c@mail.gmail.com>

Hi,

I want to read in a sequence which has case sensitive
alphabets(nucleotides).Basically I want to replace only small
'a,g,t,c' with blanks .Although I saw a similar post earlier but
couldn't understand much.Can someone help me with this ?

-- 
Vineith Kaul
Masters Student Bioinformatics
The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
Georgia Tech, Atlanta


From holland at ebi.ac.uk  Wed Oct 10 08:06:16 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 10 Oct 2007 09:06:16 +0100
Subject: [Biojava-l] case-sensitive sequences
In-Reply-To: <f2446ee40710092144q5b18b599yb42cf4a5ecba3c2c@mail.gmail.com>
References: <f2446ee40710092144q5b18b599yb42cf4a5ecba3c2c@mail.gmail.com>
Message-ID: <470C87F8.8020502@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

You can use SoftMaskedAlphabet with the BioJavaX parsers to get the
desired effect.

By default, a soft masked character is one in lower case. The code below
will detect these. If you have other search criteria you can modify the
soft masked detection criteria to match this instead. To do that, add a
second parameter to the call to SoftMaskedAlphabet.getInstance() and use
it to pass in an instance of SoftMaskedAlphabet.MaskingDetector (see the
JavaDocs to see how this should work).

Hope this helps! :


// Set up a soft-masked alphabet.
SoftMaskedAlphabet sma =
	SoftMaskedAlphabet.getInstance(DNATools.getDNA());
SymbolTokenization stok = sma.getTokenization("token");

// Set up sequence parsing.
BufferedReader input = ....;
	// Get your sequences from somewhere
RichSequenceFormat format = new FastaFormat();
	// Or Genbank etc.
RichSequenceBuilderFactory factory = RichSequenceBuilderFactory.FACTORY;
	// See Javadocs for alternative factories.
Namespace ns = RichObjectFactory.getDefaultNamespace();
	// See Javadocs for alternative namespaces.

// Parse the sequences.
RichStreamReader seqsIn =
	new RichStreamReader(input, format,  stok, factory, ns);


// Find the soft-masked symbols in the sequences.
while (seqsIn.hasNext()) {
  RichSequence seq = seqsIn.nextRichSequence();

  // Iterate over symbols in sequence.
  for (Iterator i = seq.iterator(); i.hasNext(); ) {

     Symbol sym = (Symbol)i.next();

     // Is this symbol masked?
     if (sma.isMasked(sym)) {
        // Yes it is so deal with it.
        .......
     } else {
        // No it isn't, so deal with that instead.
        .......
     }
  }
}


cheers,
Richard

vineith kaul wrote:
> Hi,
> 
> I want to read in a sequence which has case sensitive
> alphabets(nucleotides).Basically I want to replace only small
> 'a,g,t,c' with blanks .Although I saw a similar post earlier but
> couldn't understand much.Can someone help me with this ?
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHDIf44C5LeMEKA/QRAmuNAJ426M/UgInqDG5rG6w+F+qoMdVzPQCfZo1S
nAS5v8jSFBX5WCuB5UmzczQ=
=Sicc
-----END PGP SIGNATURE-----


From vineith at gmail.com  Sun Oct 14 17:21:45 2007
From: vineith at gmail.com (vineith kaul)
Date: Sun, 14 Oct 2007 13:21:45 -0400
Subject: [Biojava-l] Java to Perl
Message-ID: <f2446ee40710141021y434112a2m466fbe0a99e40486@mail.gmail.com>

Is there some tool by which we can convert a complete Java Code to  a
Perl code ?

-- 
Vineith Kaul
Masters Student Bioinformatics
The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
Georgia Tech, Atlanta


From davidfeitosa at gmail.com  Sun Oct 14 17:57:47 2007
From: davidfeitosa at gmail.com (David Barbosa Feitosa)
Date: Sun, 14 Oct 2007 14:57:47 -0300
Subject: [Biojava-l] Java to Perl
In-Reply-To: <f2446ee40710141021y434112a2m466fbe0a99e40486@mail.gmail.com>
References: <f2446ee40710141021y434112a2m466fbe0a99e40486@mail.gmail.com>
Message-ID: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com>

Vineith

I do not know, but if you need to execute Pearl code inside Java code, in
Java 6, codename Mustang, is possible to execute script code inside the Java
Virtual Machine.

The default scripting engine is Rhino, for JavaScript, but as it is a
specification, if exists an Pearl engine, you can plug it into the JVM and
execute your Pearl code.

Mode infoa bout the available engines and how to install one:

https://scripting.dev.java.net/

Maybe it can help you,

David.

2007/10/14, vineith kaul <vineith at gmail.com>:
>
> Is there some tool by which we can convert a complete Java Code to  a
> Perl code ?
>
> --
> Vineith Kaul
> Masters Student Bioinformatics
> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From ayates at ebi.ac.uk  Mon Oct 15 08:15:33 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Mon, 15 Oct 2007 09:15:33 +0100
Subject: [Biojava-l] Java to Perl
In-Reply-To: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com>
References: <f2446ee40710141021y434112a2m466fbe0a99e40486@mail.gmail.com>
	<93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com>
Message-ID: <471321A5.5090600@ebi.ac.uk>

Unfortunately to my knowledge there is no Perl/Java scripting interface. 
Apparently for some reason Perl is not trendy enough to warrant a port 
(which is a pity).

In response to Vineith's original question such a tool really wouldn't 
work. Good Perl code is very different to good Java code. If you did get 
something that would work you'd probably end up with quite verbose & 
in-efficent Perl code (not to mention the problems that would arise with 
Perl objects having no access modifiers, using inside-out objects, 
converting 3rd party libraries etc).

Two options do spring to mind if you need code available in both languages:

* Make one of the pieces of code a "black box" where you read results 
from STDOUT (works well enough calling a Java program from Perl).

* Write the commmon code in C

Out of these two options if you want the code replicated in a 1-1 
fashion then C is your only option. Otherwise the first idea is the 
easiest to work with.

As David did mention there are other scripting engines available 
(Jython, Groovy, JRuby & Rhino all spring to mind) which might satisfy 
your scripting needs whilst remaining in a Java environment (Groovy hits 
that nice sweet spot for a Java inspired scripting language).

Andy

P.S. This really isn't a Biojava question ...

David Barbosa Feitosa wrote:
> Vineith
> 
> I do not know, but if you need to execute Pearl code inside Java code, in
> Java 6, codename Mustang, is possible to execute script code inside the Java
> Virtual Machine.
> 
> The default scripting engine is Rhino, for JavaScript, but as it is a
> specification, if exists an Pearl engine, you can plug it into the JVM and
> execute your Pearl code.
> 
> Mode infoa bout the available engines and how to install one:
> 
> https://scripting.dev.java.net/
> 
> Maybe it can help you,
> 
> David.
> 
> 2007/10/14, vineith kaul <vineith at gmail.com>:
>> Is there some tool by which we can convert a complete Java Code to  a
>> Perl code ?
>>
>> --
>> Vineith Kaul
>> Masters Student Bioinformatics
>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>> Georgia Tech, Atlanta
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l


From phidias51 at gmail.com  Mon Oct 15 14:57:06 2007
From: phidias51 at gmail.com (Mark Fortner)
Date: Mon, 15 Oct 2007 07:57:06 -0700
Subject: [Biojava-l] Java to Perl
In-Reply-To: <471321A5.5090600@ebi.ac.uk>
References: <f2446ee40710141021y434112a2m466fbe0a99e40486@mail.gmail.com>
	<93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com>
	<471321A5.5090600@ebi.ac.uk>
Message-ID: <6e1d61f50710150757p6ba25c1ck9466baa5f8273bc2@mail.gmail.com>

The original post indicated that they wanted to go from java to perl.  Doing
a quick Google search yielded a lot of hits for tools going from perl to
java.  Just out curiosity, was there some reason you wanted to create perl
code from Java code?

There are a couple of projects which supposedly provide PERL-scripting
support inside Java to one extent or another.  The first is called Sleep (
http://sleep.hick.org/) which is described as being a PERL-like plugin for
the Java 6 scripting engine.  There's also a BSF plugin called BSF Perl (
http://bsfperl.sf.net) and another BSF plugin called PerlScript which is
part of ActiveState's ActivePerl distribution.

I don't have any first-hand experience with any of these, so please don't
construe anything I say as an endorsement of these technologies.  Although
none of these solutions will convert PERL code into Java or vice-versa, they
may allow you to run Perl inside a VM.

Hope this helps,

Mark

On 10/15/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>
> Unfortunately to my knowledge there is no Perl/Java scripting interface.
> Apparently for some reason Perl is not trendy enough to warrant a port
> (which is a pity).
>
> In response to Vineith's original question such a tool really wouldn't
> work. Good Perl code is very different to good Java code. If you did get
> something that would work you'd probably end up with quite verbose &
> in-efficent Perl code (not to mention the problems that would arise with
> Perl objects having no access modifiers, using inside-out objects,
> converting 3rd party libraries etc).
>
> Two options do spring to mind if you need code available in both
> languages:
>
> * Make one of the pieces of code a "black box" where you read results
> from STDOUT (works well enough calling a Java program from Perl).
>
> * Write the commmon code in C
>
> Out of these two options if you want the code replicated in a 1-1
> fashion then C is your only option. Otherwise the first idea is the
> easiest to work with.
>
> As David did mention there are other scripting engines available
> (Jython, Groovy, JRuby & Rhino all spring to mind) which might satisfy
> your scripting needs whilst remaining in a Java environment (Groovy hits
> that nice sweet spot for a Java inspired scripting language).
>
> Andy
>
> P.S. This really isn't a Biojava question ...
>
> David Barbosa Feitosa wrote:
> > Vineith
> >
> > I do not know, but if you need to execute Pearl code inside Java code,
> in
> > Java 6, codename Mustang, is possible to execute script code inside the
> Java
> > Virtual Machine.
> >
> > The default scripting engine is Rhino, for JavaScript, but as it is a
> > specification, if exists an Pearl engine, you can plug it into the JVM
> and
> > execute your Pearl code.
> >
> > Mode infoa bout the available engines and how to install one:
> >
> > https://scripting.dev.java.net/
> >
> > Maybe it can help you,
> >
> > David.
> >
> > 2007/10/14, vineith kaul <vineith at gmail.com>:
> >> Is there some tool by which we can convert a complete Java Code to  a
> >> Perl code ?
> >>
> >> --
> >> Vineith Kaul
> >> Masters Student Bioinformatics
> >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >> Georgia Tech, Atlanta
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From vineith at gmail.com  Sun Oct 21 16:30:48 2007
From: vineith at gmail.com (vineith kaul)
Date: Sun, 21 Oct 2007 12:30:48 -0400
Subject: [Biojava-l] Evolutionary distances
Message-ID: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>

Hi,

Are there functions to calculate evolutionary pairwise distances like
Kimura2P,Finkelstein etc in Biojava
I did write smthng on my own but on large sequences it runs terribly
slow and I am not even sure if thats right.
-- 
Vineith Kaul
Masters Student Bioinformatics
The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
Georgia Tech, Atlanta


From holland at ebi.ac.uk  Mon Oct 22 12:06:57 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Mon, 22 Oct 2007 13:06:57 +0100 (BST)
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
Message-ID: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>

You should take a look at the latest 1.5 release, in the
org.biojavax.bio.phylo packages. This code is the beginnings of some
phylogenetics code that will perform tasks as you describe. The future
plan is to extend this code to cover a wider range of use cases. Kimura2P
is already implemented here, in
org.biojavax.bio.phylo.MultipleHitCorrection.

If you can't find code that will do what you want, but have written some
before, then please do feel free to contribute it. Even if it is slow, I'm
sure someone out there will be able to help optimise it!

cheers,
Richard

On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> Hi,
>
> Are there functions to calculate evolutionary pairwise distances like
> Kimura2P,Finkelstein etc in Biojava
> I did write smthng on my own but on large sequences it runs terribly
> slow and I am not even sure if thats right.
> --
> Vineith Kaul
> Masters Student Bioinformatics
> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


-- 
Richard Holland
BioMart (http://www.biomart.org/)
EMBL-EBI
Hinxton, Cambridgeshire CB10 1SD, UK


From vineith at gmail.com  Tue Oct 23 06:59:29 2007
From: vineith at gmail.com (vineith kaul)
Date: Tue, 23 Oct 2007 02:59:29 -0400
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
Message-ID: <f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>

This is what I have .....Thanks a lot  fr the help.


//Method to calculate the Kimura 2 parameter distance
public static double K2P(String sequence1,String sequence2){
        long p=0,q=0,numberOfAlignedSites=0; // P= transitional differences
(A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)


        char[] seq1array=sequence1.toCharArray();
        char[] seq2array=sequence2.toCharArray();

        for(int i=0;i<seq1array.length;i++){
                                // Number of aligned sites
                if(((seq1array[i]=='a') ||
(seq1array[i]=='A')||(seq1array[i]=='g') ||
(seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
(seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
(seq2array[i]=='A')||(seq2array[i]=='c') ||
(seq2array[i]=='C')||(seq2array[i]=='t') ||
(seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {

                        numberOfAlignedSites++;
                }

                if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
((seq2array[i]=='g') || (seq2array[i]=='G'))) {
                        p++;
                }
                else
                if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
((seq2array[i]=='a') || (seq2array[i]=='A'))) {
                        p++;
                }
                else
                if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
((seq2array[i]=='c') || (seq2array[i]=='C'))) {
                        p++;
                }
                else
                if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
((seq2array[i]=='t') || (seq2array[i]=='T'))) {
                        p++;
                }
                else
                if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
((seq2array[i]=='c') || (seq2array[i]=='C'))) {
                                q++;
                        }
                else
                if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
((seq2array[i]=='t') || (seq2array[i]=='T'))) {
                                q++;
                        }
                else
                if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
((seq2array[i]=='c') || (seq2array[i]=='C'))) {
                                        q++;
                                }
                else
                if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
((seq2array[i]=='t') || (seq2array[i]=='T'))) {
                                        q++;
                                }
                else
                if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
((seq2array[i]=='a') || (seq2array[i]=='A'))) {
                                        q++;
                                }
                else
                if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
((seq2array[i]=='g') || (seq2array[i]=='G'))) {
                                        q++;
                                }
                else
                if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
((seq2array[i]=='a') || (seq2array[i]=='A'))) {
                                        q++;
                                }
                else
                if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
((seq2array[i]=='g') || (seq2array[i]=='G'))) {
                                        q++;
                                }


        }

         double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
(((double)q)/numberOfAlignedSites);
         double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
         System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
         double dist = (-0.5 * Math.log(P)) - (0.25 * Math.log(Q));
         return dist;
}


On 10/22/07, Richard Holland <holland at ebi.ac.uk> wrote:
>
> You should take a look at the latest 1.5 release, in the
> org.biojavax.bio.phylo packages. This code is the beginnings of some
> phylogenetics code that will perform tasks as you describe. The future
> plan is to extend this code to cover a wider range of use cases. Kimura2P
> is already implemented here, in
> org.biojavax.bio.phylo.MultipleHitCorrection.
>
> If you can't find code that will do what you want, but have written some
> before, then please do feel free to contribute it. Even if it is slow, I'm
> sure someone out there will be able to help optimise it!
>
> cheers,
> Richard
>
> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> > Hi,
> >
> > Are there functions to calculate evolutionary pairwise distances like
> > Kimura2P,Finkelstein etc in Biojava
> > I did write smthng on my own but on large sequences it runs terribly
> > slow and I am not even sure if thats right.
> > --
> > Vineith Kaul
> > Masters Student Bioinformatics
> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> > Georgia Tech, Atlanta
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
>
>
> --
> Richard Holland
> BioMart (http://www.biomart.org/)
> EMBL-EBI
> Hinxton, Cambridgeshire CB10 1SD, UK
>
>


-- 
Vineith Kaul
Masters Student Bioinformatics
The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
Georgia Tech, Atlanta


From ozgur7 at gmail.com  Tue Oct 23 18:17:29 2007
From: ozgur7 at gmail.com (Ozgur Ozturk)
Date: Tue, 23 Oct 2007 11:17:29 -0700
Subject: [Biojava-l] problem with CookBook:Blast:Parser
Message-ID: <a662a84f0710231117y548fc2f0q471f5f1868a91777@mail.gmail.com>

Hi,
   I am receiving the following error when I use BlastParser code from the
cookbook <http://www.biojava.org/wiki/BioJava:CookBook:Blast:Parser> :

org.xml.sax.SAXException: Could not recognise the format of this file as one
supported by the framework.
    at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(
BlastLikeSAXParser.java:182)
    at org.arabidopsis.test.BlastParser.main(BlastParser.java:44)

I have generated the xml file using this command:
blast-2.2.17/bin/blastall -p blastp -d brAll -i tester -b 300 -m 7 >
tempresult.xml

Then pass it to BlastParser:
BlastParser tempresult.xml

Thanks for your help in advance,
-- 
Best regards,
Ozgur (Oscar) Ozturk,
http://www.cse.ohio-state.edu/~ozturk/
Mobile Phone: (614) 805-4370


From ozgur7 at gmail.com  Tue Oct 23 20:24:49 2007
From: ozgur7 at gmail.com (Ozgur Ozturk)
Date: Tue, 23 Oct 2007 13:24:49 -0700
Subject: [Biojava-l] Problem Solved Re: problem with CookBook:Blast:Parser
Message-ID: <a662a84f0710231324i63eb5bb0xfefc551f507f81ab@mail.gmail.com>

Hi,
      Another code in demos ( BioJava/biojava-1.5/demos/blastxml ) could
handle my xml file.
I guess the problem is solved. Thanks.
(But if the BlastParser code from the
cookbook<http://www.biojava.org/wiki/BioJava:CookBook:Blast:Parser>is
deprecated, you may want to update it.)

Best regards,
Ozgur (Oscar) Ozturk,
http://www.cse.ohio-state.edu/~ozturk/
Mobile Phone: (614) 805-4370

On 10/23/07, Ozgur Ozturk <ozgur7 at gmail.com> wrote:
>
> Hi,
>    I am receiving the following error when I use BlastParser code from the
> cookbook <http://www.biojava.org/wiki/BioJava:CookBook:Blast:Parser> :
>
> org.xml.sax.SAXException: Could not recognise the format of this file as
> one supported by the framework.
>     at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(
> BlastLikeSAXParser.java:182)
>     at org.arabidopsis.test.BlastParser.main(BlastParser.java:44)
>
> I have generated the xml file using this command:
> blast-2.2.17/bin/blastall -p blastp -d brAll -i tester -b 300 -m 7 >
> tempresult.xml
>
> Then pass it to BlastParser:
> BlastParser tempresult.xml
>
> Thanks for your help in advance,
> --
> Best regards,
> Ozgur (Oscar) Ozturk,
> http://www.cse.ohio-state.edu/~ozturk/<http://www.cse.ohio-state.edu/%7Eozturk/>
> Mobile Phone: (614) 805-4370


From holland at ebi.ac.uk  Wed Oct 24 07:52:24 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 24 Oct 2007 08:52:24 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
Message-ID: <471EF9B8.7020609@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks.

Your code is similar to the code we have in
org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
see if it is identical, but it probably is.

You can call our code like this:

 // import statement for biojava phylo stuff
 import org.biojavax.bio.phylo.*;

 // ...rest of code goes here

 // call Kimura2P
 String seq1 = ...; // Get seq1 and seq2 from somewhere
 String seq2 = ...;
 double result = MultipleHitCorrection.Kimura2P(seq1, seq2);

Note that our implementation expects sequence strings to be in upper
case, so you'll need to make sure your data is upper case or has been
converted to upper case before calling our method.

cheers,
Richard

vineith kaul wrote:
> This is what I have .....Thanks a lot  fr the help.
> 
> 
> //Method to calculate the Kimura 2 parameter distance
> public static double K2P(String sequence1,String sequence2){
>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
> 
> 
>         char[] seq1array=sequence1.toCharArray();
>         char[] seq2array=sequence2.toCharArray();
> 
>         for(int i=0;i<seq1array.length;i++){
>                                 // Number of aligned sites
>                 if(((seq1array[i]=='a') ||
> (seq1array[i]=='A')||(seq1array[i]=='g') ||
> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
> (seq2array[i]=='A')||(seq2array[i]=='c') ||
> (seq2array[i]=='C')||(seq2array[i]=='t') ||
> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
> 
>                         numberOfAlignedSites++;
>                 }
> 
>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>                         p++;
>                 }
>                 else
>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>                         p++;
>                 }
>                 else
>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>                         p++;
>                 }
>                 else
>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>                         p++;
>                 }
>                 else
>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>                                 q++;
>                         }
>                 else
>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>                                 q++;
>                         }
>                 else
>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>                                         q++;
>                                 }
>                 else
>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>                                         q++;
>                                 }
>                 else
>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>                                         q++;
>                                 }
>                 else
>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>                                         q++;
>                                 }
>                 else
>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>                                         q++;
>                                 }
>                 else
>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>                                         q++;
>                                 }
> 
> 
> 
> 
>         }
> 
>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
> (((double)q)/numberOfAlignedSites);
>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>          return dist;
> }
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
> <mailto:holland at ebi.ac.uk>> wrote:
> 
>     You should take a look at the latest 1.5 release, in the
>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>     phylogenetics code that will perform tasks as you describe. The future
>     plan is to extend this code to cover a wider range of use cases.
>     Kimura2P
>     is already implemented here, in
>     org.biojavax.bio.phylo.MultipleHitCorrection.
> 
>     If you can't find code that will do what you want, but have written some
>     before, then please do feel free to contribute it. Even if it is
>     slow, I'm
>     sure someone out there will be able to help optimise it!
> 
>     cheers,
>     Richard
> 
>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>     > Hi,
>     >
>     > Are there functions to calculate evolutionary pairwise distances like
>     > Kimura2P,Finkelstein etc in Biojava
>     > I did write smthng on my own but on large sequences it runs terribly
>     > slow and I am not even sure if thats right.
>     > --
>     > Vineith Kaul
>     > Masters Student Bioinformatics
>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>     > Georgia Tech, Atlanta
>     > _______________________________________________
>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>     <mailto:Biojava-l at lists.open-bio.org>
>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>     >
> 
> 
>     --
>     Richard Holland
>     BioMart ( http://www.biomart.org/)
>     EMBL-EBI
>     Hinxton, Cambridgeshire CB10 1SD, UK
> 
> 
> 
> 
> -- 
> Vineith Kaul
> Masters Student Bioinformatics
> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> Georgia Tech, Atlanta
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
4iKvsyBj2uznhhjTF9EYDFE=
=LALE
-----END PGP SIGNATURE-----


From ayates at ebi.ac.uk  Wed Oct 24 08:09:13 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 24 Oct 2007 09:09:13 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471EF9B8.7020609@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>		<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
	<471EF9B8.7020609@ebi.ac.uk>
Message-ID: <471EFDA9.1090706@ebi.ac.uk>

Our code is very similar but not identical. The original programmer 
shortcutted a lot of else if conditions by considering if the two bases 
were equal or not. It can then calculate the transitional changes & 
assume the rest are transversional.

In terms of speed of both pieces of code I can't see an obvious way to 
speed it up. Probably in our code removing the 10 or so calls to 
String.charAt() with a two calls & referencing those chars might help 
but in all honesty I cannot say.

Andy

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Thanks.
> 
> Your code is similar to the code we have in
> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> see if it is identical, but it probably is.
> 
> You can call our code like this:
> 
>  // import statement for biojava phylo stuff
>  import org.biojavax.bio.phylo.*;
> 
>  // ...rest of code goes here
> 
>  // call Kimura2P
>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>  String seq2 = ...;
>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> 
> Note that our implementation expects sequence strings to be in upper
> case, so you'll need to make sure your data is upper case or has been
> converted to upper case before calling our method.
> 
> cheers,
> Richard
> 
> vineith kaul wrote:
>> This is what I have .....Thanks a lot  fr the help.
>>
>>
>> //Method to calculate the Kimura 2 parameter distance
>> public static double K2P(String sequence1,String sequence2){
>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>
>>
>>         char[] seq1array=sequence1.toCharArray();
>>         char[] seq2array=sequence2.toCharArray();
>>
>>         for(int i=0;i<seq1array.length;i++){
>>                                 // Number of aligned sites
>>                 if(((seq1array[i]=='a') ||
>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>
>>                         numberOfAlignedSites++;
>>                 }
>>
>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>                         p++;
>>                 }
>>                 else
>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>                         p++;
>>                 }
>>                 else
>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>                         p++;
>>                 }
>>                 else
>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>                         p++;
>>                 }
>>                 else
>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>                                 q++;
>>                         }
>>                 else
>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>                                 q++;
>>                         }
>>                 else
>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>                                         q++;
>>                                 }
>>                 else
>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>                                         q++;
>>                                 }
>>                 else
>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>                                         q++;
>>                                 }
>>                 else
>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>                                         q++;
>>                                 }
>>                 else
>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>                                         q++;
>>                                 }
>>                 else
>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>                                         q++;
>>                                 }
>>
>>
>>
>>
>>         }
>>
>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>> (((double)q)/numberOfAlignedSites);
>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>          return dist;
>> }
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>> <mailto:holland at ebi.ac.uk>> wrote:
>>
>>     You should take a look at the latest 1.5 release, in the
>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>     phylogenetics code that will perform tasks as you describe. The future
>>     plan is to extend this code to cover a wider range of use cases.
>>     Kimura2P
>>     is already implemented here, in
>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>
>>     If you can't find code that will do what you want, but have written some
>>     before, then please do feel free to contribute it. Even if it is
>>     slow, I'm
>>     sure someone out there will be able to help optimise it!
>>
>>     cheers,
>>     Richard
>>
>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>     > Hi,
>>     >
>>     > Are there functions to calculate evolutionary pairwise distances like
>>     > Kimura2P,Finkelstein etc in Biojava
>>     > I did write smthng on my own but on large sequences it runs terribly
>>     > slow and I am not even sure if thats right.
>>     > --
>>     > Vineith Kaul
>>     > Masters Student Bioinformatics
>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>     > Georgia Tech, Atlanta
>>     > _______________________________________________
>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>     <mailto:Biojava-l at lists.open-bio.org>
>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>     >
>>
>>
>>     --
>>     Richard Holland
>>     BioMart ( http://www.biomart.org/)
>>     EMBL-EBI
>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>
>>
>>
>>
>> -- 
>> Vineith Kaul
>> Masters Student Bioinformatics
>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>> Georgia Tech, Atlanta
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
> 4iKvsyBj2uznhhjTF9EYDFE=
> =LALE
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l


From markjschreiber at gmail.com  Wed Oct 24 11:59:04 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 24 Oct 2007 19:59:04 +0800
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471EFDA9.1090706@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>
Message-ID: <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>

Hi -

>From experience the best way to optimize java code is to run a
profiler. The one in Netbeans is quite good.

The reason is that the hotspot or JIT compilers might natively compile
the part of the code that you think is slow and actually make it
faster than something else which becomes the bottle neck. Using a good
profiler you can detect how much time is spent in each method and pin
point some candidate methods for optimization. You can also see if
there is a burden due to creation of lots of objects.

- Mark

On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> Our code is very similar but not identical. The original programmer
> shortcutted a lot of else if conditions by considering if the two bases
> were equal or not. It can then calculate the transitional changes &
> assume the rest are transversional.
>
> In terms of speed of both pieces of code I can't see an obvious way to
> speed it up. Probably in our code removing the 10 or so calls to
> String.charAt() with a two calls & referencing those chars might help
> but in all honesty I cannot say.
>
> Andy
>
> Richard Holland wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Thanks.
> >
> > Your code is similar to the code we have in
> > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> > see if it is identical, but it probably is.
> >
> > You can call our code like this:
> >
> >  // import statement for biojava phylo stuff
> >  import org.biojavax.bio.phylo.*;
> >
> >  // ...rest of code goes here
> >
> >  // call Kimura2P
> >  String seq1 = ...; // Get seq1 and seq2 from somewhere
> >  String seq2 = ...;
> >  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> >
> > Note that our implementation expects sequence strings to be in upper
> > case, so you'll need to make sure your data is upper case or has been
> > converted to upper case before calling our method.
> >
> > cheers,
> > Richard
> >
> > vineith kaul wrote:
> >> This is what I have .....Thanks a lot  fr the help.
> >>
> >>
> >> //Method to calculate the Kimura 2 parameter distance
> >> public static double K2P(String sequence1,String sequence2){
> >>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
> >> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
> >>
> >>
> >>         char[] seq1array=sequence1.toCharArray();
> >>         char[] seq2array=sequence2.toCharArray();
> >>
> >>         for(int i=0;i<seq1array.length;i++){
> >>                                 // Number of aligned sites
> >>                 if(((seq1array[i]=='a') ||
> >> (seq1array[i]=='A')||(seq1array[i]=='g') ||
> >> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
> >> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
> >> (seq2array[i]=='A')||(seq2array[i]=='c') ||
> >> (seq2array[i]=='C')||(seq2array[i]=='t') ||
> >> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>
> >>                         numberOfAlignedSites++;
> >>                 }
> >>
> >>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>                         p++;
> >>                 }
> >>                 else
> >>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>                         p++;
> >>                 }
> >>                 else
> >>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>                         p++;
> >>                 }
> >>                 else
> >>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>                         p++;
> >>                 }
> >>                 else
> >>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>                                 q++;
> >>                         }
> >>                 else
> >>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>                                 q++;
> >>                         }
> >>                 else
> >>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>                                         q++;
> >>                                 }
> >>                 else
> >>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>                                         q++;
> >>                                 }
> >>                 else
> >>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>                                         q++;
> >>                                 }
> >>                 else
> >>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>                                         q++;
> >>                                 }
> >>                 else
> >>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>                                         q++;
> >>                                 }
> >>                 else
> >>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>                                         q++;
> >>                                 }
> >>
> >>
> >>
> >>
> >>         }
> >>
> >>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
> >> (((double)q)/numberOfAlignedSites);
> >>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
> >>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
> >>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
> >>          return dist;
> >> }
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
> >> <mailto:holland at ebi.ac.uk>> wrote:
> >>
> >>     You should take a look at the latest 1.5 release, in the
> >>     org.biojavax.bio.phylo packages. This code is the beginnings of some
> >>     phylogenetics code that will perform tasks as you describe. The future
> >>     plan is to extend this code to cover a wider range of use cases.
> >>     Kimura2P
> >>     is already implemented here, in
> >>     org.biojavax.bio.phylo.MultipleHitCorrection.
> >>
> >>     If you can't find code that will do what you want, but have written some
> >>     before, then please do feel free to contribute it. Even if it is
> >>     slow, I'm
> >>     sure someone out there will be able to help optimise it!
> >>
> >>     cheers,
> >>     Richard
> >>
> >>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> >>     > Hi,
> >>     >
> >>     > Are there functions to calculate evolutionary pairwise distances like
> >>     > Kimura2P,Finkelstein etc in Biojava
> >>     > I did write smthng on my own but on large sequences it runs terribly
> >>     > slow and I am not even sure if thats right.
> >>     > --
> >>     > Vineith Kaul
> >>     > Masters Student Bioinformatics
> >>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >>     > Georgia Tech, Atlanta
> >>     > _______________________________________________
> >>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> >>     <mailto:Biojava-l at lists.open-bio.org>
> >>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>     >
> >>
> >>
> >>     --
> >>     Richard Holland
> >>     BioMart ( http://www.biomart.org/)
> >>     EMBL-EBI
> >>     Hinxton, Cambridgeshire CB10 1SD, UK
> >>
> >>
> >>
> >>
> >> --
> >> Vineith Kaul
> >> Masters Student Bioinformatics
> >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >> Georgia Tech, Atlanta
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.2.2 (GNU/Linux)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >
> > iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
> > 4iKvsyBj2uznhhjTF9EYDFE=
> > =LALE
> > -----END PGP SIGNATURE-----
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>


From ayates at ebi.ac.uk  Wed Oct 24 12:28:21 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 24 Oct 2007 13:28:21 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>
Message-ID: <471F3A65.50202@ebi.ac.uk>

Yes a very good point & one I was going to make before hand but forgot :)

Also not to mention that micro-benchmarks/profiling in Java are 
notorious for giving false results due to VM warmup & JIT compilation 
optimisations. There is a framework hosted on Java.net somewhere which 
can perform VM warmups and code iterations to produce more accurate 
benchmarking results; but the name escapes me at the moment.

However looking at this particular code I get the feeling that this is 
about as fast as its going to get without someone doing bitwise XOR 
operations or some C code ... that's not an open invitation for people 
to start recoding this in C :). At the end of the day the key to 
optimisation is to ask the question "is it fast enough already?". If it 
is then there's no point :)

Andy

Mark Schreiber wrote:
> Hi -
> 
>>From experience the best way to optimize java code is to run a
> profiler. The one in Netbeans is quite good.
> 
> The reason is that the hotspot or JIT compilers might natively compile
> the part of the code that you think is slow and actually make it
> faster than something else which becomes the bottle neck. Using a good
> profiler you can detect how much time is spent in each method and pin
> point some candidate methods for optimization. You can also see if
> there is a burden due to creation of lots of objects.
> 
> - Mark
> 
> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Our code is very similar but not identical. The original programmer
>> shortcutted a lot of else if conditions by considering if the two bases
>> were equal or not. It can then calculate the transitional changes &
>> assume the rest are transversional.
>>
>> In terms of speed of both pieces of code I can't see an obvious way to
>> speed it up. Probably in our code removing the 10 or so calls to
>> String.charAt() with a two calls & referencing those chars might help
>> but in all honesty I cannot say.
>>
>> Andy
>>
>> Richard Holland wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Thanks.
>>>
>>> Your code is similar to the code we have in
>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>>> see if it is identical, but it probably is.
>>>
>>> You can call our code like this:
>>>
>>>  // import statement for biojava phylo stuff
>>>  import org.biojavax.bio.phylo.*;
>>>
>>>  // ...rest of code goes here
>>>
>>>  // call Kimura2P
>>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>>  String seq2 = ...;
>>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>>
>>> Note that our implementation expects sequence strings to be in upper
>>> case, so you'll need to make sure your data is upper case or has been
>>> converted to upper case before calling our method.
>>>
>>> cheers,
>>> Richard
>>>
>>> vineith kaul wrote:
>>>> This is what I have .....Thanks a lot  fr the help.
>>>>
>>>>
>>>> //Method to calculate the Kimura 2 parameter distance
>>>> public static double K2P(String sequence1,String sequence2){
>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>
>>>>
>>>>         char[] seq1array=sequence1.toCharArray();
>>>>         char[] seq2array=sequence2.toCharArray();
>>>>
>>>>         for(int i=0;i<seq1array.length;i++){
>>>>                                 // Number of aligned sites
>>>>                 if(((seq1array[i]=='a') ||
>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>
>>>>                         numberOfAlignedSites++;
>>>>                 }
>>>>
>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>                         p++;
>>>>                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>                                 q++;
>>>>                         }
>>>>                 else
>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>                                 q++;
>>>>                         }
>>>>                 else
>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>                                         q++;
>>>>                                 }
>>>>                 else
>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>                                         q++;
>>>>                                 }
>>>>
>>>>
>>>>
>>>>
>>>>         }
>>>>
>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>> (((double)q)/numberOfAlignedSites);
>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>          return dist;
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>
>>>>     You should take a look at the latest 1.5 release, in the
>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>     Kimura2P
>>>>     is already implemented here, in
>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>
>>>>     If you can't find code that will do what you want, but have written some
>>>>     before, then please do feel free to contribute it. Even if it is
>>>>     slow, I'm
>>>>     sure someone out there will be able to help optimise it!
>>>>
>>>>     cheers,
>>>>     Richard
>>>>
>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>     > Hi,
>>>>     >
>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>     > slow and I am not even sure if thats right.
>>>>     > --
>>>>     > Vineith Kaul
>>>>     > Masters Student Bioinformatics
>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>     > Georgia Tech, Atlanta
>>>>     > _______________________________________________
>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>     >
>>>>
>>>>
>>>>     --
>>>>     Richard Holland
>>>>     BioMart ( http://www.biomart.org/)
>>>>     EMBL-EBI
>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Vineith Kaul
>>>> Masters Student Bioinformatics
>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>> Georgia Tech, Atlanta
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>>
>>> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
>>> 4iKvsyBj2uznhhjTF9EYDFE=
>>> =LALE
>>> -----END PGP SIGNATURE-----
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>


From markjschreiber at gmail.com  Wed Oct 24 13:19:25 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 24 Oct 2007 21:19:25 +0800
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471F3A65.50202@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>
	<471F3A65.50202@ebi.ac.uk>
Message-ID: <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>

Another important consideration after optimization is can the task be
multithreaded?  Almost all modern computers have at least 2 cores. So
if the algorithm can be parallelized you will get some performance
bonus on most machines.

Modern JVM's will automagically try to use idle CPU's to execute new
threads spawned by the programmer.

- Mark

On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> Yes a very good point & one I was going to make before hand but forgot :)
>
> Also not to mention that micro-benchmarks/profiling in Java are
> notorious for giving false results due to VM warmup & JIT compilation
> optimisations. There is a framework hosted on Java.net somewhere which
> can perform VM warmups and code iterations to produce more accurate
> benchmarking results; but the name escapes me at the moment.
>
> However looking at this particular code I get the feeling that this is
> about as fast as its going to get without someone doing bitwise XOR
> operations or some C code ... that's not an open invitation for people
> to start recoding this in C :). At the end of the day the key to
> optimisation is to ask the question "is it fast enough already?". If it
> is then there's no point :)
>
> Andy
>
> Mark Schreiber wrote:
> > Hi -
> >
> >>From experience the best way to optimize java code is to run a
> > profiler. The one in Netbeans is quite good.
> >
> > The reason is that the hotspot or JIT compilers might natively compile
> > the part of the code that you think is slow and actually make it
> > faster than something else which becomes the bottle neck. Using a good
> > profiler you can detect how much time is spent in each method and pin
> > point some candidate methods for optimization. You can also see if
> > there is a burden due to creation of lots of objects.
> >
> > - Mark
> >
> > On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> Our code is very similar but not identical. The original programmer
> >> shortcutted a lot of else if conditions by considering if the two bases
> >> were equal or not. It can then calculate the transitional changes &
> >> assume the rest are transversional.
> >>
> >> In terms of speed of both pieces of code I can't see an obvious way to
> >> speed it up. Probably in our code removing the 10 or so calls to
> >> String.charAt() with a two calls & referencing those chars might help
> >> but in all honesty I cannot say.
> >>
> >> Andy
> >>
> >> Richard Holland wrote:
> >>> -----BEGIN PGP SIGNED MESSAGE-----
> >>> Hash: SHA1
> >>>
> >>> Thanks.
> >>>
> >>> Your code is similar to the code we have in
> >>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> >>> see if it is identical, but it probably is.
> >>>
> >>> You can call our code like this:
> >>>
> >>>  // import statement for biojava phylo stuff
> >>>  import org.biojavax.bio.phylo.*;
> >>>
> >>>  // ...rest of code goes here
> >>>
> >>>  // call Kimura2P
> >>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
> >>>  String seq2 = ...;
> >>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> >>>
> >>> Note that our implementation expects sequence strings to be in upper
> >>> case, so you'll need to make sure your data is upper case or has been
> >>> converted to upper case before calling our method.
> >>>
> >>> cheers,
> >>> Richard
> >>>
> >>> vineith kaul wrote:
> >>>> This is what I have .....Thanks a lot  fr the help.
> >>>>
> >>>>
> >>>> //Method to calculate the Kimura 2 parameter distance
> >>>> public static double K2P(String sequence1,String sequence2){
> >>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
> >>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
> >>>>
> >>>>
> >>>>         char[] seq1array=sequence1.toCharArray();
> >>>>         char[] seq2array=sequence2.toCharArray();
> >>>>
> >>>>         for(int i=0;i<seq1array.length;i++){
> >>>>                                 // Number of aligned sites
> >>>>                 if(((seq1array[i]=='a') ||
> >>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
> >>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
> >>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
> >>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
> >>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
> >>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>
> >>>>                         numberOfAlignedSites++;
> >>>>                 }
> >>>>
> >>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>                         p++;
> >>>>                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>                         p++;
> >>>>                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>                         p++;
> >>>>                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>                         p++;
> >>>>                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>                                 q++;
> >>>>                         }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>                                 q++;
> >>>>                         }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>                 else
> >>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>                                         q++;
> >>>>                                 }
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>         }
> >>>>
> >>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
> >>>> (((double)q)/numberOfAlignedSites);
> >>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
> >>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
> >>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
> >>>>          return dist;
> >>>> }
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
> >>>> <mailto:holland at ebi.ac.uk>> wrote:
> >>>>
> >>>>     You should take a look at the latest 1.5 release, in the
> >>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
> >>>>     phylogenetics code that will perform tasks as you describe. The future
> >>>>     plan is to extend this code to cover a wider range of use cases.
> >>>>     Kimura2P
> >>>>     is already implemented here, in
> >>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
> >>>>
> >>>>     If you can't find code that will do what you want, but have written some
> >>>>     before, then please do feel free to contribute it. Even if it is
> >>>>     slow, I'm
> >>>>     sure someone out there will be able to help optimise it!
> >>>>
> >>>>     cheers,
> >>>>     Richard
> >>>>
> >>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> >>>>     > Hi,
> >>>>     >
> >>>>     > Are there functions to calculate evolutionary pairwise distances like
> >>>>     > Kimura2P,Finkelstein etc in Biojava
> >>>>     > I did write smthng on my own but on large sequences it runs terribly
> >>>>     > slow and I am not even sure if thats right.
> >>>>     > --
> >>>>     > Vineith Kaul
> >>>>     > Masters Student Bioinformatics
> >>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >>>>     > Georgia Tech, Atlanta
> >>>>     > _______________________________________________
> >>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> >>>>     <mailto:Biojava-l at lists.open-bio.org>
> >>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>     >
> >>>>
> >>>>
> >>>>     --
> >>>>     Richard Holland
> >>>>     BioMart ( http://www.biomart.org/)
> >>>>     EMBL-EBI
> >>>>     Hinxton, Cambridgeshire CB10 1SD, UK
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Vineith Kaul
> >>>> Masters Student Bioinformatics
> >>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >>>> Georgia Tech, Atlanta
> >>> -----BEGIN PGP SIGNATURE-----
> >>> Version: GnuPG v1.4.2.2 (GNU/Linux)
> >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >>>
> >>> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa
> >>> 4iKvsyBj2uznhhjTF9EYDFE=
> >>> =LALE
> >>> -----END PGP SIGNATURE-----
> >>> _______________________________________________
> >>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >> _______________________________________________
> >> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>
>


From holland at ebi.ac.uk  Wed Oct 24 13:33:53 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 24 Oct 2007 14:33:53 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>	
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>	
	<471F3A65.50202@ebi.ac.uk>
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>
Message-ID: <471F49C1.9070901@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This particular code could easily be parallelised - given N threads, you
can simply divide the input into N chunks and get each thread to process
1/Nth of the input. You then combine the output of each thread to do the
final calculation.

But, it'd be bad practice to always fork a predetermined N threads for a
given task. It'd be much better to somehow be able to ask 'how parallel
can I make this?' at runtime by checking system resources, or maybe get
the parallel-savvy user to set an optional BioJava-wide parallelisation
hint. N could then be determined and the task divided appropriately.

cheers,
Richard

Mark Schreiber wrote:
> Another important consideration after optimization is can the task be
> multithreaded?  Almost all modern computers have at least 2 cores. So
> if the algorithm can be parallelized you will get some performance
> bonus on most machines.
> 
> Modern JVM's will automagically try to use idle CPU's to execute new
> threads spawned by the programmer.
> 
> - Mark
> 
> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Yes a very good point & one I was going to make before hand but forgot :)
>>
>> Also not to mention that micro-benchmarks/profiling in Java are
>> notorious for giving false results due to VM warmup & JIT compilation
>> optimisations. There is a framework hosted on Java.net somewhere which
>> can perform VM warmups and code iterations to produce more accurate
>> benchmarking results; but the name escapes me at the moment.
>>
>> However looking at this particular code I get the feeling that this is
>> about as fast as its going to get without someone doing bitwise XOR
>> operations or some C code ... that's not an open invitation for people
>> to start recoding this in C :). At the end of the day the key to
>> optimisation is to ask the question "is it fast enough already?". If it
>> is then there's no point :)
>>
>> Andy
>>
>> Mark Schreiber wrote:
>>> Hi -
>>>
>>> >From experience the best way to optimize java code is to run a
>>> profiler. The one in Netbeans is quite good.
>>>
>>> The reason is that the hotspot or JIT compilers might natively compile
>>> the part of the code that you think is slow and actually make it
>>> faster than something else which becomes the bottle neck. Using a good
>>> profiler you can detect how much time is spent in each method and pin
>>> point some candidate methods for optimization. You can also see if
>>> there is a burden due to creation of lots of objects.
>>>
>>> - Mark
>>>
>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>> Our code is very similar but not identical. The original programmer
>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>> were equal or not. It can then calculate the transitional changes &
>>>> assume the rest are transversional.
>>>>
>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>> String.charAt() with a two calls & referencing those chars might help
>>>> but in all honesty I cannot say.
>>>>
>>>> Andy
>>>>
>>>> Richard Holland wrote:
> Thanks.
> 
> Your code is similar to the code we have in
> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> see if it is identical, but it probably is.
> 
> You can call our code like this:
> 
>  // import statement for biojava phylo stuff
>  import org.biojavax.bio.phylo.*;
> 
>  // ...rest of code goes here
> 
>  // call Kimura2P
>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>  String seq2 = ...;
>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> 
> Note that our implementation expects sequence strings to be in upper
> case, so you'll need to make sure your data is upper case or has been
> converted to upper case before calling our method.
> 
> cheers,
> Richard
> 
> vineith kaul wrote:
>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>
>>>>>>>
>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>
>>>>>>>
>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>
>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>                                 // Number of aligned sites
>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>
>>>>>>>                         numberOfAlignedSites++;
>>>>>>>                 }
>>>>>>>
>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>                         p++;
>>>>>>>                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>                                 q++;
>>>>>>>                         }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>                                 q++;
>>>>>>>                         }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>                 else
>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>                                         q++;
>>>>>>>                                 }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>         }
>>>>>>>
>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>          return dist;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>
>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>     Kimura2P
>>>>>>>     is already implemented here, in
>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>
>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>     slow, I'm
>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>
>>>>>>>     cheers,
>>>>>>>     Richard
>>>>>>>
>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>     > Hi,
>>>>>>>     >
>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>     > --
>>>>>>>     > Vineith Kaul
>>>>>>>     > Masters Student Bioinformatics
>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>     > _______________________________________________
>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>     >
>>>>>>>
>>>>>>>
>>>>>>>     --
>>>>>>>     Richard Holland
>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>     EMBL-EBI
>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Vineith Kaul
>>>>>>> Masters Student Bioinformatics
>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>> Georgia Tech, Atlanta
_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
IEyRleSs1+AziCvfhcES8wI=
=uLDm
-----END PGP SIGNATURE-----


From markjschreiber at gmail.com  Wed Oct 24 13:41:16 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 24 Oct 2007 21:41:16 +0800
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471F49C1.9070901@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>
	<471F3A65.50202@ebi.ac.uk>
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>
	<471F49C1.9070901@ebi.ac.uk>
Message-ID: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>

I'm not aware of a way to determine the number of CPU's within a
program although possibly it is one the the environment variables
available from System.

Even if it can't be determined there could be a method argument to
specify the number of threads to spawn.

- Mark

On 10/24/07, Richard Holland <holland at ebi.ac.uk> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> This particular code could easily be parallelised - given N threads, you
> can simply divide the input into N chunks and get each thread to process
> 1/Nth of the input. You then combine the output of each thread to do the
> final calculation.
>
> But, it'd be bad practice to always fork a predetermined N threads for a
> given task. It'd be much better to somehow be able to ask 'how parallel
> can I make this?' at runtime by checking system resources, or maybe get
> the parallel-savvy user to set an optional BioJava-wide parallelisation
> hint. N could then be determined and the task divided appropriately.
>
> cheers,
> Richard
>
> Mark Schreiber wrote:
> > Another important consideration after optimization is can the task be
> > multithreaded?  Almost all modern computers have at least 2 cores. So
> > if the algorithm can be parallelized you will get some performance
> > bonus on most machines.
> >
> > Modern JVM's will automagically try to use idle CPU's to execute new
> > threads spawned by the programmer.
> >
> > - Mark
> >
> > On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> >> Yes a very good point & one I was going to make before hand but forgot :)
> >>
> >> Also not to mention that micro-benchmarks/profiling in Java are
> >> notorious for giving false results due to VM warmup & JIT compilation
> >> optimisations. There is a framework hosted on Java.net somewhere which
> >> can perform VM warmups and code iterations to produce more accurate
> >> benchmarking results; but the name escapes me at the moment.
> >>
> >> However looking at this particular code I get the feeling that this is
> >> about as fast as its going to get without someone doing bitwise XOR
> >> operations or some C code ... that's not an open invitation for people
> >> to start recoding this in C :). At the end of the day the key to
> >> optimisation is to ask the question "is it fast enough already?". If it
> >> is then there's no point :)
> >>
> >> Andy
> >>
> >> Mark Schreiber wrote:
> >>> Hi -
> >>>
> >>> >From experience the best way to optimize java code is to run a
> >>> profiler. The one in Netbeans is quite good.
> >>>
> >>> The reason is that the hotspot or JIT compilers might natively compile
> >>> the part of the code that you think is slow and actually make it
> >>> faster than something else which becomes the bottle neck. Using a good
> >>> profiler you can detect how much time is spent in each method and pin
> >>> point some candidate methods for optimization. You can also see if
> >>> there is a burden due to creation of lots of objects.
> >>>
> >>> - Mark
> >>>
> >>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> >>>> Our code is very similar but not identical. The original programmer
> >>>> shortcutted a lot of else if conditions by considering if the two bases
> >>>> were equal or not. It can then calculate the transitional changes &
> >>>> assume the rest are transversional.
> >>>>
> >>>> In terms of speed of both pieces of code I can't see an obvious way to
> >>>> speed it up. Probably in our code removing the 10 or so calls to
> >>>> String.charAt() with a two calls & referencing those chars might help
> >>>> but in all honesty I cannot say.
> >>>>
> >>>> Andy
> >>>>
> >>>> Richard Holland wrote:
> > Thanks.
> >
> > Your code is similar to the code we have in
> > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> > see if it is identical, but it probably is.
> >
> > You can call our code like this:
> >
> >  // import statement for biojava phylo stuff
> >  import org.biojavax.bio.phylo.*;
> >
> >  // ...rest of code goes here
> >
> >  // call Kimura2P
> >  String seq1 = ...; // Get seq1 and seq2 from somewhere
> >  String seq2 = ...;
> >  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> >
> > Note that our implementation expects sequence strings to be in upper
> > case, so you'll need to make sure your data is upper case or has been
> > converted to upper case before calling our method.
> >
> > cheers,
> > Richard
> >
> > vineith kaul wrote:
> >>>>>>> This is what I have .....Thanks a lot  fr the help.
> >>>>>>>
> >>>>>>>
> >>>>>>> //Method to calculate the Kimura 2 parameter distance
> >>>>>>> public static double K2P(String sequence1,String sequence2){
> >>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
> >>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
> >>>>>>>
> >>>>>>>
> >>>>>>>         char[] seq1array=sequence1.toCharArray();
> >>>>>>>         char[] seq2array=sequence2.toCharArray();
> >>>>>>>
> >>>>>>>         for(int i=0;i<seq1array.length;i++){
> >>>>>>>                                 // Number of aligned sites
> >>>>>>>                 if(((seq1array[i]=='a') ||
> >>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
> >>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
> >>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
> >>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
> >>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
> >>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>>>>
> >>>>>>>                         numberOfAlignedSites++;
> >>>>>>>                 }
> >>>>>>>
> >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>>>>                         p++;
> >>>>>>>                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>>>>                         p++;
> >>>>>>>                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>>>>                         p++;
> >>>>>>>                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>>>>                         p++;
> >>>>>>>                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>>>>                                 q++;
> >>>>>>>                         }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>>>>                                 q++;
> >>>>>>>                         }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>                 else
> >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> >>>>>>>                                         q++;
> >>>>>>>                                 }
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>         }
> >>>>>>>
> >>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
> >>>>>>> (((double)q)/numberOfAlignedSites);
> >>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
> >>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
> >>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
> >>>>>>>          return dist;
> >>>>>>> }
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
> >>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
> >>>>>>>
> >>>>>>>     You should take a look at the latest 1.5 release, in the
> >>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
> >>>>>>>     phylogenetics code that will perform tasks as you describe. The future
> >>>>>>>     plan is to extend this code to cover a wider range of use cases.
> >>>>>>>     Kimura2P
> >>>>>>>     is already implemented here, in
> >>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
> >>>>>>>
> >>>>>>>     If you can't find code that will do what you want, but have written some
> >>>>>>>     before, then please do feel free to contribute it. Even if it is
> >>>>>>>     slow, I'm
> >>>>>>>     sure someone out there will be able to help optimise it!
> >>>>>>>
> >>>>>>>     cheers,
> >>>>>>>     Richard
> >>>>>>>
> >>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> >>>>>>>     > Hi,
> >>>>>>>     >
> >>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
> >>>>>>>     > Kimura2P,Finkelstein etc in Biojava
> >>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
> >>>>>>>     > slow and I am not even sure if thats right.
> >>>>>>>     > --
> >>>>>>>     > Vineith Kaul
> >>>>>>>     > Masters Student Bioinformatics
> >>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >>>>>>>     > Georgia Tech, Atlanta
> >>>>>>>     > _______________________________________________
> >>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> >>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
> >>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>>>>     >
> >>>>>>>
> >>>>>>>
> >>>>>>>     --
> >>>>>>>     Richard Holland
> >>>>>>>     BioMart ( http://www.biomart.org/)
> >>>>>>>     EMBL-EBI
> >>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Vineith Kaul
> >>>>>>> Masters Student Bioinformatics
> >>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> >>>>>>> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>> _______________________________________________
> >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
> IEyRleSs1+AziCvfhcES8wI=
> =uLDm
> -----END PGP SIGNATURE-----
>


From markjschreiber at gmail.com  Wed Oct 24 13:48:00 2007
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 24 Oct 2007 21:48:00 +0800
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>
	<471F3A65.50202@ebi.ac.uk>
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>
	<471F49C1.9070901@ebi.ac.uk>
	<93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
Message-ID: <93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com>

It appears it is as simple as:

Runtime.getRuntime().availableProcessors();

- Mark

On 10/24/07, Mark Schreiber <markjschreiber at gmail.com> wrote:
> I'm not aware of a way to determine the number of CPU's within a
> program although possibly it is one the the environment variables
> available from System.
>
> Even if it can't be determined there could be a method argument to
> specify the number of threads to spawn.
>
> - Mark
>
> On 10/24/07, Richard Holland <holland at ebi.ac.uk> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > This particular code could easily be parallelised - given N threads, you
> > can simply divide the input into N chunks and get each thread to process
> > 1/Nth of the input. You then combine the output of each thread to do the
> > final calculation.
> >
> > But, it'd be bad practice to always fork a predetermined N threads for a
> > given task. It'd be much better to somehow be able to ask 'how parallel
> > can I make this?' at runtime by checking system resources, or maybe get
> > the parallel-savvy user to set an optional BioJava-wide parallelisation
> > hint. N could then be determined and the task divided appropriately.
> >
> > cheers,
> > Richard
> >
> > Mark Schreiber wrote:
> > > Another important consideration after optimization is can the task be
> > > multithreaded?  Almost all modern computers have at least 2 cores. So
> > > if the algorithm can be parallelized you will get some performance
> > > bonus on most machines.
> > >
> > > Modern JVM's will automagically try to use idle CPU's to execute new
> > > threads spawned by the programmer.
> > >
> > > - Mark
> > >
> > > On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> > >> Yes a very good point & one I was going to make before hand but forgot :)
> > >>
> > >> Also not to mention that micro-benchmarks/profiling in Java are
> > >> notorious for giving false results due to VM warmup & JIT compilation
> > >> optimisations. There is a framework hosted on Java.net somewhere which
> > >> can perform VM warmups and code iterations to produce more accurate
> > >> benchmarking results; but the name escapes me at the moment.
> > >>
> > >> However looking at this particular code I get the feeling that this is
> > >> about as fast as its going to get without someone doing bitwise XOR
> > >> operations or some C code ... that's not an open invitation for people
> > >> to start recoding this in C :). At the end of the day the key to
> > >> optimisation is to ask the question "is it fast enough already?". If it
> > >> is then there's no point :)
> > >>
> > >> Andy
> > >>
> > >> Mark Schreiber wrote:
> > >>> Hi -
> > >>>
> > >>> >From experience the best way to optimize java code is to run a
> > >>> profiler. The one in Netbeans is quite good.
> > >>>
> > >>> The reason is that the hotspot or JIT compilers might natively compile
> > >>> the part of the code that you think is slow and actually make it
> > >>> faster than something else which becomes the bottle neck. Using a good
> > >>> profiler you can detect how much time is spent in each method and pin
> > >>> point some candidate methods for optimization. You can also see if
> > >>> there is a burden due to creation of lots of objects.
> > >>>
> > >>> - Mark
> > >>>
> > >>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
> > >>>> Our code is very similar but not identical. The original programmer
> > >>>> shortcutted a lot of else if conditions by considering if the two bases
> > >>>> were equal or not. It can then calculate the transitional changes &
> > >>>> assume the rest are transversional.
> > >>>>
> > >>>> In terms of speed of both pieces of code I can't see an obvious way to
> > >>>> speed it up. Probably in our code removing the 10 or so calls to
> > >>>> String.charAt() with a two calls & referencing those chars might help
> > >>>> but in all honesty I cannot say.
> > >>>>
> > >>>> Andy
> > >>>>
> > >>>> Richard Holland wrote:
> > > Thanks.
> > >
> > > Your code is similar to the code we have in
> > > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
> > > see if it is identical, but it probably is.
> > >
> > > You can call our code like this:
> > >
> > >  // import statement for biojava phylo stuff
> > >  import org.biojavax.bio.phylo.*;
> > >
> > >  // ...rest of code goes here
> > >
> > >  // call Kimura2P
> > >  String seq1 = ...; // Get seq1 and seq2 from somewhere
> > >  String seq2 = ...;
> > >  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
> > >
> > > Note that our implementation expects sequence strings to be in upper
> > > case, so you'll need to make sure your data is upper case or has been
> > > converted to upper case before calling our method.
> > >
> > > cheers,
> > > Richard
> > >
> > > vineith kaul wrote:
> > >>>>>>> This is what I have .....Thanks a lot  fr the help.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> //Method to calculate the Kimura 2 parameter distance
> > >>>>>>> public static double K2P(String sequence1,String sequence2){
> > >>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
> > >>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>         char[] seq1array=sequence1.toCharArray();
> > >>>>>>>         char[] seq2array=sequence2.toCharArray();
> > >>>>>>>
> > >>>>>>>         for(int i=0;i<seq1array.length;i++){
> > >>>>>>>                                 // Number of aligned sites
> > >>>>>>>                 if(((seq1array[i]=='a') ||
> > >>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
> > >>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
> > >>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
> > >>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
> > >>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
> > >>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
> > >>>>>>>
> > >>>>>>>                         numberOfAlignedSites++;
> > >>>>>>>                 }
> > >>>>>>>
> > >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> > >>>>>>>                         p++;
> > >>>>>>>                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> > >>>>>>>                         p++;
> > >>>>>>>                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> > >>>>>>>                         p++;
> > >>>>>>>                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> > >>>>>>>                         p++;
> > >>>>>>>                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> > >>>>>>>                                 q++;
> > >>>>>>>                         }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
> > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> > >>>>>>>                                 q++;
> > >>>>>>>                         }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
> > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
> > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>                 else
> > >>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
> > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
> > >>>>>>>                                         q++;
> > >>>>>>>                                 }
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>         }
> > >>>>>>>
> > >>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
> > >>>>>>> (((double)q)/numberOfAlignedSites);
> > >>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
> > >>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
> > >>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
> > >>>>>>>          return dist;
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
> > >>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
> > >>>>>>>
> > >>>>>>>     You should take a look at the latest 1.5 release, in the
> > >>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
> > >>>>>>>     phylogenetics code that will perform tasks as you describe. The future
> > >>>>>>>     plan is to extend this code to cover a wider range of use cases.
> > >>>>>>>     Kimura2P
> > >>>>>>>     is already implemented here, in
> > >>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
> > >>>>>>>
> > >>>>>>>     If you can't find code that will do what you want, but have written some
> > >>>>>>>     before, then please do feel free to contribute it. Even if it is
> > >>>>>>>     slow, I'm
> > >>>>>>>     sure someone out there will be able to help optimise it!
> > >>>>>>>
> > >>>>>>>     cheers,
> > >>>>>>>     Richard
> > >>>>>>>
> > >>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
> > >>>>>>>     > Hi,
> > >>>>>>>     >
> > >>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
> > >>>>>>>     > Kimura2P,Finkelstein etc in Biojava
> > >>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
> > >>>>>>>     > slow and I am not even sure if thats right.
> > >>>>>>>     > --
> > >>>>>>>     > Vineith Kaul
> > >>>>>>>     > Masters Student Bioinformatics
> > >>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> > >>>>>>>     > Georgia Tech, Atlanta
> > >>>>>>>     > _______________________________________________
> > >>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
> > >>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
> > >>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >>>>>>>     >
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>     --
> > >>>>>>>     Richard Holland
> > >>>>>>>     BioMart ( http://www.biomart.org/)
> > >>>>>>>     EMBL-EBI
> > >>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Vineith Kaul
> > >>>>>>> Masters Student Bioinformatics
> > >>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
> > >>>>>>> Georgia Tech, Atlanta
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >>>> _______________________________________________
> > >>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >>>>
> >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.2.2 (GNU/Linux)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >
> > iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
> > IEyRleSs1+AziCvfhcES8wI=
> > =uLDm
> > -----END PGP SIGNATURE-----
> >
>


From ayates at ebi.ac.uk  Wed Oct 24 13:49:22 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 24 Oct 2007 14:49:22 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471F49C1.9070901@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>	
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>	
	<471F3A65.50202@ebi.ac.uk>
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>
	<471F49C1.9070901@ebi.ac.uk>
Message-ID: <471F4D62.3030900@ebi.ac.uk>

Of course parallelisation all depends on the task not being limited by 
something else like memory, IO or database (which of course this 
wouldn't be). There's also the scenario where thread startup takes 
longer than running the code in serial :). Not to mention Java 
concurrency isn't an easy thing to write correctly.

I'd prefer the model promoted in Java5 where you have pools of threads & 
pass in instances of Callable (which are a successor to Runnable but 
return Futures which return objects & exceptions). You then pass in a 
list of these callables & wait for them all to finish & grab the 
results. You can have as many callables as you like & the thread pool 
will process them as & when a thread becomes free. Combine this with 
looking at the reported number of processors/cores on the machine & say 
that's the default size of the pool (assuming you're making it parallel 
because you're flat-lining a processor).

Say:

int processorCount = Runtime.getRuntime().availableProcessors();
ExecutorService.createThreadPool(processorCount);

This code might be wrong (well the creating the thread pool bit) but you 
get the idea :). Of course someone may not want to parallise a job (I 
quite like having dual cores as a runaway process can take out one but I 
can still run top & kill the thing).

Andy

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> This particular code could easily be parallelised - given N threads, you
> can simply divide the input into N chunks and get each thread to process
> 1/Nth of the input. You then combine the output of each thread to do the
> final calculation.
> 
> But, it'd be bad practice to always fork a predetermined N threads for a
> given task. It'd be much better to somehow be able to ask 'how parallel
> can I make this?' at runtime by checking system resources, or maybe get
> the parallel-savvy user to set an optional BioJava-wide parallelisation
> hint. N could then be determined and the task divided appropriately.
> 
> cheers,
> Richard
> 
> Mark Schreiber wrote:
>> Another important consideration after optimization is can the task be
>> multithreaded?  Almost all modern computers have at least 2 cores. So
>> if the algorithm can be parallelized you will get some performance
>> bonus on most machines.
>>
>> Modern JVM's will automagically try to use idle CPU's to execute new
>> threads spawned by the programmer.
>>
>> - Mark
>>
>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>> Yes a very good point & one I was going to make before hand but forgot :)
>>>
>>> Also not to mention that micro-benchmarks/profiling in Java are
>>> notorious for giving false results due to VM warmup & JIT compilation
>>> optimisations. There is a framework hosted on Java.net somewhere which
>>> can perform VM warmups and code iterations to produce more accurate
>>> benchmarking results; but the name escapes me at the moment.
>>>
>>> However looking at this particular code I get the feeling that this is
>>> about as fast as its going to get without someone doing bitwise XOR
>>> operations or some C code ... that's not an open invitation for people
>>> to start recoding this in C :). At the end of the day the key to
>>> optimisation is to ask the question "is it fast enough already?". If it
>>> is then there's no point :)
>>>
>>> Andy
>>>
>>> Mark Schreiber wrote:
>>>> Hi -
>>>>
>>>> >From experience the best way to optimize java code is to run a
>>>> profiler. The one in Netbeans is quite good.
>>>>
>>>> The reason is that the hotspot or JIT compilers might natively compile
>>>> the part of the code that you think is slow and actually make it
>>>> faster than something else which becomes the bottle neck. Using a good
>>>> profiler you can detect how much time is spent in each method and pin
>>>> point some candidate methods for optimization. You can also see if
>>>> there is a burden due to creation of lots of objects.
>>>>
>>>> - Mark
>>>>
>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> Our code is very similar but not identical. The original programmer
>>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>>> were equal or not. It can then calculate the transitional changes &
>>>>> assume the rest are transversional.
>>>>>
>>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>>> String.charAt() with a two calls & referencing those chars might help
>>>>> but in all honesty I cannot say.
>>>>>
>>>>> Andy
>>>>>
>>>>> Richard Holland wrote:
>> Thanks.
>>
>> Your code is similar to the code we have in
>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>> see if it is identical, but it probably is.
>>
>> You can call our code like this:
>>
>>  // import statement for biojava phylo stuff
>>  import org.biojavax.bio.phylo.*;
>>
>>  // ...rest of code goes here
>>
>>  // call Kimura2P
>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>  String seq2 = ...;
>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>
>> Note that our implementation expects sequence strings to be in upper
>> case, so you'll need to make sure your data is upper case or has been
>> converted to upper case before calling our method.
>>
>> cheers,
>> Richard
>>
>> vineith kaul wrote:
>>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>>
>>>>>>>>
>>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>>
>>>>>>>>
>>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>>
>>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>>                                 // Number of aligned sites
>>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>
>>>>>>>>                         numberOfAlignedSites++;
>>>>>>>>                 }
>>>>>>>>
>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>                         p++;
>>>>>>>>                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>                                 q++;
>>>>>>>>                         }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>                                 q++;
>>>>>>>>                         }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>                 else
>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>                                         q++;
>>>>>>>>                                 }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>         }
>>>>>>>>
>>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>>          return dist;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>>
>>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>>     Kimura2P
>>>>>>>>     is already implemented here, in
>>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>>
>>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>>     slow, I'm
>>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>>
>>>>>>>>     cheers,
>>>>>>>>     Richard
>>>>>>>>
>>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>>     > Hi,
>>>>>>>>     >
>>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>>     > --
>>>>>>>>     > Vineith Kaul
>>>>>>>>     > Masters Student Bioinformatics
>>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>>     > _______________________________________________
>>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>     >
>>>>>>>>
>>>>>>>>
>>>>>>>>     --
>>>>>>>>     Richard Holland
>>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>>     EMBL-EBI
>>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Vineith Kaul
>>>>>>>> Masters Student Bioinformatics
>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>> _______________________________________________
>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
> IEyRleSs1+AziCvfhcES8wI=
> =uLDm
> -----END PGP SIGNATURE-----


From ayates at ebi.ac.uk  Wed Oct 24 13:49:38 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 24 Oct 2007 14:49:38 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>	
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>	
	<471F3A65.50202@ebi.ac.uk>	
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>	
	<471F49C1.9070901@ebi.ac.uk>	
	<93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
	<93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com>
Message-ID: <471F4D72.80505@ebi.ac.uk>

Beat me to it :)

Andy

Mark Schreiber wrote:
> It appears it is as simple as:
> 
> Runtime.getRuntime().availableProcessors();
> 
> - Mark
> 
> On 10/24/07, Mark Schreiber <markjschreiber at gmail.com> wrote:
>> I'm not aware of a way to determine the number of CPU's within a
>> program although possibly it is one the the environment variables
>> available from System.
>>
>> Even if it can't be determined there could be a method argument to
>> specify the number of threads to spawn.
>>
>> - Mark
>>
>> On 10/24/07, Richard Holland <holland at ebi.ac.uk> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> This particular code could easily be parallelised - given N threads, you
>>> can simply divide the input into N chunks and get each thread to process
>>> 1/Nth of the input. You then combine the output of each thread to do the
>>> final calculation.
>>>
>>> But, it'd be bad practice to always fork a predetermined N threads for a
>>> given task. It'd be much better to somehow be able to ask 'how parallel
>>> can I make this?' at runtime by checking system resources, or maybe get
>>> the parallel-savvy user to set an optional BioJava-wide parallelisation
>>> hint. N could then be determined and the task divided appropriately.
>>>
>>> cheers,
>>> Richard
>>>
>>> Mark Schreiber wrote:
>>>> Another important consideration after optimization is can the task be
>>>> multithreaded?  Almost all modern computers have at least 2 cores. So
>>>> if the algorithm can be parallelized you will get some performance
>>>> bonus on most machines.
>>>>
>>>> Modern JVM's will automagically try to use idle CPU's to execute new
>>>> threads spawned by the programmer.
>>>>
>>>> - Mark
>>>>
>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> Yes a very good point & one I was going to make before hand but forgot :)
>>>>>
>>>>> Also not to mention that micro-benchmarks/profiling in Java are
>>>>> notorious for giving false results due to VM warmup & JIT compilation
>>>>> optimisations. There is a framework hosted on Java.net somewhere which
>>>>> can perform VM warmups and code iterations to produce more accurate
>>>>> benchmarking results; but the name escapes me at the moment.
>>>>>
>>>>> However looking at this particular code I get the feeling that this is
>>>>> about as fast as its going to get without someone doing bitwise XOR
>>>>> operations or some C code ... that's not an open invitation for people
>>>>> to start recoding this in C :). At the end of the day the key to
>>>>> optimisation is to ask the question "is it fast enough already?". If it
>>>>> is then there's no point :)
>>>>>
>>>>> Andy
>>>>>
>>>>> Mark Schreiber wrote:
>>>>>> Hi -
>>>>>>
>>>>>> >From experience the best way to optimize java code is to run a
>>>>>> profiler. The one in Netbeans is quite good.
>>>>>>
>>>>>> The reason is that the hotspot or JIT compilers might natively compile
>>>>>> the part of the code that you think is slow and actually make it
>>>>>> faster than something else which becomes the bottle neck. Using a good
>>>>>> profiler you can detect how much time is spent in each method and pin
>>>>>> point some candidate methods for optimization. You can also see if
>>>>>> there is a burden due to creation of lots of objects.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>> Our code is very similar but not identical. The original programmer
>>>>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>>>>> were equal or not. It can then calculate the transitional changes &
>>>>>>> assume the rest are transversional.
>>>>>>>
>>>>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>>>>> String.charAt() with a two calls & referencing those chars might help
>>>>>>> but in all honesty I cannot say.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> Richard Holland wrote:
>>>> Thanks.
>>>>
>>>> Your code is similar to the code we have in
>>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>>>> see if it is identical, but it probably is.
>>>>
>>>> You can call our code like this:
>>>>
>>>>  // import statement for biojava phylo stuff
>>>>  import org.biojavax.bio.phylo.*;
>>>>
>>>>  // ...rest of code goes here
>>>>
>>>>  // call Kimura2P
>>>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>>>  String seq2 = ...;
>>>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>>>
>>>> Note that our implementation expects sequence strings to be in upper
>>>> case, so you'll need to make sure your data is upper case or has been
>>>> converted to upper case before calling our method.
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> vineith kaul wrote:
>>>>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>>>>
>>>>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>>>>                                 // Number of aligned sites
>>>>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>
>>>>>>>>>>                         numberOfAlignedSites++;
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>>>>          return dist;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>>>>
>>>>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>>>>     Kimura2P
>>>>>>>>>>     is already implemented here, in
>>>>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>>>>
>>>>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>>>>     slow, I'm
>>>>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>>>>
>>>>>>>>>>     cheers,
>>>>>>>>>>     Richard
>>>>>>>>>>
>>>>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>>>>     > Hi,
>>>>>>>>>>     >
>>>>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>>>>     > --
>>>>>>>>>>     > Vineith Kaul
>>>>>>>>>>     > Masters Student Bioinformatics
>>>>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>>>>     > _______________________________________________
>>>>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>     >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>>>>     Richard Holland
>>>>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>>>>     EMBL-EBI
>>>>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Vineith Kaul
>>>>>>>>>> Masters Student Bioinformatics
>>>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>> Georgia Tech, Atlanta
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.2.2 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>>
>>> iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P
>>> IEyRleSs1+AziCvfhcES8wI=
>>> =uLDm
>>> -----END PGP SIGNATURE-----
>>>


From holland at ebi.ac.uk  Wed Oct 24 13:53:29 2007
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 24 Oct 2007 14:53:29 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>	
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>	
	<471F3A65.50202@ebi.ac.uk>	
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>	
	<471F49C1.9070901@ebi.ac.uk>
	<93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
Message-ID: <471F4E59.1040703@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Mark Schreiber wrote:
> I'm not aware of a way to determine the number of CPU's within a
> program although possibly it is one the the environment variables
> available from System.

Yup, I'm not aware of one either. Actually, thinking about this, it'd be
a bad thing if BioJava grabbed both CPUs just because they're currently
available - the user might want it to only run on one, with something
else running on the second one. So attempting to guess a good
parallelisation value from the system is probably not good!

> Even if it can't be determined there could be a method argument to
> specify the number of threads to spawn.

I was thinking more along the lines of a global static method in some
kind of toolkit class, so that any part of BJ which is
parallelisation-aware can take advantage of it if it is set. This also
avoids passing parameters that don't have an immediately obvious impact
on the expected output of the method. I'd also like to have this global
variable control the total number of threads, so that if the user forks
a set of threads themselves and runs a parallel-aware method in each of
them, then BJ will not attempt to sub-divide each thread into more
threads than the limit configured by this variable. Likewise if the user
changes the limit whilst threads are currently running, they should stop
(if there are too many) or new ones should start (if there are too few),
but taking care to make sure that every parallelisation request
maintains at least one thread so the job doesn't stop entirely.... there
must be a toolkit for this somewhere surely?

cheers,
Richard

> - Mark
> 
> On 10/24/07, Richard Holland <holland at ebi.ac.uk> wrote:
> This particular code could easily be parallelised - given N threads, you
> can simply divide the input into N chunks and get each thread to process
> 1/Nth of the input. You then combine the output of each thread to do the
> final calculation.
> 
> But, it'd be bad practice to always fork a predetermined N threads for a
> given task. It'd be much better to somehow be able to ask 'how parallel
> can I make this?' at runtime by checking system resources, or maybe get
> the parallel-savvy user to set an optional BioJava-wide parallelisation
> hint. N could then be determined and the task divided appropriately.
> 
> cheers,
> Richard
> 
> Mark Schreiber wrote:
>>>> Another important consideration after optimization is can the task be
>>>> multithreaded?  Almost all modern computers have at least 2 cores. So
>>>> if the algorithm can be parallelized you will get some performance
>>>> bonus on most machines.
>>>>
>>>> Modern JVM's will automagically try to use idle CPU's to execute new
>>>> threads spawned by the programmer.
>>>>
>>>> - Mark
>>>>
>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>> Yes a very good point & one I was going to make before hand but forgot :)
>>>>>
>>>>> Also not to mention that micro-benchmarks/profiling in Java are
>>>>> notorious for giving false results due to VM warmup & JIT compilation
>>>>> optimisations. There is a framework hosted on Java.net somewhere which
>>>>> can perform VM warmups and code iterations to produce more accurate
>>>>> benchmarking results; but the name escapes me at the moment.
>>>>>
>>>>> However looking at this particular code I get the feeling that this is
>>>>> about as fast as its going to get without someone doing bitwise XOR
>>>>> operations or some C code ... that's not an open invitation for people
>>>>> to start recoding this in C :). At the end of the day the key to
>>>>> optimisation is to ask the question "is it fast enough already?". If it
>>>>> is then there's no point :)
>>>>>
>>>>> Andy
>>>>>
>>>>> Mark Schreiber wrote:
>>>>>> Hi -
>>>>>>
>>>>>> >From experience the best way to optimize java code is to run a
>>>>>> profiler. The one in Netbeans is quite good.
>>>>>>
>>>>>> The reason is that the hotspot or JIT compilers might natively compile
>>>>>> the part of the code that you think is slow and actually make it
>>>>>> faster than something else which becomes the bottle neck. Using a good
>>>>>> profiler you can detect how much time is spent in each method and pin
>>>>>> point some candidate methods for optimization. You can also see if
>>>>>> there is a burden due to creation of lots of objects.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> On 10/24/07, Andy Yates <ayates at ebi.ac.uk> wrote:
>>>>>>> Our code is very similar but not identical. The original programmer
>>>>>>> shortcutted a lot of else if conditions by considering if the two bases
>>>>>>> were equal or not. It can then calculate the transitional changes &
>>>>>>> assume the rest are transversional.
>>>>>>>
>>>>>>> In terms of speed of both pieces of code I can't see an obvious way to
>>>>>>> speed it up. Probably in our code removing the 10 or so calls to
>>>>>>> String.charAt() with a two calls & referencing those chars might help
>>>>>>> but in all honesty I cannot say.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> Richard Holland wrote:
>>>> Thanks.
>>>>
>>>> Your code is similar to the code we have in
>>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to
>>>> see if it is identical, but it probably is.
>>>>
>>>> You can call our code like this:
>>>>
>>>>  // import statement for biojava phylo stuff
>>>>  import org.biojavax.bio.phylo.*;
>>>>
>>>>  // ...rest of code goes here
>>>>
>>>>  // call Kimura2P
>>>>  String seq1 = ...; // Get seq1 and seq2 from somewhere
>>>>  String seq2 = ...;
>>>>  double result = MultipleHitCorrection.Kimura2P(seq1, seq2);
>>>>
>>>> Note that our implementation expects sequence strings to be in upper
>>>> case, so you'll need to make sure your data is upper case or has been
>>>> converted to upper case before calling our method.
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> vineith kaul wrote:
>>>>>>>>>> This is what I have .....Thanks a lot  fr the help.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> //Method to calculate the Kimura 2 parameter distance
>>>>>>>>>> public static double K2P(String sequence1,String sequence2){
>>>>>>>>>>         long p=0,q=0,numberOfAlignedSites=0; // P= transitional
>>>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         char[] seq1array=sequence1.toCharArray();
>>>>>>>>>>         char[] seq2array=sequence2.toCharArray();
>>>>>>>>>>
>>>>>>>>>>         for(int i=0;i<seq1array.length;i++){
>>>>>>>>>>                                 // Number of aligned sites
>>>>>>>>>>                 if(((seq1array[i]=='a') ||
>>>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') ||
>>>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') ||
>>>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') ||
>>>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') ||
>>>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') ||
>>>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>
>>>>>>>>>>                         numberOfAlignedSites++;
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                         p++;
>>>>>>>>>>                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='a') || (seq1array[i]=='A')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                 q++;
>>>>>>>>>>                         }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='g') || (seq1array[i]=='G')) &&
>>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='t') || (seq1array[i]=='T')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>                 else
>>>>>>>>>>                 if(((seq1array[i]=='c') || (seq1array[i]=='C')) &&
>>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) {
>>>>>>>>>>                                         q++;
>>>>>>>>>>                                 }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>          double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) -
>>>>>>>>>> (((double)q)/numberOfAlignedSites);
>>>>>>>>>>          double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites);
>>>>>>>>>>          System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t");
>>>>>>>>>>          double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q));
>>>>>>>>>>          return dist;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/22/07, *Richard Holland* <holland at ebi.ac.uk
>>>>>>>>>> <mailto:holland at ebi.ac.uk>> wrote:
>>>>>>>>>>
>>>>>>>>>>     You should take a look at the latest 1.5 release, in the
>>>>>>>>>>     org.biojavax.bio.phylo packages. This code is the beginnings of some
>>>>>>>>>>     phylogenetics code that will perform tasks as you describe. The future
>>>>>>>>>>     plan is to extend this code to cover a wider range of use cases.
>>>>>>>>>>     Kimura2P
>>>>>>>>>>     is already implemented here, in
>>>>>>>>>>     org.biojavax.bio.phylo.MultipleHitCorrection.
>>>>>>>>>>
>>>>>>>>>>     If you can't find code that will do what you want, but have written some
>>>>>>>>>>     before, then please do feel free to contribute it. Even if it is
>>>>>>>>>>     slow, I'm
>>>>>>>>>>     sure someone out there will be able to help optimise it!
>>>>>>>>>>
>>>>>>>>>>     cheers,
>>>>>>>>>>     Richard
>>>>>>>>>>
>>>>>>>>>>     On Sun, October 21, 2007 5:30 pm, vineith kaul wrote:
>>>>>>>>>>     > Hi,
>>>>>>>>>>     >
>>>>>>>>>>     > Are there functions to calculate evolutionary pairwise distances like
>>>>>>>>>>     > Kimura2P,Finkelstein etc in Biojava
>>>>>>>>>>     > I did write smthng on my own but on large sequences it runs terribly
>>>>>>>>>>     > slow and I am not even sure if thats right.
>>>>>>>>>>     > --
>>>>>>>>>>     > Vineith Kaul
>>>>>>>>>>     > Masters Student Bioinformatics
>>>>>>>>>>     > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>>     > Georgia Tech, Atlanta
>>>>>>>>>>     > _______________________________________________
>>>>>>>>>>     > Biojava-l mailing list  -   Biojava-l at lists.open-bio.org
>>>>>>>>>>     <mailto:Biojava-l at lists.open-bio.org>
>>>>>>>>>>     > http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>>>     >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>>>>     Richard Holland
>>>>>>>>>>     BioMart ( http://www.biomart.org/)
>>>>>>>>>>     EMBL-EBI
>>>>>>>>>>     Hinxton, Cambridgeshire CB10 1SD, UK
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Vineith Kaul
>>>>>>>>>> Masters Student Bioinformatics
>>>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB)
>>>>>>>>>> Georgia Tech, Atlanta
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>> _______________________________________________
>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>
>>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHH05Y4C5LeMEKA/QRAouqAJ9TgDACIQLPeenSZcStDhkZQg/UuQCfc7sZ
cocyjnf9/T8H3uQJ+rW5m2U=
=Q6UR
-----END PGP SIGNATURE-----


From ayates at ebi.ac.uk  Wed Oct 24 13:58:01 2007
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 24 Oct 2007 14:58:01 +0100
Subject: [Biojava-l] Evolutionary distances
In-Reply-To: <471F4E59.1040703@ebi.ac.uk>
References: <f2446ee40710210930w264d9fe2r16db455b1592c965@mail.gmail.com>	
	<48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk>	
	<f2446ee40710222359r5c9c5c64j805158bc221599ca@mail.gmail.com>	
	<471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk>	
	<93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com>	
	<471F3A65.50202@ebi.ac.uk>	
	<93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com>	
	<471F49C1.9070901@ebi.ac.uk>
	<93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com>
	<471F4E59.1040703@ebi.ac.uk>
Message-ID: <471F4F69.3010806@ebi.ac.uk>

The executor thread pool system is the best way to control this. The 
thread pool can be setup once & called out whilst all clients of the 
code will wait for their jobs/futures to complete.

Richard Holland wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I was thinking more along the lines of a global static method in some
> kind of toolkit class, so that any part of BJ which is
> parallelisation-aware can take advantage of it if it is set. This also
> avoids passing parameters that don't have an immediately obvious impact
> on the expected output of the method. I'd also like to have this global
> variable control the total number of threads, so that if the user forks
> a set of threads themselves and runs a parallel-aware method in each of
> them, then BJ will not attempt to sub-divide each thread into more
> threads than the limit configured by this variable. Likewise if the user
> changes the limit whilst threads are currently running, they should stop
> (if there are too many) or new ones should start (if there are too few),
> but taking care to make sure that every parallelisation request
> maintains at least one thread so the job doesn't stop entirely.... there
> must be a toolkit for this somewhere surely?
>