From matthew.pocock at ncl.ac.uk Tue Oct 2 18:14:01 2007 From: matthew.pocock at ncl.ac.uk (Matthew Pocock) Date: Tue, 2 Oct 2007 23:14:01 +0100 Subject: [Biojava-l] Biojava Question. In-Reply-To: <1653.130.207.66.142.1191340893.squirrel@webmail.cc.gatech.edu> References: <1653.130.207.66.142.1191340893.squirrel@webmail.cc.gatech.edu> Message-ID: <200710022314.02018.matthew.pocock@ncl.ac.uk> This is very strange. This sort of error nearly always happens because of a miss-configured classpath. Could you send me: The html of the page that causes the problem The URL of the jars the page should be referencing A URL that I can point my browser at that causes the problem It is difficult to debug something like this without the program actually infront of me. Matthew On Tuesday 02 October 2007, abhi232 at cc.gatech.edu wrote: > Respected Sir, > > I am sorry if I sent you a direct mail but this is a kind of emergency and > I am not getting any substantial response from the biojava mailing > community. > I a graduate student at Georgia Institute of technology.We are working on > creating a Teaceviewer applet for viewing the Sequence using biojava > library. > I am able to create the applet using netbeans and run it there. > The error comes when I upload it on net. I am getting this particular > error. > > java.lang.NoClassDefFoundError: > org/biojava/bio/gui/sequence/SequenceRenderer at > java.lang.Class.getDeclaredConstructors0(Native Method) > at java.lang.Class.privateGetDeclaredConstructors(Unknown Source) > at java.lang.Class.getConstructor0(Unknown Source) > at java.lang.Class.newInstance0(Unknown Source) > at java.lang.Class.newInstance(Unknown Source) > at sun.applet.AppletPanel.createApplet(Unknown Source) > at sun.plugin.AppletViewer.createApplet(Unknown Source) > at sun.applet.AppletPanel.runLoader(Unknown Source) > at sun.applet.AppletPanel.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > > I am getting an error only for SequenceRenderer class.Even If I comment > that out still it is giving me error. > > I have set the classpath as well as the path variables and also I am > giving the archive field in the applet code so as the biojava library will > be available. > > Is there any particular thing required which I probably am missing? > Please guide me on this topic. > I would really appreciate your gesture. > Thanks a lot in advance. From elmh06 at yahoo.ca Wed Oct 3 14:27:36 2007 From: elmh06 at yahoo.ca (El Mabrouk M) Date: Wed, 3 Oct 2007 14:27:36 -0400 (EDT) Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method Message-ID: <975012.12435.qm@web37310.mail.mud.yahoo.com> Hi! I have just started to learn biojava. I have written a small program that write a sequence in fasta file with the help of the biojavax method RichSequence.IOTools.writeFasta(seqOut, s1, ns); I have got the error "cannot find symbol". I'm using biojava 1.5, jdk 1.6 and netbeans. What can be done to fix this problem? This is what I tried: import org.biojava.bio.seq.*; import java.io.*; import org.biojava.bio.symbol.SymbolList; import org.biojavax.RichObjectFactory; import javax.xml.stream.events.Namespace; import org.biojavax.bio.seq.RichSequence; public class SeqFastaF { public static void main(String[] args) { SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1"); Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0"); try { OutputStream seqOut = System.out; Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace(); RichSequence.IOTools.writeFasta(seqOut,s1,ns); } catch (IOException ex) { //io error ex.printStackTrace(); } } } Error: cannot find symbol symbol : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace) location: class org.biojavax.bio.seq.RichSequence.IOTools --------------------------------- Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail From markjschreiber at gmail.com Wed Oct 3 19:20:31 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 4 Oct 2007 07:20:31 +0800 Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method In-Reply-To: <975012.12435.qm@web37310.mail.mud.yahoo.com> References: <975012.12435.qm@web37310.mail.mud.yahoo.com> Message-ID: <93b45ca50710031620m35495bfey8ec111177c6201f@mail.gmail.com> Hi - This is a compilation error. It is caused because the biojava write method is expecting a Namespace object from the biojavax package but netbeans has guessed that you wanted a Namespace object from the javax.xml.stream.events package and has imported this for you. If you remove that import ( javax.xml.stream.events.Namespace) and then import the biojavax Namespace object it should compile. - Mark On 10/4/07, El Mabrouk M wrote: > Hi! > > I have just started to learn biojava. I have written a small > program that write a sequence in fasta file with the help of the biojavax method > > RichSequence.IOTools.writeFasta(seqOut, s1, ns); > I have got the error "cannot find symbol". > I'm using biojava 1.5, jdk 1.6 and netbeans. > What can be done to fix this problem? > > This is what I tried: > > import org.biojava.bio.seq.*; > import java.io.*; > import org.biojava.bio.symbol.SymbolList; > import org.biojavax.RichObjectFactory; > import javax.xml.stream.events.Namespace; > import org.biojavax.bio.seq.RichSequence; > > public class SeqFastaF { > public static void main(String[] args) { > SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1"); > Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0"); > try { > OutputStream seqOut = System.out; > Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace(); > RichSequence.IOTools.writeFasta(seqOut,s1,ns); > } catch (IOException ex) { > //io error > ex.printStackTrace(); > } > } > } > > Error: > cannot find symbol > symbol : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace) > location: class org.biojavax.bio.seq.RichSequence.IOTools > > > > --------------------------------- > Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From md5 at sanger.ac.uk Wed Oct 3 19:05:43 2007 From: md5 at sanger.ac.uk (Mutlu Dogruel) Date: Thu, 4 Oct 2007 00:05:43 +0100 (BST) Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method In-Reply-To: <975012.12435.qm@web37310.mail.mud.yahoo.com> References: <975012.12435.qm@web37310.mail.mud.yahoo.com> Message-ID: Hi, try using import org.biojavax.Namespace instead of javax.xml.stream.events.Namespace; Also, you should handle the illegal symbol exception that DNATools.createDNASequence may throw. Cheers, mutlu On Wed, 3 Oct 2007, El Mabrouk M wrote: > Hi! > > I have just started to learn biojava. I have written a small > program that write a sequence in fasta file with the help of the biojavax method > > RichSequence.IOTools.writeFasta(seqOut, s1, ns); > I have got the error "cannot find symbol". > I'm using biojava 1.5, jdk 1.6 and netbeans. > What can be done to fix this problem? > > This is what I tried: > > import org.biojava.bio.seq.*; > import java.io.*; > import org.biojava.bio.symbol.SymbolList; > import org.biojavax.RichObjectFactory; > import javax.xml.stream.events.Namespace; > import org.biojavax.bio.seq.RichSequence; > > public class SeqFastaF { > public static void main(String[] args) { > SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1"); > Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0"); > try { > OutputStream seqOut = System.out; > Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace(); > RichSequence.IOTools.writeFasta(seqOut,s1,ns); > } catch (IOException ex) { > //io error > ex.printStackTrace(); > } > } > } > > Error: > cannot find symbol > symbol : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace) > location: class org.biojavax.bio.seq.RichSequence.IOTools > > > > --------------------------------- > Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From su24 at st-andrews.ac.uk Thu Oct 4 10:43:23 2007 From: su24 at st-andrews.ac.uk (Saif Ur-Rehman) Date: Thu, 4 Oct 2007 15:43:23 +0100 Subject: [Biojava-l] WriteFasta Message-ID: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> Dear All, I was writing to ask about the SeqIOTools.writeFasta() Method. I am currently trying to break up Fasta Files of whole organisms into one file per gene for further analysis. However the writeFasta method appears to append the characters "?? ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From holland at ebi.ac.uk Thu Oct 4 11:23:10 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 04 Oct 2007 16:23:10 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> Message-ID: <4705055E.5070401@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 SeqIOTools is deprecated. Try RichSequence.IOTools.writeFasta() instead to see if that helps. e.g.: RichSequence.IOTools.writeFasta( System.out, seq, RichObjectFactory.getDefaultNamespace() ); where seq is either a Sequence or a SequenceIterator. cheers, Richard Saif Ur-Rehman wrote: > Dear All, > > I was writing to ask about the SeqIOTools.writeFasta() Method. I am currently > trying to break up Fasta Files of whole organisms into one file per gene for > further analysis. However the writeFasta method appears to append the > characters > "?? > > ------------------------------------------------------------------ > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp C4xPs/2ywAMfIPDmUKPCrqg= =TwwH -----END PGP SIGNATURE----- From su24 at st-andrews.ac.uk Thu Oct 4 11:23:52 2007 From: su24 at st-andrews.ac.uk (Saif Ur-Rehman) Date: Thu, 4 Oct 2007 16:23:52 +0100 Subject: [Biojava-l] (no subject) Message-ID: <1191511432.4705058825b79@webmail.st-andrews.ac.uk> Dear All, I'm sorry the use of the characters seems to have truncated the previous email I sent. To complete my question I was just wondering as to possible causes for this addition of random charcters and if there was a way to stop it from occuring. Thanking you again Saif ------------------------------------------------------------------------------- Saif Ur-Rehman Research Student The Centre for Evolution, Genes & Genomics (CEGG) Dyers Brae School of Biology The University of St Andrews St Andrews, Fife Scotland,UK ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From su24 at st-andrews.ac.uk Fri Oct 5 06:06:25 2007 From: su24 at st-andrews.ac.uk (Saif Ur-Rehman) Date: Fri, 5 Oct 2007 11:06:25 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <4705055E.5070401@ebi.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> Message-ID: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> Dear Richard, I have tried the RichSEquence.IOTools.writeFasta method and this method is still appending the characters "??" to the front of each write. I am using a FileOutputStream and a Sequence object as inputs to the method. like so. Sequence seq; // read in from File FileOutputStream f =new FileOutputStream (fileName); try{ RichSequence.IOTools.writeFasta(f, seq, RichObjectFactory.getDefaultNamespace() ); } Thanks a lot for your time Sincerely, Saif Quoting Richard Holland : > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > SeqIOTools is deprecated. > > Try RichSequence.IOTools.writeFasta() instead to see if that helps. > > e.g.: > > RichSequence.IOTools.writeFasta( > System.out, > seq, > RichObjectFactory.getDefaultNamespace() > ); > > where seq is either a Sequence or a SequenceIterator. > > cheers, > Richard > > Saif Ur-Rehman wrote: > > Dear All, > > > > I was writing to ask about the SeqIOTools.writeFasta() Method. I am > currently > > trying to break up Fasta Files of whole organisms into one file per gene > for > > further analysis. However the writeFasta method appears to append the > > characters > > "?? > > > > ------------------------------------------------------------------ > > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp > C4xPs/2ywAMfIPDmUKPCrqg= > =TwwH > -----END PGP SIGNATURE----- > ------------------------------------------------------------------------------- Saif Ur-Rehman Research Student The Centre for Evolution, Genes & Genomics (CEGG) Dyers Brae School of Biology The University of St Andrews St Andrews, Fife Scotland,UK ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From holland at ebi.ac.uk Fri Oct 5 06:13:36 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 05 Oct 2007 11:13:36 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> Message-ID: <47060E50.2070405@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Where are the input sequences coming from? i.e. what method are you using to construct them or read them from a file. Also, what do you mean by the 'front' of each write? Could you send me an example of an entire FASTA file containing the problem? (It'd be best to attach the file to an email to me personally as this list will not accept attachments, and copying-and-pasting from a text editor to an email client may obscure the underlying problem). It'd be good also to see your entire code from the point the sequences are read or created to the point where they are written out. Or, a sample program which exhibits the same behaviour would suffice. I suspect that the sequences themselves contain the incorrect data, although technically this should be impossible as the sequence alphabet should prevent it. We recently had an issue reported here regarding BioJava not being able to do certain sequence tasks on platforms using non-Western-European character mappings. If your machine is running such a mapping, try it again on a machine with an English or other Western European language set up by default. If it works there but not on your machine, then this'll be the same problem. (There is no solution yet, but at least you'll know what's wrong). cheers, Richard Saif Ur-Rehman wrote: > Dear Richard, > > I have tried the RichSEquence.IOTools.writeFasta method and this method is still > appending the characters "??" to the front of each write. I am using a > FileOutputStream and a Sequence object as inputs to the method. like so. > > > Sequence seq; // read in from File > FileOutputStream f =new FileOutputStream (fileName); > > > try{ > > RichSequence.IOTools.writeFasta(f, > seq, > RichObjectFactory.getDefaultNamespace() > ); > > > } > > > Thanks a lot for your time > > Sincerely, > > Saif > > Quoting Richard Holland : > > SeqIOTools is deprecated. > > Try RichSequence.IOTools.writeFasta() instead to see if that helps. > > e.g.: > > RichSequence.IOTools.writeFasta( > System.out, > seq, > RichObjectFactory.getDefaultNamespace() > ); > > where seq is either a Sequence or a SequenceIterator. > > cheers, > Richard > > Saif Ur-Rehman wrote: >>>> Dear All, >>>> >>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am > currently >>>> trying to break up Fasta Files of whole organisms into one file per gene > for >>>> further analysis. However the writeFasta method appears to append the >>>> characters >>>> "?? >>>> >>>> ------------------------------------------------------------------ >>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk >>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >> > ------------------------------------------------------------------------------- > Saif Ur-Rehman > Research Student > The Centre for Evolution, Genes & Genomics (CEGG) > Dyers Brae > School of Biology > The University of St Andrews > St Andrews, > Fife > Scotland,UK > ------------------------------------------------------------------ > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHBg5Q4C5LeMEKA/QRAlKlAKCKXrMfJI2W4Ir7Us5P9bj3KmEY1ACgo89L WgUPFCLGUNSUZxO8h3Ltqlw= =Jq7X -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Fri Oct 5 06:16:02 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 05 Oct 2007 11:16:02 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> Message-ID: <47060EE2.2000909@ebi.ac.uk> Is it possible for you to send us the code which you're trying to run & the sequence you are trying to write out. If it is sent to us in a manner we can drop it into an IDE & run that would help us a lot. Thanks, Andy Yates Saif Ur-Rehman wrote: > Dear Richard, > > I have tried the RichSEquence.IOTools.writeFasta method and this method is still > appending the characters "??" to the front of each write. I am using a > FileOutputStream and a Sequence object as inputs to the method. like so. > > > Sequence seq; // read in from File > FileOutputStream f =new FileOutputStream (fileName); > > > try{ > > RichSequence.IOTools.writeFasta(f, > seq, > RichObjectFactory.getDefaultNamespace() > ); > > > } > > > Thanks a lot for your time > > Sincerely, > > Saif > > Quoting Richard Holland : > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> SeqIOTools is deprecated. >> >> Try RichSequence.IOTools.writeFasta() instead to see if that helps. >> >> e.g.: >> >> RichSequence.IOTools.writeFasta( >> System.out, >> seq, >> RichObjectFactory.getDefaultNamespace() >> ); >> >> where seq is either a Sequence or a SequenceIterator. >> >> cheers, >> Richard >> >> Saif Ur-Rehman wrote: >>> Dear All, >>> >>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am >> currently >>> trying to break up Fasta Files of whole organisms into one file per gene >> for >>> further analysis. However the writeFasta method appears to append the >>> characters >>> "?? >>> >>> ------------------------------------------------------------------ >>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp >> C4xPs/2ywAMfIPDmUKPCrqg= >> =TwwH >> -----END PGP SIGNATURE----- >> > > > ------------------------------------------------------------------------------- > Saif Ur-Rehman > Research Student > The Centre for Evolution, Genes & Genomics (CEGG) > Dyers Brae > School of Biology > The University of St Andrews > St Andrews, > Fife > Scotland,UK > > ------------------------------------------------------------------ > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From holland at ebi.ac.uk Fri Oct 5 08:10:58 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 05 Oct 2007 13:10:58 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <1191584372.4706227437594@webmail.st-andrews.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> <47060E50.2070405@ebi.ac.uk> <1191582472.47061b0836c9f@webmail.st-andrews.ac.uk> <47061FDD.1070806@ebi.ac.uk> <1191584372.4706227437594@webmail.st-andrews.ac.uk> Message-ID: <470629D2.6020709@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Great, thanks. The initial analysis shows that the text file generated contains four extra characters at the beginning of the file, and is using '\n' as the line separator. This is a hex dump of the file: 00000000 ac ed 00 05 3e 67 69 7c 31 38 33 39 38 33 39 30 |....>gi|18398390| 00000010 7c 6c 63 6c 7c 4e 50 5f 35 36 35 34 31 33 2e 31 ||lcl|NP_565413.1| 00000020 7c 4e 50 5f 35 36 35 34 31 33 20 75 6e 6b 6e 6f ||NP_565413 unkno| 00000030 77 6e 20 70 72 6f 74 65 69 6e 20 5b 41 72 61 62 |wn protein [Arab| 00000040 69 64 6f 70 73 69 73 20 74 68 61 6c 69 61 6e 61 |idopsis thaliana| 00000050 5d 0a 4d 53 4c 52 49 4b 4c 56 56 44 4b 46 56 45 |].MSLRIKLVVDKFVE| 00000060 45 4c 4b 51 41 4c 44 41 44 49 51 44 52 49 4d 4b |ELKQALDADIQDRIMK| 00000070 45 52 45 4d 51 53 59 49 58 58 58 58 58 58 58 58 |EREMQSYIXXXXXXXX| 00000080 58 58 58 58 58 57 4b 41 45 4c 53 52 52 45 54 45 |XXXXXWKAELSRRETE| 00000090 49 41 52 51 45 41 52 4c 4b 4d 45 52 45 4e 4c 45 |IARQEARLKMERENLE| 000000a0 4b 45 0a 4b 53 56 4c 4d 47 54 41 53 4e 51 44 4e |KE.KSVLMGTASNQDN| 000000b0 51 44 47 41 4c 45 49 54 56 53 47 45 4b 59 52 43 |QDGALEITVSGEKYRC| 000000c0 4c 52 46 53 4b 41 4b 4b 0a |LRFSKAKK.| The four extra characters are hex #ac #ed #00 #05 and these are showing as question marks in your text editor because that's how text editors handle unprintable characters. Does anyone recognise these characters? There is no code in BioJava which writes anything like this, in fact there is no output code at all before the initial write of the first > symbol in the file. Something tells me that these symbols are being inserted by the VM or the OS somewhere under the hood, possibly due to internationalisation? I strongly suspect this is an internationalisation problem. It seems probable that Java has been set up on your system to use a language or character encoding that causes Java by default to write these extra characters at the start of files to indicate the encoding. Check the output of: System.getProperty("file.encode"); to see if it is using something other than UTF-8. If it is, then chances are that this is the problem. We've had internationalisation problems before with BioJava. Hopefully these will be addressed in future development, but there is no current activity in that area due to lack of resources. In the meantime the best workaround is to set every setting you can find to a Western European character set/character mapping and UTF-8 file encoding, in the hope that it will all match up nicely and work. cheers, Richard Saif Ur-Rehman wrote: > Dear Richard, > > The input file is just the entire set of RefSeq proteins for Arabdopsis thaliana > and is too large for me to send as an attachment. But it can be downloaded from > NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]". > > Cheers, > > Saif > > > > Quoting Richard Holland : > > Interesting. Could you send your input file as well? > > cheers, > Richard > > Saif Ur-Rehman wrote: >>>> Dear Richard, >>>> >>>> The sequences are being read by SeqIO.readFasta. The code from read to > write is >>>> as follows. Essentially the program wants to read in a fasta file > containing >>>> all the protein sequences in a given organism and split them up into one > file >>>> per protein. >>>> >>>> >>>> BufferedReader br=null; >>>> try >>>> { >>>> br = new BufferedReader(new FileReader(filename)); >>>> } >>>> catch (FileNotFoundException e1) >>>> { >>>> >>>> e1.printStackTrace(); >>>> } >>>> >>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br); >>>> while (stream.hasNext()) >>>> { >>>> try >>>> { >>>> Sequence seq = stream.nextSequence(); >>>> File scriptFile1= new > File("///Users/Saif/Organisms/RunTemp/"+name >>>> +"/"+seq.getName()); >>>> >>>> try >>>> { >>>> scriptFile1.createNewFile(); >>>> } >>>> catch (IOException e1) >>>> { >>>> >>>> e1.printStackTrace(); >>>> } >>>> >>>> try >>>> { >>>> FileWriter fstream = new > FileWriter(scriptFile1.getAbsolutePath()); >>>> BufferedWriter out = new BufferedWriter(fstream); >>>> >>>> FileOutputStream f =new FileOutputStream (scriptFile1); >>>> >>>> RichSequence rs=RichSequence.Tools.enrich(seq); >>>> >>>> >>>> try{ >>>> >>>> >>>> RichSequence.IOTools.writeFasta( >>>> f, >>>> rs, >>>> RichObjectFactory.getDefaultNamespace() >>>> ); >>>> >>>> >>>> } >>>> >>>> catch (IOException ioe){} >>>> >>>> An example of an outputted fasta file from this code is attached. >>>> >>>> >>>> >>>> Thanks a lot for your time. >>>> >>>> Saif >>>> >>>> >>>> Quoting Richard Holland : >>>> >>>> Where are the input sequences coming from? i.e. what method are you >>>> using to construct them or read them from a file. >>>> >>>> Also, what do you mean by the 'front' of each write? Could you send me >>>> an example of an entire FASTA file containing the problem? (It'd be best >>>> to attach the file to an email to me personally as this list will not >>>> accept attachments, and copying-and-pasting from a text editor to an >>>> email client may obscure the underlying problem). >>>> >>>> It'd be good also to see your entire code from the point the sequences >>>> are read or created to the point where they are written out. Or, a >>>> sample program which exhibits the same behaviour would suffice. >>>> >>>> I suspect that the sequences themselves contain the incorrect data, >>>> although technically this should be impossible as the sequence alphabet >>>> should prevent it. >>>> >>>> We recently had an issue reported here regarding BioJava not being able >>>> to do certain sequence tasks on platforms using non-Western-European >>>> character mappings. If your machine is running such a mapping, try it >>>> again on a machine with an English or other Western European language >>>> set up by default. If it works there but not on your machine, then >>>> this'll be the same problem. (There is no solution yet, but at least >>>> you'll know what's wrong). >>>> >>>> cheers, >>>> Richard >>>> >>>> Saif Ur-Rehman wrote: >>>>>>> Dear Richard, >>>>>>> >>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this method > is >>>> still >>>>>>> appending the characters "??" to the front of each write. I am using a >>>>>>> FileOutputStream and a Sequence object as inputs to the method. like so. >>>>>>> >>>>>>> >>>>>>> Sequence seq; // read in from File >>>>>>> FileOutputStream f =new FileOutputStream (fileName); >>>>>>> >>>>>>> >>>>>>> try{ >>>>>>> >>>>>>> RichSequence.IOTools.writeFasta(f, >>>>>>> seq, >>>>>>> RichObjectFactory.getDefaultNamespace() >>>>>>> ); >>>>>>> >>>>>>> >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Thanks a lot for your time >>>>>>> >>>>>>> Sincerely, >>>>>>> >>>>>>> Saif >>>>>>> >>>>>>> Quoting Richard Holland : >>>>>>> >>>>>>> SeqIOTools is deprecated. >>>>>>> >>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps. >>>>>>> >>>>>>> e.g.: >>>>>>> >>>>>>> RichSequence.IOTools.writeFasta( >>>>>>> System.out, >>>>>>> seq, >>>>>>> RichObjectFactory.getDefaultNamespace() >>>>>>> ); >>>>>>> >>>>>>> where seq is either a Sequence or a SequenceIterator. >>>>>>> >>>>>>> cheers, >>>>>>> Richard >>>>>>> >>>>>>> Saif Ur-Rehman wrote: >>>>>>>>>> Dear All, >>>>>>>>>> >>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am >>>>>>> currently >>>>>>>>>> trying to break up Fasta Files of whole organisms into one file per > gene >>>>>>> for >>>>>>>>>> further analysis. However the writeFasta method appears to append the >>>>>>>>>> characters >>>>>>>>>> "?? >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------ >>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>>> >> ------------------------------------------------------------------------------- >>>>>>> Saif Ur-Rehman >>>>>>> Research Student >>>>>>> The Centre for Evolution, Genes & Genomics (CEGG) >>>>>>> Dyers Brae >>>>>>> School of Biology >>>>>>> The University of St Andrews >>>>>>> St Andrews, >>>>>>> Fife >>>>>>> Scotland,UK >>>>>>> ------------------------------------------------------------------ >>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk >> ------------------------------------------------------------------------------- >>>> Saif Ur-Rehman >>>> Research Student >>>> The Centre for Evolution, Genes & Genomics (CEGG) >>>> Dyers Brae >>>> School of Biology >>>> The University of St Andrews >>>> St Andrews, >>>> Fife >>>> Scotland,UK >>>> ------------------------------------------------------------------ >>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk >> > ------------------------------------------------------------------------------- > Saif Ur-Rehman > Research Student > The Centre for Evolution, Genes & Genomics (CEGG) > Dyers Brae > School of Biology > The University of St Andrews > St Andrews, > Fife > Scotland,UK > ------------------------------------------------------------------ > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb pxAPAybISoRQgbvQ1wyzqVg= =MS7P -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Fri Oct 5 08:28:43 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 05 Oct 2007 13:28:43 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <470629D2.6020709@ebi.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> <47060E50.2070405@ebi.ac.uk> <1191582472.47061b0836c9f@webmail.st-andrews.ac.uk> <47061FDD.1070806@ebi.ac.uk> <1191584372.4706227437594@webmail.st-andrews.ac.uk> <470629D2.6020709@ebi.ac.uk> Message-ID: <47062DFB.6040201@ebi.ac.uk> I've done a quick search & it seems as if U+ACED is a Chinese character & the other is just a blank. Something is getting confused quite badly here Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Great, thanks. > > The initial analysis shows that the text file generated contains four > extra characters at the beginning of the file, and is using '\n' as the > line separator. > > This is a hex dump of the file: > > 00000000 ac ed 00 05 3e 67 69 7c 31 38 33 39 38 33 39 30 > |....>gi|18398390| > 00000010 7c 6c 63 6c 7c 4e 50 5f 35 36 35 34 31 33 2e 31 > ||lcl|NP_565413.1| > 00000020 7c 4e 50 5f 35 36 35 34 31 33 20 75 6e 6b 6e 6f ||NP_565413 > unkno| > 00000030 77 6e 20 70 72 6f 74 65 69 6e 20 5b 41 72 61 62 |wn protein > [Arab| > 00000040 69 64 6f 70 73 69 73 20 74 68 61 6c 69 61 6e 61 |idopsis > thaliana| > 00000050 5d 0a 4d 53 4c 52 49 4b 4c 56 56 44 4b 46 56 45 > |].MSLRIKLVVDKFVE| > 00000060 45 4c 4b 51 41 4c 44 41 44 49 51 44 52 49 4d 4b > |ELKQALDADIQDRIMK| > 00000070 45 52 45 4d 51 53 59 49 58 58 58 58 58 58 58 58 > |EREMQSYIXXXXXXXX| > 00000080 58 58 58 58 58 57 4b 41 45 4c 53 52 52 45 54 45 > |XXXXXWKAELSRRETE| > 00000090 49 41 52 51 45 41 52 4c 4b 4d 45 52 45 4e 4c 45 > |IARQEARLKMERENLE| > 000000a0 4b 45 0a 4b 53 56 4c 4d 47 54 41 53 4e 51 44 4e > |KE.KSVLMGTASNQDN| > 000000b0 51 44 47 41 4c 45 49 54 56 53 47 45 4b 59 52 43 > |QDGALEITVSGEKYRC| > 000000c0 4c 52 46 53 4b 41 4b 4b 0a |LRFSKAKK.| > > > The four extra characters are hex #ac #ed #00 #05 and these are showing > as question marks in your text editor because that's how text editors > handle unprintable characters. > > Does anyone recognise these characters? There is no code in BioJava > which writes anything like this, in fact there is no output code at all > before the initial write of the first > symbol in the file. Something > tells me that these symbols are being inserted by the VM or the OS > somewhere under the hood, possibly due to internationalisation? > > I strongly suspect this is an internationalisation problem. It seems > probable that Java has been set up on your system to use a language or > character encoding that causes Java by default to write these extra > characters at the start of files to indicate the encoding. Check the > output of: > > System.getProperty("file.encode"); > > to see if it is using something other than UTF-8. If it is, then chances > are that this is the problem. > > We've had internationalisation problems before with BioJava. Hopefully > these will be addressed in future development, but there is no current > activity in that area due to lack of resources. In the meantime the best > workaround is to set every setting you can find to a Western European > character set/character mapping and UTF-8 file encoding, in the hope > that it will all match up nicely and work. > > cheers, > Richard > > From su24 at st-andrews.ac.uk Fri Oct 5 09:44:29 2007 From: su24 at st-andrews.ac.uk (Saif Ur-Rehman) Date: Fri, 5 Oct 2007 14:44:29 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <470629D2.6020709@ebi.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> <47060E50.2070405@ebi.ac.uk> <1191582472.47061b0836c9f@webmail.st-andrews.ac.uk> <47061FDD.1070806@ebi.ac.uk> <1191584372.4706227437594@webmail.st-andrews.ac.uk> <470629D2.6020709@ebi.ac.uk> Message-ID: <1191591869.47063fbd22461@webmail.st-andrews.ac.uk> Setting the System properties solved the problem. Thanks a lot, Saif Quoting Richard Holland : > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Great, thanks. > > The initial analysis shows that the text file generated contains four > extra characters at the beginning of the file, and is using '\n' as the > line separator. > > This is a hex dump of the file: > > 00000000 ac ed 00 05 3e 67 69 7c 31 38 33 39 38 33 39 30 > |....>gi|18398390| > 00000010 7c 6c 63 6c 7c 4e 50 5f 35 36 35 34 31 33 2e 31 > ||lcl|NP_565413.1| > 00000020 7c 4e 50 5f 35 36 35 34 31 33 20 75 6e 6b 6e 6f ||NP_565413 > unkno| > 00000030 77 6e 20 70 72 6f 74 65 69 6e 20 5b 41 72 61 62 |wn protein > [Arab| > 00000040 69 64 6f 70 73 69 73 20 74 68 61 6c 69 61 6e 61 |idopsis > thaliana| > 00000050 5d 0a 4d 53 4c 52 49 4b 4c 56 56 44 4b 46 56 45 > |].MSLRIKLVVDKFVE| > 00000060 45 4c 4b 51 41 4c 44 41 44 49 51 44 52 49 4d 4b > |ELKQALDADIQDRIMK| > 00000070 45 52 45 4d 51 53 59 49 58 58 58 58 58 58 58 58 > |EREMQSYIXXXXXXXX| > 00000080 58 58 58 58 58 57 4b 41 45 4c 53 52 52 45 54 45 > |XXXXXWKAELSRRETE| > 00000090 49 41 52 51 45 41 52 4c 4b 4d 45 52 45 4e 4c 45 > |IARQEARLKMERENLE| > 000000a0 4b 45 0a 4b 53 56 4c 4d 47 54 41 53 4e 51 44 4e > |KE.KSVLMGTASNQDN| > 000000b0 51 44 47 41 4c 45 49 54 56 53 47 45 4b 59 52 43 > |QDGALEITVSGEKYRC| > 000000c0 4c 52 46 53 4b 41 4b 4b 0a |LRFSKAKK.| > > > The four extra characters are hex #ac #ed #00 #05 and these are showing > as question marks in your text editor because that's how text editors > handle unprintable characters. > > Does anyone recognise these characters? There is no code in BioJava > which writes anything like this, in fact there is no output code at all > before the initial write of the first > symbol in the file. Something > tells me that these symbols are being inserted by the VM or the OS > somewhere under the hood, possibly due to internationalisation? > > I strongly suspect this is an internationalisation problem. It seems > probable that Java has been set up on your system to use a language or > character encoding that causes Java by default to write these extra > characters at the start of files to indicate the encoding. Check the > output of: > > System.getProperty("file.encode"); > > to see if it is using something other than UTF-8. If it is, then chances > are that this is the problem. > > We've had internationalisation problems before with BioJava. Hopefully > these will be addressed in future development, but there is no current > activity in that area due to lack of resources. In the meantime the best > workaround is to set every setting you can find to a Western European > character set/character mapping and UTF-8 file encoding, in the hope > that it will all match up nicely and work. > > cheers, > Richard > > Saif Ur-Rehman wrote: > > Dear Richard, > > > > The input file is just the entire set of RefSeq proteins for Arabdopsis > thaliana > > and is too large for me to send as an attachment. But it can be downloaded > from > > NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]". > > > > Cheers, > > > > Saif > > > > > > > > Quoting Richard Holland : > > > > Interesting. Could you send your input file as well? > > > > cheers, > > Richard > > > > Saif Ur-Rehman wrote: > >>>> Dear Richard, > >>>> > >>>> The sequences are being read by SeqIO.readFasta. The code from read to > > write is > >>>> as follows. Essentially the program wants to read in a fasta file > > containing > >>>> all the protein sequences in a given organism and split them up into one > > file > >>>> per protein. > >>>> > >>>> > >>>> BufferedReader br=null; > >>>> try > >>>> { > >>>> br = new BufferedReader(new FileReader(filename)); > >>>> } > >>>> catch (FileNotFoundException e1) > >>>> { > >>>> > >>>> e1.printStackTrace(); > >>>> } > >>>> > >>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br); > >>>> while (stream.hasNext()) > >>>> { > >>>> try > >>>> { > >>>> Sequence seq = stream.nextSequence(); > >>>> File scriptFile1= new > > File("///Users/Saif/Organisms/RunTemp/"+name > >>>> +"/"+seq.getName()); > >>>> > >>>> try > >>>> { > >>>> scriptFile1.createNewFile(); > >>>> } > >>>> catch (IOException e1) > >>>> { > >>>> > >>>> e1.printStackTrace(); > >>>> } > >>>> > >>>> try > >>>> { > >>>> FileWriter fstream = new > > FileWriter(scriptFile1.getAbsolutePath()); > >>>> BufferedWriter out = new BufferedWriter(fstream); > >>>> > >>>> FileOutputStream f =new FileOutputStream (scriptFile1); > >>>> > >>>> RichSequence rs=RichSequence.Tools.enrich(seq); > >>>> > >>>> > >>>> try{ > >>>> > >>>> > >>>> RichSequence.IOTools.writeFasta( > >>>> f, > >>>> rs, > >>>> RichObjectFactory.getDefaultNamespace() > >>>> ); > >>>> > >>>> > >>>> } > >>>> > >>>> catch (IOException ioe){} > >>>> > >>>> An example of an outputted fasta file from this code is attached. > >>>> > >>>> > >>>> > >>>> Thanks a lot for your time. > >>>> > >>>> Saif > >>>> > >>>> > >>>> Quoting Richard Holland : > >>>> > >>>> Where are the input sequences coming from? i.e. what method are you > >>>> using to construct them or read them from a file. > >>>> > >>>> Also, what do you mean by the 'front' of each write? Could you send me > >>>> an example of an entire FASTA file containing the problem? (It'd be best > >>>> to attach the file to an email to me personally as this list will not > >>>> accept attachments, and copying-and-pasting from a text editor to an > >>>> email client may obscure the underlying problem). > >>>> > >>>> It'd be good also to see your entire code from the point the sequences > >>>> are read or created to the point where they are written out. Or, a > >>>> sample program which exhibits the same behaviour would suffice. > >>>> > >>>> I suspect that the sequences themselves contain the incorrect data, > >>>> although technically this should be impossible as the sequence alphabet > >>>> should prevent it. > >>>> > >>>> We recently had an issue reported here regarding BioJava not being able > >>>> to do certain sequence tasks on platforms using non-Western-European > >>>> character mappings. If your machine is running such a mapping, try it > >>>> again on a machine with an English or other Western European language > >>>> set up by default. If it works there but not on your machine, then > >>>> this'll be the same problem. (There is no solution yet, but at least > >>>> you'll know what's wrong). > >>>> > >>>> cheers, > >>>> Richard > >>>> > >>>> Saif Ur-Rehman wrote: > >>>>>>> Dear Richard, > >>>>>>> > >>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this > method > > is > >>>> still > >>>>>>> appending the characters "??" to the front of each write. I am using > a > >>>>>>> FileOutputStream and a Sequence object as inputs to the method. like > so. > >>>>>>> > >>>>>>> > >>>>>>> Sequence seq; // read in from File > >>>>>>> FileOutputStream f =new FileOutputStream (fileName); > >>>>>>> > >>>>>>> > >>>>>>> try{ > >>>>>>> > >>>>>>> RichSequence.IOTools.writeFasta(f, > >>>>>>> seq, > >>>>>>> RichObjectFactory.getDefaultNamespace() > >>>>>>> ); > >>>>>>> > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> > >>>>>>> Thanks a lot for your time > >>>>>>> > >>>>>>> Sincerely, > >>>>>>> > >>>>>>> Saif > >>>>>>> > >>>>>>> Quoting Richard Holland : > >>>>>>> > >>>>>>> SeqIOTools is deprecated. > >>>>>>> > >>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps. > >>>>>>> > >>>>>>> e.g.: > >>>>>>> > >>>>>>> RichSequence.IOTools.writeFasta( > >>>>>>> System.out, > >>>>>>> seq, > >>>>>>> RichObjectFactory.getDefaultNamespace() > >>>>>>> ); > >>>>>>> > >>>>>>> where seq is either a Sequence or a SequenceIterator. > >>>>>>> > >>>>>>> cheers, > >>>>>>> Richard > >>>>>>> > >>>>>>> Saif Ur-Rehman wrote: > >>>>>>>>>> Dear All, > >>>>>>>>>> > >>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I > am > >>>>>>> currently > >>>>>>>>>> trying to break up Fasta Files of whole organisms into one file > per > > gene > >>>>>>> for > >>>>>>>>>> further analysis. However the writeFasta method appears to append > the > >>>>>>>>>> characters > >>>>>>>>>> "?? > >>>>>>>>>> > >>>>>>>>>> ------------------------------------------------------------------ > >>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>>>>>> > >> > ------------------------------------------------------------------------------- > >>>>>>> Saif Ur-Rehman > >>>>>>> Research Student > >>>>>>> The Centre for Evolution, Genes & Genomics (CEGG) > >>>>>>> Dyers Brae > >>>>>>> School of Biology > >>>>>>> The University of St Andrews > >>>>>>> St Andrews, > >>>>>>> Fife > >>>>>>> Scotland,UK > >>>>>>> ------------------------------------------------------------------ > >>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > >> > ------------------------------------------------------------------------------- > >>>> Saif Ur-Rehman > >>>> Research Student > >>>> The Centre for Evolution, Genes & Genomics (CEGG) > >>>> Dyers Brae > >>>> School of Biology > >>>> The University of St Andrews > >>>> St Andrews, > >>>> Fife > >>>> Scotland,UK > >>>> ------------------------------------------------------------------ > >>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > >> > > > > ------------------------------------------------------------------------------- > > Saif Ur-Rehman > > Research Student > > The Centre for Evolution, Genes & Genomics (CEGG) > > Dyers Brae > > School of Biology > > The University of St Andrews > > St Andrews, > > Fife > > Scotland,UK > > > ------------------------------------------------------------------ > > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb > pxAPAybISoRQgbvQ1wyzqVg= > =MS7P > -----END PGP SIGNATURE----- > ------------------------------------------------------------------------------- Saif Ur-Rehman Research Student The Centre for Evolution, Genes & Genomics (CEGG) Dyers Brae School of Biology The University of St Andrews St Andrews, Fife Scotland,UK ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From sanbiogene at yahoo.co.in Sat Oct 6 05:23:11 2007 From: sanbiogene at yahoo.co.in (sandeep telkar) Date: Sat, 6 Oct 2007 10:23:11 +0100 (BST) Subject: [Biojava-l] BIOJAVA INSTALLATION FOR WINDOWS PLATFORM Message-ID: <121992.19693.qm@web94408.mail.in2.yahoo.com> Dear friends, Sandeep here... I wanna learn biojava n now i am beginner.but from where to download its exe installation file as like that of JDK6 fron sun website.... please suggest me any thing other than the following url: http://biojava.org/wiki/BioJava:Download N plese tell in which directory i have to save the program..... I am not getting any clear idea .. please help me.. - Sandeep Sandeep Telkar, M.Sc Bioinformatics. Meet people who discuss and share your passions. Go to http://in.promos.yahoo.com/groups From su24 at st-andrews.ac.uk Sat Oct 6 14:04:28 2007 From: su24 at st-andrews.ac.uk (Saif Ur-Rehman) Date: Sat, 6 Oct 2007 19:04:28 +0100 Subject: [Biojava-l] BIOJAVA INSTALLATION FOR WINDOWS PLATFORM In-Reply-To: <121992.19693.qm@web94408.mail.in2.yahoo.com> References: <121992.19693.qm@web94408.mail.in2.yahoo.com> Message-ID: <1191693868.4707ce2caae97@webmail.st-andrews.ac.uk> Hi, You need to download the Jar files from http://biojava.org/wiki/BioJava:Download. You can then use the File biojava-1.5.jar. Just include it in the buildpath as an external JAR if you're using an IDE like Netbeans or Eclipse or your class path if working from the command line. You can then import the BioJava classes and use them. Hope that helps Cheers, Saif Quoting sandeep telkar : > Dear friends, > Sandeep here... > I wanna learn biojava n now i am beginner.but > from where to download its exe installation file as > like that of JDK6 fron sun website.... > > please suggest me any thing other than the following > url: > http://biojava.org/wiki/BioJava:Download > > N plese tell in which directory i have to save the > program..... > I am not getting any clear idea .. > > please help me.. > - Sandeep > > Sandeep Telkar, > M.Sc Bioinformatics. > > > > Meet people who discuss and share your passions. Go to > http://in.promos.yahoo.com/groups > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > ------------------------------------------------------------------------------- Saif Ur-Rehman Research Student The Centre for Evolution, Genes & Genomics (CEGG) Dyers Brae School of Biology The University of St Andrews St Andrews, Fife Scotland,UK ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From vineith at gmail.com Wed Oct 10 00:44:22 2007 From: vineith at gmail.com (vineith kaul) Date: Wed, 10 Oct 2007 00:44:22 -0400 Subject: [Biojava-l] case-sensitive sequences Message-ID: Hi, I want to read in a sequence which has case sensitive alphabets(nucleotides).Basically I want to replace only small 'a,g,t,c' with blanks .Although I saw a similar post earlier but couldn't understand much.Can someone help me with this ? -- Vineith Kaul Masters Student Bioinformatics The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) Georgia Tech, Atlanta From holland at ebi.ac.uk Wed Oct 10 04:06:16 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 10 Oct 2007 09:06:16 +0100 Subject: [Biojava-l] case-sensitive sequences In-Reply-To: References: Message-ID: <470C87F8.8020502@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 You can use SoftMaskedAlphabet with the BioJavaX parsers to get the desired effect. By default, a soft masked character is one in lower case. The code below will detect these. If you have other search criteria you can modify the soft masked detection criteria to match this instead. To do that, add a second parameter to the call to SoftMaskedAlphabet.getInstance() and use it to pass in an instance of SoftMaskedAlphabet.MaskingDetector (see the JavaDocs to see how this should work). Hope this helps! : // Set up a soft-masked alphabet. SoftMaskedAlphabet sma = SoftMaskedAlphabet.getInstance(DNATools.getDNA()); SymbolTokenization stok = sma.getTokenization("token"); // Set up sequence parsing. BufferedReader input = ....; // Get your sequences from somewhere RichSequenceFormat format = new FastaFormat(); // Or Genbank etc. RichSequenceBuilderFactory factory = RichSequenceBuilderFactory.FACTORY; // See Javadocs for alternative factories. Namespace ns = RichObjectFactory.getDefaultNamespace(); // See Javadocs for alternative namespaces. // Parse the sequences. RichStreamReader seqsIn = new RichStreamReader(input, format, stok, factory, ns); // Find the soft-masked symbols in the sequences. while (seqsIn.hasNext()) { RichSequence seq = seqsIn.nextRichSequence(); // Iterate over symbols in sequence. for (Iterator i = seq.iterator(); i.hasNext(); ) { Symbol sym = (Symbol)i.next(); // Is this symbol masked? if (sma.isMasked(sym)) { // Yes it is so deal with it. ....... } else { // No it isn't, so deal with that instead. ....... } } } cheers, Richard vineith kaul wrote: > Hi, > > I want to read in a sequence which has case sensitive > alphabets(nucleotides).Basically I want to replace only small > 'a,g,t,c' with blanks .Although I saw a similar post earlier but > couldn't understand much.Can someone help me with this ? > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHDIf44C5LeMEKA/QRAmuNAJ426M/UgInqDG5rG6w+F+qoMdVzPQCfZo1S nAS5v8jSFBX5WCuB5UmzczQ= =Sicc -----END PGP SIGNATURE----- From vineith at gmail.com Sun Oct 14 13:21:45 2007 From: vineith at gmail.com (vineith kaul) Date: Sun, 14 Oct 2007 13:21:45 -0400 Subject: [Biojava-l] Java to Perl Message-ID: Is there some tool by which we can convert a complete Java Code to a Perl code ? -- Vineith Kaul Masters Student Bioinformatics The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) Georgia Tech, Atlanta From davidfeitosa at gmail.com Sun Oct 14 13:57:47 2007 From: davidfeitosa at gmail.com (David Barbosa Feitosa) Date: Sun, 14 Oct 2007 14:57:47 -0300 Subject: [Biojava-l] Java to Perl In-Reply-To: References: Message-ID: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com> Vineith I do not know, but if you need to execute Pearl code inside Java code, in Java 6, codename Mustang, is possible to execute script code inside the Java Virtual Machine. The default scripting engine is Rhino, for JavaScript, but as it is a specification, if exists an Pearl engine, you can plug it into the JVM and execute your Pearl code. Mode infoa bout the available engines and how to install one: https://scripting.dev.java.net/ Maybe it can help you, David. 2007/10/14, vineith kaul : > > Is there some tool by which we can convert a complete Java Code to a > Perl code ? > > -- > Vineith Kaul > Masters Student Bioinformatics > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > Georgia Tech, Atlanta > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Mon Oct 15 04:15:33 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Mon, 15 Oct 2007 09:15:33 +0100 Subject: [Biojava-l] Java to Perl In-Reply-To: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com> References: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com> Message-ID: <471321A5.5090600@ebi.ac.uk> Unfortunately to my knowledge there is no Perl/Java scripting interface. Apparently for some reason Perl is not trendy enough to warrant a port (which is a pity). In response to Vineith's original question such a tool really wouldn't work. Good Perl code is very different to good Java code. If you did get something that would work you'd probably end up with quite verbose & in-efficent Perl code (not to mention the problems that would arise with Perl objects having no access modifiers, using inside-out objects, converting 3rd party libraries etc). Two options do spring to mind if you need code available in both languages: * Make one of the pieces of code a "black box" where you read results from STDOUT (works well enough calling a Java program from Perl). * Write the commmon code in C Out of these two options if you want the code replicated in a 1-1 fashion then C is your only option. Otherwise the first idea is the easiest to work with. As David did mention there are other scripting engines available (Jython, Groovy, JRuby & Rhino all spring to mind) which might satisfy your scripting needs whilst remaining in a Java environment (Groovy hits that nice sweet spot for a Java inspired scripting language). Andy P.S. This really isn't a Biojava question ... David Barbosa Feitosa wrote: > Vineith > > I do not know, but if you need to execute Pearl code inside Java code, in > Java 6, codename Mustang, is possible to execute script code inside the Java > Virtual Machine. > > The default scripting engine is Rhino, for JavaScript, but as it is a > specification, if exists an Pearl engine, you can plug it into the JVM and > execute your Pearl code. > > Mode infoa bout the available engines and how to install one: > > https://scripting.dev.java.net/ > > Maybe it can help you, > > David. > > 2007/10/14, vineith kaul : >> Is there some tool by which we can convert a complete Java Code to a >> Perl code ? >> >> -- >> Vineith Kaul >> Masters Student Bioinformatics >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >> Georgia Tech, Atlanta >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From phidias51 at gmail.com Mon Oct 15 10:57:06 2007 From: phidias51 at gmail.com (Mark Fortner) Date: Mon, 15 Oct 2007 07:57:06 -0700 Subject: [Biojava-l] Java to Perl In-Reply-To: <471321A5.5090600@ebi.ac.uk> References: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com> <471321A5.5090600@ebi.ac.uk> Message-ID: <6e1d61f50710150757p6ba25c1ck9466baa5f8273bc2@mail.gmail.com> The original post indicated that they wanted to go from java to perl. Doing a quick Google search yielded a lot of hits for tools going from perl to java. Just out curiosity, was there some reason you wanted to create perl code from Java code? There are a couple of projects which supposedly provide PERL-scripting support inside Java to one extent or another. The first is called Sleep ( http://sleep.hick.org/) which is described as being a PERL-like plugin for the Java 6 scripting engine. There's also a BSF plugin called BSF Perl ( http://bsfperl.sf.net) and another BSF plugin called PerlScript which is part of ActiveState's ActivePerl distribution. I don't have any first-hand experience with any of these, so please don't construe anything I say as an endorsement of these technologies. Although none of these solutions will convert PERL code into Java or vice-versa, they may allow you to run Perl inside a VM. Hope this helps, Mark On 10/15/07, Andy Yates wrote: > > Unfortunately to my knowledge there is no Perl/Java scripting interface. > Apparently for some reason Perl is not trendy enough to warrant a port > (which is a pity). > > In response to Vineith's original question such a tool really wouldn't > work. Good Perl code is very different to good Java code. If you did get > something that would work you'd probably end up with quite verbose & > in-efficent Perl code (not to mention the problems that would arise with > Perl objects having no access modifiers, using inside-out objects, > converting 3rd party libraries etc). > > Two options do spring to mind if you need code available in both > languages: > > * Make one of the pieces of code a "black box" where you read results > from STDOUT (works well enough calling a Java program from Perl). > > * Write the commmon code in C > > Out of these two options if you want the code replicated in a 1-1 > fashion then C is your only option. Otherwise the first idea is the > easiest to work with. > > As David did mention there are other scripting engines available > (Jython, Groovy, JRuby & Rhino all spring to mind) which might satisfy > your scripting needs whilst remaining in a Java environment (Groovy hits > that nice sweet spot for a Java inspired scripting language). > > Andy > > P.S. This really isn't a Biojava question ... > > David Barbosa Feitosa wrote: > > Vineith > > > > I do not know, but if you need to execute Pearl code inside Java code, > in > > Java 6, codename Mustang, is possible to execute script code inside the > Java > > Virtual Machine. > > > > The default scripting engine is Rhino, for JavaScript, but as it is a > > specification, if exists an Pearl engine, you can plug it into the JVM > and > > execute your Pearl code. > > > > Mode infoa bout the available engines and how to install one: > > > > https://scripting.dev.java.net/ > > > > Maybe it can help you, > > > > David. > > > > 2007/10/14, vineith kaul : > >> Is there some tool by which we can convert a complete Java Code to a > >> Perl code ? > >> > >> -- > >> Vineith Kaul > >> Masters Student Bioinformatics > >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >> Georgia Tech, Atlanta > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From vineith at gmail.com Sun Oct 21 12:30:48 2007 From: vineith at gmail.com (vineith kaul) Date: Sun, 21 Oct 2007 12:30:48 -0400 Subject: [Biojava-l] Evolutionary distances Message-ID: Hi, Are there functions to calculate evolutionary pairwise distances like Kimura2P,Finkelstein etc in Biojava I did write smthng on my own but on large sequences it runs terribly slow and I am not even sure if thats right. -- Vineith Kaul Masters Student Bioinformatics The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) Georgia Tech, Atlanta From holland at ebi.ac.uk Mon Oct 22 08:06:57 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 22 Oct 2007 13:06:57 +0100 (BST) Subject: [Biojava-l] Evolutionary distances In-Reply-To: References: Message-ID: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> You should take a look at the latest 1.5 release, in the org.biojavax.bio.phylo packages. This code is the beginnings of some phylogenetics code that will perform tasks as you describe. The future plan is to extend this code to cover a wider range of use cases. Kimura2P is already implemented here, in org.biojavax.bio.phylo.MultipleHitCorrection. If you can't find code that will do what you want, but have written some before, then please do feel free to contribute it. Even if it is slow, I'm sure someone out there will be able to help optimise it! cheers, Richard On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > Hi, > > Are there functions to calculate evolutionary pairwise distances like > Kimura2P,Finkelstein etc in Biojava > I did write smthng on my own but on large sequences it runs terribly > slow and I am not even sure if thats right. > -- > Vineith Kaul > Masters Student Bioinformatics > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > Georgia Tech, Atlanta > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland BioMart (http://www.biomart.org/) EMBL-EBI Hinxton, Cambridgeshire CB10 1SD, UK From vineith at gmail.com Tue Oct 23 02:59:29 2007 From: vineith at gmail.com (vineith kaul) Date: Tue, 23 Oct 2007 02:59:29 -0400 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> Message-ID: This is what I have .....Thanks a lot fr the help. //Method to calculate the Kimura 2 parameter distance public static double K2P(String sequence1,String sequence2){ long p=0,q=0,numberOfAlignedSites=0; // P= transitional differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) char[] seq1array=sequence1.toCharArray(); char[] seq2array=sequence2.toCharArray(); for(int i=0;i wrote: > > You should take a look at the latest 1.5 release, in the > org.biojavax.bio.phylo packages. This code is the beginnings of some > phylogenetics code that will perform tasks as you describe. The future > plan is to extend this code to cover a wider range of use cases. Kimura2P > is already implemented here, in > org.biojavax.bio.phylo.MultipleHitCorrection. > > If you can't find code that will do what you want, but have written some > before, then please do feel free to contribute it. Even if it is slow, I'm > sure someone out there will be able to help optimise it! > > cheers, > Richard > > On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > > Hi, > > > > Are there functions to calculate evolutionary pairwise distances like > > Kimura2P,Finkelstein etc in Biojava > > I did write smthng on my own but on large sequences it runs terribly > > slow and I am not even sure if thats right. > > -- > > Vineith Kaul > > Masters Student Bioinformatics > > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > > Georgia Tech, Atlanta > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > -- > Richard Holland > BioMart (http://www.biomart.org/) > EMBL-EBI > Hinxton, Cambridgeshire CB10 1SD, UK > > -- Vineith Kaul Masters Student Bioinformatics The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) Georgia Tech, Atlanta From ozgur7 at gmail.com Tue Oct 23 14:17:29 2007 From: ozgur7 at gmail.com (Ozgur Ozturk) Date: Tue, 23 Oct 2007 11:17:29 -0700 Subject: [Biojava-l] problem with CookBook:Blast:Parser Message-ID: Hi, I am receiving the following error when I use BlastParser code from the cookbook : org.xml.sax.SAXException: Could not recognise the format of this file as one supported by the framework. at org.biojava.bio.program.sax.BlastLikeSAXParser.parse( BlastLikeSAXParser.java:182) at org.arabidopsis.test.BlastParser.main(BlastParser.java:44) I have generated the xml file using this command: blast-2.2.17/bin/blastall -p blastp -d brAll -i tester -b 300 -m 7 > tempresult.xml Then pass it to BlastParser: BlastParser tempresult.xml Thanks for your help in advance, -- Best regards, Ozgur (Oscar) Ozturk, http://www.cse.ohio-state.edu/~ozturk/ Mobile Phone: (614) 805-4370 From ozgur7 at gmail.com Tue Oct 23 16:24:49 2007 From: ozgur7 at gmail.com (Ozgur Ozturk) Date: Tue, 23 Oct 2007 13:24:49 -0700 Subject: [Biojava-l] Problem Solved Re: problem with CookBook:Blast:Parser Message-ID: Hi, Another code in demos ( BioJava/biojava-1.5/demos/blastxml ) could handle my xml file. I guess the problem is solved. Thanks. (But if the BlastParser code from the cookbookis deprecated, you may want to update it.) Best regards, Ozgur (Oscar) Ozturk, http://www.cse.ohio-state.edu/~ozturk/ Mobile Phone: (614) 805-4370 On 10/23/07, Ozgur Ozturk wrote: > > Hi, > I am receiving the following error when I use BlastParser code from the > cookbook : > > org.xml.sax.SAXException: Could not recognise the format of this file as > one supported by the framework. > at org.biojava.bio.program.sax.BlastLikeSAXParser.parse( > BlastLikeSAXParser.java:182) > at org.arabidopsis.test.BlastParser.main(BlastParser.java:44) > > I have generated the xml file using this command: > blast-2.2.17/bin/blastall -p blastp -d brAll -i tester -b 300 -m 7 > > tempresult.xml > > Then pass it to BlastParser: > BlastParser tempresult.xml > > Thanks for your help in advance, > -- > Best regards, > Ozgur (Oscar) Ozturk, > http://www.cse.ohio-state.edu/~ozturk/ > Mobile Phone: (614) 805-4370 From holland at ebi.ac.uk Wed Oct 24 03:52:24 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 24 Oct 2007 08:52:24 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> Message-ID: <471EF9B8.7020609@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks. Your code is similar to the code we have in org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to see if it is identical, but it probably is. You can call our code like this: // import statement for biojava phylo stuff import org.biojavax.bio.phylo.*; // ...rest of code goes here // call Kimura2P String seq1 = ...; // Get seq1 and seq2 from somewhere String seq2 = ...; double result = MultipleHitCorrection.Kimura2P(seq1, seq2); Note that our implementation expects sequence strings to be in upper case, so you'll need to make sure your data is upper case or has been converted to upper case before calling our method. cheers, Richard vineith kaul wrote: > This is what I have .....Thanks a lot fr the help. > > > //Method to calculate the Kimura 2 parameter distance > public static double K2P(String sequence1,String sequence2){ > long p=0,q=0,numberOfAlignedSites=0; // P= transitional > differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) > > > char[] seq1array=sequence1.toCharArray(); > char[] seq2array=sequence2.toCharArray(); > > for(int i=0;i // Number of aligned sites > if(((seq1array[i]=='a') || > (seq1array[i]=='A')||(seq1array[i]=='g') || > (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || > (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || > (seq2array[i]=='A')||(seq2array[i]=='c') || > (seq2array[i]=='C')||(seq2array[i]=='t') || > (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { > > numberOfAlignedSites++; > } > > if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > p++; > } > else > if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > p++; > } > else > if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > p++; > } > else > if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > p++; > } > else > if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > q++; > } > else > if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > q++; > } > else > if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > q++; > } > else > if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > q++; > } > else > if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > q++; > } > else > if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > q++; > } > else > if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > q++; > } > else > if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > q++; > } > > > > > } > > double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - > (((double)q)/numberOfAlignedSites); > double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); > System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); > double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); > return dist; > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 10/22/07, *Richard Holland* > wrote: > > You should take a look at the latest 1.5 release, in the > org.biojavax.bio.phylo packages. This code is the beginnings of some > phylogenetics code that will perform tasks as you describe. The future > plan is to extend this code to cover a wider range of use cases. > Kimura2P > is already implemented here, in > org.biojavax.bio.phylo.MultipleHitCorrection. > > If you can't find code that will do what you want, but have written some > before, then please do feel free to contribute it. Even if it is > slow, I'm > sure someone out there will be able to help optimise it! > > cheers, > Richard > > On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > > Hi, > > > > Are there functions to calculate evolutionary pairwise distances like > > Kimura2P,Finkelstein etc in Biojava > > I did write smthng on my own but on large sequences it runs terribly > > slow and I am not even sure if thats right. > > -- > > Vineith Kaul > > Masters Student Bioinformatics > > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > > Georgia Tech, Atlanta > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > -- > Richard Holland > BioMart ( http://www.biomart.org/) > EMBL-EBI > Hinxton, Cambridgeshire CB10 1SD, UK > > > > > -- > Vineith Kaul > Masters Student Bioinformatics > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > Georgia Tech, Atlanta -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa 4iKvsyBj2uznhhjTF9EYDFE= =LALE -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Wed Oct 24 04:09:13 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 24 Oct 2007 09:09:13 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471EF9B8.7020609@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> Message-ID: <471EFDA9.1090706@ebi.ac.uk> Our code is very similar but not identical. The original programmer shortcutted a lot of else if conditions by considering if the two bases were equal or not. It can then calculate the transitional changes & assume the rest are transversional. In terms of speed of both pieces of code I can't see an obvious way to speed it up. Probably in our code removing the 10 or so calls to String.charAt() with a two calls & referencing those chars might help but in all honesty I cannot say. Andy Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Thanks. > > Your code is similar to the code we have in > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > see if it is identical, but it probably is. > > You can call our code like this: > > // import statement for biojava phylo stuff > import org.biojavax.bio.phylo.*; > > // ...rest of code goes here > > // call Kimura2P > String seq1 = ...; // Get seq1 and seq2 from somewhere > String seq2 = ...; > double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > > Note that our implementation expects sequence strings to be in upper > case, so you'll need to make sure your data is upper case or has been > converted to upper case before calling our method. > > cheers, > Richard > > vineith kaul wrote: >> This is what I have .....Thanks a lot fr the help. >> >> >> //Method to calculate the Kimura 2 parameter distance >> public static double K2P(String sequence1,String sequence2){ >> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >> >> >> char[] seq1array=sequence1.toCharArray(); >> char[] seq2array=sequence2.toCharArray(); >> >> for(int i=0;i> // Number of aligned sites >> if(((seq1array[i]=='a') || >> (seq1array[i]=='A')||(seq1array[i]=='g') || >> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >> (seq2array[i]=='A')||(seq2array[i]=='c') || >> (seq2array[i]=='C')||(seq2array[i]=='t') || >> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >> >> numberOfAlignedSites++; >> } >> >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >> p++; >> } >> else >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >> p++; >> } >> else >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >> p++; >> } >> else >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >> p++; >> } >> else >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >> q++; >> } >> else >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >> q++; >> } >> else >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >> q++; >> } >> else >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >> q++; >> } >> else >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >> q++; >> } >> else >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >> q++; >> } >> else >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >> q++; >> } >> else >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >> q++; >> } >> >> >> >> >> } >> >> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >> (((double)q)/numberOfAlignedSites); >> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >> return dist; >> } >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 10/22/07, *Richard Holland* > > wrote: >> >> You should take a look at the latest 1.5 release, in the >> org.biojavax.bio.phylo packages. This code is the beginnings of some >> phylogenetics code that will perform tasks as you describe. The future >> plan is to extend this code to cover a wider range of use cases. >> Kimura2P >> is already implemented here, in >> org.biojavax.bio.phylo.MultipleHitCorrection. >> >> If you can't find code that will do what you want, but have written some >> before, then please do feel free to contribute it. Even if it is >> slow, I'm >> sure someone out there will be able to help optimise it! >> >> cheers, >> Richard >> >> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >> > Hi, >> > >> > Are there functions to calculate evolutionary pairwise distances like >> > Kimura2P,Finkelstein etc in Biojava >> > I did write smthng on my own but on large sequences it runs terribly >> > slow and I am not even sure if thats right. >> > -- >> > Vineith Kaul >> > Masters Student Bioinformatics >> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >> > Georgia Tech, Atlanta >> > _______________________________________________ >> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >> >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > >> >> >> -- >> Richard Holland >> BioMart ( http://www.biomart.org/) >> EMBL-EBI >> Hinxton, Cambridgeshire CB10 1SD, UK >> >> >> >> >> -- >> Vineith Kaul >> Masters Student Bioinformatics >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >> Georgia Tech, Atlanta > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa > 4iKvsyBj2uznhhjTF9EYDFE= > =LALE > -----END PGP SIGNATURE----- > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From markjschreiber at gmail.com Wed Oct 24 07:59:04 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 24 Oct 2007 19:59:04 +0800 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471EFDA9.1090706@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> Message-ID: <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> Hi - >From experience the best way to optimize java code is to run a profiler. The one in Netbeans is quite good. The reason is that the hotspot or JIT compilers might natively compile the part of the code that you think is slow and actually make it faster than something else which becomes the bottle neck. Using a good profiler you can detect how much time is spent in each method and pin point some candidate methods for optimization. You can also see if there is a burden due to creation of lots of objects. - Mark On 10/24/07, Andy Yates wrote: > Our code is very similar but not identical. The original programmer > shortcutted a lot of else if conditions by considering if the two bases > were equal or not. It can then calculate the transitional changes & > assume the rest are transversional. > > In terms of speed of both pieces of code I can't see an obvious way to > speed it up. Probably in our code removing the 10 or so calls to > String.charAt() with a two calls & referencing those chars might help > but in all honesty I cannot say. > > Andy > > Richard Holland wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Thanks. > > > > Your code is similar to the code we have in > > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > > see if it is identical, but it probably is. > > > > You can call our code like this: > > > > // import statement for biojava phylo stuff > > import org.biojavax.bio.phylo.*; > > > > // ...rest of code goes here > > > > // call Kimura2P > > String seq1 = ...; // Get seq1 and seq2 from somewhere > > String seq2 = ...; > > double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > > > > Note that our implementation expects sequence strings to be in upper > > case, so you'll need to make sure your data is upper case or has been > > converted to upper case before calling our method. > > > > cheers, > > Richard > > > > vineith kaul wrote: > >> This is what I have .....Thanks a lot fr the help. > >> > >> > >> //Method to calculate the Kimura 2 parameter distance > >> public static double K2P(String sequence1,String sequence2){ > >> long p=0,q=0,numberOfAlignedSites=0; // P= transitional > >> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) > >> > >> > >> char[] seq1array=sequence1.toCharArray(); > >> char[] seq2array=sequence2.toCharArray(); > >> > >> for(int i=0;i >> // Number of aligned sites > >> if(((seq1array[i]=='a') || > >> (seq1array[i]=='A')||(seq1array[i]=='g') || > >> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || > >> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || > >> (seq2array[i]=='A')||(seq2array[i]=='c') || > >> (seq2array[i]=='C')||(seq2array[i]=='t') || > >> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { > >> > >> numberOfAlignedSites++; > >> } > >> > >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >> p++; > >> } > >> else > >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >> p++; > >> } > >> else > >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >> p++; > >> } > >> else > >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >> p++; > >> } > >> else > >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >> q++; > >> } > >> > >> > >> > >> > >> } > >> > >> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - > >> (((double)q)/numberOfAlignedSites); > >> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); > >> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); > >> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); > >> return dist; > >> } > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> On 10/22/07, *Richard Holland* >> > wrote: > >> > >> You should take a look at the latest 1.5 release, in the > >> org.biojavax.bio.phylo packages. This code is the beginnings of some > >> phylogenetics code that will perform tasks as you describe. The future > >> plan is to extend this code to cover a wider range of use cases. > >> Kimura2P > >> is already implemented here, in > >> org.biojavax.bio.phylo.MultipleHitCorrection. > >> > >> If you can't find code that will do what you want, but have written some > >> before, then please do feel free to contribute it. Even if it is > >> slow, I'm > >> sure someone out there will be able to help optimise it! > >> > >> cheers, > >> Richard > >> > >> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > >> > Hi, > >> > > >> > Are there functions to calculate evolutionary pairwise distances like > >> > Kimura2P,Finkelstein etc in Biojava > >> > I did write smthng on my own but on large sequences it runs terribly > >> > slow and I am not even sure if thats right. > >> > -- > >> > Vineith Kaul > >> > Masters Student Bioinformatics > >> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >> > Georgia Tech, Atlanta > >> > _______________________________________________ > >> > Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> > >> > http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > >> > >> > >> -- > >> Richard Holland > >> BioMart ( http://www.biomart.org/) > >> EMBL-EBI > >> Hinxton, Cambridgeshire CB10 1SD, UK > >> > >> > >> > >> > >> -- > >> Vineith Kaul > >> Masters Student Bioinformatics > >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >> Georgia Tech, Atlanta > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.2.2 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > > > iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa > > 4iKvsyBj2uznhhjTF9EYDFE= > > =LALE > > -----END PGP SIGNATURE----- > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Wed Oct 24 08:28:21 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 24 Oct 2007 13:28:21 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> Message-ID: <471F3A65.50202@ebi.ac.uk> Yes a very good point & one I was going to make before hand but forgot :) Also not to mention that micro-benchmarks/profiling in Java are notorious for giving false results due to VM warmup & JIT compilation optimisations. There is a framework hosted on Java.net somewhere which can perform VM warmups and code iterations to produce more accurate benchmarking results; but the name escapes me at the moment. However looking at this particular code I get the feeling that this is about as fast as its going to get without someone doing bitwise XOR operations or some C code ... that's not an open invitation for people to start recoding this in C :). At the end of the day the key to optimisation is to ask the question "is it fast enough already?". If it is then there's no point :) Andy Mark Schreiber wrote: > Hi - > >>From experience the best way to optimize java code is to run a > profiler. The one in Netbeans is quite good. > > The reason is that the hotspot or JIT compilers might natively compile > the part of the code that you think is slow and actually make it > faster than something else which becomes the bottle neck. Using a good > profiler you can detect how much time is spent in each method and pin > point some candidate methods for optimization. You can also see if > there is a burden due to creation of lots of objects. > > - Mark > > On 10/24/07, Andy Yates wrote: >> Our code is very similar but not identical. The original programmer >> shortcutted a lot of else if conditions by considering if the two bases >> were equal or not. It can then calculate the transitional changes & >> assume the rest are transversional. >> >> In terms of speed of both pieces of code I can't see an obvious way to >> speed it up. Probably in our code removing the 10 or so calls to >> String.charAt() with a two calls & referencing those chars might help >> but in all honesty I cannot say. >> >> Andy >> >> Richard Holland wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Thanks. >>> >>> Your code is similar to the code we have in >>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to >>> see if it is identical, but it probably is. >>> >>> You can call our code like this: >>> >>> // import statement for biojava phylo stuff >>> import org.biojavax.bio.phylo.*; >>> >>> // ...rest of code goes here >>> >>> // call Kimura2P >>> String seq1 = ...; // Get seq1 and seq2 from somewhere >>> String seq2 = ...; >>> double result = MultipleHitCorrection.Kimura2P(seq1, seq2); >>> >>> Note that our implementation expects sequence strings to be in upper >>> case, so you'll need to make sure your data is upper case or has been >>> converted to upper case before calling our method. >>> >>> cheers, >>> Richard >>> >>> vineith kaul wrote: >>>> This is what I have .....Thanks a lot fr the help. >>>> >>>> >>>> //Method to calculate the Kimura 2 parameter distance >>>> public static double K2P(String sequence1,String sequence2){ >>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >>>> >>>> >>>> char[] seq1array=sequence1.toCharArray(); >>>> char[] seq2array=sequence2.toCharArray(); >>>> >>>> for(int i=0;i>>> // Number of aligned sites >>>> if(((seq1array[i]=='a') || >>>> (seq1array[i]=='A')||(seq1array[i]=='g') || >>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >>>> (seq2array[i]=='A')||(seq2array[i]=='c') || >>>> (seq2array[i]=='C')||(seq2array[i]=='t') || >>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>> >>>> numberOfAlignedSites++; >>>> } >>>> >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>> p++; >>>> } >>>> else >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>> p++; >>>> } >>>> else >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>> p++; >>>> } >>>> else >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>> p++; >>>> } >>>> else >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>> q++; >>>> } >>>> >>>> >>>> >>>> >>>> } >>>> >>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >>>> (((double)q)/numberOfAlignedSites); >>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >>>> return dist; >>>> } >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 10/22/07, *Richard Holland* >>> > wrote: >>>> >>>> You should take a look at the latest 1.5 release, in the >>>> org.biojavax.bio.phylo packages. This code is the beginnings of some >>>> phylogenetics code that will perform tasks as you describe. The future >>>> plan is to extend this code to cover a wider range of use cases. >>>> Kimura2P >>>> is already implemented here, in >>>> org.biojavax.bio.phylo.MultipleHitCorrection. >>>> >>>> If you can't find code that will do what you want, but have written some >>>> before, then please do feel free to contribute it. Even if it is >>>> slow, I'm >>>> sure someone out there will be able to help optimise it! >>>> >>>> cheers, >>>> Richard >>>> >>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >>>> > Hi, >>>> > >>>> > Are there functions to calculate evolutionary pairwise distances like >>>> > Kimura2P,Finkelstein etc in Biojava >>>> > I did write smthng on my own but on large sequences it runs terribly >>>> > slow and I am not even sure if thats right. >>>> > -- >>>> > Vineith Kaul >>>> > Masters Student Bioinformatics >>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>> > Georgia Tech, Atlanta >>>> > _______________________________________________ >>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> >>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> > >>>> >>>> >>>> -- >>>> Richard Holland >>>> BioMart ( http://www.biomart.org/) >>>> EMBL-EBI >>>> Hinxton, Cambridgeshire CB10 1SD, UK >>>> >>>> >>>> >>>> >>>> -- >>>> Vineith Kaul >>>> Masters Student Bioinformatics >>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>> Georgia Tech, Atlanta >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG v1.4.2.2 (GNU/Linux) >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >>> >>> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa >>> 4iKvsyBj2uznhhjTF9EYDFE= >>> =LALE >>> -----END PGP SIGNATURE----- >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> From markjschreiber at gmail.com Wed Oct 24 09:19:25 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 24 Oct 2007 21:19:25 +0800 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471F3A65.50202@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> Message-ID: <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> Another important consideration after optimization is can the task be multithreaded? Almost all modern computers have at least 2 cores. So if the algorithm can be parallelized you will get some performance bonus on most machines. Modern JVM's will automagically try to use idle CPU's to execute new threads spawned by the programmer. - Mark On 10/24/07, Andy Yates wrote: > Yes a very good point & one I was going to make before hand but forgot :) > > Also not to mention that micro-benchmarks/profiling in Java are > notorious for giving false results due to VM warmup & JIT compilation > optimisations. There is a framework hosted on Java.net somewhere which > can perform VM warmups and code iterations to produce more accurate > benchmarking results; but the name escapes me at the moment. > > However looking at this particular code I get the feeling that this is > about as fast as its going to get without someone doing bitwise XOR > operations or some C code ... that's not an open invitation for people > to start recoding this in C :). At the end of the day the key to > optimisation is to ask the question "is it fast enough already?". If it > is then there's no point :) > > Andy > > Mark Schreiber wrote: > > Hi - > > > >>From experience the best way to optimize java code is to run a > > profiler. The one in Netbeans is quite good. > > > > The reason is that the hotspot or JIT compilers might natively compile > > the part of the code that you think is slow and actually make it > > faster than something else which becomes the bottle neck. Using a good > > profiler you can detect how much time is spent in each method and pin > > point some candidate methods for optimization. You can also see if > > there is a burden due to creation of lots of objects. > > > > - Mark > > > > On 10/24/07, Andy Yates wrote: > >> Our code is very similar but not identical. The original programmer > >> shortcutted a lot of else if conditions by considering if the two bases > >> were equal or not. It can then calculate the transitional changes & > >> assume the rest are transversional. > >> > >> In terms of speed of both pieces of code I can't see an obvious way to > >> speed it up. Probably in our code removing the 10 or so calls to > >> String.charAt() with a two calls & referencing those chars might help > >> but in all honesty I cannot say. > >> > >> Andy > >> > >> Richard Holland wrote: > >>> -----BEGIN PGP SIGNED MESSAGE----- > >>> Hash: SHA1 > >>> > >>> Thanks. > >>> > >>> Your code is similar to the code we have in > >>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > >>> see if it is identical, but it probably is. > >>> > >>> You can call our code like this: > >>> > >>> // import statement for biojava phylo stuff > >>> import org.biojavax.bio.phylo.*; > >>> > >>> // ...rest of code goes here > >>> > >>> // call Kimura2P > >>> String seq1 = ...; // Get seq1 and seq2 from somewhere > >>> String seq2 = ...; > >>> double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > >>> > >>> Note that our implementation expects sequence strings to be in upper > >>> case, so you'll need to make sure your data is upper case or has been > >>> converted to upper case before calling our method. > >>> > >>> cheers, > >>> Richard > >>> > >>> vineith kaul wrote: > >>>> This is what I have .....Thanks a lot fr the help. > >>>> > >>>> > >>>> //Method to calculate the Kimura 2 parameter distance > >>>> public static double K2P(String sequence1,String sequence2){ > >>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional > >>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) > >>>> > >>>> > >>>> char[] seq1array=sequence1.toCharArray(); > >>>> char[] seq2array=sequence2.toCharArray(); > >>>> > >>>> for(int i=0;i >>>> // Number of aligned sites > >>>> if(((seq1array[i]=='a') || > >>>> (seq1array[i]=='A')||(seq1array[i]=='g') || > >>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || > >>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || > >>>> (seq2array[i]=='A')||(seq2array[i]=='c') || > >>>> (seq2array[i]=='C')||(seq2array[i]=='t') || > >>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>> > >>>> numberOfAlignedSites++; > >>>> } > >>>> > >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>> p++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>> p++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>> p++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>> p++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>> q++; > >>>> } > >>>> > >>>> > >>>> > >>>> > >>>> } > >>>> > >>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - > >>>> (((double)q)/numberOfAlignedSites); > >>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); > >>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); > >>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); > >>>> return dist; > >>>> } > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On 10/22/07, *Richard Holland* >>>> > wrote: > >>>> > >>>> You should take a look at the latest 1.5 release, in the > >>>> org.biojavax.bio.phylo packages. This code is the beginnings of some > >>>> phylogenetics code that will perform tasks as you describe. The future > >>>> plan is to extend this code to cover a wider range of use cases. > >>>> Kimura2P > >>>> is already implemented here, in > >>>> org.biojavax.bio.phylo.MultipleHitCorrection. > >>>> > >>>> If you can't find code that will do what you want, but have written some > >>>> before, then please do feel free to contribute it. Even if it is > >>>> slow, I'm > >>>> sure someone out there will be able to help optimise it! > >>>> > >>>> cheers, > >>>> Richard > >>>> > >>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > >>>> > Hi, > >>>> > > >>>> > Are there functions to calculate evolutionary pairwise distances like > >>>> > Kimura2P,Finkelstein etc in Biojava > >>>> > I did write smthng on my own but on large sequences it runs terribly > >>>> > slow and I am not even sure if thats right. > >>>> > -- > >>>> > Vineith Kaul > >>>> > Masters Student Bioinformatics > >>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >>>> > Georgia Tech, Atlanta > >>>> > _______________________________________________ > >>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> > >>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > > >>>> > >>>> > >>>> -- > >>>> Richard Holland > >>>> BioMart ( http://www.biomart.org/) > >>>> EMBL-EBI > >>>> Hinxton, Cambridgeshire CB10 1SD, UK > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Vineith Kaul > >>>> Masters Student Bioinformatics > >>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >>>> Georgia Tech, Atlanta > >>> -----BEGIN PGP SIGNATURE----- > >>> Version: GnuPG v1.4.2.2 (GNU/Linux) > >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > >>> > >>> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa > >>> 4iKvsyBj2uznhhjTF9EYDFE= > >>> =LALE > >>> -----END PGP SIGNATURE----- > >>> _______________________________________________ > >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > From holland at ebi.ac.uk Wed Oct 24 09:33:53 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 24 Oct 2007 14:33:53 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> Message-ID: <471F49C1.9070901@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This particular code could easily be parallelised - given N threads, you can simply divide the input into N chunks and get each thread to process 1/Nth of the input. You then combine the output of each thread to do the final calculation. But, it'd be bad practice to always fork a predetermined N threads for a given task. It'd be much better to somehow be able to ask 'how parallel can I make this?' at runtime by checking system resources, or maybe get the parallel-savvy user to set an optional BioJava-wide parallelisation hint. N could then be determined and the task divided appropriately. cheers, Richard Mark Schreiber wrote: > Another important consideration after optimization is can the task be > multithreaded? Almost all modern computers have at least 2 cores. So > if the algorithm can be parallelized you will get some performance > bonus on most machines. > > Modern JVM's will automagically try to use idle CPU's to execute new > threads spawned by the programmer. > > - Mark > > On 10/24/07, Andy Yates wrote: >> Yes a very good point & one I was going to make before hand but forgot :) >> >> Also not to mention that micro-benchmarks/profiling in Java are >> notorious for giving false results due to VM warmup & JIT compilation >> optimisations. There is a framework hosted on Java.net somewhere which >> can perform VM warmups and code iterations to produce more accurate >> benchmarking results; but the name escapes me at the moment. >> >> However looking at this particular code I get the feeling that this is >> about as fast as its going to get without someone doing bitwise XOR >> operations or some C code ... that's not an open invitation for people >> to start recoding this in C :). At the end of the day the key to >> optimisation is to ask the question "is it fast enough already?". If it >> is then there's no point :) >> >> Andy >> >> Mark Schreiber wrote: >>> Hi - >>> >>> >From experience the best way to optimize java code is to run a >>> profiler. The one in Netbeans is quite good. >>> >>> The reason is that the hotspot or JIT compilers might natively compile >>> the part of the code that you think is slow and actually make it >>> faster than something else which becomes the bottle neck. Using a good >>> profiler you can detect how much time is spent in each method and pin >>> point some candidate methods for optimization. You can also see if >>> there is a burden due to creation of lots of objects. >>> >>> - Mark >>> >>> On 10/24/07, Andy Yates wrote: >>>> Our code is very similar but not identical. The original programmer >>>> shortcutted a lot of else if conditions by considering if the two bases >>>> were equal or not. It can then calculate the transitional changes & >>>> assume the rest are transversional. >>>> >>>> In terms of speed of both pieces of code I can't see an obvious way to >>>> speed it up. Probably in our code removing the 10 or so calls to >>>> String.charAt() with a two calls & referencing those chars might help >>>> but in all honesty I cannot say. >>>> >>>> Andy >>>> >>>> Richard Holland wrote: > Thanks. > > Your code is similar to the code we have in > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > see if it is identical, but it probably is. > > You can call our code like this: > > // import statement for biojava phylo stuff > import org.biojavax.bio.phylo.*; > > // ...rest of code goes here > > // call Kimura2P > String seq1 = ...; // Get seq1 and seq2 from somewhere > String seq2 = ...; > double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > > Note that our implementation expects sequence strings to be in upper > case, so you'll need to make sure your data is upper case or has been > converted to upper case before calling our method. > > cheers, > Richard > > vineith kaul wrote: >>>>>>> This is what I have .....Thanks a lot fr the help. >>>>>>> >>>>>>> >>>>>>> //Method to calculate the Kimura 2 parameter distance >>>>>>> public static double K2P(String sequence1,String sequence2){ >>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >>>>>>> >>>>>>> >>>>>>> char[] seq1array=sequence1.toCharArray(); >>>>>>> char[] seq2array=sequence2.toCharArray(); >>>>>>> >>>>>>> for(int i=0;i>>>>>> // Number of aligned sites >>>>>>> if(((seq1array[i]=='a') || >>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || >>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || >>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || >>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>> >>>>>>> numberOfAlignedSites++; >>>>>>> } >>>>>>> >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>> p++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>> p++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>> p++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>> p++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>> q++; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> } >>>>>>> >>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >>>>>>> (((double)q)/numberOfAlignedSites); >>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >>>>>>> return dist; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 10/22/07, *Richard Holland* >>>>>> > wrote: >>>>>>> >>>>>>> You should take a look at the latest 1.5 release, in the >>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some >>>>>>> phylogenetics code that will perform tasks as you describe. The future >>>>>>> plan is to extend this code to cover a wider range of use cases. >>>>>>> Kimura2P >>>>>>> is already implemented here, in >>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. >>>>>>> >>>>>>> If you can't find code that will do what you want, but have written some >>>>>>> before, then please do feel free to contribute it. Even if it is >>>>>>> slow, I'm >>>>>>> sure someone out there will be able to help optimise it! >>>>>>> >>>>>>> cheers, >>>>>>> Richard >>>>>>> >>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >>>>>>> > Hi, >>>>>>> > >>>>>>> > Are there functions to calculate evolutionary pairwise distances like >>>>>>> > Kimura2P,Finkelstein etc in Biojava >>>>>>> > I did write smthng on my own but on large sequences it runs terribly >>>>>>> > slow and I am not even sure if thats right. >>>>>>> > -- >>>>>>> > Vineith Kaul >>>>>>> > Masters Student Bioinformatics >>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>> > Georgia Tech, Atlanta >>>>>>> > _______________________________________________ >>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>> >>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> > >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Richard Holland >>>>>>> BioMart ( http://www.biomart.org/) >>>>>>> EMBL-EBI >>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Vineith Kaul >>>>>>> Masters Student Bioinformatics >>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>> Georgia Tech, Atlanta _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P IEyRleSs1+AziCvfhcES8wI= =uLDm -----END PGP SIGNATURE----- From markjschreiber at gmail.com Wed Oct 24 09:41:16 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 24 Oct 2007 21:41:16 +0800 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471F49C1.9070901@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> Message-ID: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> I'm not aware of a way to determine the number of CPU's within a program although possibly it is one the the environment variables available from System. Even if it can't be determined there could be a method argument to specify the number of threads to spawn. - Mark On 10/24/07, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This particular code could easily be parallelised - given N threads, you > can simply divide the input into N chunks and get each thread to process > 1/Nth of the input. You then combine the output of each thread to do the > final calculation. > > But, it'd be bad practice to always fork a predetermined N threads for a > given task. It'd be much better to somehow be able to ask 'how parallel > can I make this?' at runtime by checking system resources, or maybe get > the parallel-savvy user to set an optional BioJava-wide parallelisation > hint. N could then be determined and the task divided appropriately. > > cheers, > Richard > > Mark Schreiber wrote: > > Another important consideration after optimization is can the task be > > multithreaded? Almost all modern computers have at least 2 cores. So > > if the algorithm can be parallelized you will get some performance > > bonus on most machines. > > > > Modern JVM's will automagically try to use idle CPU's to execute new > > threads spawned by the programmer. > > > > - Mark > > > > On 10/24/07, Andy Yates wrote: > >> Yes a very good point & one I was going to make before hand but forgot :) > >> > >> Also not to mention that micro-benchmarks/profiling in Java are > >> notorious for giving false results due to VM warmup & JIT compilation > >> optimisations. There is a framework hosted on Java.net somewhere which > >> can perform VM warmups and code iterations to produce more accurate > >> benchmarking results; but the name escapes me at the moment. > >> > >> However looking at this particular code I get the feeling that this is > >> about as fast as its going to get without someone doing bitwise XOR > >> operations or some C code ... that's not an open invitation for people > >> to start recoding this in C :). At the end of the day the key to > >> optimisation is to ask the question "is it fast enough already?". If it > >> is then there's no point :) > >> > >> Andy > >> > >> Mark Schreiber wrote: > >>> Hi - > >>> > >>> >From experience the best way to optimize java code is to run a > >>> profiler. The one in Netbeans is quite good. > >>> > >>> The reason is that the hotspot or JIT compilers might natively compile > >>> the part of the code that you think is slow and actually make it > >>> faster than something else which becomes the bottle neck. Using a good > >>> profiler you can detect how much time is spent in each method and pin > >>> point some candidate methods for optimization. You can also see if > >>> there is a burden due to creation of lots of objects. > >>> > >>> - Mark > >>> > >>> On 10/24/07, Andy Yates wrote: > >>>> Our code is very similar but not identical. The original programmer > >>>> shortcutted a lot of else if conditions by considering if the two bases > >>>> were equal or not. It can then calculate the transitional changes & > >>>> assume the rest are transversional. > >>>> > >>>> In terms of speed of both pieces of code I can't see an obvious way to > >>>> speed it up. Probably in our code removing the 10 or so calls to > >>>> String.charAt() with a two calls & referencing those chars might help > >>>> but in all honesty I cannot say. > >>>> > >>>> Andy > >>>> > >>>> Richard Holland wrote: > > Thanks. > > > > Your code is similar to the code we have in > > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > > see if it is identical, but it probably is. > > > > You can call our code like this: > > > > // import statement for biojava phylo stuff > > import org.biojavax.bio.phylo.*; > > > > // ...rest of code goes here > > > > // call Kimura2P > > String seq1 = ...; // Get seq1 and seq2 from somewhere > > String seq2 = ...; > > double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > > > > Note that our implementation expects sequence strings to be in upper > > case, so you'll need to make sure your data is upper case or has been > > converted to upper case before calling our method. > > > > cheers, > > Richard > > > > vineith kaul wrote: > >>>>>>> This is what I have .....Thanks a lot fr the help. > >>>>>>> > >>>>>>> > >>>>>>> //Method to calculate the Kimura 2 parameter distance > >>>>>>> public static double K2P(String sequence1,String sequence2){ > >>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional > >>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) > >>>>>>> > >>>>>>> > >>>>>>> char[] seq1array=sequence1.toCharArray(); > >>>>>>> char[] seq2array=sequence2.toCharArray(); > >>>>>>> > >>>>>>> for(int i=0;i >>>>>>> // Number of aligned sites > >>>>>>> if(((seq1array[i]=='a') || > >>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || > >>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || > >>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || > >>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || > >>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || > >>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>>>>> > >>>>>>> numberOfAlignedSites++; > >>>>>>> } > >>>>>>> > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>>>>> p++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>>>>> p++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>>>>> p++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>>>>> p++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - > >>>>>>> (((double)q)/numberOfAlignedSites); > >>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); > >>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); > >>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); > >>>>>>> return dist; > >>>>>>> } > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 10/22/07, *Richard Holland* >>>>>>> > wrote: > >>>>>>> > >>>>>>> You should take a look at the latest 1.5 release, in the > >>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some > >>>>>>> phylogenetics code that will perform tasks as you describe. The future > >>>>>>> plan is to extend this code to cover a wider range of use cases. > >>>>>>> Kimura2P > >>>>>>> is already implemented here, in > >>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. > >>>>>>> > >>>>>>> If you can't find code that will do what you want, but have written some > >>>>>>> before, then please do feel free to contribute it. Even if it is > >>>>>>> slow, I'm > >>>>>>> sure someone out there will be able to help optimise it! > >>>>>>> > >>>>>>> cheers, > >>>>>>> Richard > >>>>>>> > >>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > >>>>>>> > Hi, > >>>>>>> > > >>>>>>> > Are there functions to calculate evolutionary pairwise distances like > >>>>>>> > Kimura2P,Finkelstein etc in Biojava > >>>>>>> > I did write smthng on my own but on large sequences it runs terribly > >>>>>>> > slow and I am not even sure if thats right. > >>>>>>> > -- > >>>>>>> > Vineith Kaul > >>>>>>> > Masters Student Bioinformatics > >>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >>>>>>> > Georgia Tech, Atlanta > >>>>>>> > _______________________________________________ > >>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>>>> > >>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>>> > > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Richard Holland > >>>>>>> BioMart ( http://www.biomart.org/) > >>>>>>> EMBL-EBI > >>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Vineith Kaul > >>>>>>> Masters Student Bioinformatics > >>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >>>>>>> Georgia Tech, Atlanta > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P > IEyRleSs1+AziCvfhcES8wI= > =uLDm > -----END PGP SIGNATURE----- > From markjschreiber at gmail.com Wed Oct 24 09:48:00 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 24 Oct 2007 21:48:00 +0800 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> Message-ID: <93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com> It appears it is as simple as: Runtime.getRuntime().availableProcessors(); - Mark On 10/24/07, Mark Schreiber wrote: > I'm not aware of a way to determine the number of CPU's within a > program although possibly it is one the the environment variables > available from System. > > Even if it can't be determined there could be a method argument to > specify the number of threads to spawn. > > - Mark > > On 10/24/07, Richard Holland wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > This particular code could easily be parallelised - given N threads, you > > can simply divide the input into N chunks and get each thread to process > > 1/Nth of the input. You then combine the output of each thread to do the > > final calculation. > > > > But, it'd be bad practice to always fork a predetermined N threads for a > > given task. It'd be much better to somehow be able to ask 'how parallel > > can I make this?' at runtime by checking system resources, or maybe get > > the parallel-savvy user to set an optional BioJava-wide parallelisation > > hint. N could then be determined and the task divided appropriately. > > > > cheers, > > Richard > > > > Mark Schreiber wrote: > > > Another important consideration after optimization is can the task be > > > multithreaded? Almost all modern computers have at least 2 cores. So > > > if the algorithm can be parallelized you will get some performance > > > bonus on most machines. > > > > > > Modern JVM's will automagically try to use idle CPU's to execute new > > > threads spawned by the programmer. > > > > > > - Mark > > > > > > On 10/24/07, Andy Yates wrote: > > >> Yes a very good point & one I was going to make before hand but forgot :) > > >> > > >> Also not to mention that micro-benchmarks/profiling in Java are > > >> notorious for giving false results due to VM warmup & JIT compilation > > >> optimisations. There is a framework hosted on Java.net somewhere which > > >> can perform VM warmups and code iterations to produce more accurate > > >> benchmarking results; but the name escapes me at the moment. > > >> > > >> However looking at this particular code I get the feeling that this is > > >> about as fast as its going to get without someone doing bitwise XOR > > >> operations or some C code ... that's not an open invitation for people > > >> to start recoding this in C :). At the end of the day the key to > > >> optimisation is to ask the question "is it fast enough already?". If it > > >> is then there's no point :) > > >> > > >> Andy > > >> > > >> Mark Schreiber wrote: > > >>> Hi - > > >>> > > >>> >From experience the best way to optimize java code is to run a > > >>> profiler. The one in Netbeans is quite good. > > >>> > > >>> The reason is that the hotspot or JIT compilers might natively compile > > >>> the part of the code that you think is slow and actually make it > > >>> faster than something else which becomes the bottle neck. Using a good > > >>> profiler you can detect how much time is spent in each method and pin > > >>> point some candidate methods for optimization. You can also see if > > >>> there is a burden due to creation of lots of objects. > > >>> > > >>> - Mark > > >>> > > >>> On 10/24/07, Andy Yates wrote: > > >>>> Our code is very similar but not identical. The original programmer > > >>>> shortcutted a lot of else if conditions by considering if the two bases > > >>>> were equal or not. It can then calculate the transitional changes & > > >>>> assume the rest are transversional. > > >>>> > > >>>> In terms of speed of both pieces of code I can't see an obvious way to > > >>>> speed it up. Probably in our code removing the 10 or so calls to > > >>>> String.charAt() with a two calls & referencing those chars might help > > >>>> but in all honesty I cannot say. > > >>>> > > >>>> Andy > > >>>> > > >>>> Richard Holland wrote: > > > Thanks. > > > > > > Your code is similar to the code we have in > > > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > > > see if it is identical, but it probably is. > > > > > > You can call our code like this: > > > > > > // import statement for biojava phylo stuff > > > import org.biojavax.bio.phylo.*; > > > > > > // ...rest of code goes here > > > > > > // call Kimura2P > > > String seq1 = ...; // Get seq1 and seq2 from somewhere > > > String seq2 = ...; > > > double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > > > > > > Note that our implementation expects sequence strings to be in upper > > > case, so you'll need to make sure your data is upper case or has been > > > converted to upper case before calling our method. > > > > > > cheers, > > > Richard > > > > > > vineith kaul wrote: > > >>>>>>> This is what I have .....Thanks a lot fr the help. > > >>>>>>> > > >>>>>>> > > >>>>>>> //Method to calculate the Kimura 2 parameter distance > > >>>>>>> public static double K2P(String sequence1,String sequence2){ > > >>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional > > >>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) > > >>>>>>> > > >>>>>>> > > >>>>>>> char[] seq1array=sequence1.toCharArray(); > > >>>>>>> char[] seq2array=sequence2.toCharArray(); > > >>>>>>> > > >>>>>>> for(int i=0;i > >>>>>>> // Number of aligned sites > > >>>>>>> if(((seq1array[i]=='a') || > > >>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || > > >>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || > > >>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || > > >>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || > > >>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || > > >>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { > > >>>>>>> > > >>>>>>> numberOfAlignedSites++; > > >>>>>>> } > > >>>>>>> > > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > > >>>>>>> p++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > > >>>>>>> p++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > > >>>>>>> p++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > > >>>>>>> p++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> } > > >>>>>>> > > >>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - > > >>>>>>> (((double)q)/numberOfAlignedSites); > > >>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); > > >>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); > > >>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); > > >>>>>>> return dist; > > >>>>>>> } > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On 10/22/07, *Richard Holland* > >>>>>>> > wrote: > > >>>>>>> > > >>>>>>> You should take a look at the latest 1.5 release, in the > > >>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some > > >>>>>>> phylogenetics code that will perform tasks as you describe. The future > > >>>>>>> plan is to extend this code to cover a wider range of use cases. > > >>>>>>> Kimura2P > > >>>>>>> is already implemented here, in > > >>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. > > >>>>>>> > > >>>>>>> If you can't find code that will do what you want, but have written some > > >>>>>>> before, then please do feel free to contribute it. Even if it is > > >>>>>>> slow, I'm > > >>>>>>> sure someone out there will be able to help optimise it! > > >>>>>>> > > >>>>>>> cheers, > > >>>>>>> Richard > > >>>>>>> > > >>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > > >>>>>>> > Hi, > > >>>>>>> > > > >>>>>>> > Are there functions to calculate evolutionary pairwise distances like > > >>>>>>> > Kimura2P,Finkelstein etc in Biojava > > >>>>>>> > I did write smthng on my own but on large sequences it runs terribly > > >>>>>>> > slow and I am not even sure if thats right. > > >>>>>>> > -- > > >>>>>>> > Vineith Kaul > > >>>>>>> > Masters Student Bioinformatics > > >>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > > >>>>>>> > Georgia Tech, Atlanta > > >>>>>>> > _______________________________________________ > > >>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > >>>>>>> > > >>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l > > >>>>>>> > > > >>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Richard Holland > > >>>>>>> BioMart ( http://www.biomart.org/) > > >>>>>>> EMBL-EBI > > >>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Vineith Kaul > > >>>>>>> Masters Student Bioinformatics > > >>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > > >>>>>>> Georgia Tech, Atlanta > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > >>>> _______________________________________________ > > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > > >>>> > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.2.2 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > > > iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P > > IEyRleSs1+AziCvfhcES8wI= > > =uLDm > > -----END PGP SIGNATURE----- > > > From ayates at ebi.ac.uk Wed Oct 24 09:49:22 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 24 Oct 2007 14:49:22 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471F49C1.9070901@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> Message-ID: <471F4D62.3030900@ebi.ac.uk> Of course parallelisation all depends on the task not being limited by something else like memory, IO or database (which of course this wouldn't be). There's also the scenario where thread startup takes longer than running the code in serial :). Not to mention Java concurrency isn't an easy thing to write correctly. I'd prefer the model promoted in Java5 where you have pools of threads & pass in instances of Callable (which are a successor to Runnable but return Futures which return objects & exceptions). You then pass in a list of these callables & wait for them all to finish & grab the results. You can have as many callables as you like & the thread pool will process them as & when a thread becomes free. Combine this with looking at the reported number of processors/cores on the machine & say that's the default size of the pool (assuming you're making it parallel because you're flat-lining a processor). Say: int processorCount = Runtime.getRuntime().availableProcessors(); ExecutorService.createThreadPool(processorCount); This code might be wrong (well the creating the thread pool bit) but you get the idea :). Of course someone may not want to parallise a job (I quite like having dual cores as a runaway process can take out one but I can still run top & kill the thing). Andy Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This particular code could easily be parallelised - given N threads, you > can simply divide the input into N chunks and get each thread to process > 1/Nth of the input. You then combine the output of each thread to do the > final calculation. > > But, it'd be bad practice to always fork a predetermined N threads for a > given task. It'd be much better to somehow be able to ask 'how parallel > can I make this?' at runtime by checking system resources, or maybe get > the parallel-savvy user to set an optional BioJava-wide parallelisation > hint. N could then be determined and the task divided appropriately. > > cheers, > Richard > > Mark Schreiber wrote: >> Another important consideration after optimization is can the task be >> multithreaded? Almost all modern computers have at least 2 cores. So >> if the algorithm can be parallelized you will get some performance >> bonus on most machines. >> >> Modern JVM's will automagically try to use idle CPU's to execute new >> threads spawned by the programmer. >> >> - Mark >> >> On 10/24/07, Andy Yates wrote: >>> Yes a very good point & one I was going to make before hand but forgot :) >>> >>> Also not to mention that micro-benchmarks/profiling in Java are >>> notorious for giving false results due to VM warmup & JIT compilation >>> optimisations. There is a framework hosted on Java.net somewhere which >>> can perform VM warmups and code iterations to produce more accurate >>> benchmarking results; but the name escapes me at the moment. >>> >>> However looking at this particular code I get the feeling that this is >>> about as fast as its going to get without someone doing bitwise XOR >>> operations or some C code ... that's not an open invitation for people >>> to start recoding this in C :). At the end of the day the key to >>> optimisation is to ask the question "is it fast enough already?". If it >>> is then there's no point :) >>> >>> Andy >>> >>> Mark Schreiber wrote: >>>> Hi - >>>> >>>> >From experience the best way to optimize java code is to run a >>>> profiler. The one in Netbeans is quite good. >>>> >>>> The reason is that the hotspot or JIT compilers might natively compile >>>> the part of the code that you think is slow and actually make it >>>> faster than something else which becomes the bottle neck. Using a good >>>> profiler you can detect how much time is spent in each method and pin >>>> point some candidate methods for optimization. You can also see if >>>> there is a burden due to creation of lots of objects. >>>> >>>> - Mark >>>> >>>> On 10/24/07, Andy Yates wrote: >>>>> Our code is very similar but not identical. The original programmer >>>>> shortcutted a lot of else if conditions by considering if the two bases >>>>> were equal or not. It can then calculate the transitional changes & >>>>> assume the rest are transversional. >>>>> >>>>> In terms of speed of both pieces of code I can't see an obvious way to >>>>> speed it up. Probably in our code removing the 10 or so calls to >>>>> String.charAt() with a two calls & referencing those chars might help >>>>> but in all honesty I cannot say. >>>>> >>>>> Andy >>>>> >>>>> Richard Holland wrote: >> Thanks. >> >> Your code is similar to the code we have in >> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to >> see if it is identical, but it probably is. >> >> You can call our code like this: >> >> // import statement for biojava phylo stuff >> import org.biojavax.bio.phylo.*; >> >> // ...rest of code goes here >> >> // call Kimura2P >> String seq1 = ...; // Get seq1 and seq2 from somewhere >> String seq2 = ...; >> double result = MultipleHitCorrection.Kimura2P(seq1, seq2); >> >> Note that our implementation expects sequence strings to be in upper >> case, so you'll need to make sure your data is upper case or has been >> converted to upper case before calling our method. >> >> cheers, >> Richard >> >> vineith kaul wrote: >>>>>>>> This is what I have .....Thanks a lot fr the help. >>>>>>>> >>>>>>>> >>>>>>>> //Method to calculate the Kimura 2 parameter distance >>>>>>>> public static double K2P(String sequence1,String sequence2){ >>>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >>>>>>>> >>>>>>>> >>>>>>>> char[] seq1array=sequence1.toCharArray(); >>>>>>>> char[] seq2array=sequence2.toCharArray(); >>>>>>>> >>>>>>>> for(int i=0;i>>>>>>> // Number of aligned sites >>>>>>>> if(((seq1array[i]=='a') || >>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || >>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || >>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || >>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>> >>>>>>>> numberOfAlignedSites++; >>>>>>>> } >>>>>>>> >>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>> p++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>> p++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>> p++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>> p++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >>>>>>>> (((double)q)/numberOfAlignedSites); >>>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >>>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >>>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >>>>>>>> return dist; >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 10/22/07, *Richard Holland* >>>>>>> > wrote: >>>>>>>> >>>>>>>> You should take a look at the latest 1.5 release, in the >>>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some >>>>>>>> phylogenetics code that will perform tasks as you describe. The future >>>>>>>> plan is to extend this code to cover a wider range of use cases. >>>>>>>> Kimura2P >>>>>>>> is already implemented here, in >>>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. >>>>>>>> >>>>>>>> If you can't find code that will do what you want, but have written some >>>>>>>> before, then please do feel free to contribute it. Even if it is >>>>>>>> slow, I'm >>>>>>>> sure someone out there will be able to help optimise it! >>>>>>>> >>>>>>>> cheers, >>>>>>>> Richard >>>>>>>> >>>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > Are there functions to calculate evolutionary pairwise distances like >>>>>>>> > Kimura2P,Finkelstein etc in Biojava >>>>>>>> > I did write smthng on my own but on large sequences it runs terribly >>>>>>>> > slow and I am not even sure if thats right. >>>>>>>> > -- >>>>>>>> > Vineith Kaul >>>>>>>> > Masters Student Bioinformatics >>>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>> > Georgia Tech, Atlanta >>>>>>>> > _______________________________________________ >>>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>> >>>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>> > >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Richard Holland >>>>>>>> BioMart ( http://www.biomart.org/) >>>>>>>> EMBL-EBI >>>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Vineith Kaul >>>>>>>> Masters Student Bioinformatics >>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>> Georgia Tech, Atlanta > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P > IEyRleSs1+AziCvfhcES8wI= > =uLDm > -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Wed Oct 24 09:49:38 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 24 Oct 2007 14:49:38 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> <93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com> Message-ID: <471F4D72.80505@ebi.ac.uk> Beat me to it :) Andy Mark Schreiber wrote: > It appears it is as simple as: > > Runtime.getRuntime().availableProcessors(); > > - Mark > > On 10/24/07, Mark Schreiber wrote: >> I'm not aware of a way to determine the number of CPU's within a >> program although possibly it is one the the environment variables >> available from System. >> >> Even if it can't be determined there could be a method argument to >> specify the number of threads to spawn. >> >> - Mark >> >> On 10/24/07, Richard Holland wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> This particular code could easily be parallelised - given N threads, you >>> can simply divide the input into N chunks and get each thread to process >>> 1/Nth of the input. You then combine the output of each thread to do the >>> final calculation. >>> >>> But, it'd be bad practice to always fork a predetermined N threads for a >>> given task. It'd be much better to somehow be able to ask 'how parallel >>> can I make this?' at runtime by checking system resources, or maybe get >>> the parallel-savvy user to set an optional BioJava-wide parallelisation >>> hint. N could then be determined and the task divided appropriately. >>> >>> cheers, >>> Richard >>> >>> Mark Schreiber wrote: >>>> Another important consideration after optimization is can the task be >>>> multithreaded? Almost all modern computers have at least 2 cores. So >>>> if the algorithm can be parallelized you will get some performance >>>> bonus on most machines. >>>> >>>> Modern JVM's will automagically try to use idle CPU's to execute new >>>> threads spawned by the programmer. >>>> >>>> - Mark >>>> >>>> On 10/24/07, Andy Yates wrote: >>>>> Yes a very good point & one I was going to make before hand but forgot :) >>>>> >>>>> Also not to mention that micro-benchmarks/profiling in Java are >>>>> notorious for giving false results due to VM warmup & JIT compilation >>>>> optimisations. There is a framework hosted on Java.net somewhere which >>>>> can perform VM warmups and code iterations to produce more accurate >>>>> benchmarking results; but the name escapes me at the moment. >>>>> >>>>> However looking at this particular code I get the feeling that this is >>>>> about as fast as its going to get without someone doing bitwise XOR >>>>> operations or some C code ... that's not an open invitation for people >>>>> to start recoding this in C :). At the end of the day the key to >>>>> optimisation is to ask the question "is it fast enough already?". If it >>>>> is then there's no point :) >>>>> >>>>> Andy >>>>> >>>>> Mark Schreiber wrote: >>>>>> Hi - >>>>>> >>>>>> >From experience the best way to optimize java code is to run a >>>>>> profiler. The one in Netbeans is quite good. >>>>>> >>>>>> The reason is that the hotspot or JIT compilers might natively compile >>>>>> the part of the code that you think is slow and actually make it >>>>>> faster than something else which becomes the bottle neck. Using a good >>>>>> profiler you can detect how much time is spent in each method and pin >>>>>> point some candidate methods for optimization. You can also see if >>>>>> there is a burden due to creation of lots of objects. >>>>>> >>>>>> - Mark >>>>>> >>>>>> On 10/24/07, Andy Yates wrote: >>>>>>> Our code is very similar but not identical. The original programmer >>>>>>> shortcutted a lot of else if conditions by considering if the two bases >>>>>>> were equal or not. It can then calculate the transitional changes & >>>>>>> assume the rest are transversional. >>>>>>> >>>>>>> In terms of speed of both pieces of code I can't see an obvious way to >>>>>>> speed it up. Probably in our code removing the 10 or so calls to >>>>>>> String.charAt() with a two calls & referencing those chars might help >>>>>>> but in all honesty I cannot say. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> Richard Holland wrote: >>>> Thanks. >>>> >>>> Your code is similar to the code we have in >>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to >>>> see if it is identical, but it probably is. >>>> >>>> You can call our code like this: >>>> >>>> // import statement for biojava phylo stuff >>>> import org.biojavax.bio.phylo.*; >>>> >>>> // ...rest of code goes here >>>> >>>> // call Kimura2P >>>> String seq1 = ...; // Get seq1 and seq2 from somewhere >>>> String seq2 = ...; >>>> double result = MultipleHitCorrection.Kimura2P(seq1, seq2); >>>> >>>> Note that our implementation expects sequence strings to be in upper >>>> case, so you'll need to make sure your data is upper case or has been >>>> converted to upper case before calling our method. >>>> >>>> cheers, >>>> Richard >>>> >>>> vineith kaul wrote: >>>>>>>>>> This is what I have .....Thanks a lot fr the help. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> //Method to calculate the Kimura 2 parameter distance >>>>>>>>>> public static double K2P(String sequence1,String sequence2){ >>>>>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >>>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> char[] seq1array=sequence1.toCharArray(); >>>>>>>>>> char[] seq2array=sequence2.toCharArray(); >>>>>>>>>> >>>>>>>>>> for(int i=0;i>>>>>>>>> // Number of aligned sites >>>>>>>>>> if(((seq1array[i]=='a') || >>>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || >>>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >>>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >>>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || >>>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || >>>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> >>>>>>>>>> numberOfAlignedSites++; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >>>>>>>>>> (((double)q)/numberOfAlignedSites); >>>>>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >>>>>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >>>>>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >>>>>>>>>> return dist; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 10/22/07, *Richard Holland* >>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>> You should take a look at the latest 1.5 release, in the >>>>>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some >>>>>>>>>> phylogenetics code that will perform tasks as you describe. The future >>>>>>>>>> plan is to extend this code to cover a wider range of use cases. >>>>>>>>>> Kimura2P >>>>>>>>>> is already implemented here, in >>>>>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. >>>>>>>>>> >>>>>>>>>> If you can't find code that will do what you want, but have written some >>>>>>>>>> before, then please do feel free to contribute it. Even if it is >>>>>>>>>> slow, I'm >>>>>>>>>> sure someone out there will be able to help optimise it! >>>>>>>>>> >>>>>>>>>> cheers, >>>>>>>>>> Richard >>>>>>>>>> >>>>>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >>>>>>>>>> > Hi, >>>>>>>>>> > >>>>>>>>>> > Are there functions to calculate evolutionary pairwise distances like >>>>>>>>>> > Kimura2P,Finkelstein etc in Biojava >>>>>>>>>> > I did write smthng on my own but on large sequences it runs terribly >>>>>>>>>> > slow and I am not even sure if thats right. >>>>>>>>>> > -- >>>>>>>>>> > Vineith Kaul >>>>>>>>>> > Masters Student Bioinformatics >>>>>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>>>> > Georgia Tech, Atlanta >>>>>>>>>> > _______________________________________________ >>>>>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>>>> >>>>>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Richard Holland >>>>>>>>>> BioMart ( http://www.biomart.org/) >>>>>>>>>> EMBL-EBI >>>>>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Vineith Kaul >>>>>>>>>> Masters Student Bioinformatics >>>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>>>> Georgia Tech, Atlanta >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG v1.4.2.2 (GNU/Linux) >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >>> >>> iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P >>> IEyRleSs1+AziCvfhcES8wI= >>> =uLDm >>> -----END PGP SIGNATURE----- >>> From holland at ebi.ac.uk Wed Oct 24 09:53:29 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 24 Oct 2007 14:53:29 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> Message-ID: <471F4E59.1040703@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mark Schreiber wrote: > I'm not aware of a way to determine the number of CPU's within a > program although possibly it is one the the environment variables > available from System. Yup, I'm not aware of one either. Actually, thinking about this, it'd be a bad thing if BioJava grabbed both CPUs just because they're currently available - the user might want it to only run on one, with something else running on the second one. So attempting to guess a good parallelisation value from the system is probably not good! > Even if it can't be determined there could be a method argument to > specify the number of threads to spawn. I was thinking more along the lines of a global static method in some kind of toolkit class, so that any part of BJ which is parallelisation-aware can take advantage of it if it is set. This also avoids passing parameters that don't have an immediately obvious impact on the expected output of the method. I'd also like to have this global variable control the total number of threads, so that if the user forks a set of threads themselves and runs a parallel-aware method in each of them, then BJ will not attempt to sub-divide each thread into more threads than the limit configured by this variable. Likewise if the user changes the limit whilst threads are currently running, they should stop (if there are too many) or new ones should start (if there are too few), but taking care to make sure that every parallelisation request maintains at least one thread so the job doesn't stop entirely.... there must be a toolkit for this somewhere surely? cheers, Richard > - Mark > > On 10/24/07, Richard Holland wrote: > This particular code could easily be parallelised - given N threads, you > can simply divide the input into N chunks and get each thread to process > 1/Nth of the input. You then combine the output of each thread to do the > final calculation. > > But, it'd be bad practice to always fork a predetermined N threads for a > given task. It'd be much better to somehow be able to ask 'how parallel > can I make this?' at runtime by checking system resources, or maybe get > the parallel-savvy user to set an optional BioJava-wide parallelisation > hint. N could then be determined and the task divided appropriately. > > cheers, > Richard > > Mark Schreiber wrote: >>>> Another important consideration after optimization is can the task be >>>> multithreaded? Almost all modern computers have at least 2 cores. So >>>> if the algorithm can be parallelized you will get some performance >>>> bonus on most machines. >>>> >>>> Modern JVM's will automagically try to use idle CPU's to execute new >>>> threads spawned by the programmer. >>>> >>>> - Mark >>>> >>>> On 10/24/07, Andy Yates wrote: >>>>> Yes a very good point & one I was going to make before hand but forgot :) >>>>> >>>>> Also not to mention that micro-benchmarks/profiling in Java are >>>>> notorious for giving false results due to VM warmup & JIT compilation >>>>> optimisations. There is a framework hosted on Java.net somewhere which >>>>> can perform VM warmups and code iterations to produce more accurate >>>>> benchmarking results; but the name escapes me at the moment. >>>>> >>>>> However looking at this particular code I get the feeling that this is >>>>> about as fast as its going to get without someone doing bitwise XOR >>>>> operations or some C code ... that's not an open invitation for people >>>>> to start recoding this in C :). At the end of the day the key to >>>>> optimisation is to ask the question "is it fast enough already?". If it >>>>> is then there's no point :) >>>>> >>>>> Andy >>>>> >>>>> Mark Schreiber wrote: >>>>>> Hi - >>>>>> >>>>>> >From experience the best way to optimize java code is to run a >>>>>> profiler. The one in Netbeans is quite good. >>>>>> >>>>>> The reason is that the hotspot or JIT compilers might natively compile >>>>>> the part of the code that you think is slow and actually make it >>>>>> faster than something else which becomes the bottle neck. Using a good >>>>>> profiler you can detect how much time is spent in each method and pin >>>>>> point some candidate methods for optimization. You can also see if >>>>>> there is a burden due to creation of lots of objects. >>>>>> >>>>>> - Mark >>>>>> >>>>>> On 10/24/07, Andy Yates wrote: >>>>>>> Our code is very similar but not identical. The original programmer >>>>>>> shortcutted a lot of else if conditions by considering if the two bases >>>>>>> were equal or not. It can then calculate the transitional changes & >>>>>>> assume the rest are transversional. >>>>>>> >>>>>>> In terms of speed of both pieces of code I can't see an obvious way to >>>>>>> speed it up. Probably in our code removing the 10 or so calls to >>>>>>> String.charAt() with a two calls & referencing those chars might help >>>>>>> but in all honesty I cannot say. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> Richard Holland wrote: >>>> Thanks. >>>> >>>> Your code is similar to the code we have in >>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to >>>> see if it is identical, but it probably is. >>>> >>>> You can call our code like this: >>>> >>>> // import statement for biojava phylo stuff >>>> import org.biojavax.bio.phylo.*; >>>> >>>> // ...rest of code goes here >>>> >>>> // call Kimura2P >>>> String seq1 = ...; // Get seq1 and seq2 from somewhere >>>> String seq2 = ...; >>>> double result = MultipleHitCorrection.Kimura2P(seq1, seq2); >>>> >>>> Note that our implementation expects sequence strings to be in upper >>>> case, so you'll need to make sure your data is upper case or has been >>>> converted to upper case before calling our method. >>>> >>>> cheers, >>>> Richard >>>> >>>> vineith kaul wrote: >>>>>>>>>> This is what I have .....Thanks a lot fr the help. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> //Method to calculate the Kimura 2 parameter distance >>>>>>>>>> public static double K2P(String sequence1,String sequence2){ >>>>>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >>>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> char[] seq1array=sequence1.toCharArray(); >>>>>>>>>> char[] seq2array=sequence2.toCharArray(); >>>>>>>>>> >>>>>>>>>> for(int i=0;i>>>>>>>>> // Number of aligned sites >>>>>>>>>> if(((seq1array[i]=='a') || >>>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || >>>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >>>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >>>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || >>>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || >>>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> >>>>>>>>>> numberOfAlignedSites++; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >>>>>>>>>> (((double)q)/numberOfAlignedSites); >>>>>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >>>>>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >>>>>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >>>>>>>>>> return dist; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 10/22/07, *Richard Holland* >>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>> You should take a look at the latest 1.5 release, in the >>>>>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some >>>>>>>>>> phylogenetics code that will perform tasks as you describe. The future >>>>>>>>>> plan is to extend this code to cover a wider range of use cases. >>>>>>>>>> Kimura2P >>>>>>>>>> is already implemented here, in >>>>>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. >>>>>>>>>> >>>>>>>>>> If you can't find code that will do what you want, but have written some >>>>>>>>>> before, then please do feel free to contribute it. Even if it is >>>>>>>>>> slow, I'm >>>>>>>>>> sure someone out there will be able to help optimise it! >>>>>>>>>> >>>>>>>>>> cheers, >>>>>>>>>> Richard >>>>>>>>>> >>>>>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >>>>>>>>>> > Hi, >>>>>>>>>> > >>>>>>>>>> > Are there functions to calculate evolutionary pairwise distances like >>>>>>>>>> > Kimura2P,Finkelstein etc in Biojava >>>>>>>>>> > I did write smthng on my own but on large sequences it runs terribly >>>>>>>>>> > slow and I am not even sure if thats right. >>>>>>>>>> > -- >>>>>>>>>> > Vineith Kaul >>>>>>>>>> > Masters Student Bioinformatics >>>>>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>>>> > Georgia Tech, Atlanta >>>>>>>>>> > _______________________________________________ >>>>>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>>>> >>>>>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Richard Holland >>>>>>>>>> BioMart ( http://www.biomart.org/) >>>>>>>>>> EMBL-EBI >>>>>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Vineith Kaul >>>>>>>>>> Masters Student Bioinformatics >>>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>>>> Georgia Tech, Atlanta > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHH05Y4C5LeMEKA/QRAouqAJ9TgDACIQLPeenSZcStDhkZQg/UuQCfc7sZ cocyjnf9/T8H3uQJ+rW5m2U= =Q6UR -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Wed Oct 24 09:58:01 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 24 Oct 2007 14:58:01 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471F4E59.1040703@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> <471F4E59.1040703@ebi.ac.uk> Message-ID: <471F4F69.3010806@ebi.ac.uk> The executor thread pool system is the best way to control this. The thread pool can be setup once & called out whilst all clients of the code will wait for their jobs/futures to complete. Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I was thinking more along the lines of a global static method in some > kind of toolkit class, so that any part of BJ which is > parallelisation-aware can take advantage of it if it is set. This also > avoids passing parameters that don't have an immediately obvious impact > on the expected output of the method. I'd also like to have this global > variable control the total number of threads, so that if the user forks > a set of threads themselves and runs a parallel-aware method in each of > them, then BJ will not attempt to sub-divide each thread into more > threads than the limit configured by this variable. Likewise if the user > changes the limit whilst threads are currently running, they should stop > (if there are too many) or new ones should start (if there are too few), > but taking care to make sure that every parallelisation request > maintains at least one thread so the job doesn't stop entirely.... there > must be a toolkit for this somewhere surely? > From matthew.pocock at ncl.ac.uk Tue Oct 2 22:14:01 2007 From: matthew.pocock at ncl.ac.uk (Matthew Pocock) Date: Tue, 2 Oct 2007 23:14:01 +0100 Subject: [Biojava-l] Biojava Question. In-Reply-To: <1653.130.207.66.142.1191340893.squirrel@webmail.cc.gatech.edu> References: <1653.130.207.66.142.1191340893.squirrel@webmail.cc.gatech.edu> Message-ID: <200710022314.02018.matthew.pocock@ncl.ac.uk> This is very strange. This sort of error nearly always happens because of a miss-configured classpath. Could you send me: The html of the page that causes the problem The URL of the jars the page should be referencing A URL that I can point my browser at that causes the problem It is difficult to debug something like this without the program actually infront of me. Matthew On Tuesday 02 October 2007, abhi232 at cc.gatech.edu wrote: > Respected Sir, > > I am sorry if I sent you a direct mail but this is a kind of emergency and > I am not getting any substantial response from the biojava mailing > community. > I a graduate student at Georgia Institute of technology.We are working on > creating a Teaceviewer applet for viewing the Sequence using biojava > library. > I am able to create the applet using netbeans and run it there. > The error comes when I upload it on net. I am getting this particular > error. > > java.lang.NoClassDefFoundError: > org/biojava/bio/gui/sequence/SequenceRenderer at > java.lang.Class.getDeclaredConstructors0(Native Method) > at java.lang.Class.privateGetDeclaredConstructors(Unknown Source) > at java.lang.Class.getConstructor0(Unknown Source) > at java.lang.Class.newInstance0(Unknown Source) > at java.lang.Class.newInstance(Unknown Source) > at sun.applet.AppletPanel.createApplet(Unknown Source) > at sun.plugin.AppletViewer.createApplet(Unknown Source) > at sun.applet.AppletPanel.runLoader(Unknown Source) > at sun.applet.AppletPanel.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > > I am getting an error only for SequenceRenderer class.Even If I comment > that out still it is giving me error. > > I have set the classpath as well as the path variables and also I am > giving the archive field in the applet code so as the biojava library will > be available. > > Is there any particular thing required which I probably am missing? > Please guide me on this topic. > I would really appreciate your gesture. > Thanks a lot in advance. From elmh06 at yahoo.ca Wed Oct 3 18:27:36 2007 From: elmh06 at yahoo.ca (El Mabrouk M) Date: Wed, 3 Oct 2007 14:27:36 -0400 (EDT) Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method Message-ID: <975012.12435.qm@web37310.mail.mud.yahoo.com> Hi! I have just started to learn biojava. I have written a small program that write a sequence in fasta file with the help of the biojavax method RichSequence.IOTools.writeFasta(seqOut, s1, ns); I have got the error "cannot find symbol". I'm using biojava 1.5, jdk 1.6 and netbeans. What can be done to fix this problem? This is what I tried: import org.biojava.bio.seq.*; import java.io.*; import org.biojava.bio.symbol.SymbolList; import org.biojavax.RichObjectFactory; import javax.xml.stream.events.Namespace; import org.biojavax.bio.seq.RichSequence; public class SeqFastaF { public static void main(String[] args) { SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1"); Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0"); try { OutputStream seqOut = System.out; Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace(); RichSequence.IOTools.writeFasta(seqOut,s1,ns); } catch (IOException ex) { //io error ex.printStackTrace(); } } } Error: cannot find symbol symbol : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace) location: class org.biojavax.bio.seq.RichSequence.IOTools --------------------------------- Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail From markjschreiber at gmail.com Wed Oct 3 23:20:31 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 4 Oct 2007 07:20:31 +0800 Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method In-Reply-To: <975012.12435.qm@web37310.mail.mud.yahoo.com> References: <975012.12435.qm@web37310.mail.mud.yahoo.com> Message-ID: <93b45ca50710031620m35495bfey8ec111177c6201f@mail.gmail.com> Hi - This is a compilation error. It is caused because the biojava write method is expecting a Namespace object from the biojavax package but netbeans has guessed that you wanted a Namespace object from the javax.xml.stream.events package and has imported this for you. If you remove that import ( javax.xml.stream.events.Namespace) and then import the biojavax Namespace object it should compile. - Mark On 10/4/07, El Mabrouk M wrote: > Hi! > > I have just started to learn biojava. I have written a small > program that write a sequence in fasta file with the help of the biojavax method > > RichSequence.IOTools.writeFasta(seqOut, s1, ns); > I have got the error "cannot find symbol". > I'm using biojava 1.5, jdk 1.6 and netbeans. > What can be done to fix this problem? > > This is what I tried: > > import org.biojava.bio.seq.*; > import java.io.*; > import org.biojava.bio.symbol.SymbolList; > import org.biojavax.RichObjectFactory; > import javax.xml.stream.events.Namespace; > import org.biojavax.bio.seq.RichSequence; > > public class SeqFastaF { > public static void main(String[] args) { > SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1"); > Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0"); > try { > OutputStream seqOut = System.out; > Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace(); > RichSequence.IOTools.writeFasta(seqOut,s1,ns); > } catch (IOException ex) { > //io error > ex.printStackTrace(); > } > } > } > > Error: > cannot find symbol > symbol : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace) > location: class org.biojavax.bio.seq.RichSequence.IOTools > > > > --------------------------------- > Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From md5 at sanger.ac.uk Wed Oct 3 23:05:43 2007 From: md5 at sanger.ac.uk (Mutlu Dogruel) Date: Thu, 4 Oct 2007 00:05:43 +0100 (BST) Subject: [Biojava-l] Problem with RichSequence.IOTools.writeFasta method In-Reply-To: <975012.12435.qm@web37310.mail.mud.yahoo.com> References: <975012.12435.qm@web37310.mail.mud.yahoo.com> Message-ID: Hi, try using import org.biojavax.Namespace instead of javax.xml.stream.events.Namespace; Also, you should handle the illegal symbol exception that DNATools.createDNASequence may throw. Cheers, mutlu On Wed, 3 Oct 2007, El Mabrouk M wrote: > Hi! > > I have just started to learn biojava. I have written a small > program that write a sequence in fasta file with the help of the biojavax method > > RichSequence.IOTools.writeFasta(seqOut, s1, ns); > I have got the error "cannot find symbol". > I'm using biojava 1.5, jdk 1.6 and netbeans. > What can be done to fix this problem? > > This is what I tried: > > import org.biojava.bio.seq.*; > import java.io.*; > import org.biojava.bio.symbol.SymbolList; > import org.biojavax.RichObjectFactory; > import javax.xml.stream.events.Namespace; > import org.biojavax.bio.seq.RichSequence; > > public class SeqFastaF { > public static void main(String[] args) { > SymbolList dna0 = DNATools.createDNASequence("atgctgaacaacggcatggcaacttacggacggactacgact", "dna_1"); > Sequence s1 = DNATools.createDNASequence(dna0.seqString(), "dna_0"); > try { > OutputStream seqOut = System.out; > Namespace ns = (Namespace) RichObjectFactory.getDefaultNamespace(); > RichSequence.IOTools.writeFasta(seqOut,s1,ns); > } catch (IOException ex) { > //io error > ex.printStackTrace(); > } > } > } > > Error: > cannot find symbol > symbol : method writeFasta(java.io.OutputStream,org.biojava.bio.seq.Sequence,javax.xml.stream.events.Namespace) > location: class org.biojavax.bio.seq.RichSequence.IOTools > > > > --------------------------------- > Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From su24 at st-andrews.ac.uk Thu Oct 4 14:43:23 2007 From: su24 at st-andrews.ac.uk (Saif Ur-Rehman) Date: Thu, 4 Oct 2007 15:43:23 +0100 Subject: [Biojava-l] WriteFasta Message-ID: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> Dear All, I was writing to ask about the SeqIOTools.writeFasta() Method. I am currently trying to break up Fasta Files of whole organisms into one file per gene for further analysis. However the writeFasta method appears to append the characters "?? ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From holland at ebi.ac.uk Thu Oct 4 15:23:10 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Thu, 04 Oct 2007 16:23:10 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> Message-ID: <4705055E.5070401@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 SeqIOTools is deprecated. Try RichSequence.IOTools.writeFasta() instead to see if that helps. e.g.: RichSequence.IOTools.writeFasta( System.out, seq, RichObjectFactory.getDefaultNamespace() ); where seq is either a Sequence or a SequenceIterator. cheers, Richard Saif Ur-Rehman wrote: > Dear All, > > I was writing to ask about the SeqIOTools.writeFasta() Method. I am currently > trying to break up Fasta Files of whole organisms into one file per gene for > further analysis. However the writeFasta method appears to append the > characters > "?? > > ------------------------------------------------------------------ > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp C4xPs/2ywAMfIPDmUKPCrqg= =TwwH -----END PGP SIGNATURE----- From su24 at st-andrews.ac.uk Thu Oct 4 15:23:52 2007 From: su24 at st-andrews.ac.uk (Saif Ur-Rehman) Date: Thu, 4 Oct 2007 16:23:52 +0100 Subject: [Biojava-l] (no subject) Message-ID: <1191511432.4705058825b79@webmail.st-andrews.ac.uk> Dear All, I'm sorry the use of the characters seems to have truncated the previous email I sent. To complete my question I was just wondering as to possible causes for this addition of random charcters and if there was a way to stop it from occuring. Thanking you again Saif ------------------------------------------------------------------------------- Saif Ur-Rehman Research Student The Centre for Evolution, Genes & Genomics (CEGG) Dyers Brae School of Biology The University of St Andrews St Andrews, Fife Scotland,UK ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From su24 at st-andrews.ac.uk Fri Oct 5 10:06:25 2007 From: su24 at st-andrews.ac.uk (Saif Ur-Rehman) Date: Fri, 5 Oct 2007 11:06:25 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <4705055E.5070401@ebi.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> Message-ID: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> Dear Richard, I have tried the RichSEquence.IOTools.writeFasta method and this method is still appending the characters "??" to the front of each write. I am using a FileOutputStream and a Sequence object as inputs to the method. like so. Sequence seq; // read in from File FileOutputStream f =new FileOutputStream (fileName); try{ RichSequence.IOTools.writeFasta(f, seq, RichObjectFactory.getDefaultNamespace() ); } Thanks a lot for your time Sincerely, Saif Quoting Richard Holland : > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > SeqIOTools is deprecated. > > Try RichSequence.IOTools.writeFasta() instead to see if that helps. > > e.g.: > > RichSequence.IOTools.writeFasta( > System.out, > seq, > RichObjectFactory.getDefaultNamespace() > ); > > where seq is either a Sequence or a SequenceIterator. > > cheers, > Richard > > Saif Ur-Rehman wrote: > > Dear All, > > > > I was writing to ask about the SeqIOTools.writeFasta() Method. I am > currently > > trying to break up Fasta Files of whole organisms into one file per gene > for > > further analysis. However the writeFasta method appears to append the > > characters > > "?? > > > > ------------------------------------------------------------------ > > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp > C4xPs/2ywAMfIPDmUKPCrqg= > =TwwH > -----END PGP SIGNATURE----- > ------------------------------------------------------------------------------- Saif Ur-Rehman Research Student The Centre for Evolution, Genes & Genomics (CEGG) Dyers Brae School of Biology The University of St Andrews St Andrews, Fife Scotland,UK ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From holland at ebi.ac.uk Fri Oct 5 10:13:36 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 05 Oct 2007 11:13:36 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> Message-ID: <47060E50.2070405@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Where are the input sequences coming from? i.e. what method are you using to construct them or read them from a file. Also, what do you mean by the 'front' of each write? Could you send me an example of an entire FASTA file containing the problem? (It'd be best to attach the file to an email to me personally as this list will not accept attachments, and copying-and-pasting from a text editor to an email client may obscure the underlying problem). It'd be good also to see your entire code from the point the sequences are read or created to the point where they are written out. Or, a sample program which exhibits the same behaviour would suffice. I suspect that the sequences themselves contain the incorrect data, although technically this should be impossible as the sequence alphabet should prevent it. We recently had an issue reported here regarding BioJava not being able to do certain sequence tasks on platforms using non-Western-European character mappings. If your machine is running such a mapping, try it again on a machine with an English or other Western European language set up by default. If it works there but not on your machine, then this'll be the same problem. (There is no solution yet, but at least you'll know what's wrong). cheers, Richard Saif Ur-Rehman wrote: > Dear Richard, > > I have tried the RichSEquence.IOTools.writeFasta method and this method is still > appending the characters "??" to the front of each write. I am using a > FileOutputStream and a Sequence object as inputs to the method. like so. > > > Sequence seq; // read in from File > FileOutputStream f =new FileOutputStream (fileName); > > > try{ > > RichSequence.IOTools.writeFasta(f, > seq, > RichObjectFactory.getDefaultNamespace() > ); > > > } > > > Thanks a lot for your time > > Sincerely, > > Saif > > Quoting Richard Holland : > > SeqIOTools is deprecated. > > Try RichSequence.IOTools.writeFasta() instead to see if that helps. > > e.g.: > > RichSequence.IOTools.writeFasta( > System.out, > seq, > RichObjectFactory.getDefaultNamespace() > ); > > where seq is either a Sequence or a SequenceIterator. > > cheers, > Richard > > Saif Ur-Rehman wrote: >>>> Dear All, >>>> >>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am > currently >>>> trying to break up Fasta Files of whole organisms into one file per gene > for >>>> further analysis. However the writeFasta method appears to append the >>>> characters >>>> "?? >>>> >>>> ------------------------------------------------------------------ >>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk >>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >> > ------------------------------------------------------------------------------- > Saif Ur-Rehman > Research Student > The Centre for Evolution, Genes & Genomics (CEGG) > Dyers Brae > School of Biology > The University of St Andrews > St Andrews, > Fife > Scotland,UK > ------------------------------------------------------------------ > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHBg5Q4C5LeMEKA/QRAlKlAKCKXrMfJI2W4Ir7Us5P9bj3KmEY1ACgo89L WgUPFCLGUNSUZxO8h3Ltqlw= =Jq7X -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Fri Oct 5 10:16:02 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 05 Oct 2007 11:16:02 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> Message-ID: <47060EE2.2000909@ebi.ac.uk> Is it possible for you to send us the code which you're trying to run & the sequence you are trying to write out. If it is sent to us in a manner we can drop it into an IDE & run that would help us a lot. Thanks, Andy Yates Saif Ur-Rehman wrote: > Dear Richard, > > I have tried the RichSEquence.IOTools.writeFasta method and this method is still > appending the characters "??" to the front of each write. I am using a > FileOutputStream and a Sequence object as inputs to the method. like so. > > > Sequence seq; // read in from File > FileOutputStream f =new FileOutputStream (fileName); > > > try{ > > RichSequence.IOTools.writeFasta(f, > seq, > RichObjectFactory.getDefaultNamespace() > ); > > > } > > > Thanks a lot for your time > > Sincerely, > > Saif > > Quoting Richard Holland : > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> SeqIOTools is deprecated. >> >> Try RichSequence.IOTools.writeFasta() instead to see if that helps. >> >> e.g.: >> >> RichSequence.IOTools.writeFasta( >> System.out, >> seq, >> RichObjectFactory.getDefaultNamespace() >> ); >> >> where seq is either a Sequence or a SequenceIterator. >> >> cheers, >> Richard >> >> Saif Ur-Rehman wrote: >>> Dear All, >>> >>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am >> currently >>> trying to break up Fasta Files of whole organisms into one file per gene >> for >>> further analysis. However the writeFasta method appears to append the >>> characters >>> "?? >>> >>> ------------------------------------------------------------------ >>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >> >> iD8DBQFHBQVe4C5LeMEKA/QRAvBDAKCQkyH+a6TK5VpgfpSmAgfTUPrG+gCgkIJp >> C4xPs/2ywAMfIPDmUKPCrqg= >> =TwwH >> -----END PGP SIGNATURE----- >> > > > ------------------------------------------------------------------------------- > Saif Ur-Rehman > Research Student > The Centre for Evolution, Genes & Genomics (CEGG) > Dyers Brae > School of Biology > The University of St Andrews > St Andrews, > Fife > Scotland,UK > > ------------------------------------------------------------------ > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From holland at ebi.ac.uk Fri Oct 5 12:10:58 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Fri, 05 Oct 2007 13:10:58 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <1191584372.4706227437594@webmail.st-andrews.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> <47060E50.2070405@ebi.ac.uk> <1191582472.47061b0836c9f@webmail.st-andrews.ac.uk> <47061FDD.1070806@ebi.ac.uk> <1191584372.4706227437594@webmail.st-andrews.ac.uk> Message-ID: <470629D2.6020709@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Great, thanks. The initial analysis shows that the text file generated contains four extra characters at the beginning of the file, and is using '\n' as the line separator. This is a hex dump of the file: 00000000 ac ed 00 05 3e 67 69 7c 31 38 33 39 38 33 39 30 |....>gi|18398390| 00000010 7c 6c 63 6c 7c 4e 50 5f 35 36 35 34 31 33 2e 31 ||lcl|NP_565413.1| 00000020 7c 4e 50 5f 35 36 35 34 31 33 20 75 6e 6b 6e 6f ||NP_565413 unkno| 00000030 77 6e 20 70 72 6f 74 65 69 6e 20 5b 41 72 61 62 |wn protein [Arab| 00000040 69 64 6f 70 73 69 73 20 74 68 61 6c 69 61 6e 61 |idopsis thaliana| 00000050 5d 0a 4d 53 4c 52 49 4b 4c 56 56 44 4b 46 56 45 |].MSLRIKLVVDKFVE| 00000060 45 4c 4b 51 41 4c 44 41 44 49 51 44 52 49 4d 4b |ELKQALDADIQDRIMK| 00000070 45 52 45 4d 51 53 59 49 58 58 58 58 58 58 58 58 |EREMQSYIXXXXXXXX| 00000080 58 58 58 58 58 57 4b 41 45 4c 53 52 52 45 54 45 |XXXXXWKAELSRRETE| 00000090 49 41 52 51 45 41 52 4c 4b 4d 45 52 45 4e 4c 45 |IARQEARLKMERENLE| 000000a0 4b 45 0a 4b 53 56 4c 4d 47 54 41 53 4e 51 44 4e |KE.KSVLMGTASNQDN| 000000b0 51 44 47 41 4c 45 49 54 56 53 47 45 4b 59 52 43 |QDGALEITVSGEKYRC| 000000c0 4c 52 46 53 4b 41 4b 4b 0a |LRFSKAKK.| The four extra characters are hex #ac #ed #00 #05 and these are showing as question marks in your text editor because that's how text editors handle unprintable characters. Does anyone recognise these characters? There is no code in BioJava which writes anything like this, in fact there is no output code at all before the initial write of the first > symbol in the file. Something tells me that these symbols are being inserted by the VM or the OS somewhere under the hood, possibly due to internationalisation? I strongly suspect this is an internationalisation problem. It seems probable that Java has been set up on your system to use a language or character encoding that causes Java by default to write these extra characters at the start of files to indicate the encoding. Check the output of: System.getProperty("file.encode"); to see if it is using something other than UTF-8. If it is, then chances are that this is the problem. We've had internationalisation problems before with BioJava. Hopefully these will be addressed in future development, but there is no current activity in that area due to lack of resources. In the meantime the best workaround is to set every setting you can find to a Western European character set/character mapping and UTF-8 file encoding, in the hope that it will all match up nicely and work. cheers, Richard Saif Ur-Rehman wrote: > Dear Richard, > > The input file is just the entire set of RefSeq proteins for Arabdopsis thaliana > and is too large for me to send as an attachment. But it can be downloaded from > NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]". > > Cheers, > > Saif > > > > Quoting Richard Holland : > > Interesting. Could you send your input file as well? > > cheers, > Richard > > Saif Ur-Rehman wrote: >>>> Dear Richard, >>>> >>>> The sequences are being read by SeqIO.readFasta. The code from read to > write is >>>> as follows. Essentially the program wants to read in a fasta file > containing >>>> all the protein sequences in a given organism and split them up into one > file >>>> per protein. >>>> >>>> >>>> BufferedReader br=null; >>>> try >>>> { >>>> br = new BufferedReader(new FileReader(filename)); >>>> } >>>> catch (FileNotFoundException e1) >>>> { >>>> >>>> e1.printStackTrace(); >>>> } >>>> >>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br); >>>> while (stream.hasNext()) >>>> { >>>> try >>>> { >>>> Sequence seq = stream.nextSequence(); >>>> File scriptFile1= new > File("///Users/Saif/Organisms/RunTemp/"+name >>>> +"/"+seq.getName()); >>>> >>>> try >>>> { >>>> scriptFile1.createNewFile(); >>>> } >>>> catch (IOException e1) >>>> { >>>> >>>> e1.printStackTrace(); >>>> } >>>> >>>> try >>>> { >>>> FileWriter fstream = new > FileWriter(scriptFile1.getAbsolutePath()); >>>> BufferedWriter out = new BufferedWriter(fstream); >>>> >>>> FileOutputStream f =new FileOutputStream (scriptFile1); >>>> >>>> RichSequence rs=RichSequence.Tools.enrich(seq); >>>> >>>> >>>> try{ >>>> >>>> >>>> RichSequence.IOTools.writeFasta( >>>> f, >>>> rs, >>>> RichObjectFactory.getDefaultNamespace() >>>> ); >>>> >>>> >>>> } >>>> >>>> catch (IOException ioe){} >>>> >>>> An example of an outputted fasta file from this code is attached. >>>> >>>> >>>> >>>> Thanks a lot for your time. >>>> >>>> Saif >>>> >>>> >>>> Quoting Richard Holland : >>>> >>>> Where are the input sequences coming from? i.e. what method are you >>>> using to construct them or read them from a file. >>>> >>>> Also, what do you mean by the 'front' of each write? Could you send me >>>> an example of an entire FASTA file containing the problem? (It'd be best >>>> to attach the file to an email to me personally as this list will not >>>> accept attachments, and copying-and-pasting from a text editor to an >>>> email client may obscure the underlying problem). >>>> >>>> It'd be good also to see your entire code from the point the sequences >>>> are read or created to the point where they are written out. Or, a >>>> sample program which exhibits the same behaviour would suffice. >>>> >>>> I suspect that the sequences themselves contain the incorrect data, >>>> although technically this should be impossible as the sequence alphabet >>>> should prevent it. >>>> >>>> We recently had an issue reported here regarding BioJava not being able >>>> to do certain sequence tasks on platforms using non-Western-European >>>> character mappings. If your machine is running such a mapping, try it >>>> again on a machine with an English or other Western European language >>>> set up by default. If it works there but not on your machine, then >>>> this'll be the same problem. (There is no solution yet, but at least >>>> you'll know what's wrong). >>>> >>>> cheers, >>>> Richard >>>> >>>> Saif Ur-Rehman wrote: >>>>>>> Dear Richard, >>>>>>> >>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this method > is >>>> still >>>>>>> appending the characters "??" to the front of each write. I am using a >>>>>>> FileOutputStream and a Sequence object as inputs to the method. like so. >>>>>>> >>>>>>> >>>>>>> Sequence seq; // read in from File >>>>>>> FileOutputStream f =new FileOutputStream (fileName); >>>>>>> >>>>>>> >>>>>>> try{ >>>>>>> >>>>>>> RichSequence.IOTools.writeFasta(f, >>>>>>> seq, >>>>>>> RichObjectFactory.getDefaultNamespace() >>>>>>> ); >>>>>>> >>>>>>> >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Thanks a lot for your time >>>>>>> >>>>>>> Sincerely, >>>>>>> >>>>>>> Saif >>>>>>> >>>>>>> Quoting Richard Holland : >>>>>>> >>>>>>> SeqIOTools is deprecated. >>>>>>> >>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps. >>>>>>> >>>>>>> e.g.: >>>>>>> >>>>>>> RichSequence.IOTools.writeFasta( >>>>>>> System.out, >>>>>>> seq, >>>>>>> RichObjectFactory.getDefaultNamespace() >>>>>>> ); >>>>>>> >>>>>>> where seq is either a Sequence or a SequenceIterator. >>>>>>> >>>>>>> cheers, >>>>>>> Richard >>>>>>> >>>>>>> Saif Ur-Rehman wrote: >>>>>>>>>> Dear All, >>>>>>>>>> >>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I am >>>>>>> currently >>>>>>>>>> trying to break up Fasta Files of whole organisms into one file per > gene >>>>>>> for >>>>>>>>>> further analysis. However the writeFasta method appears to append the >>>>>>>>>> characters >>>>>>>>>> "?? >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------ >>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>>> >> ------------------------------------------------------------------------------- >>>>>>> Saif Ur-Rehman >>>>>>> Research Student >>>>>>> The Centre for Evolution, Genes & Genomics (CEGG) >>>>>>> Dyers Brae >>>>>>> School of Biology >>>>>>> The University of St Andrews >>>>>>> St Andrews, >>>>>>> Fife >>>>>>> Scotland,UK >>>>>>> ------------------------------------------------------------------ >>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk >> ------------------------------------------------------------------------------- >>>> Saif Ur-Rehman >>>> Research Student >>>> The Centre for Evolution, Genes & Genomics (CEGG) >>>> Dyers Brae >>>> School of Biology >>>> The University of St Andrews >>>> St Andrews, >>>> Fife >>>> Scotland,UK >>>> ------------------------------------------------------------------ >>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk >> > ------------------------------------------------------------------------------- > Saif Ur-Rehman > Research Student > The Centre for Evolution, Genes & Genomics (CEGG) > Dyers Brae > School of Biology > The University of St Andrews > St Andrews, > Fife > Scotland,UK > ------------------------------------------------------------------ > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb pxAPAybISoRQgbvQ1wyzqVg= =MS7P -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Fri Oct 5 12:28:43 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 05 Oct 2007 13:28:43 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <470629D2.6020709@ebi.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> <47060E50.2070405@ebi.ac.uk> <1191582472.47061b0836c9f@webmail.st-andrews.ac.uk> <47061FDD.1070806@ebi.ac.uk> <1191584372.4706227437594@webmail.st-andrews.ac.uk> <470629D2.6020709@ebi.ac.uk> Message-ID: <47062DFB.6040201@ebi.ac.uk> I've done a quick search & it seems as if U+ACED is a Chinese character & the other is just a blank. Something is getting confused quite badly here Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Great, thanks. > > The initial analysis shows that the text file generated contains four > extra characters at the beginning of the file, and is using '\n' as the > line separator. > > This is a hex dump of the file: > > 00000000 ac ed 00 05 3e 67 69 7c 31 38 33 39 38 33 39 30 > |....>gi|18398390| > 00000010 7c 6c 63 6c 7c 4e 50 5f 35 36 35 34 31 33 2e 31 > ||lcl|NP_565413.1| > 00000020 7c 4e 50 5f 35 36 35 34 31 33 20 75 6e 6b 6e 6f ||NP_565413 > unkno| > 00000030 77 6e 20 70 72 6f 74 65 69 6e 20 5b 41 72 61 62 |wn protein > [Arab| > 00000040 69 64 6f 70 73 69 73 20 74 68 61 6c 69 61 6e 61 |idopsis > thaliana| > 00000050 5d 0a 4d 53 4c 52 49 4b 4c 56 56 44 4b 46 56 45 > |].MSLRIKLVVDKFVE| > 00000060 45 4c 4b 51 41 4c 44 41 44 49 51 44 52 49 4d 4b > |ELKQALDADIQDRIMK| > 00000070 45 52 45 4d 51 53 59 49 58 58 58 58 58 58 58 58 > |EREMQSYIXXXXXXXX| > 00000080 58 58 58 58 58 57 4b 41 45 4c 53 52 52 45 54 45 > |XXXXXWKAELSRRETE| > 00000090 49 41 52 51 45 41 52 4c 4b 4d 45 52 45 4e 4c 45 > |IARQEARLKMERENLE| > 000000a0 4b 45 0a 4b 53 56 4c 4d 47 54 41 53 4e 51 44 4e > |KE.KSVLMGTASNQDN| > 000000b0 51 44 47 41 4c 45 49 54 56 53 47 45 4b 59 52 43 > |QDGALEITVSGEKYRC| > 000000c0 4c 52 46 53 4b 41 4b 4b 0a |LRFSKAKK.| > > > The four extra characters are hex #ac #ed #00 #05 and these are showing > as question marks in your text editor because that's how text editors > handle unprintable characters. > > Does anyone recognise these characters? There is no code in BioJava > which writes anything like this, in fact there is no output code at all > before the initial write of the first > symbol in the file. Something > tells me that these symbols are being inserted by the VM or the OS > somewhere under the hood, possibly due to internationalisation? > > I strongly suspect this is an internationalisation problem. It seems > probable that Java has been set up on your system to use a language or > character encoding that causes Java by default to write these extra > characters at the start of files to indicate the encoding. Check the > output of: > > System.getProperty("file.encode"); > > to see if it is using something other than UTF-8. If it is, then chances > are that this is the problem. > > We've had internationalisation problems before with BioJava. Hopefully > these will be addressed in future development, but there is no current > activity in that area due to lack of resources. In the meantime the best > workaround is to set every setting you can find to a Western European > character set/character mapping and UTF-8 file encoding, in the hope > that it will all match up nicely and work. > > cheers, > Richard > > From su24 at st-andrews.ac.uk Fri Oct 5 13:44:29 2007 From: su24 at st-andrews.ac.uk (Saif Ur-Rehman) Date: Fri, 5 Oct 2007 14:44:29 +0100 Subject: [Biojava-l] WriteFasta In-Reply-To: <470629D2.6020709@ebi.ac.uk> References: <1191509003.4704fc0b74668@webmail.st-andrews.ac.uk> <4705055E.5070401@ebi.ac.uk> <1191578785.47060ca19d74b@webmail.st-andrews.ac.uk> <47060E50.2070405@ebi.ac.uk> <1191582472.47061b0836c9f@webmail.st-andrews.ac.uk> <47061FDD.1070806@ebi.ac.uk> <1191584372.4706227437594@webmail.st-andrews.ac.uk> <470629D2.6020709@ebi.ac.uk> Message-ID: <1191591869.47063fbd22461@webmail.st-andrews.ac.uk> Setting the System properties solved the problem. Thanks a lot, Saif Quoting Richard Holland : > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Great, thanks. > > The initial analysis shows that the text file generated contains four > extra characters at the beginning of the file, and is using '\n' as the > line separator. > > This is a hex dump of the file: > > 00000000 ac ed 00 05 3e 67 69 7c 31 38 33 39 38 33 39 30 > |....>gi|18398390| > 00000010 7c 6c 63 6c 7c 4e 50 5f 35 36 35 34 31 33 2e 31 > ||lcl|NP_565413.1| > 00000020 7c 4e 50 5f 35 36 35 34 31 33 20 75 6e 6b 6e 6f ||NP_565413 > unkno| > 00000030 77 6e 20 70 72 6f 74 65 69 6e 20 5b 41 72 61 62 |wn protein > [Arab| > 00000040 69 64 6f 70 73 69 73 20 74 68 61 6c 69 61 6e 61 |idopsis > thaliana| > 00000050 5d 0a 4d 53 4c 52 49 4b 4c 56 56 44 4b 46 56 45 > |].MSLRIKLVVDKFVE| > 00000060 45 4c 4b 51 41 4c 44 41 44 49 51 44 52 49 4d 4b > |ELKQALDADIQDRIMK| > 00000070 45 52 45 4d 51 53 59 49 58 58 58 58 58 58 58 58 > |EREMQSYIXXXXXXXX| > 00000080 58 58 58 58 58 57 4b 41 45 4c 53 52 52 45 54 45 > |XXXXXWKAELSRRETE| > 00000090 49 41 52 51 45 41 52 4c 4b 4d 45 52 45 4e 4c 45 > |IARQEARLKMERENLE| > 000000a0 4b 45 0a 4b 53 56 4c 4d 47 54 41 53 4e 51 44 4e > |KE.KSVLMGTASNQDN| > 000000b0 51 44 47 41 4c 45 49 54 56 53 47 45 4b 59 52 43 > |QDGALEITVSGEKYRC| > 000000c0 4c 52 46 53 4b 41 4b 4b 0a |LRFSKAKK.| > > > The four extra characters are hex #ac #ed #00 #05 and these are showing > as question marks in your text editor because that's how text editors > handle unprintable characters. > > Does anyone recognise these characters? There is no code in BioJava > which writes anything like this, in fact there is no output code at all > before the initial write of the first > symbol in the file. Something > tells me that these symbols are being inserted by the VM or the OS > somewhere under the hood, possibly due to internationalisation? > > I strongly suspect this is an internationalisation problem. It seems > probable that Java has been set up on your system to use a language or > character encoding that causes Java by default to write these extra > characters at the start of files to indicate the encoding. Check the > output of: > > System.getProperty("file.encode"); > > to see if it is using something other than UTF-8. If it is, then chances > are that this is the problem. > > We've had internationalisation problems before with BioJava. Hopefully > these will be addressed in future development, but there is no current > activity in that area due to lack of resources. In the meantime the best > workaround is to set every setting you can find to a Western European > character set/character mapping and UTF-8 file encoding, in the hope > that it will all match up nicely and work. > > cheers, > Richard > > Saif Ur-Rehman wrote: > > Dear Richard, > > > > The input file is just the entire set of RefSeq proteins for Arabdopsis > thaliana > > and is too large for me to send as an attachment. But it can be downloaded > from > > NCBI using the query "Arabdopsis thaliana [orgn] srcdb_refseq[prop]". > > > > Cheers, > > > > Saif > > > > > > > > Quoting Richard Holland : > > > > Interesting. Could you send your input file as well? > > > > cheers, > > Richard > > > > Saif Ur-Rehman wrote: > >>>> Dear Richard, > >>>> > >>>> The sequences are being read by SeqIO.readFasta. The code from read to > > write is > >>>> as follows. Essentially the program wants to read in a fasta file > > containing > >>>> all the protein sequences in a given organism and split them up into one > > file > >>>> per protein. > >>>> > >>>> > >>>> BufferedReader br=null; > >>>> try > >>>> { > >>>> br = new BufferedReader(new FileReader(filename)); > >>>> } > >>>> catch (FileNotFoundException e1) > >>>> { > >>>> > >>>> e1.printStackTrace(); > >>>> } > >>>> > >>>> SequenceIterator stream = SeqIOTools.readFastaProtein(br); > >>>> while (stream.hasNext()) > >>>> { > >>>> try > >>>> { > >>>> Sequence seq = stream.nextSequence(); > >>>> File scriptFile1= new > > File("///Users/Saif/Organisms/RunTemp/"+name > >>>> +"/"+seq.getName()); > >>>> > >>>> try > >>>> { > >>>> scriptFile1.createNewFile(); > >>>> } > >>>> catch (IOException e1) > >>>> { > >>>> > >>>> e1.printStackTrace(); > >>>> } > >>>> > >>>> try > >>>> { > >>>> FileWriter fstream = new > > FileWriter(scriptFile1.getAbsolutePath()); > >>>> BufferedWriter out = new BufferedWriter(fstream); > >>>> > >>>> FileOutputStream f =new FileOutputStream (scriptFile1); > >>>> > >>>> RichSequence rs=RichSequence.Tools.enrich(seq); > >>>> > >>>> > >>>> try{ > >>>> > >>>> > >>>> RichSequence.IOTools.writeFasta( > >>>> f, > >>>> rs, > >>>> RichObjectFactory.getDefaultNamespace() > >>>> ); > >>>> > >>>> > >>>> } > >>>> > >>>> catch (IOException ioe){} > >>>> > >>>> An example of an outputted fasta file from this code is attached. > >>>> > >>>> > >>>> > >>>> Thanks a lot for your time. > >>>> > >>>> Saif > >>>> > >>>> > >>>> Quoting Richard Holland : > >>>> > >>>> Where are the input sequences coming from? i.e. what method are you > >>>> using to construct them or read them from a file. > >>>> > >>>> Also, what do you mean by the 'front' of each write? Could you send me > >>>> an example of an entire FASTA file containing the problem? (It'd be best > >>>> to attach the file to an email to me personally as this list will not > >>>> accept attachments, and copying-and-pasting from a text editor to an > >>>> email client may obscure the underlying problem). > >>>> > >>>> It'd be good also to see your entire code from the point the sequences > >>>> are read or created to the point where they are written out. Or, a > >>>> sample program which exhibits the same behaviour would suffice. > >>>> > >>>> I suspect that the sequences themselves contain the incorrect data, > >>>> although technically this should be impossible as the sequence alphabet > >>>> should prevent it. > >>>> > >>>> We recently had an issue reported here regarding BioJava not being able > >>>> to do certain sequence tasks on platforms using non-Western-European > >>>> character mappings. If your machine is running such a mapping, try it > >>>> again on a machine with an English or other Western European language > >>>> set up by default. If it works there but not on your machine, then > >>>> this'll be the same problem. (There is no solution yet, but at least > >>>> you'll know what's wrong). > >>>> > >>>> cheers, > >>>> Richard > >>>> > >>>> Saif Ur-Rehman wrote: > >>>>>>> Dear Richard, > >>>>>>> > >>>>>>> I have tried the RichSEquence.IOTools.writeFasta method and this > method > > is > >>>> still > >>>>>>> appending the characters "??" to the front of each write. I am using > a > >>>>>>> FileOutputStream and a Sequence object as inputs to the method. like > so. > >>>>>>> > >>>>>>> > >>>>>>> Sequence seq; // read in from File > >>>>>>> FileOutputStream f =new FileOutputStream (fileName); > >>>>>>> > >>>>>>> > >>>>>>> try{ > >>>>>>> > >>>>>>> RichSequence.IOTools.writeFasta(f, > >>>>>>> seq, > >>>>>>> RichObjectFactory.getDefaultNamespace() > >>>>>>> ); > >>>>>>> > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> > >>>>>>> Thanks a lot for your time > >>>>>>> > >>>>>>> Sincerely, > >>>>>>> > >>>>>>> Saif > >>>>>>> > >>>>>>> Quoting Richard Holland : > >>>>>>> > >>>>>>> SeqIOTools is deprecated. > >>>>>>> > >>>>>>> Try RichSequence.IOTools.writeFasta() instead to see if that helps. > >>>>>>> > >>>>>>> e.g.: > >>>>>>> > >>>>>>> RichSequence.IOTools.writeFasta( > >>>>>>> System.out, > >>>>>>> seq, > >>>>>>> RichObjectFactory.getDefaultNamespace() > >>>>>>> ); > >>>>>>> > >>>>>>> where seq is either a Sequence or a SequenceIterator. > >>>>>>> > >>>>>>> cheers, > >>>>>>> Richard > >>>>>>> > >>>>>>> Saif Ur-Rehman wrote: > >>>>>>>>>> Dear All, > >>>>>>>>>> > >>>>>>>>>> I was writing to ask about the SeqIOTools.writeFasta() Method. I > am > >>>>>>> currently > >>>>>>>>>> trying to break up Fasta Files of whole organisms into one file > per > > gene > >>>>>>> for > >>>>>>>>>> further analysis. However the writeFasta method appears to append > the > >>>>>>>>>> characters > >>>>>>>>>> "?? > >>>>>>>>>> > >>>>>>>>>> ------------------------------------------------------------------ > >>>>>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>>>>>> > >> > ------------------------------------------------------------------------------- > >>>>>>> Saif Ur-Rehman > >>>>>>> Research Student > >>>>>>> The Centre for Evolution, Genes & Genomics (CEGG) > >>>>>>> Dyers Brae > >>>>>>> School of Biology > >>>>>>> The University of St Andrews > >>>>>>> St Andrews, > >>>>>>> Fife > >>>>>>> Scotland,UK > >>>>>>> ------------------------------------------------------------------ > >>>>>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > >> > ------------------------------------------------------------------------------- > >>>> Saif Ur-Rehman > >>>> Research Student > >>>> The Centre for Evolution, Genes & Genomics (CEGG) > >>>> Dyers Brae > >>>> School of Biology > >>>> The University of St Andrews > >>>> St Andrews, > >>>> Fife > >>>> Scotland,UK > >>>> ------------------------------------------------------------------ > >>>> University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > >> > > > > ------------------------------------------------------------------------------- > > Saif Ur-Rehman > > Research Student > > The Centre for Evolution, Genes & Genomics (CEGG) > > Dyers Brae > > School of Biology > > The University of St Andrews > > St Andrews, > > Fife > > Scotland,UK > > > ------------------------------------------------------------------ > > University of St Andrews Webmail: https://webmail.st-andrews.ac.uk > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHBinR4C5LeMEKA/QRAqs9AJ9yzLmta3jFDoKWLVTXKgrdADnswQCeNDmb > pxAPAybISoRQgbvQ1wyzqVg= > =MS7P > -----END PGP SIGNATURE----- > ------------------------------------------------------------------------------- Saif Ur-Rehman Research Student The Centre for Evolution, Genes & Genomics (CEGG) Dyers Brae School of Biology The University of St Andrews St Andrews, Fife Scotland,UK ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From sanbiogene at yahoo.co.in Sat Oct 6 09:23:11 2007 From: sanbiogene at yahoo.co.in (sandeep telkar) Date: Sat, 6 Oct 2007 10:23:11 +0100 (BST) Subject: [Biojava-l] BIOJAVA INSTALLATION FOR WINDOWS PLATFORM Message-ID: <121992.19693.qm@web94408.mail.in2.yahoo.com> Dear friends, Sandeep here... I wanna learn biojava n now i am beginner.but from where to download its exe installation file as like that of JDK6 fron sun website.... please suggest me any thing other than the following url: http://biojava.org/wiki/BioJava:Download N plese tell in which directory i have to save the program..... I am not getting any clear idea .. please help me.. - Sandeep Sandeep Telkar, M.Sc Bioinformatics. Meet people who discuss and share your passions. Go to http://in.promos.yahoo.com/groups From su24 at st-andrews.ac.uk Sat Oct 6 18:04:28 2007 From: su24 at st-andrews.ac.uk (Saif Ur-Rehman) Date: Sat, 6 Oct 2007 19:04:28 +0100 Subject: [Biojava-l] BIOJAVA INSTALLATION FOR WINDOWS PLATFORM In-Reply-To: <121992.19693.qm@web94408.mail.in2.yahoo.com> References: <121992.19693.qm@web94408.mail.in2.yahoo.com> Message-ID: <1191693868.4707ce2caae97@webmail.st-andrews.ac.uk> Hi, You need to download the Jar files from http://biojava.org/wiki/BioJava:Download. You can then use the File biojava-1.5.jar. Just include it in the buildpath as an external JAR if you're using an IDE like Netbeans or Eclipse or your class path if working from the command line. You can then import the BioJava classes and use them. Hope that helps Cheers, Saif Quoting sandeep telkar : > Dear friends, > Sandeep here... > I wanna learn biojava n now i am beginner.but > from where to download its exe installation file as > like that of JDK6 fron sun website.... > > please suggest me any thing other than the following > url: > http://biojava.org/wiki/BioJava:Download > > N plese tell in which directory i have to save the > program..... > I am not getting any clear idea .. > > please help me.. > - Sandeep > > Sandeep Telkar, > M.Sc Bioinformatics. > > > > Meet people who discuss and share your passions. Go to > http://in.promos.yahoo.com/groups > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > ------------------------------------------------------------------------------- Saif Ur-Rehman Research Student The Centre for Evolution, Genes & Genomics (CEGG) Dyers Brae School of Biology The University of St Andrews St Andrews, Fife Scotland,UK ------------------------------------------------------------------ University of St Andrews Webmail: https://webmail.st-andrews.ac.uk From vineith at gmail.com Wed Oct 10 04:44:22 2007 From: vineith at gmail.com (vineith kaul) Date: Wed, 10 Oct 2007 00:44:22 -0400 Subject: [Biojava-l] case-sensitive sequences Message-ID: Hi, I want to read in a sequence which has case sensitive alphabets(nucleotides).Basically I want to replace only small 'a,g,t,c' with blanks .Although I saw a similar post earlier but couldn't understand much.Can someone help me with this ? -- Vineith Kaul Masters Student Bioinformatics The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) Georgia Tech, Atlanta From holland at ebi.ac.uk Wed Oct 10 08:06:16 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 10 Oct 2007 09:06:16 +0100 Subject: [Biojava-l] case-sensitive sequences In-Reply-To: References: Message-ID: <470C87F8.8020502@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 You can use SoftMaskedAlphabet with the BioJavaX parsers to get the desired effect. By default, a soft masked character is one in lower case. The code below will detect these. If you have other search criteria you can modify the soft masked detection criteria to match this instead. To do that, add a second parameter to the call to SoftMaskedAlphabet.getInstance() and use it to pass in an instance of SoftMaskedAlphabet.MaskingDetector (see the JavaDocs to see how this should work). Hope this helps! : // Set up a soft-masked alphabet. SoftMaskedAlphabet sma = SoftMaskedAlphabet.getInstance(DNATools.getDNA()); SymbolTokenization stok = sma.getTokenization("token"); // Set up sequence parsing. BufferedReader input = ....; // Get your sequences from somewhere RichSequenceFormat format = new FastaFormat(); // Or Genbank etc. RichSequenceBuilderFactory factory = RichSequenceBuilderFactory.FACTORY; // See Javadocs for alternative factories. Namespace ns = RichObjectFactory.getDefaultNamespace(); // See Javadocs for alternative namespaces. // Parse the sequences. RichStreamReader seqsIn = new RichStreamReader(input, format, stok, factory, ns); // Find the soft-masked symbols in the sequences. while (seqsIn.hasNext()) { RichSequence seq = seqsIn.nextRichSequence(); // Iterate over symbols in sequence. for (Iterator i = seq.iterator(); i.hasNext(); ) { Symbol sym = (Symbol)i.next(); // Is this symbol masked? if (sma.isMasked(sym)) { // Yes it is so deal with it. ....... } else { // No it isn't, so deal with that instead. ....... } } } cheers, Richard vineith kaul wrote: > Hi, > > I want to read in a sequence which has case sensitive > alphabets(nucleotides).Basically I want to replace only small > 'a,g,t,c' with blanks .Although I saw a similar post earlier but > couldn't understand much.Can someone help me with this ? > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHDIf44C5LeMEKA/QRAmuNAJ426M/UgInqDG5rG6w+F+qoMdVzPQCfZo1S nAS5v8jSFBX5WCuB5UmzczQ= =Sicc -----END PGP SIGNATURE----- From vineith at gmail.com Sun Oct 14 17:21:45 2007 From: vineith at gmail.com (vineith kaul) Date: Sun, 14 Oct 2007 13:21:45 -0400 Subject: [Biojava-l] Java to Perl Message-ID: Is there some tool by which we can convert a complete Java Code to a Perl code ? -- Vineith Kaul Masters Student Bioinformatics The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) Georgia Tech, Atlanta From davidfeitosa at gmail.com Sun Oct 14 17:57:47 2007 From: davidfeitosa at gmail.com (David Barbosa Feitosa) Date: Sun, 14 Oct 2007 14:57:47 -0300 Subject: [Biojava-l] Java to Perl In-Reply-To: References: Message-ID: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com> Vineith I do not know, but if you need to execute Pearl code inside Java code, in Java 6, codename Mustang, is possible to execute script code inside the Java Virtual Machine. The default scripting engine is Rhino, for JavaScript, but as it is a specification, if exists an Pearl engine, you can plug it into the JVM and execute your Pearl code. Mode infoa bout the available engines and how to install one: https://scripting.dev.java.net/ Maybe it can help you, David. 2007/10/14, vineith kaul : > > Is there some tool by which we can convert a complete Java Code to a > Perl code ? > > -- > Vineith Kaul > Masters Student Bioinformatics > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > Georgia Tech, Atlanta > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Mon Oct 15 08:15:33 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Mon, 15 Oct 2007 09:15:33 +0100 Subject: [Biojava-l] Java to Perl In-Reply-To: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com> References: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com> Message-ID: <471321A5.5090600@ebi.ac.uk> Unfortunately to my knowledge there is no Perl/Java scripting interface. Apparently for some reason Perl is not trendy enough to warrant a port (which is a pity). In response to Vineith's original question such a tool really wouldn't work. Good Perl code is very different to good Java code. If you did get something that would work you'd probably end up with quite verbose & in-efficent Perl code (not to mention the problems that would arise with Perl objects having no access modifiers, using inside-out objects, converting 3rd party libraries etc). Two options do spring to mind if you need code available in both languages: * Make one of the pieces of code a "black box" where you read results from STDOUT (works well enough calling a Java program from Perl). * Write the commmon code in C Out of these two options if you want the code replicated in a 1-1 fashion then C is your only option. Otherwise the first idea is the easiest to work with. As David did mention there are other scripting engines available (Jython, Groovy, JRuby & Rhino all spring to mind) which might satisfy your scripting needs whilst remaining in a Java environment (Groovy hits that nice sweet spot for a Java inspired scripting language). Andy P.S. This really isn't a Biojava question ... David Barbosa Feitosa wrote: > Vineith > > I do not know, but if you need to execute Pearl code inside Java code, in > Java 6, codename Mustang, is possible to execute script code inside the Java > Virtual Machine. > > The default scripting engine is Rhino, for JavaScript, but as it is a > specification, if exists an Pearl engine, you can plug it into the JVM and > execute your Pearl code. > > Mode infoa bout the available engines and how to install one: > > https://scripting.dev.java.net/ > > Maybe it can help you, > > David. > > 2007/10/14, vineith kaul : >> Is there some tool by which we can convert a complete Java Code to a >> Perl code ? >> >> -- >> Vineith Kaul >> Masters Student Bioinformatics >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >> Georgia Tech, Atlanta >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From phidias51 at gmail.com Mon Oct 15 14:57:06 2007 From: phidias51 at gmail.com (Mark Fortner) Date: Mon, 15 Oct 2007 07:57:06 -0700 Subject: [Biojava-l] Java to Perl In-Reply-To: <471321A5.5090600@ebi.ac.uk> References: <93dd9ad00710141057r37b4d80bufe0efe5125cb0393@mail.gmail.com> <471321A5.5090600@ebi.ac.uk> Message-ID: <6e1d61f50710150757p6ba25c1ck9466baa5f8273bc2@mail.gmail.com> The original post indicated that they wanted to go from java to perl. Doing a quick Google search yielded a lot of hits for tools going from perl to java. Just out curiosity, was there some reason you wanted to create perl code from Java code? There are a couple of projects which supposedly provide PERL-scripting support inside Java to one extent or another. The first is called Sleep ( http://sleep.hick.org/) which is described as being a PERL-like plugin for the Java 6 scripting engine. There's also a BSF plugin called BSF Perl ( http://bsfperl.sf.net) and another BSF plugin called PerlScript which is part of ActiveState's ActivePerl distribution. I don't have any first-hand experience with any of these, so please don't construe anything I say as an endorsement of these technologies. Although none of these solutions will convert PERL code into Java or vice-versa, they may allow you to run Perl inside a VM. Hope this helps, Mark On 10/15/07, Andy Yates wrote: > > Unfortunately to my knowledge there is no Perl/Java scripting interface. > Apparently for some reason Perl is not trendy enough to warrant a port > (which is a pity). > > In response to Vineith's original question such a tool really wouldn't > work. Good Perl code is very different to good Java code. If you did get > something that would work you'd probably end up with quite verbose & > in-efficent Perl code (not to mention the problems that would arise with > Perl objects having no access modifiers, using inside-out objects, > converting 3rd party libraries etc). > > Two options do spring to mind if you need code available in both > languages: > > * Make one of the pieces of code a "black box" where you read results > from STDOUT (works well enough calling a Java program from Perl). > > * Write the commmon code in C > > Out of these two options if you want the code replicated in a 1-1 > fashion then C is your only option. Otherwise the first idea is the > easiest to work with. > > As David did mention there are other scripting engines available > (Jython, Groovy, JRuby & Rhino all spring to mind) which might satisfy > your scripting needs whilst remaining in a Java environment (Groovy hits > that nice sweet spot for a Java inspired scripting language). > > Andy > > P.S. This really isn't a Biojava question ... > > David Barbosa Feitosa wrote: > > Vineith > > > > I do not know, but if you need to execute Pearl code inside Java code, > in > > Java 6, codename Mustang, is possible to execute script code inside the > Java > > Virtual Machine. > > > > The default scripting engine is Rhino, for JavaScript, but as it is a > > specification, if exists an Pearl engine, you can plug it into the JVM > and > > execute your Pearl code. > > > > Mode infoa bout the available engines and how to install one: > > > > https://scripting.dev.java.net/ > > > > Maybe it can help you, > > > > David. > > > > 2007/10/14, vineith kaul : > >> Is there some tool by which we can convert a complete Java Code to a > >> Perl code ? > >> > >> -- > >> Vineith Kaul > >> Masters Student Bioinformatics > >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >> Georgia Tech, Atlanta > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From vineith at gmail.com Sun Oct 21 16:30:48 2007 From: vineith at gmail.com (vineith kaul) Date: Sun, 21 Oct 2007 12:30:48 -0400 Subject: [Biojava-l] Evolutionary distances Message-ID: Hi, Are there functions to calculate evolutionary pairwise distances like Kimura2P,Finkelstein etc in Biojava I did write smthng on my own but on large sequences it runs terribly slow and I am not even sure if thats right. -- Vineith Kaul Masters Student Bioinformatics The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) Georgia Tech, Atlanta From holland at ebi.ac.uk Mon Oct 22 12:06:57 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Mon, 22 Oct 2007 13:06:57 +0100 (BST) Subject: [Biojava-l] Evolutionary distances In-Reply-To: References: Message-ID: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> You should take a look at the latest 1.5 release, in the org.biojavax.bio.phylo packages. This code is the beginnings of some phylogenetics code that will perform tasks as you describe. The future plan is to extend this code to cover a wider range of use cases. Kimura2P is already implemented here, in org.biojavax.bio.phylo.MultipleHitCorrection. If you can't find code that will do what you want, but have written some before, then please do feel free to contribute it. Even if it is slow, I'm sure someone out there will be able to help optimise it! cheers, Richard On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > Hi, > > Are there functions to calculate evolutionary pairwise distances like > Kimura2P,Finkelstein etc in Biojava > I did write smthng on my own but on large sequences it runs terribly > slow and I am not even sure if thats right. > -- > Vineith Kaul > Masters Student Bioinformatics > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > Georgia Tech, Atlanta > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Richard Holland BioMart (http://www.biomart.org/) EMBL-EBI Hinxton, Cambridgeshire CB10 1SD, UK From vineith at gmail.com Tue Oct 23 06:59:29 2007 From: vineith at gmail.com (vineith kaul) Date: Tue, 23 Oct 2007 02:59:29 -0400 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> Message-ID: This is what I have .....Thanks a lot fr the help. //Method to calculate the Kimura 2 parameter distance public static double K2P(String sequence1,String sequence2){ long p=0,q=0,numberOfAlignedSites=0; // P= transitional differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) char[] seq1array=sequence1.toCharArray(); char[] seq2array=sequence2.toCharArray(); for(int i=0;i wrote: > > You should take a look at the latest 1.5 release, in the > org.biojavax.bio.phylo packages. This code is the beginnings of some > phylogenetics code that will perform tasks as you describe. The future > plan is to extend this code to cover a wider range of use cases. Kimura2P > is already implemented here, in > org.biojavax.bio.phylo.MultipleHitCorrection. > > If you can't find code that will do what you want, but have written some > before, then please do feel free to contribute it. Even if it is slow, I'm > sure someone out there will be able to help optimise it! > > cheers, > Richard > > On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > > Hi, > > > > Are there functions to calculate evolutionary pairwise distances like > > Kimura2P,Finkelstein etc in Biojava > > I did write smthng on my own but on large sequences it runs terribly > > slow and I am not even sure if thats right. > > -- > > Vineith Kaul > > Masters Student Bioinformatics > > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > > Georgia Tech, Atlanta > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > -- > Richard Holland > BioMart (http://www.biomart.org/) > EMBL-EBI > Hinxton, Cambridgeshire CB10 1SD, UK > > -- Vineith Kaul Masters Student Bioinformatics The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) Georgia Tech, Atlanta From ozgur7 at gmail.com Tue Oct 23 18:17:29 2007 From: ozgur7 at gmail.com (Ozgur Ozturk) Date: Tue, 23 Oct 2007 11:17:29 -0700 Subject: [Biojava-l] problem with CookBook:Blast:Parser Message-ID: Hi, I am receiving the following error when I use BlastParser code from the cookbook : org.xml.sax.SAXException: Could not recognise the format of this file as one supported by the framework. at org.biojava.bio.program.sax.BlastLikeSAXParser.parse( BlastLikeSAXParser.java:182) at org.arabidopsis.test.BlastParser.main(BlastParser.java:44) I have generated the xml file using this command: blast-2.2.17/bin/blastall -p blastp -d brAll -i tester -b 300 -m 7 > tempresult.xml Then pass it to BlastParser: BlastParser tempresult.xml Thanks for your help in advance, -- Best regards, Ozgur (Oscar) Ozturk, http://www.cse.ohio-state.edu/~ozturk/ Mobile Phone: (614) 805-4370 From ozgur7 at gmail.com Tue Oct 23 20:24:49 2007 From: ozgur7 at gmail.com (Ozgur Ozturk) Date: Tue, 23 Oct 2007 13:24:49 -0700 Subject: [Biojava-l] Problem Solved Re: problem with CookBook:Blast:Parser Message-ID: Hi, Another code in demos ( BioJava/biojava-1.5/demos/blastxml ) could handle my xml file. I guess the problem is solved. Thanks. (But if the BlastParser code from the cookbookis deprecated, you may want to update it.) Best regards, Ozgur (Oscar) Ozturk, http://www.cse.ohio-state.edu/~ozturk/ Mobile Phone: (614) 805-4370 On 10/23/07, Ozgur Ozturk wrote: > > Hi, > I am receiving the following error when I use BlastParser code from the > cookbook : > > org.xml.sax.SAXException: Could not recognise the format of this file as > one supported by the framework. > at org.biojava.bio.program.sax.BlastLikeSAXParser.parse( > BlastLikeSAXParser.java:182) > at org.arabidopsis.test.BlastParser.main(BlastParser.java:44) > > I have generated the xml file using this command: > blast-2.2.17/bin/blastall -p blastp -d brAll -i tester -b 300 -m 7 > > tempresult.xml > > Then pass it to BlastParser: > BlastParser tempresult.xml > > Thanks for your help in advance, > -- > Best regards, > Ozgur (Oscar) Ozturk, > http://www.cse.ohio-state.edu/~ozturk/ > Mobile Phone: (614) 805-4370 From holland at ebi.ac.uk Wed Oct 24 07:52:24 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 24 Oct 2007 08:52:24 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> Message-ID: <471EF9B8.7020609@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks. Your code is similar to the code we have in org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to see if it is identical, but it probably is. You can call our code like this: // import statement for biojava phylo stuff import org.biojavax.bio.phylo.*; // ...rest of code goes here // call Kimura2P String seq1 = ...; // Get seq1 and seq2 from somewhere String seq2 = ...; double result = MultipleHitCorrection.Kimura2P(seq1, seq2); Note that our implementation expects sequence strings to be in upper case, so you'll need to make sure your data is upper case or has been converted to upper case before calling our method. cheers, Richard vineith kaul wrote: > This is what I have .....Thanks a lot fr the help. > > > //Method to calculate the Kimura 2 parameter distance > public static double K2P(String sequence1,String sequence2){ > long p=0,q=0,numberOfAlignedSites=0; // P= transitional > differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) > > > char[] seq1array=sequence1.toCharArray(); > char[] seq2array=sequence2.toCharArray(); > > for(int i=0;i // Number of aligned sites > if(((seq1array[i]=='a') || > (seq1array[i]=='A')||(seq1array[i]=='g') || > (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || > (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || > (seq2array[i]=='A')||(seq2array[i]=='c') || > (seq2array[i]=='C')||(seq2array[i]=='t') || > (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { > > numberOfAlignedSites++; > } > > if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > p++; > } > else > if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > p++; > } > else > if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > p++; > } > else > if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > p++; > } > else > if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > q++; > } > else > if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > q++; > } > else > if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > q++; > } > else > if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > q++; > } > else > if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > q++; > } > else > if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > q++; > } > else > if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > q++; > } > else > if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > q++; > } > > > > > } > > double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - > (((double)q)/numberOfAlignedSites); > double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); > System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); > double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); > return dist; > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 10/22/07, *Richard Holland* > wrote: > > You should take a look at the latest 1.5 release, in the > org.biojavax.bio.phylo packages. This code is the beginnings of some > phylogenetics code that will perform tasks as you describe. The future > plan is to extend this code to cover a wider range of use cases. > Kimura2P > is already implemented here, in > org.biojavax.bio.phylo.MultipleHitCorrection. > > If you can't find code that will do what you want, but have written some > before, then please do feel free to contribute it. Even if it is > slow, I'm > sure someone out there will be able to help optimise it! > > cheers, > Richard > > On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > > Hi, > > > > Are there functions to calculate evolutionary pairwise distances like > > Kimura2P,Finkelstein etc in Biojava > > I did write smthng on my own but on large sequences it runs terribly > > slow and I am not even sure if thats right. > > -- > > Vineith Kaul > > Masters Student Bioinformatics > > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > > Georgia Tech, Atlanta > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > -- > Richard Holland > BioMart ( http://www.biomart.org/) > EMBL-EBI > Hinxton, Cambridgeshire CB10 1SD, UK > > > > > -- > Vineith Kaul > Masters Student Bioinformatics > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > Georgia Tech, Atlanta -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa 4iKvsyBj2uznhhjTF9EYDFE= =LALE -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Wed Oct 24 08:09:13 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 24 Oct 2007 09:09:13 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471EF9B8.7020609@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> Message-ID: <471EFDA9.1090706@ebi.ac.uk> Our code is very similar but not identical. The original programmer shortcutted a lot of else if conditions by considering if the two bases were equal or not. It can then calculate the transitional changes & assume the rest are transversional. In terms of speed of both pieces of code I can't see an obvious way to speed it up. Probably in our code removing the 10 or so calls to String.charAt() with a two calls & referencing those chars might help but in all honesty I cannot say. Andy Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Thanks. > > Your code is similar to the code we have in > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > see if it is identical, but it probably is. > > You can call our code like this: > > // import statement for biojava phylo stuff > import org.biojavax.bio.phylo.*; > > // ...rest of code goes here > > // call Kimura2P > String seq1 = ...; // Get seq1 and seq2 from somewhere > String seq2 = ...; > double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > > Note that our implementation expects sequence strings to be in upper > case, so you'll need to make sure your data is upper case or has been > converted to upper case before calling our method. > > cheers, > Richard > > vineith kaul wrote: >> This is what I have .....Thanks a lot fr the help. >> >> >> //Method to calculate the Kimura 2 parameter distance >> public static double K2P(String sequence1,String sequence2){ >> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >> >> >> char[] seq1array=sequence1.toCharArray(); >> char[] seq2array=sequence2.toCharArray(); >> >> for(int i=0;i> // Number of aligned sites >> if(((seq1array[i]=='a') || >> (seq1array[i]=='A')||(seq1array[i]=='g') || >> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >> (seq2array[i]=='A')||(seq2array[i]=='c') || >> (seq2array[i]=='C')||(seq2array[i]=='t') || >> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >> >> numberOfAlignedSites++; >> } >> >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >> p++; >> } >> else >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >> p++; >> } >> else >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >> p++; >> } >> else >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >> p++; >> } >> else >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >> q++; >> } >> else >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >> q++; >> } >> else >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >> q++; >> } >> else >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >> q++; >> } >> else >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >> q++; >> } >> else >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >> q++; >> } >> else >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >> q++; >> } >> else >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >> q++; >> } >> >> >> >> >> } >> >> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >> (((double)q)/numberOfAlignedSites); >> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >> return dist; >> } >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 10/22/07, *Richard Holland* > > wrote: >> >> You should take a look at the latest 1.5 release, in the >> org.biojavax.bio.phylo packages. This code is the beginnings of some >> phylogenetics code that will perform tasks as you describe. The future >> plan is to extend this code to cover a wider range of use cases. >> Kimura2P >> is already implemented here, in >> org.biojavax.bio.phylo.MultipleHitCorrection. >> >> If you can't find code that will do what you want, but have written some >> before, then please do feel free to contribute it. Even if it is >> slow, I'm >> sure someone out there will be able to help optimise it! >> >> cheers, >> Richard >> >> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >> > Hi, >> > >> > Are there functions to calculate evolutionary pairwise distances like >> > Kimura2P,Finkelstein etc in Biojava >> > I did write smthng on my own but on large sequences it runs terribly >> > slow and I am not even sure if thats right. >> > -- >> > Vineith Kaul >> > Masters Student Bioinformatics >> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >> > Georgia Tech, Atlanta >> > _______________________________________________ >> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >> >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > >> >> >> -- >> Richard Holland >> BioMart ( http://www.biomart.org/) >> EMBL-EBI >> Hinxton, Cambridgeshire CB10 1SD, UK >> >> >> >> >> -- >> Vineith Kaul >> Masters Student Bioinformatics >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >> Georgia Tech, Atlanta > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa > 4iKvsyBj2uznhhjTF9EYDFE= > =LALE > -----END PGP SIGNATURE----- > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From markjschreiber at gmail.com Wed Oct 24 11:59:04 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 24 Oct 2007 19:59:04 +0800 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471EFDA9.1090706@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> Message-ID: <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> Hi - >From experience the best way to optimize java code is to run a profiler. The one in Netbeans is quite good. The reason is that the hotspot or JIT compilers might natively compile the part of the code that you think is slow and actually make it faster than something else which becomes the bottle neck. Using a good profiler you can detect how much time is spent in each method and pin point some candidate methods for optimization. You can also see if there is a burden due to creation of lots of objects. - Mark On 10/24/07, Andy Yates wrote: > Our code is very similar but not identical. The original programmer > shortcutted a lot of else if conditions by considering if the two bases > were equal or not. It can then calculate the transitional changes & > assume the rest are transversional. > > In terms of speed of both pieces of code I can't see an obvious way to > speed it up. Probably in our code removing the 10 or so calls to > String.charAt() with a two calls & referencing those chars might help > but in all honesty I cannot say. > > Andy > > Richard Holland wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Thanks. > > > > Your code is similar to the code we have in > > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > > see if it is identical, but it probably is. > > > > You can call our code like this: > > > > // import statement for biojava phylo stuff > > import org.biojavax.bio.phylo.*; > > > > // ...rest of code goes here > > > > // call Kimura2P > > String seq1 = ...; // Get seq1 and seq2 from somewhere > > String seq2 = ...; > > double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > > > > Note that our implementation expects sequence strings to be in upper > > case, so you'll need to make sure your data is upper case or has been > > converted to upper case before calling our method. > > > > cheers, > > Richard > > > > vineith kaul wrote: > >> This is what I have .....Thanks a lot fr the help. > >> > >> > >> //Method to calculate the Kimura 2 parameter distance > >> public static double K2P(String sequence1,String sequence2){ > >> long p=0,q=0,numberOfAlignedSites=0; // P= transitional > >> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) > >> > >> > >> char[] seq1array=sequence1.toCharArray(); > >> char[] seq2array=sequence2.toCharArray(); > >> > >> for(int i=0;i >> // Number of aligned sites > >> if(((seq1array[i]=='a') || > >> (seq1array[i]=='A')||(seq1array[i]=='g') || > >> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || > >> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || > >> (seq2array[i]=='A')||(seq2array[i]=='c') || > >> (seq2array[i]=='C')||(seq2array[i]=='t') || > >> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { > >> > >> numberOfAlignedSites++; > >> } > >> > >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >> p++; > >> } > >> else > >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >> p++; > >> } > >> else > >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >> p++; > >> } > >> else > >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >> p++; > >> } > >> else > >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >> q++; > >> } > >> else > >> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >> q++; > >> } > >> > >> > >> > >> > >> } > >> > >> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - > >> (((double)q)/numberOfAlignedSites); > >> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); > >> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); > >> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); > >> return dist; > >> } > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> On 10/22/07, *Richard Holland* >> > wrote: > >> > >> You should take a look at the latest 1.5 release, in the > >> org.biojavax.bio.phylo packages. This code is the beginnings of some > >> phylogenetics code that will perform tasks as you describe. The future > >> plan is to extend this code to cover a wider range of use cases. > >> Kimura2P > >> is already implemented here, in > >> org.biojavax.bio.phylo.MultipleHitCorrection. > >> > >> If you can't find code that will do what you want, but have written some > >> before, then please do feel free to contribute it. Even if it is > >> slow, I'm > >> sure someone out there will be able to help optimise it! > >> > >> cheers, > >> Richard > >> > >> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > >> > Hi, > >> > > >> > Are there functions to calculate evolutionary pairwise distances like > >> > Kimura2P,Finkelstein etc in Biojava > >> > I did write smthng on my own but on large sequences it runs terribly > >> > slow and I am not even sure if thats right. > >> > -- > >> > Vineith Kaul > >> > Masters Student Bioinformatics > >> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >> > Georgia Tech, Atlanta > >> > _______________________________________________ > >> > Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> > >> > http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > >> > >> > >> -- > >> Richard Holland > >> BioMart ( http://www.biomart.org/) > >> EMBL-EBI > >> Hinxton, Cambridgeshire CB10 1SD, UK > >> > >> > >> > >> > >> -- > >> Vineith Kaul > >> Masters Student Bioinformatics > >> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >> Georgia Tech, Atlanta > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.2.2 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > > > iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa > > 4iKvsyBj2uznhhjTF9EYDFE= > > =LALE > > -----END PGP SIGNATURE----- > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Wed Oct 24 12:28:21 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 24 Oct 2007 13:28:21 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> Message-ID: <471F3A65.50202@ebi.ac.uk> Yes a very good point & one I was going to make before hand but forgot :) Also not to mention that micro-benchmarks/profiling in Java are notorious for giving false results due to VM warmup & JIT compilation optimisations. There is a framework hosted on Java.net somewhere which can perform VM warmups and code iterations to produce more accurate benchmarking results; but the name escapes me at the moment. However looking at this particular code I get the feeling that this is about as fast as its going to get without someone doing bitwise XOR operations or some C code ... that's not an open invitation for people to start recoding this in C :). At the end of the day the key to optimisation is to ask the question "is it fast enough already?". If it is then there's no point :) Andy Mark Schreiber wrote: > Hi - > >>From experience the best way to optimize java code is to run a > profiler. The one in Netbeans is quite good. > > The reason is that the hotspot or JIT compilers might natively compile > the part of the code that you think is slow and actually make it > faster than something else which becomes the bottle neck. Using a good > profiler you can detect how much time is spent in each method and pin > point some candidate methods for optimization. You can also see if > there is a burden due to creation of lots of objects. > > - Mark > > On 10/24/07, Andy Yates wrote: >> Our code is very similar but not identical. The original programmer >> shortcutted a lot of else if conditions by considering if the two bases >> were equal or not. It can then calculate the transitional changes & >> assume the rest are transversional. >> >> In terms of speed of both pieces of code I can't see an obvious way to >> speed it up. Probably in our code removing the 10 or so calls to >> String.charAt() with a two calls & referencing those chars might help >> but in all honesty I cannot say. >> >> Andy >> >> Richard Holland wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Thanks. >>> >>> Your code is similar to the code we have in >>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to >>> see if it is identical, but it probably is. >>> >>> You can call our code like this: >>> >>> // import statement for biojava phylo stuff >>> import org.biojavax.bio.phylo.*; >>> >>> // ...rest of code goes here >>> >>> // call Kimura2P >>> String seq1 = ...; // Get seq1 and seq2 from somewhere >>> String seq2 = ...; >>> double result = MultipleHitCorrection.Kimura2P(seq1, seq2); >>> >>> Note that our implementation expects sequence strings to be in upper >>> case, so you'll need to make sure your data is upper case or has been >>> converted to upper case before calling our method. >>> >>> cheers, >>> Richard >>> >>> vineith kaul wrote: >>>> This is what I have .....Thanks a lot fr the help. >>>> >>>> >>>> //Method to calculate the Kimura 2 parameter distance >>>> public static double K2P(String sequence1,String sequence2){ >>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >>>> >>>> >>>> char[] seq1array=sequence1.toCharArray(); >>>> char[] seq2array=sequence2.toCharArray(); >>>> >>>> for(int i=0;i>>> // Number of aligned sites >>>> if(((seq1array[i]=='a') || >>>> (seq1array[i]=='A')||(seq1array[i]=='g') || >>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >>>> (seq2array[i]=='A')||(seq2array[i]=='c') || >>>> (seq2array[i]=='C')||(seq2array[i]=='t') || >>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>> >>>> numberOfAlignedSites++; >>>> } >>>> >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>> p++; >>>> } >>>> else >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>> p++; >>>> } >>>> else >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>> p++; >>>> } >>>> else >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>> p++; >>>> } >>>> else >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>> q++; >>>> } >>>> else >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>> q++; >>>> } >>>> >>>> >>>> >>>> >>>> } >>>> >>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >>>> (((double)q)/numberOfAlignedSites); >>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >>>> return dist; >>>> } >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 10/22/07, *Richard Holland* >>> > wrote: >>>> >>>> You should take a look at the latest 1.5 release, in the >>>> org.biojavax.bio.phylo packages. This code is the beginnings of some >>>> phylogenetics code that will perform tasks as you describe. The future >>>> plan is to extend this code to cover a wider range of use cases. >>>> Kimura2P >>>> is already implemented here, in >>>> org.biojavax.bio.phylo.MultipleHitCorrection. >>>> >>>> If you can't find code that will do what you want, but have written some >>>> before, then please do feel free to contribute it. Even if it is >>>> slow, I'm >>>> sure someone out there will be able to help optimise it! >>>> >>>> cheers, >>>> Richard >>>> >>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >>>> > Hi, >>>> > >>>> > Are there functions to calculate evolutionary pairwise distances like >>>> > Kimura2P,Finkelstein etc in Biojava >>>> > I did write smthng on my own but on large sequences it runs terribly >>>> > slow and I am not even sure if thats right. >>>> > -- >>>> > Vineith Kaul >>>> > Masters Student Bioinformatics >>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>> > Georgia Tech, Atlanta >>>> > _______________________________________________ >>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> >>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> > >>>> >>>> >>>> -- >>>> Richard Holland >>>> BioMart ( http://www.biomart.org/) >>>> EMBL-EBI >>>> Hinxton, Cambridgeshire CB10 1SD, UK >>>> >>>> >>>> >>>> >>>> -- >>>> Vineith Kaul >>>> Masters Student Bioinformatics >>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>> Georgia Tech, Atlanta >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG v1.4.2.2 (GNU/Linux) >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >>> >>> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa >>> 4iKvsyBj2uznhhjTF9EYDFE= >>> =LALE >>> -----END PGP SIGNATURE----- >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> From markjschreiber at gmail.com Wed Oct 24 13:19:25 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 24 Oct 2007 21:19:25 +0800 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471F3A65.50202@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> Message-ID: <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> Another important consideration after optimization is can the task be multithreaded? Almost all modern computers have at least 2 cores. So if the algorithm can be parallelized you will get some performance bonus on most machines. Modern JVM's will automagically try to use idle CPU's to execute new threads spawned by the programmer. - Mark On 10/24/07, Andy Yates wrote: > Yes a very good point & one I was going to make before hand but forgot :) > > Also not to mention that micro-benchmarks/profiling in Java are > notorious for giving false results due to VM warmup & JIT compilation > optimisations. There is a framework hosted on Java.net somewhere which > can perform VM warmups and code iterations to produce more accurate > benchmarking results; but the name escapes me at the moment. > > However looking at this particular code I get the feeling that this is > about as fast as its going to get without someone doing bitwise XOR > operations or some C code ... that's not an open invitation for people > to start recoding this in C :). At the end of the day the key to > optimisation is to ask the question "is it fast enough already?". If it > is then there's no point :) > > Andy > > Mark Schreiber wrote: > > Hi - > > > >>From experience the best way to optimize java code is to run a > > profiler. The one in Netbeans is quite good. > > > > The reason is that the hotspot or JIT compilers might natively compile > > the part of the code that you think is slow and actually make it > > faster than something else which becomes the bottle neck. Using a good > > profiler you can detect how much time is spent in each method and pin > > point some candidate methods for optimization. You can also see if > > there is a burden due to creation of lots of objects. > > > > - Mark > > > > On 10/24/07, Andy Yates wrote: > >> Our code is very similar but not identical. The original programmer > >> shortcutted a lot of else if conditions by considering if the two bases > >> were equal or not. It can then calculate the transitional changes & > >> assume the rest are transversional. > >> > >> In terms of speed of both pieces of code I can't see an obvious way to > >> speed it up. Probably in our code removing the 10 or so calls to > >> String.charAt() with a two calls & referencing those chars might help > >> but in all honesty I cannot say. > >> > >> Andy > >> > >> Richard Holland wrote: > >>> -----BEGIN PGP SIGNED MESSAGE----- > >>> Hash: SHA1 > >>> > >>> Thanks. > >>> > >>> Your code is similar to the code we have in > >>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > >>> see if it is identical, but it probably is. > >>> > >>> You can call our code like this: > >>> > >>> // import statement for biojava phylo stuff > >>> import org.biojavax.bio.phylo.*; > >>> > >>> // ...rest of code goes here > >>> > >>> // call Kimura2P > >>> String seq1 = ...; // Get seq1 and seq2 from somewhere > >>> String seq2 = ...; > >>> double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > >>> > >>> Note that our implementation expects sequence strings to be in upper > >>> case, so you'll need to make sure your data is upper case or has been > >>> converted to upper case before calling our method. > >>> > >>> cheers, > >>> Richard > >>> > >>> vineith kaul wrote: > >>>> This is what I have .....Thanks a lot fr the help. > >>>> > >>>> > >>>> //Method to calculate the Kimura 2 parameter distance > >>>> public static double K2P(String sequence1,String sequence2){ > >>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional > >>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) > >>>> > >>>> > >>>> char[] seq1array=sequence1.toCharArray(); > >>>> char[] seq2array=sequence2.toCharArray(); > >>>> > >>>> for(int i=0;i >>>> // Number of aligned sites > >>>> if(((seq1array[i]=='a') || > >>>> (seq1array[i]=='A')||(seq1array[i]=='g') || > >>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || > >>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || > >>>> (seq2array[i]=='A')||(seq2array[i]=='c') || > >>>> (seq2array[i]=='C')||(seq2array[i]=='t') || > >>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>> > >>>> numberOfAlignedSites++; > >>>> } > >>>> > >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>> p++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>> p++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>> p++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>> p++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>> q++; > >>>> } > >>>> else > >>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>> q++; > >>>> } > >>>> > >>>> > >>>> > >>>> > >>>> } > >>>> > >>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - > >>>> (((double)q)/numberOfAlignedSites); > >>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); > >>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); > >>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); > >>>> return dist; > >>>> } > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On 10/22/07, *Richard Holland* >>>> > wrote: > >>>> > >>>> You should take a look at the latest 1.5 release, in the > >>>> org.biojavax.bio.phylo packages. This code is the beginnings of some > >>>> phylogenetics code that will perform tasks as you describe. The future > >>>> plan is to extend this code to cover a wider range of use cases. > >>>> Kimura2P > >>>> is already implemented here, in > >>>> org.biojavax.bio.phylo.MultipleHitCorrection. > >>>> > >>>> If you can't find code that will do what you want, but have written some > >>>> before, then please do feel free to contribute it. Even if it is > >>>> slow, I'm > >>>> sure someone out there will be able to help optimise it! > >>>> > >>>> cheers, > >>>> Richard > >>>> > >>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > >>>> > Hi, > >>>> > > >>>> > Are there functions to calculate evolutionary pairwise distances like > >>>> > Kimura2P,Finkelstein etc in Biojava > >>>> > I did write smthng on my own but on large sequences it runs terribly > >>>> > slow and I am not even sure if thats right. > >>>> > -- > >>>> > Vineith Kaul > >>>> > Masters Student Bioinformatics > >>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >>>> > Georgia Tech, Atlanta > >>>> > _______________________________________________ > >>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> > >>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > > >>>> > >>>> > >>>> -- > >>>> Richard Holland > >>>> BioMart ( http://www.biomart.org/) > >>>> EMBL-EBI > >>>> Hinxton, Cambridgeshire CB10 1SD, UK > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Vineith Kaul > >>>> Masters Student Bioinformatics > >>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >>>> Georgia Tech, Atlanta > >>> -----BEGIN PGP SIGNATURE----- > >>> Version: GnuPG v1.4.2.2 (GNU/Linux) > >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > >>> > >>> iD8DBQFHHvm34C5LeMEKA/QRAlc3AJ9GAMML/z5+BBl12PA2a/Zyz/CHDQCdFWKa > >>> 4iKvsyBj2uznhhjTF9EYDFE= > >>> =LALE > >>> -----END PGP SIGNATURE----- > >>> _______________________________________________ > >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > From holland at ebi.ac.uk Wed Oct 24 13:33:53 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 24 Oct 2007 14:33:53 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> Message-ID: <471F49C1.9070901@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This particular code could easily be parallelised - given N threads, you can simply divide the input into N chunks and get each thread to process 1/Nth of the input. You then combine the output of each thread to do the final calculation. But, it'd be bad practice to always fork a predetermined N threads for a given task. It'd be much better to somehow be able to ask 'how parallel can I make this?' at runtime by checking system resources, or maybe get the parallel-savvy user to set an optional BioJava-wide parallelisation hint. N could then be determined and the task divided appropriately. cheers, Richard Mark Schreiber wrote: > Another important consideration after optimization is can the task be > multithreaded? Almost all modern computers have at least 2 cores. So > if the algorithm can be parallelized you will get some performance > bonus on most machines. > > Modern JVM's will automagically try to use idle CPU's to execute new > threads spawned by the programmer. > > - Mark > > On 10/24/07, Andy Yates wrote: >> Yes a very good point & one I was going to make before hand but forgot :) >> >> Also not to mention that micro-benchmarks/profiling in Java are >> notorious for giving false results due to VM warmup & JIT compilation >> optimisations. There is a framework hosted on Java.net somewhere which >> can perform VM warmups and code iterations to produce more accurate >> benchmarking results; but the name escapes me at the moment. >> >> However looking at this particular code I get the feeling that this is >> about as fast as its going to get without someone doing bitwise XOR >> operations or some C code ... that's not an open invitation for people >> to start recoding this in C :). At the end of the day the key to >> optimisation is to ask the question "is it fast enough already?". If it >> is then there's no point :) >> >> Andy >> >> Mark Schreiber wrote: >>> Hi - >>> >>> >From experience the best way to optimize java code is to run a >>> profiler. The one in Netbeans is quite good. >>> >>> The reason is that the hotspot or JIT compilers might natively compile >>> the part of the code that you think is slow and actually make it >>> faster than something else which becomes the bottle neck. Using a good >>> profiler you can detect how much time is spent in each method and pin >>> point some candidate methods for optimization. You can also see if >>> there is a burden due to creation of lots of objects. >>> >>> - Mark >>> >>> On 10/24/07, Andy Yates wrote: >>>> Our code is very similar but not identical. The original programmer >>>> shortcutted a lot of else if conditions by considering if the two bases >>>> were equal or not. It can then calculate the transitional changes & >>>> assume the rest are transversional. >>>> >>>> In terms of speed of both pieces of code I can't see an obvious way to >>>> speed it up. Probably in our code removing the 10 or so calls to >>>> String.charAt() with a two calls & referencing those chars might help >>>> but in all honesty I cannot say. >>>> >>>> Andy >>>> >>>> Richard Holland wrote: > Thanks. > > Your code is similar to the code we have in > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > see if it is identical, but it probably is. > > You can call our code like this: > > // import statement for biojava phylo stuff > import org.biojavax.bio.phylo.*; > > // ...rest of code goes here > > // call Kimura2P > String seq1 = ...; // Get seq1 and seq2 from somewhere > String seq2 = ...; > double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > > Note that our implementation expects sequence strings to be in upper > case, so you'll need to make sure your data is upper case or has been > converted to upper case before calling our method. > > cheers, > Richard > > vineith kaul wrote: >>>>>>> This is what I have .....Thanks a lot fr the help. >>>>>>> >>>>>>> >>>>>>> //Method to calculate the Kimura 2 parameter distance >>>>>>> public static double K2P(String sequence1,String sequence2){ >>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >>>>>>> >>>>>>> >>>>>>> char[] seq1array=sequence1.toCharArray(); >>>>>>> char[] seq2array=sequence2.toCharArray(); >>>>>>> >>>>>>> for(int i=0;i>>>>>> // Number of aligned sites >>>>>>> if(((seq1array[i]=='a') || >>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || >>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || >>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || >>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>> >>>>>>> numberOfAlignedSites++; >>>>>>> } >>>>>>> >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>> p++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>> p++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>> p++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>> p++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>> q++; >>>>>>> } >>>>>>> else >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>> q++; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> } >>>>>>> >>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >>>>>>> (((double)q)/numberOfAlignedSites); >>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >>>>>>> return dist; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 10/22/07, *Richard Holland* >>>>>> > wrote: >>>>>>> >>>>>>> You should take a look at the latest 1.5 release, in the >>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some >>>>>>> phylogenetics code that will perform tasks as you describe. The future >>>>>>> plan is to extend this code to cover a wider range of use cases. >>>>>>> Kimura2P >>>>>>> is already implemented here, in >>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. >>>>>>> >>>>>>> If you can't find code that will do what you want, but have written some >>>>>>> before, then please do feel free to contribute it. Even if it is >>>>>>> slow, I'm >>>>>>> sure someone out there will be able to help optimise it! >>>>>>> >>>>>>> cheers, >>>>>>> Richard >>>>>>> >>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >>>>>>> > Hi, >>>>>>> > >>>>>>> > Are there functions to calculate evolutionary pairwise distances like >>>>>>> > Kimura2P,Finkelstein etc in Biojava >>>>>>> > I did write smthng on my own but on large sequences it runs terribly >>>>>>> > slow and I am not even sure if thats right. >>>>>>> > -- >>>>>>> > Vineith Kaul >>>>>>> > Masters Student Bioinformatics >>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>> > Georgia Tech, Atlanta >>>>>>> > _______________________________________________ >>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>> >>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> > >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Richard Holland >>>>>>> BioMart ( http://www.biomart.org/) >>>>>>> EMBL-EBI >>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Vineith Kaul >>>>>>> Masters Student Bioinformatics >>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>> Georgia Tech, Atlanta _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P IEyRleSs1+AziCvfhcES8wI= =uLDm -----END PGP SIGNATURE----- From markjschreiber at gmail.com Wed Oct 24 13:41:16 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 24 Oct 2007 21:41:16 +0800 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471F49C1.9070901@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> Message-ID: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> I'm not aware of a way to determine the number of CPU's within a program although possibly it is one the the environment variables available from System. Even if it can't be determined there could be a method argument to specify the number of threads to spawn. - Mark On 10/24/07, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This particular code could easily be parallelised - given N threads, you > can simply divide the input into N chunks and get each thread to process > 1/Nth of the input. You then combine the output of each thread to do the > final calculation. > > But, it'd be bad practice to always fork a predetermined N threads for a > given task. It'd be much better to somehow be able to ask 'how parallel > can I make this?' at runtime by checking system resources, or maybe get > the parallel-savvy user to set an optional BioJava-wide parallelisation > hint. N could then be determined and the task divided appropriately. > > cheers, > Richard > > Mark Schreiber wrote: > > Another important consideration after optimization is can the task be > > multithreaded? Almost all modern computers have at least 2 cores. So > > if the algorithm can be parallelized you will get some performance > > bonus on most machines. > > > > Modern JVM's will automagically try to use idle CPU's to execute new > > threads spawned by the programmer. > > > > - Mark > > > > On 10/24/07, Andy Yates wrote: > >> Yes a very good point & one I was going to make before hand but forgot :) > >> > >> Also not to mention that micro-benchmarks/profiling in Java are > >> notorious for giving false results due to VM warmup & JIT compilation > >> optimisations. There is a framework hosted on Java.net somewhere which > >> can perform VM warmups and code iterations to produce more accurate > >> benchmarking results; but the name escapes me at the moment. > >> > >> However looking at this particular code I get the feeling that this is > >> about as fast as its going to get without someone doing bitwise XOR > >> operations or some C code ... that's not an open invitation for people > >> to start recoding this in C :). At the end of the day the key to > >> optimisation is to ask the question "is it fast enough already?". If it > >> is then there's no point :) > >> > >> Andy > >> > >> Mark Schreiber wrote: > >>> Hi - > >>> > >>> >From experience the best way to optimize java code is to run a > >>> profiler. The one in Netbeans is quite good. > >>> > >>> The reason is that the hotspot or JIT compilers might natively compile > >>> the part of the code that you think is slow and actually make it > >>> faster than something else which becomes the bottle neck. Using a good > >>> profiler you can detect how much time is spent in each method and pin > >>> point some candidate methods for optimization. You can also see if > >>> there is a burden due to creation of lots of objects. > >>> > >>> - Mark > >>> > >>> On 10/24/07, Andy Yates wrote: > >>>> Our code is very similar but not identical. The original programmer > >>>> shortcutted a lot of else if conditions by considering if the two bases > >>>> were equal or not. It can then calculate the transitional changes & > >>>> assume the rest are transversional. > >>>> > >>>> In terms of speed of both pieces of code I can't see an obvious way to > >>>> speed it up. Probably in our code removing the 10 or so calls to > >>>> String.charAt() with a two calls & referencing those chars might help > >>>> but in all honesty I cannot say. > >>>> > >>>> Andy > >>>> > >>>> Richard Holland wrote: > > Thanks. > > > > Your code is similar to the code we have in > > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > > see if it is identical, but it probably is. > > > > You can call our code like this: > > > > // import statement for biojava phylo stuff > > import org.biojavax.bio.phylo.*; > > > > // ...rest of code goes here > > > > // call Kimura2P > > String seq1 = ...; // Get seq1 and seq2 from somewhere > > String seq2 = ...; > > double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > > > > Note that our implementation expects sequence strings to be in upper > > case, so you'll need to make sure your data is upper case or has been > > converted to upper case before calling our method. > > > > cheers, > > Richard > > > > vineith kaul wrote: > >>>>>>> This is what I have .....Thanks a lot fr the help. > >>>>>>> > >>>>>>> > >>>>>>> //Method to calculate the Kimura 2 parameter distance > >>>>>>> public static double K2P(String sequence1,String sequence2){ > >>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional > >>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) > >>>>>>> > >>>>>>> > >>>>>>> char[] seq1array=sequence1.toCharArray(); > >>>>>>> char[] seq2array=sequence2.toCharArray(); > >>>>>>> > >>>>>>> for(int i=0;i >>>>>>> // Number of aligned sites > >>>>>>> if(((seq1array[i]=='a') || > >>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || > >>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || > >>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || > >>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || > >>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || > >>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>>>>> > >>>>>>> numberOfAlignedSites++; > >>>>>>> } > >>>>>>> > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>>>>> p++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>>>>> p++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>>>>> p++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>>>>> p++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> else > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > >>>>>>> q++; > >>>>>>> } > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - > >>>>>>> (((double)q)/numberOfAlignedSites); > >>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); > >>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); > >>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); > >>>>>>> return dist; > >>>>>>> } > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 10/22/07, *Richard Holland* >>>>>>> > wrote: > >>>>>>> > >>>>>>> You should take a look at the latest 1.5 release, in the > >>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some > >>>>>>> phylogenetics code that will perform tasks as you describe. The future > >>>>>>> plan is to extend this code to cover a wider range of use cases. > >>>>>>> Kimura2P > >>>>>>> is already implemented here, in > >>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. > >>>>>>> > >>>>>>> If you can't find code that will do what you want, but have written some > >>>>>>> before, then please do feel free to contribute it. Even if it is > >>>>>>> slow, I'm > >>>>>>> sure someone out there will be able to help optimise it! > >>>>>>> > >>>>>>> cheers, > >>>>>>> Richard > >>>>>>> > >>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > >>>>>>> > Hi, > >>>>>>> > > >>>>>>> > Are there functions to calculate evolutionary pairwise distances like > >>>>>>> > Kimura2P,Finkelstein etc in Biojava > >>>>>>> > I did write smthng on my own but on large sequences it runs terribly > >>>>>>> > slow and I am not even sure if thats right. > >>>>>>> > -- > >>>>>>> > Vineith Kaul > >>>>>>> > Masters Student Bioinformatics > >>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >>>>>>> > Georgia Tech, Atlanta > >>>>>>> > _______________________________________________ > >>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>>>> > >>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>>> > > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Richard Holland > >>>>>>> BioMart ( http://www.biomart.org/) > >>>>>>> EMBL-EBI > >>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Vineith Kaul > >>>>>>> Masters Student Bioinformatics > >>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > >>>>>>> Georgia Tech, Atlanta > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P > IEyRleSs1+AziCvfhcES8wI= > =uLDm > -----END PGP SIGNATURE----- > From markjschreiber at gmail.com Wed Oct 24 13:48:00 2007 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 24 Oct 2007 21:48:00 +0800 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> Message-ID: <93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com> It appears it is as simple as: Runtime.getRuntime().availableProcessors(); - Mark On 10/24/07, Mark Schreiber wrote: > I'm not aware of a way to determine the number of CPU's within a > program although possibly it is one the the environment variables > available from System. > > Even if it can't be determined there could be a method argument to > specify the number of threads to spawn. > > - Mark > > On 10/24/07, Richard Holland wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > This particular code could easily be parallelised - given N threads, you > > can simply divide the input into N chunks and get each thread to process > > 1/Nth of the input. You then combine the output of each thread to do the > > final calculation. > > > > But, it'd be bad practice to always fork a predetermined N threads for a > > given task. It'd be much better to somehow be able to ask 'how parallel > > can I make this?' at runtime by checking system resources, or maybe get > > the parallel-savvy user to set an optional BioJava-wide parallelisation > > hint. N could then be determined and the task divided appropriately. > > > > cheers, > > Richard > > > > Mark Schreiber wrote: > > > Another important consideration after optimization is can the task be > > > multithreaded? Almost all modern computers have at least 2 cores. So > > > if the algorithm can be parallelized you will get some performance > > > bonus on most machines. > > > > > > Modern JVM's will automagically try to use idle CPU's to execute new > > > threads spawned by the programmer. > > > > > > - Mark > > > > > > On 10/24/07, Andy Yates wrote: > > >> Yes a very good point & one I was going to make before hand but forgot :) > > >> > > >> Also not to mention that micro-benchmarks/profiling in Java are > > >> notorious for giving false results due to VM warmup & JIT compilation > > >> optimisations. There is a framework hosted on Java.net somewhere which > > >> can perform VM warmups and code iterations to produce more accurate > > >> benchmarking results; but the name escapes me at the moment. > > >> > > >> However looking at this particular code I get the feeling that this is > > >> about as fast as its going to get without someone doing bitwise XOR > > >> operations or some C code ... that's not an open invitation for people > > >> to start recoding this in C :). At the end of the day the key to > > >> optimisation is to ask the question "is it fast enough already?". If it > > >> is then there's no point :) > > >> > > >> Andy > > >> > > >> Mark Schreiber wrote: > > >>> Hi - > > >>> > > >>> >From experience the best way to optimize java code is to run a > > >>> profiler. The one in Netbeans is quite good. > > >>> > > >>> The reason is that the hotspot or JIT compilers might natively compile > > >>> the part of the code that you think is slow and actually make it > > >>> faster than something else which becomes the bottle neck. Using a good > > >>> profiler you can detect how much time is spent in each method and pin > > >>> point some candidate methods for optimization. You can also see if > > >>> there is a burden due to creation of lots of objects. > > >>> > > >>> - Mark > > >>> > > >>> On 10/24/07, Andy Yates wrote: > > >>>> Our code is very similar but not identical. The original programmer > > >>>> shortcutted a lot of else if conditions by considering if the two bases > > >>>> were equal or not. It can then calculate the transitional changes & > > >>>> assume the rest are transversional. > > >>>> > > >>>> In terms of speed of both pieces of code I can't see an obvious way to > > >>>> speed it up. Probably in our code removing the 10 or so calls to > > >>>> String.charAt() with a two calls & referencing those chars might help > > >>>> but in all honesty I cannot say. > > >>>> > > >>>> Andy > > >>>> > > >>>> Richard Holland wrote: > > > Thanks. > > > > > > Your code is similar to the code we have in > > > org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to > > > see if it is identical, but it probably is. > > > > > > You can call our code like this: > > > > > > // import statement for biojava phylo stuff > > > import org.biojavax.bio.phylo.*; > > > > > > // ...rest of code goes here > > > > > > // call Kimura2P > > > String seq1 = ...; // Get seq1 and seq2 from somewhere > > > String seq2 = ...; > > > double result = MultipleHitCorrection.Kimura2P(seq1, seq2); > > > > > > Note that our implementation expects sequence strings to be in upper > > > case, so you'll need to make sure your data is upper case or has been > > > converted to upper case before calling our method. > > > > > > cheers, > > > Richard > > > > > > vineith kaul wrote: > > >>>>>>> This is what I have .....Thanks a lot fr the help. > > >>>>>>> > > >>>>>>> > > >>>>>>> //Method to calculate the Kimura 2 parameter distance > > >>>>>>> public static double K2P(String sequence1,String sequence2){ > > >>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional > > >>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) > > >>>>>>> > > >>>>>>> > > >>>>>>> char[] seq1array=sequence1.toCharArray(); > > >>>>>>> char[] seq2array=sequence2.toCharArray(); > > >>>>>>> > > >>>>>>> for(int i=0;i > >>>>>>> // Number of aligned sites > > >>>>>>> if(((seq1array[i]=='a') || > > >>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || > > >>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || > > >>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || > > >>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || > > >>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || > > >>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { > > >>>>>>> > > >>>>>>> numberOfAlignedSites++; > > >>>>>>> } > > >>>>>>> > > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > > >>>>>>> p++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > > >>>>>>> p++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > > >>>>>>> p++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > > >>>>>>> p++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && > > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > > >>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && > > >>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && > > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > > >>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> else > > >>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && > > >>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { > > >>>>>>> q++; > > >>>>>>> } > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> } > > >>>>>>> > > >>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - > > >>>>>>> (((double)q)/numberOfAlignedSites); > > >>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); > > >>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); > > >>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); > > >>>>>>> return dist; > > >>>>>>> } > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On 10/22/07, *Richard Holland* > >>>>>>> > wrote: > > >>>>>>> > > >>>>>>> You should take a look at the latest 1.5 release, in the > > >>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some > > >>>>>>> phylogenetics code that will perform tasks as you describe. The future > > >>>>>>> plan is to extend this code to cover a wider range of use cases. > > >>>>>>> Kimura2P > > >>>>>>> is already implemented here, in > > >>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. > > >>>>>>> > > >>>>>>> If you can't find code that will do what you want, but have written some > > >>>>>>> before, then please do feel free to contribute it. Even if it is > > >>>>>>> slow, I'm > > >>>>>>> sure someone out there will be able to help optimise it! > > >>>>>>> > > >>>>>>> cheers, > > >>>>>>> Richard > > >>>>>>> > > >>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: > > >>>>>>> > Hi, > > >>>>>>> > > > >>>>>>> > Are there functions to calculate evolutionary pairwise distances like > > >>>>>>> > Kimura2P,Finkelstein etc in Biojava > > >>>>>>> > I did write smthng on my own but on large sequences it runs terribly > > >>>>>>> > slow and I am not even sure if thats right. > > >>>>>>> > -- > > >>>>>>> > Vineith Kaul > > >>>>>>> > Masters Student Bioinformatics > > >>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > > >>>>>>> > Georgia Tech, Atlanta > > >>>>>>> > _______________________________________________ > > >>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > >>>>>>> > > >>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l > > >>>>>>> > > > >>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Richard Holland > > >>>>>>> BioMart ( http://www.biomart.org/) > > >>>>>>> EMBL-EBI > > >>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Vineith Kaul > > >>>>>>> Masters Student Bioinformatics > > >>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) > > >>>>>>> Georgia Tech, Atlanta > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > >>>> _______________________________________________ > > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > > >>>> > > > > -----BEGIN PGP SIGNATURE----- > > Version: GnuPG v1.4.2.2 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > > > iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P > > IEyRleSs1+AziCvfhcES8wI= > > =uLDm > > -----END PGP SIGNATURE----- > > > From ayates at ebi.ac.uk Wed Oct 24 13:49:22 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 24 Oct 2007 14:49:22 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471F49C1.9070901@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> Message-ID: <471F4D62.3030900@ebi.ac.uk> Of course parallelisation all depends on the task not being limited by something else like memory, IO or database (which of course this wouldn't be). There's also the scenario where thread startup takes longer than running the code in serial :). Not to mention Java concurrency isn't an easy thing to write correctly. I'd prefer the model promoted in Java5 where you have pools of threads & pass in instances of Callable (which are a successor to Runnable but return Futures which return objects & exceptions). You then pass in a list of these callables & wait for them all to finish & grab the results. You can have as many callables as you like & the thread pool will process them as & when a thread becomes free. Combine this with looking at the reported number of processors/cores on the machine & say that's the default size of the pool (assuming you're making it parallel because you're flat-lining a processor). Say: int processorCount = Runtime.getRuntime().availableProcessors(); ExecutorService.createThreadPool(processorCount); This code might be wrong (well the creating the thread pool bit) but you get the idea :). Of course someone may not want to parallise a job (I quite like having dual cores as a runaway process can take out one but I can still run top & kill the thing). Andy Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This particular code could easily be parallelised - given N threads, you > can simply divide the input into N chunks and get each thread to process > 1/Nth of the input. You then combine the output of each thread to do the > final calculation. > > But, it'd be bad practice to always fork a predetermined N threads for a > given task. It'd be much better to somehow be able to ask 'how parallel > can I make this?' at runtime by checking system resources, or maybe get > the parallel-savvy user to set an optional BioJava-wide parallelisation > hint. N could then be determined and the task divided appropriately. > > cheers, > Richard > > Mark Schreiber wrote: >> Another important consideration after optimization is can the task be >> multithreaded? Almost all modern computers have at least 2 cores. So >> if the algorithm can be parallelized you will get some performance >> bonus on most machines. >> >> Modern JVM's will automagically try to use idle CPU's to execute new >> threads spawned by the programmer. >> >> - Mark >> >> On 10/24/07, Andy Yates wrote: >>> Yes a very good point & one I was going to make before hand but forgot :) >>> >>> Also not to mention that micro-benchmarks/profiling in Java are >>> notorious for giving false results due to VM warmup & JIT compilation >>> optimisations. There is a framework hosted on Java.net somewhere which >>> can perform VM warmups and code iterations to produce more accurate >>> benchmarking results; but the name escapes me at the moment. >>> >>> However looking at this particular code I get the feeling that this is >>> about as fast as its going to get without someone doing bitwise XOR >>> operations or some C code ... that's not an open invitation for people >>> to start recoding this in C :). At the end of the day the key to >>> optimisation is to ask the question "is it fast enough already?". If it >>> is then there's no point :) >>> >>> Andy >>> >>> Mark Schreiber wrote: >>>> Hi - >>>> >>>> >From experience the best way to optimize java code is to run a >>>> profiler. The one in Netbeans is quite good. >>>> >>>> The reason is that the hotspot or JIT compilers might natively compile >>>> the part of the code that you think is slow and actually make it >>>> faster than something else which becomes the bottle neck. Using a good >>>> profiler you can detect how much time is spent in each method and pin >>>> point some candidate methods for optimization. You can also see if >>>> there is a burden due to creation of lots of objects. >>>> >>>> - Mark >>>> >>>> On 10/24/07, Andy Yates wrote: >>>>> Our code is very similar but not identical. The original programmer >>>>> shortcutted a lot of else if conditions by considering if the two bases >>>>> were equal or not. It can then calculate the transitional changes & >>>>> assume the rest are transversional. >>>>> >>>>> In terms of speed of both pieces of code I can't see an obvious way to >>>>> speed it up. Probably in our code removing the 10 or so calls to >>>>> String.charAt() with a two calls & referencing those chars might help >>>>> but in all honesty I cannot say. >>>>> >>>>> Andy >>>>> >>>>> Richard Holland wrote: >> Thanks. >> >> Your code is similar to the code we have in >> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to >> see if it is identical, but it probably is. >> >> You can call our code like this: >> >> // import statement for biojava phylo stuff >> import org.biojavax.bio.phylo.*; >> >> // ...rest of code goes here >> >> // call Kimura2P >> String seq1 = ...; // Get seq1 and seq2 from somewhere >> String seq2 = ...; >> double result = MultipleHitCorrection.Kimura2P(seq1, seq2); >> >> Note that our implementation expects sequence strings to be in upper >> case, so you'll need to make sure your data is upper case or has been >> converted to upper case before calling our method. >> >> cheers, >> Richard >> >> vineith kaul wrote: >>>>>>>> This is what I have .....Thanks a lot fr the help. >>>>>>>> >>>>>>>> >>>>>>>> //Method to calculate the Kimura 2 parameter distance >>>>>>>> public static double K2P(String sequence1,String sequence2){ >>>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >>>>>>>> >>>>>>>> >>>>>>>> char[] seq1array=sequence1.toCharArray(); >>>>>>>> char[] seq2array=sequence2.toCharArray(); >>>>>>>> >>>>>>>> for(int i=0;i>>>>>>> // Number of aligned sites >>>>>>>> if(((seq1array[i]=='a') || >>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || >>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || >>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || >>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>> >>>>>>>> numberOfAlignedSites++; >>>>>>>> } >>>>>>>> >>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>> p++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>> p++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>> p++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>> p++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> else >>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>> q++; >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >>>>>>>> (((double)q)/numberOfAlignedSites); >>>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >>>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >>>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >>>>>>>> return dist; >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 10/22/07, *Richard Holland* >>>>>>> > wrote: >>>>>>>> >>>>>>>> You should take a look at the latest 1.5 release, in the >>>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some >>>>>>>> phylogenetics code that will perform tasks as you describe. The future >>>>>>>> plan is to extend this code to cover a wider range of use cases. >>>>>>>> Kimura2P >>>>>>>> is already implemented here, in >>>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. >>>>>>>> >>>>>>>> If you can't find code that will do what you want, but have written some >>>>>>>> before, then please do feel free to contribute it. Even if it is >>>>>>>> slow, I'm >>>>>>>> sure someone out there will be able to help optimise it! >>>>>>>> >>>>>>>> cheers, >>>>>>>> Richard >>>>>>>> >>>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >>>>>>>> > Hi, >>>>>>>> > >>>>>>>> > Are there functions to calculate evolutionary pairwise distances like >>>>>>>> > Kimura2P,Finkelstein etc in Biojava >>>>>>>> > I did write smthng on my own but on large sequences it runs terribly >>>>>>>> > slow and I am not even sure if thats right. >>>>>>>> > -- >>>>>>>> > Vineith Kaul >>>>>>>> > Masters Student Bioinformatics >>>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>> > Georgia Tech, Atlanta >>>>>>>> > _______________________________________________ >>>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>> >>>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>> > >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Richard Holland >>>>>>>> BioMart ( http://www.biomart.org/) >>>>>>>> EMBL-EBI >>>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Vineith Kaul >>>>>>>> Masters Student Bioinformatics >>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>> Georgia Tech, Atlanta > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P > IEyRleSs1+AziCvfhcES8wI= > =uLDm > -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Wed Oct 24 13:49:38 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 24 Oct 2007 14:49:38 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> <93b45ca50710240648w30625ccu85ffe0a972bc2bf2@mail.gmail.com> Message-ID: <471F4D72.80505@ebi.ac.uk> Beat me to it :) Andy Mark Schreiber wrote: > It appears it is as simple as: > > Runtime.getRuntime().availableProcessors(); > > - Mark > > On 10/24/07, Mark Schreiber wrote: >> I'm not aware of a way to determine the number of CPU's within a >> program although possibly it is one the the environment variables >> available from System. >> >> Even if it can't be determined there could be a method argument to >> specify the number of threads to spawn. >> >> - Mark >> >> On 10/24/07, Richard Holland wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> This particular code could easily be parallelised - given N threads, you >>> can simply divide the input into N chunks and get each thread to process >>> 1/Nth of the input. You then combine the output of each thread to do the >>> final calculation. >>> >>> But, it'd be bad practice to always fork a predetermined N threads for a >>> given task. It'd be much better to somehow be able to ask 'how parallel >>> can I make this?' at runtime by checking system resources, or maybe get >>> the parallel-savvy user to set an optional BioJava-wide parallelisation >>> hint. N could then be determined and the task divided appropriately. >>> >>> cheers, >>> Richard >>> >>> Mark Schreiber wrote: >>>> Another important consideration after optimization is can the task be >>>> multithreaded? Almost all modern computers have at least 2 cores. So >>>> if the algorithm can be parallelized you will get some performance >>>> bonus on most machines. >>>> >>>> Modern JVM's will automagically try to use idle CPU's to execute new >>>> threads spawned by the programmer. >>>> >>>> - Mark >>>> >>>> On 10/24/07, Andy Yates wrote: >>>>> Yes a very good point & one I was going to make before hand but forgot :) >>>>> >>>>> Also not to mention that micro-benchmarks/profiling in Java are >>>>> notorious for giving false results due to VM warmup & JIT compilation >>>>> optimisations. There is a framework hosted on Java.net somewhere which >>>>> can perform VM warmups and code iterations to produce more accurate >>>>> benchmarking results; but the name escapes me at the moment. >>>>> >>>>> However looking at this particular code I get the feeling that this is >>>>> about as fast as its going to get without someone doing bitwise XOR >>>>> operations or some C code ... that's not an open invitation for people >>>>> to start recoding this in C :). At the end of the day the key to >>>>> optimisation is to ask the question "is it fast enough already?". If it >>>>> is then there's no point :) >>>>> >>>>> Andy >>>>> >>>>> Mark Schreiber wrote: >>>>>> Hi - >>>>>> >>>>>> >From experience the best way to optimize java code is to run a >>>>>> profiler. The one in Netbeans is quite good. >>>>>> >>>>>> The reason is that the hotspot or JIT compilers might natively compile >>>>>> the part of the code that you think is slow and actually make it >>>>>> faster than something else which becomes the bottle neck. Using a good >>>>>> profiler you can detect how much time is spent in each method and pin >>>>>> point some candidate methods for optimization. You can also see if >>>>>> there is a burden due to creation of lots of objects. >>>>>> >>>>>> - Mark >>>>>> >>>>>> On 10/24/07, Andy Yates wrote: >>>>>>> Our code is very similar but not identical. The original programmer >>>>>>> shortcutted a lot of else if conditions by considering if the two bases >>>>>>> were equal or not. It can then calculate the transitional changes & >>>>>>> assume the rest are transversional. >>>>>>> >>>>>>> In terms of speed of both pieces of code I can't see an obvious way to >>>>>>> speed it up. Probably in our code removing the 10 or so calls to >>>>>>> String.charAt() with a two calls & referencing those chars might help >>>>>>> but in all honesty I cannot say. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> Richard Holland wrote: >>>> Thanks. >>>> >>>> Your code is similar to the code we have in >>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to >>>> see if it is identical, but it probably is. >>>> >>>> You can call our code like this: >>>> >>>> // import statement for biojava phylo stuff >>>> import org.biojavax.bio.phylo.*; >>>> >>>> // ...rest of code goes here >>>> >>>> // call Kimura2P >>>> String seq1 = ...; // Get seq1 and seq2 from somewhere >>>> String seq2 = ...; >>>> double result = MultipleHitCorrection.Kimura2P(seq1, seq2); >>>> >>>> Note that our implementation expects sequence strings to be in upper >>>> case, so you'll need to make sure your data is upper case or has been >>>> converted to upper case before calling our method. >>>> >>>> cheers, >>>> Richard >>>> >>>> vineith kaul wrote: >>>>>>>>>> This is what I have .....Thanks a lot fr the help. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> //Method to calculate the Kimura 2 parameter distance >>>>>>>>>> public static double K2P(String sequence1,String sequence2){ >>>>>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >>>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> char[] seq1array=sequence1.toCharArray(); >>>>>>>>>> char[] seq2array=sequence2.toCharArray(); >>>>>>>>>> >>>>>>>>>> for(int i=0;i>>>>>>>>> // Number of aligned sites >>>>>>>>>> if(((seq1array[i]=='a') || >>>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || >>>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >>>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >>>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || >>>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || >>>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> >>>>>>>>>> numberOfAlignedSites++; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >>>>>>>>>> (((double)q)/numberOfAlignedSites); >>>>>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >>>>>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >>>>>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >>>>>>>>>> return dist; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 10/22/07, *Richard Holland* >>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>> You should take a look at the latest 1.5 release, in the >>>>>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some >>>>>>>>>> phylogenetics code that will perform tasks as you describe. The future >>>>>>>>>> plan is to extend this code to cover a wider range of use cases. >>>>>>>>>> Kimura2P >>>>>>>>>> is already implemented here, in >>>>>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. >>>>>>>>>> >>>>>>>>>> If you can't find code that will do what you want, but have written some >>>>>>>>>> before, then please do feel free to contribute it. Even if it is >>>>>>>>>> slow, I'm >>>>>>>>>> sure someone out there will be able to help optimise it! >>>>>>>>>> >>>>>>>>>> cheers, >>>>>>>>>> Richard >>>>>>>>>> >>>>>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >>>>>>>>>> > Hi, >>>>>>>>>> > >>>>>>>>>> > Are there functions to calculate evolutionary pairwise distances like >>>>>>>>>> > Kimura2P,Finkelstein etc in Biojava >>>>>>>>>> > I did write smthng on my own but on large sequences it runs terribly >>>>>>>>>> > slow and I am not even sure if thats right. >>>>>>>>>> > -- >>>>>>>>>> > Vineith Kaul >>>>>>>>>> > Masters Student Bioinformatics >>>>>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>>>> > Georgia Tech, Atlanta >>>>>>>>>> > _______________________________________________ >>>>>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>>>> >>>>>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Richard Holland >>>>>>>>>> BioMart ( http://www.biomart.org/) >>>>>>>>>> EMBL-EBI >>>>>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Vineith Kaul >>>>>>>>>> Masters Student Bioinformatics >>>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>>>> Georgia Tech, Atlanta >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG v1.4.2.2 (GNU/Linux) >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org >>> >>> iD8DBQFHH0nB4C5LeMEKA/QRAjW9AJwPcaHByaQhAQRPLJOZt4gQRF0TbgCeIa6P >>> IEyRleSs1+AziCvfhcES8wI= >>> =uLDm >>> -----END PGP SIGNATURE----- >>> From holland at ebi.ac.uk Wed Oct 24 13:53:29 2007 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 24 Oct 2007 14:53:29 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> Message-ID: <471F4E59.1040703@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Mark Schreiber wrote: > I'm not aware of a way to determine the number of CPU's within a > program although possibly it is one the the environment variables > available from System. Yup, I'm not aware of one either. Actually, thinking about this, it'd be a bad thing if BioJava grabbed both CPUs just because they're currently available - the user might want it to only run on one, with something else running on the second one. So attempting to guess a good parallelisation value from the system is probably not good! > Even if it can't be determined there could be a method argument to > specify the number of threads to spawn. I was thinking more along the lines of a global static method in some kind of toolkit class, so that any part of BJ which is parallelisation-aware can take advantage of it if it is set. This also avoids passing parameters that don't have an immediately obvious impact on the expected output of the method. I'd also like to have this global variable control the total number of threads, so that if the user forks a set of threads themselves and runs a parallel-aware method in each of them, then BJ will not attempt to sub-divide each thread into more threads than the limit configured by this variable. Likewise if the user changes the limit whilst threads are currently running, they should stop (if there are too many) or new ones should start (if there are too few), but taking care to make sure that every parallelisation request maintains at least one thread so the job doesn't stop entirely.... there must be a toolkit for this somewhere surely? cheers, Richard > - Mark > > On 10/24/07, Richard Holland wrote: > This particular code could easily be parallelised - given N threads, you > can simply divide the input into N chunks and get each thread to process > 1/Nth of the input. You then combine the output of each thread to do the > final calculation. > > But, it'd be bad practice to always fork a predetermined N threads for a > given task. It'd be much better to somehow be able to ask 'how parallel > can I make this?' at runtime by checking system resources, or maybe get > the parallel-savvy user to set an optional BioJava-wide parallelisation > hint. N could then be determined and the task divided appropriately. > > cheers, > Richard > > Mark Schreiber wrote: >>>> Another important consideration after optimization is can the task be >>>> multithreaded? Almost all modern computers have at least 2 cores. So >>>> if the algorithm can be parallelized you will get some performance >>>> bonus on most machines. >>>> >>>> Modern JVM's will automagically try to use idle CPU's to execute new >>>> threads spawned by the programmer. >>>> >>>> - Mark >>>> >>>> On 10/24/07, Andy Yates wrote: >>>>> Yes a very good point & one I was going to make before hand but forgot :) >>>>> >>>>> Also not to mention that micro-benchmarks/profiling in Java are >>>>> notorious for giving false results due to VM warmup & JIT compilation >>>>> optimisations. There is a framework hosted on Java.net somewhere which >>>>> can perform VM warmups and code iterations to produce more accurate >>>>> benchmarking results; but the name escapes me at the moment. >>>>> >>>>> However looking at this particular code I get the feeling that this is >>>>> about as fast as its going to get without someone doing bitwise XOR >>>>> operations or some C code ... that's not an open invitation for people >>>>> to start recoding this in C :). At the end of the day the key to >>>>> optimisation is to ask the question "is it fast enough already?". If it >>>>> is then there's no point :) >>>>> >>>>> Andy >>>>> >>>>> Mark Schreiber wrote: >>>>>> Hi - >>>>>> >>>>>> >From experience the best way to optimize java code is to run a >>>>>> profiler. The one in Netbeans is quite good. >>>>>> >>>>>> The reason is that the hotspot or JIT compilers might natively compile >>>>>> the part of the code that you think is slow and actually make it >>>>>> faster than something else which becomes the bottle neck. Using a good >>>>>> profiler you can detect how much time is spent in each method and pin >>>>>> point some candidate methods for optimization. You can also see if >>>>>> there is a burden due to creation of lots of objects. >>>>>> >>>>>> - Mark >>>>>> >>>>>> On 10/24/07, Andy Yates wrote: >>>>>>> Our code is very similar but not identical. The original programmer >>>>>>> shortcutted a lot of else if conditions by considering if the two bases >>>>>>> were equal or not. It can then calculate the transitional changes & >>>>>>> assume the rest are transversional. >>>>>>> >>>>>>> In terms of speed of both pieces of code I can't see an obvious way to >>>>>>> speed it up. Probably in our code removing the 10 or so calls to >>>>>>> String.charAt() with a two calls & referencing those chars might help >>>>>>> but in all honesty I cannot say. >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> Richard Holland wrote: >>>> Thanks. >>>> >>>> Your code is similar to the code we have in >>>> org.biojavax.bio.phylo.MultipleHitCorrection. I haven't checked it to >>>> see if it is identical, but it probably is. >>>> >>>> You can call our code like this: >>>> >>>> // import statement for biojava phylo stuff >>>> import org.biojavax.bio.phylo.*; >>>> >>>> // ...rest of code goes here >>>> >>>> // call Kimura2P >>>> String seq1 = ...; // Get seq1 and seq2 from somewhere >>>> String seq2 = ...; >>>> double result = MultipleHitCorrection.Kimura2P(seq1, seq2); >>>> >>>> Note that our implementation expects sequence strings to be in upper >>>> case, so you'll need to make sure your data is upper case or has been >>>> converted to upper case before calling our method. >>>> >>>> cheers, >>>> Richard >>>> >>>> vineith kaul wrote: >>>>>>>>>> This is what I have .....Thanks a lot fr the help. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> //Method to calculate the Kimura 2 parameter distance >>>>>>>>>> public static double K2P(String sequence1,String sequence2){ >>>>>>>>>> long p=0,q=0,numberOfAlignedSites=0; // P= transitional >>>>>>>>>> differences (A<->G & T<->C) ; Q= transversional differences (A/G<-->C/T) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> char[] seq1array=sequence1.toCharArray(); >>>>>>>>>> char[] seq2array=sequence2.toCharArray(); >>>>>>>>>> >>>>>>>>>> for(int i=0;i>>>>>>>>> // Number of aligned sites >>>>>>>>>> if(((seq1array[i]=='a') || >>>>>>>>>> (seq1array[i]=='A')||(seq1array[i]=='g') || >>>>>>>>>> (seq1array[i]=='G')||(seq1array[i]=='c') || (seq1array[i]=='C') || >>>>>>>>>> (seq1array[i]=='t') || (seq1array[i]=='T')) && ((seq2array[i]=='a') || >>>>>>>>>> (seq2array[i]=='A')||(seq2array[i]=='c') || >>>>>>>>>> (seq2array[i]=='C')||(seq2array[i]=='t') || >>>>>>>>>> (seq2array[i]=='T')||(seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> >>>>>>>>>> numberOfAlignedSites++; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> p++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='a') || (seq1array[i]=='A')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='c') || (seq2array[i]=='C'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='g') || (seq1array[i]=='G')) && >>>>>>>>>> ((seq2array[i]=='t') || (seq2array[i]=='T'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='t') || (seq1array[i]=='T')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='a') || (seq2array[i]=='A'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> else >>>>>>>>>> if(((seq1array[i]=='c') || (seq1array[i]=='C')) && >>>>>>>>>> ((seq2array[i]=='g') || (seq2array[i]=='G'))) { >>>>>>>>>> q++; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> double P = 1.0 - (2.0 * ((double)p)/numberOfAlignedSites) - >>>>>>>>>> (((double)q)/numberOfAlignedSites); >>>>>>>>>> double Q = 1.0 - (2.0 * ((double)q)/numberOfAlignedSites); >>>>>>>>>> System.out.print(numberOfAlignedSites+"\t"+p+"\t"+q+"\t"); >>>>>>>>>> double dist = (-0.5 * Math.log(P)) - ( 0.25 * Math.log(Q)); >>>>>>>>>> return dist; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 10/22/07, *Richard Holland* >>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>> You should take a look at the latest 1.5 release, in the >>>>>>>>>> org.biojavax.bio.phylo packages. This code is the beginnings of some >>>>>>>>>> phylogenetics code that will perform tasks as you describe. The future >>>>>>>>>> plan is to extend this code to cover a wider range of use cases. >>>>>>>>>> Kimura2P >>>>>>>>>> is already implemented here, in >>>>>>>>>> org.biojavax.bio.phylo.MultipleHitCorrection. >>>>>>>>>> >>>>>>>>>> If you can't find code that will do what you want, but have written some >>>>>>>>>> before, then please do feel free to contribute it. Even if it is >>>>>>>>>> slow, I'm >>>>>>>>>> sure someone out there will be able to help optimise it! >>>>>>>>>> >>>>>>>>>> cheers, >>>>>>>>>> Richard >>>>>>>>>> >>>>>>>>>> On Sun, October 21, 2007 5:30 pm, vineith kaul wrote: >>>>>>>>>> > Hi, >>>>>>>>>> > >>>>>>>>>> > Are there functions to calculate evolutionary pairwise distances like >>>>>>>>>> > Kimura2P,Finkelstein etc in Biojava >>>>>>>>>> > I did write smthng on my own but on large sequences it runs terribly >>>>>>>>>> > slow and I am not even sure if thats right. >>>>>>>>>> > -- >>>>>>>>>> > Vineith Kaul >>>>>>>>>> > Masters Student Bioinformatics >>>>>>>>>> > The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>>>> > Georgia Tech, Atlanta >>>>>>>>>> > _______________________________________________ >>>>>>>>>> > Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>>>> >>>>>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Richard Holland >>>>>>>>>> BioMart ( http://www.biomart.org/) >>>>>>>>>> EMBL-EBI >>>>>>>>>> Hinxton, Cambridgeshire CB10 1SD, UK >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Vineith Kaul >>>>>>>>>> Masters Student Bioinformatics >>>>>>>>>> The Parker H. Petit Institute for Bioengineering and Bioscience (IBB) >>>>>>>>>> Georgia Tech, Atlanta > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHH05Y4C5LeMEKA/QRAouqAJ9TgDACIQLPeenSZcStDhkZQg/UuQCfc7sZ cocyjnf9/T8H3uQJ+rW5m2U= =Q6UR -----END PGP SIGNATURE----- From ayates at ebi.ac.uk Wed Oct 24 13:58:01 2007 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 24 Oct 2007 14:58:01 +0100 Subject: [Biojava-l] Evolutionary distances In-Reply-To: <471F4E59.1040703@ebi.ac.uk> References: <48206.80.42.116.113.1193054817.squirrel@webmail.ebi.ac.uk> <471EF9B8.7020609@ebi.ac.uk> <471EFDA9.1090706@ebi.ac.uk> <93b45ca50710240459q52977d67xbdbab62eb37c3da1@mail.gmail.com> <471F3A65.50202@ebi.ac.uk> <93b45ca50710240619n43ecf163hb514fac9faa4bda4@mail.gmail.com> <471F49C1.9070901@ebi.ac.uk> <93b45ca50710240641i1923a313i558c14406bf18873@mail.gmail.com> <471F4E59.1040703@ebi.ac.uk> Message-ID: <471F4F69.3010806@ebi.ac.uk> The executor thread pool system is the best way to control this. The thread pool can be setup once & called out whilst all clients of the code will wait for their jobs/futures to complete. Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I was thinking more along the lines of a global static method in some > kind of toolkit class, so that any part of BJ which is > parallelisation-aware can take advantage of it if it is set. This also > avoids passing parameters that don't have an immediately obvious impact > on the expected output of the method. I'd also like to have this global > variable control the total number of threads, so that if the user forks > a set of threads themselves and runs a parallel-aware method in each of > them, then BJ will not attempt to sub-divide each thread into more > threads than the limit configured by this variable. Likewise if the user > changes the limit whilst threads are currently running, they should stop > (if there are too many) or new ones should start (if there are too few), > but taking care to make sure that every parallelisation request > maintains at least one thread so the job doesn't stop entirely.... there > must be a toolkit for this somewhere surely? >