From asandro1501 at gmail.com Fri Oct 1 12:52:50 2010 From: asandro1501 at gmail.com (Alex Silva) Date: Fri, 1 Oct 2010 13:52:50 -0300 Subject: [Biojava-l] Help files genbank Message-ID: Hi I am asking again for help reading a file format in genbank, I need to do the analysis of the headers. I could not use any because I am a beginner in java. Does anyone have some code that you used for this? Em portugu?s Estou solicitando novamente uma ajuda para leitura de arquivos no formato genbank, preciso fazer a analise dos cabe?alhos. N?o consegui utilizar nenhum porque sou iniciante em java. Algu?m tem algum c?digo que tenha utilizado para isso? -- Alex Silva G.R.A. Sistemas Corporativos msn: gra.sistemas at hotmail.com 55-9165-7378 From holland at eaglegenomics.com Fri Oct 1 12:56:09 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 1 Oct 2010 17:56:09 +0100 Subject: [Biojava-l] Help files genbank In-Reply-To: References: Message-ID: This is a good starting point: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_and_writing_files. On 1 Oct 2010, at 17:52, Alex Silva wrote: > Hi > > I am asking again for help reading a file format in genbank, I need to do > the analysis of the headers. I could not use any because I am a beginner in > java. Does anyone have some code that you used for this? > > > > > Em portugu?s > > Estou solicitando novamente uma ajuda para leitura de arquivos no formato > genbank, preciso fazer a analise dos cabe?alhos. N?o consegui utilizar > nenhum porque sou iniciante em java. Algu?m tem algum c?digo que tenha > utilizado para isso? > > -- > Alex Silva > G.R.A. Sistemas Corporativos > msn: gra.sistemas at hotmail.com > 55-9165-7378 > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From pjotr.public23 at thebird.nl Sat Oct 2 05:15:06 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Sat, 2 Oct 2010 11:15:06 +0200 Subject: [Biojava-l] BioJava <-> R Message-ID: <20101002091506.GA17702@thebird.nl> Anyone here who has real experience using the JRI? Who would be interested, and have some exposure to, invoking R from Java through a native interface in bioinformatics? Pj. From hlapp at drycafe.net Sat Oct 2 21:26:49 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sat, 2 Oct 2010 21:26:49 -0400 Subject: [Biojava-l] BioJava <-> R In-Reply-To: <20101002091506.GA17702@thebird.nl> References: <20101002091506.GA17702@thebird.nl> Message-ID: <74DF3E4D-FC22-4719-9E6B-08248B14D4AA@drycafe.net> We use this in the Mesquite<->R bridge. I haven't worked much on the Java to R side, but it seems to work well. http://mesquiteproject.org/packages/Mesquite.R/ -hilmar On Oct 2, 2010, at 5:15 AM, Pjotr Prins wrote: > Anyone here who has real experience using the JRI? Who would be > interested, and have some exposure to, invoking R from Java through a > native interface in bioinformatics? > > Pj. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From andrew.mcsweeny at rockets.utoledo.edu Tue Oct 12 17:41:07 2010 From: andrew.mcsweeny at rockets.utoledo.edu (McSweeny, Andrew J) Date: Tue, 12 Oct 2010 21:41:07 +0000 Subject: [Biojava-l] How to share code while protecting copyrights? Message-ID: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Hi, I am working on a project which simulates sexual reproduction in a population of digital organisms. Their genome is just a contig from hg18. It's pretty interesting and I can talk more about it in the future.... Anyways, how can I share my code for this project without having to worry that someone else will use it to publish a paper before my group does? I'm certain nobody in the open source community would do that, but how do I convince my group that opening our project to BioJava is a good idea? -Andrew From andreas at sdsc.edu Wed Oct 13 02:02:34 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 12 Oct 2010 23:02:34 -0700 Subject: [Biojava-l] biojava 3.0 release plan Message-ID: Hi, BioJava 3 has matured massively in SVN during this year and it is time to prepare a first release. I propose the following release plan. See also two other topics for discussion below. Release Plan 3.0 * Alpha release build(s) during the next days I will start to provide a first alpha release build. This will be followed by semi-regular follow up alpha builds (depending on SVN activity) - During the next weeks any missing features should be committed to SVN. Refactoring of code can still be done during this time. - Add and update documentation in wiki - Module maintainers: check compile warnings for your modules in automated builds. Make sure no compile warnings are being displayed. * Beta release build(s) the first beta release is scheduled for the weekend Nov 21st. - From this point on only minor changes (bug fixes) should be added to the code base - Module maintainers: check and update javadoc for your modules * Release 3.0 The 3.0 Release is scheduled for Dez 12th There are two things we should still discuss: * backwards compatibility: the current "core" module contains tons of legacy 1.7 code. Shall I go ahead and delete this module? * documentation: The wiki contains tons of documentation for 1.7 which will not be useful for 3.0. As a procedure for cleaning this up and avoiding confusion I suggest to move all 1.7 related docu into a special section of the wiki. All toplevel links to documentation should point to 3.0. Any other suggestions? Andreas From markjschreiber at gmail.com Wed Oct 13 05:26:04 2010 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 13 Oct 2010 11:26:04 +0200 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: Hi - My understanding of copyright is that it is yours as soon as you assert that it is your creation. You can simply add a copyright statement to each file containing the code (in the header for example). The reality is that defending copyright is your responsibility. If someone violates it, you have to take them to court or issue a legal letter. You can also put an appropriate license on the code specifying how it can be used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick one of these that best matches your needs. BioJava code is LGPL so if you want your code to go into the BioJava code base you will need to make your code LGPL. It's always a good idea to add @author tags to Java code to ensure appropriate attribution. Finally, if someone steals your code and publishes results before you then you can always make a complaint to the journal editors. If it is a reputable journal, and you have reasonable proof the editor should take some action such as forcing a retraction. You can also make a distribution agreement saying that if someone uses this code they agree not to publish without first consulting you. If you want to make it really water tight, get a lawyer and explain specifically what you want to share and what you want to protect or prevent. - Mark On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J < andrew.mcsweeny at rockets.utoledo.edu> wrote: > Hi, > > I am working on a project which simulates sexual reproduction in a > population of digital organisms. Their genome is just a contig from hg18. > It's pretty interesting and I can talk more about it in the future.... > > Anyways, how can I share my code for this project without having to worry > that someone else will use it to publish a paper before my group does? > > I'm certain nobody in the open source community would do that, but how do I > convince my group that opening our project to BioJava is a good idea? > > -Andrew > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From markjschreiber at gmail.com Wed Oct 13 05:28:05 2010 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 13 Oct 2010 11:28:05 +0200 Subject: [Biojava-l] [Biojava-dev] biojava 3.0 release plan In-Reply-To: References: Message-ID: Hi Andreas - Excellent work from the team this year. I would recommend removing as much legacy code as possible and removing (preferably rewriting) the legacy documentation. I think it would be better to have no docs than out of date docs. - Mark On Wed, Oct 13, 2010 at 8:02 AM, Andreas Prlic wrote: > Hi, > > BioJava 3 has matured massively in SVN during this year and it is time to > prepare a first release. I propose the following release plan. See also two > other topics for discussion below. > > Release Plan 3.0 > > * Alpha release build(s) > during the next days I will start to provide a first alpha release build. > This will be followed by semi-regular follow up alpha builds (depending on > SVN activity) > > - During the next weeks any missing features should be committed to SVN. > Refactoring of code can still be done during this time. > - Add and update documentation in wiki > - Module maintainers: check compile warnings for your modules in automated > builds. Make sure no compile warnings are being displayed. > > > * Beta release build(s) > the first beta release is scheduled for the weekend Nov 21st. > > - From this point on only minor changes (bug fixes) should be added to the > code base > - Module maintainers: check and update javadoc for your modules > > * Release 3.0 > The 3.0 Release is scheduled for Dez 12th > > > There are two things we should still discuss: > > * backwards compatibility: > the current "core" module contains tons of legacy 1.7 code. Shall I go > ahead > and delete this module? > > * documentation: > The wiki contains tons of documentation for 1.7 which will not be useful > for > 3.0. As a procedure for cleaning this up and avoiding confusion I suggest > to > move all 1.7 related docu into a special section of the wiki. All toplevel > links to documentation should point to 3.0. Any other suggestions? > > > Andreas > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From paolo.romano at istge.it Wed Oct 13 06:17:27 2010 From: paolo.romano at istge.it (Paolo Romano) Date: Wed, 13 Oct 2010 12:17:27 +0200 Subject: [Biojava-l] NETTAB 2010 Biological Wikis: Call for posters and participation Message-ID: <201010131018.o9DAHTjq009877@clus2.istge.it> Apologizes for duplications ==== Joint NETTAB 2010 and BBCC 2010 workshop Biological Wikis November 29 - December 1, 2010 Congress Center, University of Naples "Federico II", Naples, Italy http://www.nettab.org/2010/ The joint NETTAB and BBCC 2010 workshop on "Biological Wikis" promises to be a great meeting for all researchers involved in the exploitation of wikis in biology. Come and discuss your ideas and doubts with such scientists as Alex Bateman, Alexander Pico, Andrew Su, Dan Bolser, Robert Hoffmann, Thomas Kelder, Mike Cariaso, Adam Godzik, Luca Toldo and many other who, we hope, will join the workshop. It's a great chance to follow smart tutorials and lectures on WikiPathways, WikiGenes, Semantic Wiki, PDBWiki, Gene Wiki and a proficient use of Wikipedia. See a list of keynote speakers and tutorials at http://www.nettab.org/2010/progr.html . There still is time to submit abstracts for posters and software demonstrations until next October 17, 2010! The complete Call is available on-line at http://www.nettab.org/2010/call.html . Registration is open at http://www.nettab.org/2010/rform.html . Register within next October 29, 2010 and take profit of early registration fees. A reduction of 20 euro applies to all fees for members of ISCB and other societies and networks. More reductions are foreseen for PhD students. Further information is availble at http://www.nettab.org/2010/ . Looking forward to seeing you soon in Naples. Paolo Romano Paolo Romano (paolo.romano at istge.it) Bioinformatics National Cancer Research Institute (IST) Largo Rosanna Benzi, 10, I-16132, Genova, Italy Tel: +39-010-5737-288 Fax: +39-010-5737-295 Skype: p.romano Web: http://www.nettab.org/promano/ From pjotr.public23 at thebird.nl Wed Oct 13 07:15:41 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 13:15:41 +0200 Subject: [Biojava-l] BioJava translation Message-ID: <20101013111541.GA512@thebird.nl> I am using biojava-1.7.1 nucleotide -> amino acid translation. It is rather slow. In fact, the biopython equivalent in native Python is twice as fast. EMBOSS is again magnitudes faster. I am using something like rna = RNATools.createRNA(nucleotides); aa = RNATools.translate(rna); Embarrassingly, even the R version is faster in the GeneR module, as it uses a C module. I have a feeling this has to do with typed object creation at every level, whereas Python and others uses plain character Strings. Any plans for speeding this up on the JVM? Pj. From pjotr.public23 at thebird.nl Wed Oct 13 07:40:37 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 13:40:37 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> Message-ID: <20101013114037.GA1166@thebird.nl> Great! You mean BJ3 translation should work? Do you have a short example of use? Pj. On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: > BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. From holland at eaglegenomics.com Wed Oct 13 07:27:05 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 13 Oct 2010 12:27:05 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013111541.GA512@thebird.nl> References: <20101013111541.GA512@thebird.nl> Message-ID: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. On 13 Oct 2010, at 12:15, Pjotr Prins wrote: > I am using biojava-1.7.1 nucleotide -> amino acid translation. It is > rather slow. In fact, the biopython equivalent in native Python is > twice as fast. EMBOSS is again magnitudes faster. I am using > something like > > rna = RNATools.createRNA(nucleotides); > aa = RNATools.translate(rna); > > Embarrassingly, even the R version is faster in the GeneR module, as > it uses a C module. > > I have a feeling this has to do with typed object creation at every > level, whereas Python and others uses plain character Strings. > > Any plans for speeding this up on the JVM? > > Pj. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Wed Oct 13 07:42:21 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 13 Oct 2010 12:42:21 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013114037.GA1166@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> Message-ID: <507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com> Afraid I'm a bit out of touch but someone else on this list should be able to help. Andy or Andreas maybe? On 13 Oct 2010, at 12:40, Pjotr Prins wrote: > Great! You mean BJ3 translation should work? Do you have a short > example of use? > > Pj. > > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: >> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From pjotr.public23 at thebird.nl Wed Oct 13 07:48:07 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 13:48:07 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: <507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com> Message-ID: <20101013114807.GA1569@thebird.nl> On Wed, Oct 13, 2010 at 12:42:21PM +0100, Richard Holland wrote: > Afraid I'm a bit out of touch but someone else on this list should > be able to help. Andy or Andreas maybe? It is not on the wiki yet, and I must admit I get lost in the source tree. Any short example will do, translating from an ntseq (String) to aaseq (String). Pj. From ayates at ebi.ac.uk Wed Oct 13 07:50:25 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 13 Oct 2010 12:50:25 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013114037.GA1166@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> Message-ID: <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> As of the moment there are the translation test cases which is the best documentation: http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java This hopefully will give you a good idea about how to go about it. I was managing over 1000 translations per second of BRCA2 going from mRNA to peptide with checks. YMMV but I hope this is a lot faster than what you're currently seeing. Translation supports a lot of different modes with TranscriptionEngine being the place to configure this. The Javadoc should be good enough to help you through the different modes available Andy On 13 Oct 2010, at 12:40, Pjotr Prins wrote: > Great! You mean BJ3 translation should work? Do you have a short > example of use? > > Pj. > > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: >> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From koen.bruynseels at cropdesign.com Wed Oct 13 08:16:00 2010 From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com) Date: Wed, 13 Oct 2010 14:16:00 +0200 Subject: [Biojava-l] Koen Bruynseels is out of the office. Message-ID: I will be out of the office starting 10/12/2010 and will not return until 10/14/2010. I will respond to your message when I return. From andreas at sdsc.edu Wed Oct 13 11:42:44 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 13 Oct 2010 08:42:44 -0700 Subject: [Biojava-l] BioJava translation In-Reply-To: <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> Message-ID: Hi Andy, any chance to add some wiki documentation for this as well? Would be great.... Andreas On Wed, Oct 13, 2010 at 4:50 AM, Andy Yates wrote: > As of the moment there are the translation test cases which is the best > documentation: > > > http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java > > This hopefully will give you a good idea about how to go about it. I was > managing over 1000 translations per second of BRCA2 going from mRNA to > peptide with checks. YMMV but I hope this is a lot faster than what you're > currently seeing. > > Translation supports a lot of different modes with TranscriptionEngine > being the place to configure this. The Javadoc should be good enough to help > you through the different modes available > > Andy > > > On 13 Oct 2010, at 12:40, Pjotr Prins wrote: > > > Great! You mean BJ3 translation should work? Do you have a short > > example of use? > > > > Pj. > > > > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: > >> BJ3 should be replacing most sequence operations with string operations, > making the whole thing much faster. > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From ayates at ebi.ac.uk Wed Oct 13 11:46:58 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 13 Oct 2010 16:46:58 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> Message-ID: <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk> I will try my best to Andy On 13 Oct 2010, at 16:42, Andreas Prlic wrote: > > Hi Andy, > > any chance to add some wiki documentation for this as well? Would be great.... > > Andreas > > > On Wed, Oct 13, 2010 at 4:50 AM, Andy Yates wrote: > As of the moment there are the translation test cases which is the best documentation: > > http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java > > This hopefully will give you a good idea about how to go about it. I was managing over 1000 translations per second of BRCA2 going from mRNA to peptide with checks. YMMV but I hope this is a lot faster than what you're currently seeing. > > Translation supports a lot of different modes with TranscriptionEngine being the place to configure this. The Javadoc should be good enough to help you through the different modes available > > Andy > > > On 13 Oct 2010, at 12:40, Pjotr Prins wrote: > > > Great! You mean BJ3 translation should work? Do you have a short > > example of use? > > > > Pj. > > > > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: > >> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From pjotr.public23 at thebird.nl Wed Oct 13 11:58:44 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 17:58:44 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk> Message-ID: <20101013155844.GA2918@thebird.nl> On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote: > I will try my best to Make sure to add the sequence should be uppercase. Took me a while to crack that, as I only got a null pointer exception. Pj. From holland at eaglegenomics.com Wed Oct 13 12:02:24 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 13 Oct 2010 17:02:24 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013155844.GA2918@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk> <20101013155844.GA2918@thebird.nl> Message-ID: <1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com> whuh??? Shouldn't we be coding to cater for all case mixtures?! On 13 Oct 2010, at 16:58, Pjotr Prins wrote: > On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote: >> I will try my best to > > Make sure to add the sequence should be uppercase. Took me a while to > crack that, as I only got a null pointer exception. > > Pj. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From ayates at ebi.ac.uk Wed Oct 13 12:11:40 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 13 Oct 2010 17:11:40 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk> <20101013155844.GA2918@thebird.nl> <1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com> Message-ID: <7740A206-98A0-4FBC-9CF8-B1AC0DE7D859@ebi.ac.uk> I also thought we were as well. I can investigate On 13 Oct 2010, at 17:02, Richard Holland wrote: > whuh??? Shouldn't we be coding to cater for all case mixtures?! > > > On 13 Oct 2010, at 16:58, Pjotr Prins wrote: > >> On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote: >>> I will try my best to >> >> Make sure to add the sequence should be uppercase. Took me a while to >> crack that, as I only got a null pointer exception. >> >> Pj. >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From pjotr.public23 at thebird.nl Wed Oct 13 12:13:36 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 18:13:36 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> Message-ID: <20101013161336.GA3184@thebird.nl> On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: > BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. Good news, BJ3 is a lot faster! The previous version took 2 minutes for the C.elegans genome (33 Mb), the BJ3 version takes 27sec on my modest Thinkpad X61 laptop. After parsing the Fasta and turning it into an upper case string the actual translation takes 16sec. Only the C implementations are faster. Here the relevant Scala code: import bio._ import java.io._ import org.biojava3.core.sequence._ import org.biojava3.core.sequence.transcription.TranscriptionEngine import org.biojava3.core.sequence.io.IUPACParser // fetching infile from command line... IUPACParser.getInstance().getTable(1); // not sure we need this IUPACParser.getInstance().getTable("UNIVERSAL"); val engine = TranscriptionEngine.getDefault() val f = new FastaReader(infile) f.foreach { res => val (id,tag,dna) = res println(List(">",id).mkString) val dna2 = new DNASequence(dna.mkString.toUpperCase) val rna = dna2.getRNASequence(engine) println(rna.getProteinSequence(engine)) } } prints: >B0222.10 MLYWNDLNTVGIVADTIWKYYADQYKRLIKEHSKIRFNPLLHAASVIRIFDIHINLFNNFVTFLLVIFLYIFLIYYVTFFVFPFGPLRVSHWMRFFKIIISRSHLG >B0222.11 MSRRTASKLLVFVFLCSLCFGTQRYDMPRKIDLFNDLITQSTTPASPKCQCLPPTTPSTPPNCIPYDSRLQAASLEEAIVAFPDLTITRQEKTQQSTATLNNCKTKQCRDCYKDLRSQLRKVGLLPGTIDQVFHNQRNFTTCQKYRFARQDKGVYEKKKKAKQHYDWDYVEYDEDEDDDYFWDGLFWKKKRNVLKKIVKRDVEATTAISQPPNSTAMNSTGIIGIRFPISCTTRGVTPDGLGTVSLCSTCWVWRRLPSTYYPAYLNEVVCDYADTSCLSGYASCQTGTQQLNVLRNDSGKLIPISVSAGINCECRLAVGSTLESLVLGQGISKAMPPIDTTSTKPPNLATSTTSHS (...) Pj. From ayates at ebi.ac.uk Wed Oct 13 12:25:41 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 13 Oct 2010 17:25:41 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013161336.GA3184@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> Message-ID: That's great news and should be even faster once we get rid of the requirement to upper case since you're having to parse the same sequence twice. I wonder what the C version does to make itself even faster Andy On 13 Oct 2010, at 17:13, Pjotr Prins wrote: > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: >> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. > > Good news, BJ3 is a lot faster! The previous version took 2 minutes > for the C.elegans genome (33 Mb), the BJ3 version takes 27sec on my > modest Thinkpad X61 laptop. After parsing the Fasta and turning it > into an upper case string the actual translation takes 16sec. > > Only the C implementations are faster. > > Here the relevant Scala code: > > import bio._ > import java.io._ > import org.biojava3.core.sequence._ > import org.biojava3.core.sequence.transcription.TranscriptionEngine > import org.biojava3.core.sequence.io.IUPACParser > > // fetching infile from command line... > > IUPACParser.getInstance().getTable(1); // not sure we need this > IUPACParser.getInstance().getTable("UNIVERSAL"); > val engine = TranscriptionEngine.getDefault() > val f = new FastaReader(infile) > f.foreach { > res => > val (id,tag,dna) = res > println(List(">",id).mkString) > val dna2 = new DNASequence(dna.mkString.toUpperCase) > val rna = dna2.getRNASequence(engine) > println(rna.getProteinSequence(engine)) > } > } > > prints: > >> B0222.10 > MLYWNDLNTVGIVADTIWKYYADQYKRLIKEHSKIRFNPLLHAASVIRIFDIHINLFNNFVTFLLVIFLYIFLIYYVTFFVFPFGPLRVSHWMRFFKIIISRSHLG >> B0222.11 > MSRRTASKLLVFVFLCSLCFGTQRYDMPRKIDLFNDLITQSTTPASPKCQCLPPTTPSTPPNCIPYDSRLQAASLEEAIVAFPDLTITRQEKTQQSTATLNNCKTKQCRDCYKDLRSQLRKVGLLPGTIDQVFHNQRNFTTCQKYRFARQDKGVYEKKKKAKQHYDWDYVEYDEDEDDDYFWDGLFWKKKRNVLKKIVKRDVEATTAISQPPNSTAMNSTGIIGIRFPISCTTRGVTPDGLGTVSLCSTCWVWRRLPSTYYPAYLNEVVCDYADTSCLSGYASCQTGTQQLNVLRNDSGKLIPISVSAGINCECRLAVGSTLESLVLGQGISKAMPPIDTTSTKPPNLATSTTSHS > (...) > > Pj. > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From pjotr.public23 at thebird.nl Wed Oct 13 12:34:23 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 18:34:23 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> Message-ID: <20101013163423.GA3849@thebird.nl> On Wed, Oct 13, 2010 at 05:25:41PM +0100, Andy Yates wrote: > That's great news and should be even faster once we get rid of the requirement to upper case since you're having to parse the same sequence twice. > > I wonder what the C version does to make itself even faster The EMBOSS implementation is fastest by a mile - takes less than 3 seconds. But the code is, uhm, hard to read. I think table lookups will win in C, whatever you try. But it may be an interesting exercise if we can get close. Note I am perhaps not using the fastest JVM. java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode) Pj. From willishf at ufl.edu Wed Oct 13 13:16:01 2010 From: willishf at ufl.edu (Scooter Willis) Date: Wed, 13 Oct 2010 13:16:01 -0400 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013163423.GA3849@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> <20101013163423.GA3849@thebird.nl> Message-ID: The Biojava3 has an additional validation layer and object creation going from DNA sequence to RNA sequence and then using the appropriate translation rules to return a protein sequence. Could be easily twice as fast if you went from DNA sequence to ProteinSequence which would put it at 8 seconds. We are going to carry a performance penalty setting everything up as a proper object versus doing a simple String to String translation. On Wed, Oct 13, 2010 at 12:34 PM, Pjotr Prins wrote: > On Wed, Oct 13, 2010 at 05:25:41PM +0100, Andy Yates wrote: > > That's great news and should be even faster once we get rid of the > requirement to upper case since you're having to parse the same sequence > twice. > > > > I wonder what the C version does to make itself even faster > > The EMBOSS implementation is fastest by a mile - takes less than 3 > seconds. But the code is, uhm, hard to read. > > I think table lookups will win in C, whatever you try. But it may be an > interesting exercise if we can get close. Note I am perhaps not using the > fastest JVM. > > java version "1.6.0_20" > Java(TM) SE Runtime Environment (build 1.6.0_20-b02) > Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode) > > Pj. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From pjotr.public23 at thebird.nl Wed Oct 13 14:17:12 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 20:17:12 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> <20101013163423.GA3849@thebird.nl> Message-ID: <20101013181712.GA4482@thebird.nl> I think it is a good idea. From a purist point of view you may object (it is not biological), but most libraries do exactly that. If direct translation gets it down to 8sec, we may well half that with further tweaking. Pj. On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote: > The Biojava3 has an additional validation layer and object creation going > from DNA sequence to RNA sequence and then using the appropriate translation > rules to return a protein sequence. Could be easily twice as fast if you > went from DNA sequence to ProteinSequence which would put it at 8 seconds. > We are going to carry a performance penalty setting everything up as a > proper object versus doing a simple String to String translation. From darnells at dnastar.com Wed Oct 13 14:21:52 2010 From: darnells at dnastar.com (Steve Darnell) Date: Wed, 13 Oct 2010 13:21:52 -0500 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: Andrew, Forgive me for being pessimistic, but I do not believe you can publically distribute your code without running the risk of being scooped. Mark's suggestions are very good; however, the safest route would be to withhold distribution of your code until your work is published (or at very least accepted). Also, I would suggest this argument for convincing your group to use BioJava (disclaimer - I am not a lawyer). Under the LGPL, you are not obligated to release your source code if: (1) you create a "work based on the library" (e.g. direct modifications or additions to the licensed work) but do not distribute it, and (2) you create a "work that uses the library" by dynamically linking your work to the licensed work (see distribution clause #5 of the LGPL: http://www.gnu.org/licenses/lgpl-2.1.html) If you follow choice #2, you can license and distribute your work under terms of your group's choosing (open or closed, submit it to the BioJava developers for inclusion or not) while gaining the benefit of reusing BioJava. ~Steve -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark Schreiber Sent: Wednesday, October 13, 2010 4:26 AM To: McSweeny, Andrew J Cc: biojava-l at biojava.org Subject: Re: [Biojava-l] How to share code while protecting copyrights? Hi - My understanding of copyright is that it is yours as soon as you assert that it is your creation. You can simply add a copyright statement to each file containing the code (in the header for example). The reality is that defending copyright is your responsibility. If someone violates it, you have to take them to court or issue a legal letter. You can also put an appropriate license on the code specifying how it can be used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick one of these that best matches your needs. BioJava code is LGPL so if you want your code to go into the BioJava code base you will need to make your code LGPL. It's always a good idea to add @author tags to Java code to ensure appropriate attribution. Finally, if someone steals your code and publishes results before you then you can always make a complaint to the journal editors. If it is a reputable journal, and you have reasonable proof the editor should take some action such as forcing a retraction. You can also make a distribution agreement saying that if someone uses this code they agree not to publish without first consulting you. If you want to make it really water tight, get a lawyer and explain specifically what you want to share and what you want to protect or prevent. - Mark On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J < andrew.mcsweeny at rockets.utoledo.edu> wrote: > Hi, > > I am working on a project which simulates sexual reproduction in a > population of digital organisms. Their genome is just a contig from hg18. > It's pretty interesting and I can talk more about it in the future.... > > Anyways, how can I share my code for this project without having to worry > that someone else will use it to publish a paper before my group does? > > I'm certain nobody in the open source community would do that, but how do I > convince my group that opening our project to BioJava is a good idea? > > -Andrew > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas at sdsc.edu Wed Oct 13 14:48:32 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 13 Oct 2010 11:48:32 -0700 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: > Forgive me for being pessimistic, but I do not believe you can > publically distribute your code without running the risk of being > scooped. Mark's suggestions are very good; however, the safest route > would be to withhold distribution of your code until your work is > published (or at very least accepted). > I think that is too conservative - if getting scooped is an issue, I would release the code shortly before submission of the first manuscript to a journal. That way the source code can form part of the publication and the referees can view the code during the review process. Many views/downloads of articles happen in the first few weeks after publication. Having a link to the source code in the paper can be a great advertisement for the open source project and help in community-building. Andreas > > Also, I would suggest this argument for convincing your group to use > BioJava (disclaimer - I am not a lawyer). > > Under the LGPL, you are not obligated to release your source code if: > > (1) you create a "work based on the library" (e.g. direct modifications > or additions to the licensed work) but do not distribute it, and > (2) you create a "work that uses the library" by dynamically linking > your work to the licensed work (see distribution clause #5 of the LGPL: > http://www.gnu.org/licenses/lgpl-2.1.html) > > If you follow choice #2, you can license and distribute your work under > terms of your group's choosing (open or closed, submit it to the BioJava > developers for inclusion or not) while gaining the benefit of reusing > BioJava. > > ~Steve > > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org > [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark > Schreiber > Sent: Wednesday, October 13, 2010 4:26 AM > To: McSweeny, Andrew J > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] How to share code while protecting copyrights? > > Hi - > > My understanding of copyright is that it is yours as soon as you assert > that > it is your creation. You can simply add a copyright statement to each > file > containing the code (in the header for example). The reality is that > defending copyright is your responsibility. If someone violates it, you > have > to take them to court or issue a legal letter. > > You can also put an appropriate license on the code specifying how it > can be > used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick > one > of these that best matches your needs. BioJava code is LGPL so if you > want > your code to go into the BioJava code base you will need to make your > code > LGPL. > > It's always a good idea to add @author tags to Java code to ensure > appropriate attribution. > > Finally, if someone steals your code and publishes results before you > then > you can always make a complaint to the journal editors. If it is a > reputable > journal, and you have reasonable proof the editor should take some > action > such as forcing a retraction. You can also make a distribution > agreement > saying that if someone uses this code they agree not to publish without > first consulting you. > > If you want to make it really water tight, get a lawyer and explain > specifically what you want to share and what you want to protect or > prevent. > > - Mark > > On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J < > andrew.mcsweeny at rockets.utoledo.edu> wrote: > > > Hi, > > > > I am working on a project which simulates sexual reproduction in a > > population of digital organisms. Their genome is just a contig from > hg18. > > It's pretty interesting and I can talk more about it in the > future.... > > > > Anyways, how can I share my code for this project without having to > worry > > that someone else will use it to publish a paper before my group does? > > > > I'm certain nobody in the open source community would do that, but how > do I > > convince my group that opening our project to BioJava is a good idea? > > > > -Andrew > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas.prlic at gmail.com Wed Oct 13 15:18:12 2010 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Wed, 13 Oct 2010 12:18:12 -0700 Subject: [Biojava-l] Questions related to biojava In-Reply-To: References: Message-ID: Hi Madhu, best to keep such mails on the mailing list, otherwise they might get lost in my flood of emails... see my reply below. On Wed, Oct 13, 2010 at 12:08 PM, Madhusudan Gujral wrote: > Hi Andreas, > > I have couple of questions related to biojava. I would greatly appreciate > if you could provide directions. > > Is the biojava version 3.0 mature? > Is there any pom file for biojava that I can work with? > Is there a single tool to validate a fasta file? > > - biojava 3.0 is in preparation of getting released. It is not release ready but some of the modules are already used in some production environments - not sure what you mean with this question. You can see the source code in SVN/git and there is also an automated build server providing snapshot builds that can be used for Maven installations. - what kind of vallidation do you have in mind? biojava3-core can do FASTA parsing for you... Andreas From willishf at ufl.edu Wed Oct 13 15:16:39 2010 From: willishf at ufl.edu (Scooter Willis) Date: Wed, 13 Oct 2010 15:16:39 -0400 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013181712.GA4482@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> <20101013163423.GA3849@thebird.nl> <20101013181712.GA4482@thebird.nl> Message-ID: Pjotr What is an extra 8 seconds among friends if you know you are going to get the correct answer and you can change the rules if needed!!! Are you parsing the C.elgans genome or DNA representation of each protein in the C.elgans genome? If you take out the println statement that will help speed things up a bunch. Java System.out is always slow. I am checking on the problem with upper case. That shouldn't be an issue. Thanks Scooter On Wed, Oct 13, 2010 at 2:17 PM, Pjotr Prins wrote: > I think it is a good idea. From a purist point of view you may object > (it is not biological), but most libraries do exactly that. > > If direct translation gets it down to 8sec, we may well half that > with further tweaking. > > Pj. > > On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote: > > The Biojava3 has an additional validation layer and object creation going > > from DNA sequence to RNA sequence and then using the appropriate > translation > > rules to return a protein sequence. Could be easily twice as fast if you > > went from DNA sequence to ProteinSequence which would put it at 8 > seconds. > > We are going to carry a performance penalty setting everything up as a > > proper object versus doing a simple String to String translation. > > From pjotr.public23 at thebird.nl Wed Oct 13 17:05:46 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 23:05:46 +0200 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: <20101013210546.GB5479@thebird.nl> Is that idea of getting scooped realistic? All my code is online, that is my scientific track record, next to my papers. Online OSS code may bring benefits when other people find bugs, or even improve things. I don't worry about getting scooped. First it is easy to prove it is mine, exactly because it is out in the open, and second it takes more than plain old code to get something published in a journal. In the rare case an idea is so sensitive and easy to copy, you can publish it with some part missing. I think too much code sits on planks gathering dust, just because people have these worries. It is old school. We are in the business of moving science forward - writing beautiful tools. Nothing less. Pj. On Wed, Oct 13, 2010 at 11:48:32AM -0700, Andreas Prlic wrote: > > Forgive me for being pessimistic, but I do not believe you can > > publically distribute your code without running the risk of being > > scooped. Mark's suggestions are very good; however, the safest route > > would be to withhold distribution of your code until your work is > > published (or at very least accepted). From andreas at sdsc.edu Wed Oct 13 17:24:54 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 13 Oct 2010 14:24:54 -0700 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: <20101013210546.GB5479@thebird.nl> References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> <20101013210546.GB5479@thebird.nl> Message-ID: nicely put :-) A On Wed, Oct 13, 2010 at 2:05 PM, Pjotr Prins wrote: > Is that idea of getting scooped realistic? > > All my code is online, that is my scientific track record, next to my > papers. > > Online OSS code may bring benefits when other people find bugs, or > even improve things. I don't worry about getting scooped. First it is > easy to prove it is mine, exactly because it is out in the open, and > second it takes more than plain old code to get something published in > a journal. > > In the rare case an idea is so sensitive and easy to copy, you can > publish it with some part missing. > > I think too much code sits on planks gathering dust, just because > people have these worries. It is old school. We are in the business > of moving science forward - writing beautiful tools. Nothing less. > > Pj. > > On Wed, Oct 13, 2010 at 11:48:32AM -0700, Andreas Prlic wrote: > > > Forgive me for being pessimistic, but I do not believe you can > > > publically distribute your code without running the risk of being > > > scooped. Mark's suggestions are very good; however, the safest route > > > would be to withhold distribution of your code until your work is > > > published (or at very least accepted). > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From hlapp at drycafe.net Wed Oct 13 17:44:36 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 13 Oct 2010 16:44:36 -0500 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: How and when you want to be attributed in publications, and what you want someone else not to publish on, is an ethical matter. Licenses are legal instruments and not suited for ethical questions or social conventions. Rather, this is addressed by ethical and social conventions and requests. A good example is the Ft Lauderdale agreement, which is not a legal instrument but an ethical request of those who peruse immediate- release sequencing data. If you have ethical or social requests to make of those who peruse your code, state them explicitly in a README and in the code. By their nature, you can't legally enforce them. However, ethical behavior is policed - by all of us as a scientific community, not in the courts. -hilmar On Oct 13, 2010, at 1:21 PM, Steve Darnell wrote: > Andrew, > > Forgive me for being pessimistic, but I do not believe you can > publically distribute your code without running the risk of being > scooped. Mark's suggestions are very good; however, the safest route > would be to withhold distribution of your code until your work is > published (or at very least accepted). > > Also, I would suggest this argument for convincing your group to use > BioJava (disclaimer - I am not a lawyer). > > Under the LGPL, you are not obligated to release your source code if: > > (1) you create a "work based on the library" (e.g. direct > modifications > or additions to the licensed work) but do not distribute it, and > (2) you create a "work that uses the library" by dynamically linking > your work to the licensed work (see distribution clause #5 of the > LGPL: > http://www.gnu.org/licenses/lgpl-2.1.html) > > If you follow choice #2, you can license and distribute your work > under > terms of your group's choosing (open or closed, submit it to the > BioJava > developers for inclusion or not) while gaining the benefit of reusing > BioJava. > > ~Steve > > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org > [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark > Schreiber > Sent: Wednesday, October 13, 2010 4:26 AM > To: McSweeny, Andrew J > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] How to share code while protecting > copyrights? > > Hi - > > My understanding of copyright is that it is yours as soon as you > assert > that > it is your creation. You can simply add a copyright statement to each > file > containing the code (in the header for example). The reality is that > defending copyright is your responsibility. If someone violates it, > you > have > to take them to court or issue a legal letter. > > You can also put an appropriate license on the code specifying how it > can be > used. Examples include GPL, LGPL, BSD, Apache License etc. You can > pick > one > of these that best matches your needs. BioJava code is LGPL so if you > want > your code to go into the BioJava code base you will need to make your > code > LGPL. > > It's always a good idea to add @author tags to Java code to ensure > appropriate attribution. > > Finally, if someone steals your code and publishes results before you > then > you can always make a complaint to the journal editors. If it is a > reputable > journal, and you have reasonable proof the editor should take some > action > such as forcing a retraction. You can also make a distribution > agreement > saying that if someone uses this code they agree not to publish > without > first consulting you. > > If you want to make it really water tight, get a lawyer and explain > specifically what you want to share and what you want to protect or > prevent. > > - Mark > > On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J < > andrew.mcsweeny at rockets.utoledo.edu> wrote: > >> Hi, >> >> I am working on a project which simulates sexual reproduction in a >> population of digital organisms. Their genome is just a contig from > hg18. >> It's pretty interesting and I can talk more about it in the > future.... >> >> Anyways, how can I share my code for this project without having to > worry >> that someone else will use it to publish a paper before my group >> does? >> >> I'm certain nobody in the open source community would do that, but >> how > do I >> convince my group that opening our project to BioJava is a good idea? >> >> -Andrew >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From ayates at ebi.ac.uk Wed Oct 13 18:52:17 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 13 Oct 2010 23:52:17 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> <20101013163423.GA3849@thebird.nl> <20101013181712.GA4482@thebird.nl> Message-ID: <7E59B83F-8371-4F79-AC4C-57D1A49A9398@ebi.ac.uk> LOL well you could always parallelise it :) I've gone & pushed a new version of the translator code to the SVN repo so it'll filter through to the public server soon. There's an added test case as well. The overall impact of this change seems to be about 25 translations of BRCA2 per second so it is significant; our current limit looks to be approx. 200 per second. I hope you find this is faster without the need to edit & parse a Sequence String twice Andy On 13 Oct 2010, at 20:16, Scooter Willis wrote: > Pjotr > > What is an extra 8 seconds among friends if you know you are going to get the correct answer and you can change the rules if needed!!! > > Are you parsing the C.elgans genome or DNA representation of each protein in the C.elgans genome? > > If you take out the println statement that will help speed things up a bunch. Java System.out is always slow. > > I am checking on the problem with upper case. That shouldn't be an issue. > > Thanks > > Scooter > > > On Wed, Oct 13, 2010 at 2:17 PM, Pjotr Prins wrote: > I think it is a good idea. From a purist point of view you may object > (it is not biological), but most libraries do exactly that. > > If direct translation gets it down to 8sec, we may well half that > with further tweaking. > > Pj. > > On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote: > > The Biojava3 has an additional validation layer and object creation going > > from DNA sequence to RNA sequence and then using the appropriate translation > > rules to return a protein sequence. Could be easily twice as fast if you > > went from DNA sequence to ProteinSequence which would put it at 8 seconds. > > We are going to carry a performance penalty setting everything up as a > > proper object versus doing a simple String to String translation. > > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From pjotr.public23 at thebird.nl Thu Oct 14 03:00:12 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Thu, 14 Oct 2010 09:00:12 +0200 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: <20101014070012.GA7296@thebird.nl> On Wed, Oct 13, 2010 at 04:44:36PM -0500, Hilmar Lapp wrote: > By their nature, you can't legally enforce them. However, ethical > behavior is policed - by all of us as a scientific community, not in the > courts. I know people who make it their business to pursue companies that do not honour OSS licenses. The companies always have to retrack. Is there any precedent in science where open source software was used to scoop research? And how did that scientist fare? With scientists I can't see it happening. Getting caught out that way will hurt all future prospects for an individual or group. With this reasoning you are best off putting code in the public domain as fast as possible. Pj. From hlapp at drycafe.net Thu Oct 14 10:47:19 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Thu, 14 Oct 2010 09:47:19 -0500 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: <20101014070012.GA7296@thebird.nl> References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> <20101014070012.GA7296@thebird.nl> Message-ID: On Oct 14, 2010, at 2:00 AM, Pjotr Prins wrote: > I know people who make it their business to pursue companies that do > not honour OSS licenses. The companies always have to retrack. Of course. That's a legal issue. Attribution on publications, or what someone publishes on reusing your stuff, is not a legal issue. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri Oct 15 07:53:13 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 15 Oct 2010 12:53:13 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013111541.GA512@thebird.nl> References: <20101013111541.GA512@thebird.nl> Message-ID: On Wed, Oct 13, 2010 at 12:15 PM, Pjotr Prins wrote: > I am using biojava-1.7.1 nucleotide -> amino acid translation. It is > rather slow. In fact, the biopython equivalent in native Python is > twice as fast. EMBOSS is again magnitudes faster. I am using > something like > > ?rna = RNATools.createRNA(nucleotides); > ?aa = RNATools.translate(rna); > > Embarrassingly, even the R version is faster in the GeneR module, as > it uses a C module. > > I have a feeling this has to do with typed object creation at every > level, whereas Python and others uses plain character Strings. > > Any plans for speeding this up on the JVM? > > Pj. Actually (assuming you are not explicitly using strings), Biopython would also be using objects for each sequence, which does impose a speed penalty. Peter From kurka at mikro.biologie.tu-muenchen.de Tue Oct 19 07:25:31 2010 From: kurka at mikro.biologie.tu-muenchen.de (Hedwig Kurka) Date: Tue, 19 Oct 2010 13:25:31 +0200 Subject: [Biojava-l] feature request - full query description from blast result Message-ID: <4CBD802B.7030809@mikro.biologie.tu-muenchen.de> Hi all, I just read in a blast file and I want to get the full query description. For example, when I have that query: Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11 ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2 (1208 letters) I get as query-information locus_tag= CD0002 The rest is truncated. In the biojava-mailinglist I found the same question http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html And Mark suggested to make a request for improvement, but as I see it, nothing happened. So I would like to ask, if you can change it. Or is it changed and I don't see it. Thanks, Hedwig From sb.genny at gmail.com Thu Oct 21 10:28:53 2010 From: sb.genny at gmail.com (sobia idrees) Date: Thu, 21 Oct 2010 19:28:53 +0500 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9 In-Reply-To: References: Message-ID: Hi I want to develop phylogenetics application in biojava..but need help to do that..Kindly help me in developing some applications.. Thanks in anticipation Regards, Sobia Idrees On Tue, Oct 19, 2010 at 9:00 PM, wrote: > Send Biojava-l mailing list submissions to > biojava-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biojava-l > or, via email, send a message with subject or body 'help' to > biojava-l-request at lists.open-bio.org > > You can reach the person managing the list at > biojava-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biojava-l digest..." > > > Today's Topics: > > 1. feature request - full query description from blast result > (Hedwig Kurka) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 19 Oct 2010 13:25:31 +0200 > From: Hedwig Kurka > Subject: [Biojava-l] feature request - full query description from > blast result > To: biojava-l at lists.open-bio.org > Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de> > Content-Type: text/plain; charset=ISO-8859-15 > > Hi all, > > I just read in a blast file and I want to get the full query description. > For example, when I have that query: > Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase > III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11 > ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2 > (1208 letters) > > I get as query-information locus_tag= CD0002 > The rest is truncated. > > In the biojava-mailinglist I found the same question > http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html > > And Mark suggested to make a request for improvement, but as I see it, > nothing happened. So I would like to ask, if you can change it. Or is it > changed and I don't see it. > > Thanks, > Hedwig > > > ------------------------------ > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > End of Biojava-l Digest, Vol 93, Issue 9 > **************************************** > From sb.genny at gmail.com Thu Oct 21 10:30:35 2010 From: sb.genny at gmail.com (sobia idrees) Date: Thu, 21 Oct 2010 19:30:35 +0500 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9 In-Reply-To: References: Message-ID: Hi I have developed some web based and desktop based applications using biojava..Can it be published in Biojava journal? Thanks, Sobia Idrees On Thu, Oct 21, 2010 at 7:28 PM, sobia idrees wrote: > Hi > > I want to develop phylogenetics application in biojava..but need help to do > that..Kindly help me in developing some applications.. > > Thanks in anticipation > > Regards, > Sobia Idrees > > > On Tue, Oct 19, 2010 at 9:00 PM, wrote: > >> Send Biojava-l mailing list submissions to >> biojava-l at lists.open-bio.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> or, via email, send a message with subject or body 'help' to >> biojava-l-request at lists.open-bio.org >> >> You can reach the person managing the list at >> biojava-l-owner at lists.open-bio.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of Biojava-l digest..." >> >> >> Today's Topics: >> >> 1. feature request - full query description from blast result >> (Hedwig Kurka) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Tue, 19 Oct 2010 13:25:31 +0200 >> From: Hedwig Kurka >> Subject: [Biojava-l] feature request - full query description from >> blast result >> To: biojava-l at lists.open-bio.org >> Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de> >> Content-Type: text/plain; charset=ISO-8859-15 >> >> Hi all, >> >> I just read in a blast file and I want to get the full query description. >> For example, when I have that query: >> Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase >> III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11 >> ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2 >> (1208 letters) >> >> I get as query-information locus_tag= CD0002 >> The rest is truncated. >> >> In the biojava-mailinglist I found the same question >> http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html >> >> And Mark suggested to make a request for improvement, but as I see it, >> nothing happened. So I would like to ask, if you can change it. Or is it >> changed and I don't see it. >> >> Thanks, >> Hedwig >> >> >> ------------------------------ >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> >> End of Biojava-l Digest, Vol 93, Issue 9 >> **************************************** >> > > From holland at eaglegenomics.com Thu Oct 21 10:41:35 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 21 Oct 2010 15:41:35 +0100 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9 In-Reply-To: References: Message-ID: <97591963-F741-45C1-8E9D-231A5D05D4DA@eaglegenomics.com> There is no such thing as a Biojava journal. You would need to submit your paper to one of the main bioinformatics journals. cheers, Richard On 21 Oct 2010, at 15:30, sobia idrees wrote: > Hi > > I have developed some web based and desktop based applications using > biojava..Can it be published in Biojava journal? > > Thanks, > Sobia Idrees > > On Thu, Oct 21, 2010 at 7:28 PM, sobia idrees wrote: > >> Hi >> >> I want to develop phylogenetics application in biojava..but need help to do >> that..Kindly help me in developing some applications.. >> >> Thanks in anticipation >> >> Regards, >> Sobia Idrees >> >> >> On Tue, Oct 19, 2010 at 9:00 PM, wrote: >> >>> Send Biojava-l mailing list submissions to >>> biojava-l at lists.open-bio.org >>> >>> To subscribe or unsubscribe via the World Wide Web, visit >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> or, via email, send a message with subject or body 'help' to >>> biojava-l-request at lists.open-bio.org >>> >>> You can reach the person managing the list at >>> biojava-l-owner at lists.open-bio.org >>> >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of Biojava-l digest..." >>> >>> >>> Today's Topics: >>> >>> 1. feature request - full query description from blast result >>> (Hedwig Kurka) >>> >>> >>> ---------------------------------------------------------------------- >>> >>> Message: 1 >>> Date: Tue, 19 Oct 2010 13:25:31 +0200 >>> From: Hedwig Kurka >>> Subject: [Biojava-l] feature request - full query description from >>> blast result >>> To: biojava-l at lists.open-bio.org >>> Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de> >>> Content-Type: text/plain; charset=ISO-8859-15 >>> >>> Hi all, >>> >>> I just read in a blast file and I want to get the full query description. >>> For example, when I have that query: >>> Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase >>> III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11 >>> ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2 >>> (1208 letters) >>> >>> I get as query-information locus_tag= CD0002 >>> The rest is truncated. >>> >>> In the biojava-mailinglist I found the same question >>> http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html >>> >>> And Mark suggested to make a request for improvement, but as I see it, >>> nothing happened. So I would like to ask, if you can change it. Or is it >>> changed and I don't see it. >>> >>> Thanks, >>> Hedwig >>> >>> >>> ------------------------------ >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> >>> End of Biojava-l Digest, Vol 93, Issue 9 >>> **************************************** >>> >> >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jc.lucky at laposte.net Fri Oct 22 04:11:43 2010 From: jc.lucky at laposte.net (jc.lucky) Date: Fri, 22 Oct 2010 10:11:43 +0200 (CEST) Subject: [Biojava-l] Retrieve Information from GenBank file Message-ID: <31170592.35650.1287735103724.JavaMail.www@wwinf8210> Hi I'm trying to convert a GenBank file into a rdf file. The gene of interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 With the below code I can read the GenBank file and I manage to retrieve information and convert them in a rdf format. However I don't succeed in retrieving some information such as Title, protein or product. According to this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is possible to do so. Please help me find what I do wrong or what should be done to achieve my goal. //read the GeneBank File public static RichSequenceIterator readFile(String input, RichSequenceBuilderFactory seqFactory, Namespace ns) throws IOException, NoSuchElementException, BioException { ns = null; InputStream stream = new FileInputStream(input); BufferedReader rdfFile = new BufferedReader(new InputStreamReader(stream)); RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(rdfFile,ns); return seqs; } //Retrieve information and convert them in rdf format public void writeToRDFFile(RichSequenceIterator rsi, String output) throws IOException, NoSuchElementException, BioException { //create model for the ontology OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, null); OntClass parents; String URI = "http://pbr.wur.nl/#"; while(rsi.hasNext()) { RichSequence seq = rsi.nextRichSequence(); String id = seq.getName(); parents = model.createClass(URI + id); Set author = seq.getRankedDocRefs();//code to clean up Set&convert toString String definition = seq.getDescription(); //code to clean up String //Add to model parents.addProperty(DC.description, definition); parents.addProperty(DC.publisher, authors); parents.addComment(taxonomy, "EN"); parents.addProperty(DC.type, organism); //print in rdf format model.write(out, "RDF/XML"); out.close(); } } Thanks, Jean-Charles Ferri?res Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? Je cr?e ma bo?te mail www.laposte.net From andreas at sdsc.edu Fri Oct 22 15:56:49 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 22 Oct 2010 12:56:49 -0700 Subject: [Biojava-l] 3.0-alpha2 Message-ID: Hi, In preparation for the upcoming biojava 3 release, 3.0-alpha2 has just been released on http://biojava.org/download/maven/ Andreas From cfriedline at vcu.edu Sun Oct 24 10:38:46 2010 From: cfriedline at vcu.edu (Chris Friedline) Date: Sun, 24 Oct 2010 10:38:46 -0400 Subject: [Biojava-l] Test Message Message-ID: Per Andreas, this is a test. Chris -- PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From cfriedline at vcu.edu Sun Oct 24 10:57:48 2010 From: cfriedline at vcu.edu (Chris Friedline) Date: Sun, 24 Oct 2010 10:57:48 -0400 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch Message-ID: Hello, I am getting a weird problem with protein alignment using NeedlemanWunsch in 1.7.1, in that the alignment does not span the entire length of the proteins. I've verified that this should not happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI. I'm reluctant to switch to BioJava3 at this time, since performance is about 2-3x slower than 1.7.1 for the alignments, and I'm doing about 350,000 of them. An example of this alignment error, is shown here: http://pastebin.com/mdX516R6 Notice that the alignment stops 1 amino acid short of the end in both cases. The parameters for the alignment are: BLOSUM50, gapOpen=10, gapExtend=2. Thanks, Chris -- PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From andreas.draeger at uni-tuebingen.de Sun Oct 24 12:01:05 2010 From: andreas.draeger at uni-tuebingen.de (Andreas Draeger) Date: Sun, 24 Oct 2010 18:01:05 +0200 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: References: Message-ID: <4CC45841.5080604@uni-tuebingen.de> Hi Chris, Thank you for reprorting this problem. It would be very nice if you could also provide your source code. Then I would like to test what happens. You can send source code, substitution matrix, and the two example protein sequences that cause the problems directly to me. I'll then have a look into it. Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From cfriedline at vcu.edu Sun Oct 24 14:04:25 2010 From: cfriedline at vcu.edu (Chris Friedline) Date: Sun, 24 Oct 2010 14:04:25 -0400 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: <4CC45841.5080604@uni-tuebingen.de> References: <4CC45841.5080604@uni-tuebingen.de> Message-ID: Thanks, Andreas. I've sent you the information that you asked for below. Chris On Sun, Oct 24, 2010 at 12:01 PM, Andreas Draeger wrote: > Hi Chris, > > Thank you for reprorting this problem. It would be very nice if you > could also provide your source code. Then I would like to test what > happens. You can send source code, substitution matrix, and the two > example protein sequences that cause the problems directly to me. I'll > then have a look into it. > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: ? +49-7071-29-5091 > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From koen.bruynseels at cropdesign.com Mon Oct 25 12:15:59 2010 From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com) Date: Mon, 25 Oct 2010 18:15:59 +0200 Subject: [Biojava-l] Koen Bruynseels is out of the office. Message-ID: I will be out of the office starting 10/25/2010 and will not return until 11/02/2010. I will respond to your message when I return. From andreas at sdsc.edu Tue Oct 26 14:42:29 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 26 Oct 2010 11:42:29 -0700 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: References: Message-ID: Hi Chris, about your comment that the biojava3-alignment is slower than the 1.7 one: Do you have any data if this is coming from the io or is the actual alignment calculation slower? Andreas On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline wrote: > Hello, > > I am getting a weird problem with protein alignment using > NeedlemanWunsch in 1.7.1, in that the alignment does not span the > entire length of the proteins. ?I've verified that this should not > happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI. > I'm reluctant to switch to BioJava3 at this time, since performance is > about 2-3x slower than 1.7.1 for the alignments, and I'm doing about > 350,000 of them. > > An example of this alignment error, is shown here: http://pastebin.com/mdX516R6 > > Notice that the alignment stops 1 amino acid short of the end in both > cases. ?The parameters for the alignment are: BLOSUM50, gapOpen=10, > gapExtend=2. > > Thanks, > Chris > > -- > PhD Candidate, Integrative Life Sciences > Virginia Commonwealth University > Richmond, VA > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From cfriedline at vcu.edu Tue Oct 26 15:21:39 2010 From: cfriedline at vcu.edu (Chris Friedline) Date: Tue, 26 Oct 2010 15:21:39 -0400 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: References: Message-ID: Hi Andreas, The io should be the same, since I've used the same set of genes for testing both. So, I'm guessing it's either the alignment calculation or the new biojava design contributing to the slowness. Chris On Tue, Oct 26, 2010 at 2:42 PM, Andreas Prlic wrote: > Hi Chris, > > about your comment that the biojava3-alignment is slower than the 1.7 > one: Do you have any data if this is coming from the io or is the > actual alignment calculation slower? > > Andreas > > On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline > wrote: > > Hello, > > > > I am getting a weird problem with protein alignment using > > NeedlemanWunsch in 1.7.1, in that the alignment does not span the > > entire length of the proteins. I've verified that this should not > > happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI. > > I'm reluctant to switch to BioJava3 at this time, since performance is > > about 2-3x slower than 1.7.1 for the alignments, and I'm doing about > > 350,000 of them. > > > > An example of this alignment error, is shown here: > http://pastebin.com/mdX516R6 > > > > Notice that the alignment stops 1 amino acid short of the end in both > > cases. The parameters for the alignment are: BLOSUM50, gapOpen=10, > > gapExtend=2. > > > > Thanks, > > Chris > > > > -- > > PhD Candidate, Integrative Life Sciences > > Virginia Commonwealth University > > Richmond, VA > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > -- PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From cfriedline at vcu.edu Tue Oct 26 15:29:30 2010 From: cfriedline at vcu.edu (Chris Friedline) Date: Tue, 26 Oct 2010 15:29:30 -0400 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: References: Message-ID: That's something I'll need to go back and revisit after my deadline passes at the end of this week. Initially, I was creating them on the fly at the time of alignment, but it would be more efficient to store them that way in the gene object itself. ?I was also passing an InputStreamReader for the substitution matrix each time (pulling the matrix from my jar), but storing it as a string would also be a better option, especially since I'm threading and there are so many alignments. Chris On Tue, Oct 26, 2010 at 3:23 PM, Andreas Prlic wrote: > > ok, how do you create the biojava3 Sequence objects? just trying to > find out where the bottlenecks are, so we can fix them... > > A > > On Tue, Oct 26, 2010 at 12:20 PM, Chris Friedline wrote: > > Hi, > > The io should be the same, since I've used the same set of genes for testing > > both. ?So, it's either the alignment calculation or the new biojava design > > contributing to the slowness. > > Chris > > > > On Tue, Oct 26, 2010 at 2:42 PM, Andreas Prlic wrote: > >> > >> Hi Chris, > >> > >> about your comment that the biojava3-alignment is slower than the 1.7 > >> one: Do you have any data if this is coming from the io or is the > >> actual alignment calculation slower? > >> > >> Andreas > >> > >> On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline > >> wrote: > >> > Hello, > >> > > >> > I am getting a weird problem with protein alignment using > >> > NeedlemanWunsch in 1.7.1, in that the alignment does not span the > >> > entire length of the proteins. ?I've verified that this should not > >> > happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI. > >> > I'm reluctant to switch to BioJava3 at this time, since performance is > >> > about 2-3x slower than 1.7.1 for the alignments, and I'm doing about > >> > 350,000 of them. > >> > > >> > An example of this alignment error, is shown here: > >> > http://pastebin.com/mdX516R6 > >> > > >> > Notice that the alignment stops 1 amino acid short of the end in both > >> > cases. ?The parameters for the alignment are: BLOSUM50, gapOpen=10, > >> > gapExtend=2. > >> > > >> > Thanks, > >> > Chris > >> > > >> > -- > >> > PhD Candidate, Integrative Life Sciences > >> > Virginia Commonwealth University > >> > Richmond, VA > >> > _______________________________________________ > >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > >> > http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > >> > >> > >> > >> -- > >> ----------------------------------------------------------------------- > >> Dr. Andreas Prlic > >> Senior Scientist, RCSB PDB Protein Data Bank > >> University of California, San Diego > >> (+1) 858.246.0526 > >> ----------------------------------------------------------------------- > > > > > > > > -- > > PhD Candidate, Integrative Life Sciences > > Virginia Commonwealth University > > Richmond, VA > > > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- -- PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From andreas.draeger at uni-tuebingen.de Tue Oct 26 18:18:00 2010 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Tue, 26 Oct 2010 23:18:00 +0100 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: References: Message-ID: <4CC75398.7000301@uni-tuebingen.de> Hi all, By the way, I would like to mention that the bug has been fixed. It was a problem with the way how the alignment was presented to the user afterwards, i.e., a problem of the formatting algorithm. The alignment itself was correct and also when obtaining the GappedSequences after the alignment, these were correct. The problem was that the formatter was started with the original lenght of the sequences, which is usually to short after inserting gaps. This is now solved and the alignment should work fine now. Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From dasarnow at gmail.com Tue Oct 26 23:54:43 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Tue, 26 Oct 2010 20:54:43 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader Message-ID: Hi all, Let me first say thanks to all the BioJava community members for delivering such a useful set of libraries, and that I'm still a newbie when it comes to BioJava (and Java) so forgive me if my question is too trivial. I am doing work on lots (at least thousands) of PDB files from RCSB. As is commonly known, these are often rife with errors which can lead to exceptions during parsing with PDBFileParser. Because PDBFileParser's methods contain their own try-catch blocks, exception propagation stops there and my code proceeds blindly along regardless of any error checking I do. I would like to catch the exceptions up in my code where the parser is called, so that I can branch to a continue statement and have my batch processing loops move on to the next file. Should I edit out the try-catch blocks and compile my own version of the library? Or should I test the returned StructureImpl objects for possession of the fields in question? In that case, I'm not sure which properties will give the most general success information...and I'd rather not have to check for /every/ property being correct. If there is some great way to check if an exception was caught down a series of nested method calls, please hit me over the head with it. Thanks! -da From andreas at sdsc.edu Wed Oct 27 00:11:28 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 26 Oct 2010 21:11:28 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Hi Daniel, can you explain a bit more what you are doing, in particular what errors you would like to deal with on your end? You should not need to worry too much about exception handling. Are there any special cases you are interested in? In this case we should support you with a clean interface rather than exception handling from your end... Andreas On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: > Hi all, > Let me first say thanks to all the BioJava community members for > delivering such a useful set of libraries, and that I'm still a newbie > when it comes to BioJava (and Java) so forgive me if my question is > too trivial. > > I am doing work on lots (at least thousands) of PDB files from RCSB. > As is commonly known, these are often rife with errors which can lead > to exceptions during parsing with PDBFileParser. ?Because > PDBFileParser's methods contain their own try-catch blocks, exception > propagation stops there and my code proceeds blindly along regardless > of any error checking I do. ?I would like to catch the exceptions up > in my code where the parser is called, so that I can branch to a > continue statement and have my batch processing loops move on to the > next file. > Should I edit out the try-catch blocks and compile my own version of > the library? ?Or should I test the returned StructureImpl objects for > possession of the fields in question? ?In that case, I'm not sure > which properties will give the most general success information...and > I'd rather not have to check for /every/ property being correct. > > If there is some great way to check if an exception was caught down a > series of nested method calls, please hit me over the head with it. > > Thanks! > > -da > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From dasarnow at gmail.com Wed Oct 27 00:59:56 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Tue, 26 Oct 2010 21:59:56 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Glad to hear it, who doesn't like support or clean interfaces?. No offense intended, by the way, with respect to PDB errors - obviously the PDB is an indispensable resource for all protein scientists. I am looking at many (fixed-length) pieces of protein chains and doin' stuff with 'em. My current code has a pair of nested while loops; the outer iterates over PDB entries (locally rsync'd copy), parsing them and the inner iterates over the pieces from each. When StructureExceptions come out of my PDBFileReader object I want to continue the outer loop, moving on to the next set of files without executing any of the code that depends on correct StructureImpl objects from the reader (database updates, the inner loop). Since the reader's methods have their own try-catch blocks, a thrown StructureException is stopped there and never reaches my own error handling. I just need to know when those errors occur so I can skip those proteins - I am presuming that the correct entries will outweigh the problem ones by a significant factor and the overall data wont be seriously impacted. -da On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: > Hi Daniel, > > can you explain a bit more what you are doing, in particular what > errors you would like to deal with on your end? ?You should not need > to worry too much about exception handling. Are there any special > cases you are interested in? ?In this case we should support you with > a clean interface rather than exception handling from your end... > > Andreas > > > > On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >> Hi all, >> Let me first say thanks to all the BioJava community members for >> delivering such a useful set of libraries, and that I'm still a newbie >> when it comes to BioJava (and Java) so forgive me if my question is >> too trivial. >> >> I am doing work on lots (at least thousands) of PDB files from RCSB. >> As is commonly known, these are often rife with errors which can lead >> to exceptions during parsing with PDBFileParser. ?Because >> PDBFileParser's methods contain their own try-catch blocks, exception >> propagation stops there and my code proceeds blindly along regardless >> of any error checking I do. ?I would like to catch the exceptions up >> in my code where the parser is called, so that I can branch to a >> continue statement and have my batch processing loops move on to the >> next file. >> Should I edit out the try-catch blocks and compile my own version of >> the library? ?Or should I test the returned StructureImpl objects for >> possession of the fields in question? ?In that case, I'm not sure >> which properties will give the most general success information...and >> I'd rather not have to check for /every/ property being correct. >> >> If there is some great way to check if an exception was caught down a >> series of nested method calls, please hit me over the head with it. >> >> Thanks! >> >> -da >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From dasarnow at gmail.com Wed Oct 27 01:03:59 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Tue, 26 Oct 2010 22:03:59 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: I think that would be perfect...and of course I'm happy perform testing on whatever gets cooked up. -da 2010/10/26 Amr Al-Hossary : > We can?add some thing like an exception tracing queue, that can be?checked > for later by the caller. > > would that be OK? > > Amr > >> Date: Tue, 26 Oct 2010 21:11:28 -0700 >> From: andreas at sdsc.edu >> To: dasarnow at gmail.com >> CC: biojava-l at lists.open-bio.org >> Subject: Re: [Biojava-l] Bad PDB files and batch processing with >> PDBFileReader >> >> Hi Daniel, >> >> can you explain a bit more what you are doing, in particular what >> errors you would like to deal with on your end? You should not need >> to worry too much about exception handling. Are there any special >> cases you are interested in? In this case we should support you with >> a clean interface rather than exception handling from your end... >> >> Andreas >> >> >> >> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow >> wrote: >> > Hi all, >> > Let me first say thanks to all the BioJava community members for >> > delivering such a useful set of libraries, and that I'm still a newbie >> > when it comes to BioJava (and Java) so forgive me if my question is >> > too trivial. >> > >> > I am doing work on lots (at least thousands) of PDB files from RCSB. >> > As is commonly known, these are often rife with errors which can lead >> > to exceptions during parsing with PDBFileParser. ?Because >> > PDBFileParser's methods contain their own try-catch blocks, exception >> > propagation stops there and my code proceeds blindly along regardless >> > of any error checking I do. ?I would like to catch the exceptions up >> > in my code where the parser is called, so that I can branch to a >> > continue statement and have my batch processing loops move on to the >> > next file. >> > Should I edit out the try-catch blocks and compile my own version of >> > the library? ?Or should I test the returned StructureImpl objects for >> > possession of the fields in question? ?In that case, I'm not sure >> > which properties will give the most general success information...and >> > I'd rather not have to check for /every/ property being correct. >> > >> > If there is some great way to check if an exception was caught down a >> > series of nested method calls, please hit me over the head with it. >> > >> > Thanks! >> > >> > -da >> > _______________________________________________ >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas at sdsc.edu Wed Oct 27 01:19:07 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 26 Oct 2010 22:19:07 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Hi Daniel, PDB files are better nowadays, due to remediation, however there are still issues.. it sounds like you just want to figure out how to do the try/catch block properly. You could do something like that: boolean splitFileOrganisation = true; AtomCache cache = new AtomCache("/path/to/your/installation/",splitFileOrganisation); String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; for (String pdbID : pdbIDs){ try { Structure s = cache.getStructure(pdbID); if ( s == null) { System.out.println("could not find structure " + pdbID); continue; } // do something with the structure - your inner loop System.out.println(s); } catch (Exception e){ // something crazy happened... System.err.println("Can't load structure " + pdbID + " reason: " + e.getMessage()); e.printStackTrace(); } } On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: > Glad to hear it, who doesn't like support or clean interfaces?. ?No > offense intended, by the way, with respect to PDB errors - obviously > the PDB is an indispensable resource for all protein scientists. > > I am looking at many (fixed-length) pieces of protein chains and doin' > stuff with 'em. ?My current code has a pair of nested while loops; the > outer iterates over PDB entries (locally rsync'd copy), parsing them > and the inner iterates over the pieces from each. ?When > StructureExceptions come out of my PDBFileReader object I want to > continue the outer loop, moving on to the next set of files without > executing any of the code that depends on correct StructureImpl > objects from the reader (database updates, the inner loop). > Since the reader's methods have their own try-catch blocks, a thrown > StructureException is stopped there and never reaches my own error > handling. ?I just need to know when those errors occur so I can skip > those proteins - I am presuming that the correct entries will outweigh > the problem ones by a significant factor and the overall data wont be > seriously impacted. > > -da > > On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >> Hi Daniel, >> >> can you explain a bit more what you are doing, in particular what >> errors you would like to deal with on your end? ?You should not need >> to worry too much about exception handling. Are there any special >> cases you are interested in? ?In this case we should support you with >> a clean interface rather than exception handling from your end... >> >> Andreas >> >> >> >> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>> Hi all, >>> Let me first say thanks to all the BioJava community members for >>> delivering such a useful set of libraries, and that I'm still a newbie >>> when it comes to BioJava (and Java) so forgive me if my question is >>> too trivial. >>> >>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>> As is commonly known, these are often rife with errors which can lead >>> to exceptions during parsing with PDBFileParser. ?Because >>> PDBFileParser's methods contain their own try-catch blocks, exception >>> propagation stops there and my code proceeds blindly along regardless >>> of any error checking I do. ?I would like to catch the exceptions up >>> in my code where the parser is called, so that I can branch to a >>> continue statement and have my batch processing loops move on to the >>> next file. >>> Should I edit out the try-catch blocks and compile my own version of >>> the library? ?Or should I test the returned StructureImpl objects for >>> possession of the fields in question? ?In that case, I'm not sure >>> which properties will give the most general success information...and >>> I'd rather not have to check for /every/ property being correct. >>> >>> If there is some great way to check if an exception was caught down a >>> series of nested method calls, please hit me over the head with it. >>> >>> Thanks! >>> >>> -da >>> _______________________________________________ >>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> From andreas at sdsc.edu Wed Oct 27 02:01:38 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 26 Oct 2010 23:01:38 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Hi Amr, 2010/10/26 Amr Al-Hossary : > We can?add some thing like an exception tracing queue, that can be?checked > for later by the caller. thanks for your suggestion. In terms of API I would prefer if we can separare a user from inconsistencies in the files and I hope we won't need such a queue... If something is off, the code is written to ignore or work around issues... Abdreas > would that be OK? > > Amr > >> Date: Tue, 26 Oct 2010 21:11:28 -0700 >> From: andreas at sdsc.edu >> To: dasarnow at gmail.com >> CC: biojava-l at lists.open-bio.org >> Subject: Re: [Biojava-l] Bad PDB files and batch processing with >> PDBFileReader >> >> Hi Daniel, >> >> can you explain a bit more what you are doing, in particular what >> errors you would like to deal with on your end? You should not need >> to worry too much about exception handling. Are there any special >> cases you are interested in? In this case we should support you with >> a clean interface rather than exception handling from your end... >> >> Andreas >> >> >> >> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow >> wrote: >> > Hi all, >> > Let me first say thanks to all the BioJava community members for >> > delivering such a useful set of libraries, and that I'm still a newbie >> > when i! t comes to BioJava (and Java) so forgive me if my question is >> > too trivial. >> > >> > I am doing work on lots (at least thousands) of PDB files from RCSB. >> > As is commonly known, these are often rife with errors which can lead >> > to exceptions during parsing with PDBFileParser. ?Because >> > PDBFileParser's methods contain their own try-catch blocks, exception >> > propagation stops there and my code proceeds blindly along regardless >> > of any error checking I do. ?I would like to catch the exceptions up >> > in my code where the parser is called, so that I can branch to a >> > continue statement and have my batch processing loops move on to the >> > next file. >> > Should I edit out the try-catch blocks and compile my own version of >> > the library? ?Or should I test the returned StructureImpl objects for >> > possession of the fields i! n question? ?In that case, I'm not sure >> > which proper ties will give the most general success information...and >> > I'd rather not have to check for /every/ property being correct. >> > >> > If there is some great way to check if an exception was caught down a >> > series of nested method calls, please hit me over the head with it. >> > >> > Thanks! >> > >> > -da >> > _______________________________________________ >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > >> >> >> From dasarnow at gmail.com Wed Oct 27 03:26:22 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Wed, 27 Oct 2010 00:26:22 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: I assume AtomCache is a new class in BioJava3? I must give you my embarrassed apology...after a bunch of testing I finally figured out that I had misunderstood where the Parser's error handling returns control and started going after the wrong exceptions. It does looks like if setParseCAOnly is true, the reader excepts on chains with no CA's instead of just skipping them, though the other chains are still parsed into the structure. -da On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: > Hi Daniel, > > PDB files are better nowadays, due to remediation, however there are > still issues.. > > it sounds like you just want to figure out how to do the try/catch > block properly. You could do something like that: > > ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; > ? ? ? ? ? ? ? ?AtomCache cache = new > AtomCache("/path/to/your/installation/",splitFileOrganisation); > > ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; > > ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ > > ? ? ? ? ? ? ? ? ? ? ? ?try { > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); > > ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + > e.getMessage()); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); > ? ? ? ? ? ? ? ? ? ? ? ?} > ? ? ? ? ? ? ? ?} > > > > > On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >> Glad to hear it, who doesn't like support or clean interfaces?. ?No >> offense intended, by the way, with respect to PDB errors - obviously >> the PDB is an indispensable resource for all protein scientists. >> >> I am looking at many (fixed-length) pieces of protein chains and doin' >> stuff with 'em. ?My current code has a pair of nested while loops; the >> outer iterates over PDB entries (locally rsync'd copy), parsing them >> and the inner iterates over the pieces from each. ?When >> StructureExceptions come out of my PDBFileReader object I want to >> continue the outer loop, moving on to the next set of files without >> executing any of the code that depends on correct StructureImpl >> objects from the reader (database updates, the inner loop). >> Since the reader's methods have their own try-catch blocks, a thrown >> StructureException is stopped there and never reaches my own error >> handling. ?I just need to know when those errors occur so I can skip >> those proteins - I am presuming that the correct entries will outweigh >> the problem ones by a significant factor and the overall data wont be >> seriously impacted. >> >> -da >> >> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>> Hi Daniel, >>> >>> can you explain a bit more what you are doing, in particular what >>> errors you would like to deal with on your end? ?You should not need >>> to worry too much about exception handling. Are there any special >>> cases you are interested in? ?In this case we should support you with >>> a clean interface rather than exception handling from your end... >>> >>> Andreas >>> >>> >>> >>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>> Hi all, >>>> Let me first say thanks to all the BioJava community members for >>>> delivering such a useful set of libraries, and that I'm still a newbie >>>> when it comes to BioJava (and Java) so forgive me if my question is >>>> too trivial. >>>> >>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>> As is commonly known, these are often rife with errors which can lead >>>> to exceptions during parsing with PDBFileParser. ?Because >>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>> propagation stops there and my code proceeds blindly along regardless >>>> of any error checking I do. ?I would like to catch the exceptions up >>>> in my code where the parser is called, so that I can branch to a >>>> continue statement and have my batch processing loops move on to the >>>> next file. >>>> Should I edit out the try-catch blocks and compile my own version of >>>> the library? ?Or should I test the returned StructureImpl objects for >>>> possession of the fields in question? ?In that case, I'm not sure >>>> which properties will give the most general success information...and >>>> I'd rather not have to check for /every/ property being correct. >>>> >>>> If there is some great way to check if an exception was caught down a >>>> series of nested method calls, please hit me over the head with it. >>>> >>>> Thanks! >>>> >>>> -da >>>> _______________________________________________ >>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> > From jc.lucky at laposte.net Wed Oct 27 04:11:13 2010 From: jc.lucky at laposte.net (jc.lucky) Date: Wed, 27 Oct 2010 10:11:13 +0200 (CEST) Subject: [Biojava-l] Tr: Retrieve Information from GenBank file Message-ID: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> I tried once again with the new version of BioJava but without succeding. Any idea or suggestion? Thanks in advance Regards, Jean-Charles Ferri?res > Message du 22/10/10 10:11 > De : "jc.lucky" > A : biojava-l at lists.open-bio.org > Copie ? : > Objet : [Biojava-l] Retrieve Information from GenBank file > > > Hi > > I'm trying to convert a GenBank file into a rdf file. The gene of interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 > > With the below code I can read the GenBank file and I manage to retrieve information and convert them in a rdf format. However I don't succeed in retrieving some information such as Title, protein or product. According to this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is possible to do so. > Please help me find what I do wrong or what should be done to achieve my goal. > > //read the GeneBank File > public static RichSequenceIterator readFile(String input, > RichSequenceBuilderFactory seqFactory, > Namespace ns) > throws IOException, NoSuchElementException, BioException > { > ns = null; > InputStream stream = new FileInputStream(input); > BufferedReader rdfFile = new BufferedReader(new InputStreamReader(stream)); > RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(rdfFile,ns); > return seqs; > } > > //Retrieve information and convert them in rdf format > public void writeToRDFFile(RichSequenceIterator rsi, String output) > throws IOException, NoSuchElementException, BioException { > //create model for the ontology > OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, null); > OntClass parents; > String URI = "http://pbr.wur.nl/#"; > > while(rsi.hasNext()) > { > RichSequence seq = rsi.nextRichSequence(); > String id = seq.getName(); > parents = model.createClass(URI + id); > Set author = seq.getRankedDocRefs();//code to clean up Set&convert toString > String definition = seq.getDescription(); //code to clean up String > //Add to model > parents.addProperty(DC.description, definition); > parents.addProperty(DC.publisher, authors); > parents.addComment(taxonomy, "EN"); > parents.addProperty(DC.type, organism); > //print in rdf format > model.write(out, "RDF/XML"); > out.close(); } > } > > > Thanks, > Jean-Charles Ferri?res _____________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? Je cr?e ma bo?te mail www.laposte.net From willishf at ufl.edu Wed Oct 27 06:41:06 2010 From: willishf at ufl.edu (Scooter Willis) Date: Wed, 27 Oct 2010 06:41:06 -0400 Subject: [Biojava-l] Tr: Retrieve Information from GenBank file In-Reply-To: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> Message-ID: Jean-Charles I have it on my list to do a GenBank parser but haven't had the time. I can't promise anything in the next couple weeks. Can you send some details about what a typical use case is for your purpose? Are you trying to get the sequence data or are you more interested in the features? Thanks Scooter On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote: > > I tried once again with the new version of BioJava but without succeding. > Any idea or suggestion? > > Thanks in advance > Regards, > > Jean-Charles Ferri?res > > > > Message du 22/10/10 10:11 > > De : "jc.lucky" > > A : biojava-l at lists.open-bio.org > > Copie ? : > > Objet : [Biojava-l] Retrieve Information from GenBank file > > > > > > Hi > > > > I'm trying to convert a GenBank file into a rdf file. The gene of > interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 > > > > With the below code I can read the GenBank file and I manage to retrieve > information and convert them in a rdf format. However I don't succeed in > retrieving some information such as Title, protein or product. According to > this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is > possible to do so. > > Please help me find what I do wrong or what should be done to achieve my > goal. > > > > //read the GeneBank File > > public static RichSequenceIterator readFile(String input, > > RichSequenceBuilderFactory seqFactory, > > Namespace ns) > > throws IOException, NoSuchElementException, BioException > > { > > ns = null; > > InputStream stream = new FileInputStream(input); > > BufferedReader rdfFile = new BufferedReader(new > InputStreamReader(stream)); > > RichSequenceIterator seqs = > RichSequence.IOTools.readGenbankDNA(rdfFile,ns); > > return seqs; > > } > > > > //Retrieve information and convert them in rdf format > > public void writeToRDFFile(RichSequenceIterator rsi, String output) > > throws IOException, NoSuchElementException, BioException { > > //create model for the ontology > > OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, > null); > > OntClass parents; > > String URI = "http://pbr.wur.nl/#"; > > > > while(rsi.hasNext()) > > { > > RichSequence seq = rsi.nextRichSequence(); > > String id = seq.getName(); > > parents = model.createClass(URI + id); > > Set author = seq.getRankedDocRefs();//code to clean up Set&convert > toString > > String definition = seq.getDescription(); //code to clean up String > > //Add to model > > parents.addProperty(DC.description, definition); > > parents.addProperty(DC.publisher, authors); > > parents.addComment(taxonomy, "EN"); > > parents.addProperty(DC.type, organism); > > //print in rdf format > > model.write(out, "RDF/XML"); > > out.close(); } > > } > > > > > > Thanks, > > Jean-Charles Ferri?res > _____________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous > tente ? > Je cr?e ma bo?te mail www.laposte.net > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jc.lucky at laposte.net Wed Oct 27 09:03:55 2010 From: jc.lucky at laposte.net (jc.lucky) Date: Wed, 27 Oct 2010 15:03:55 +0200 (CEST) Subject: [Biojava-l] Tr: Retrieve Information from GenBank file In-Reply-To: References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> Message-ID: <21411489.155159.1288184635185.JavaMail.www@wwinf8222> I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data. My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future. Thanks, Jean-Charles > Message du 27/10/10 12:41 > De : "Scooter Willis" > A : "jc.lucky" > Copie ? : "biojava-l lists open-bio org" > Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file > > Jean-Charles > > I have it on my list to do a GenBank parser but haven't had the time. I > can't promise anything in the next couple weeks. Can you send some details > about what a typical use case is for your purpose? Are you trying to get the > sequence data or are you more interested in the features? > > Thanks > > Scooter > > On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote: > > > > > I tried once again with the new version of BioJava but without succeding. > > Any idea or suggestion? > > > > Thanks in advance > > Regards, > > > > Jean-Charles Ferri?res > > > > > > > Message du 22/10/10 10:11 > > > De : "jc.lucky" > > > A : biojava-l at lists.open-bio.org > > > Copie ? : > > > Objet : [Biojava-l] Retrieve Information from GenBank file > > > > > > > > > Hi > > > > > > I'm trying to convert a GenBank file into a rdf file. The gene of > > interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 > > > > > > With the below code I can read the GenBank file and I manage to retrieve > > information and convert them in a rdf format. However I don't succeed in > > retrieving some information such as Title, protein or product. According to > > this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is > > possible to do so. > > > Please help me find what I do wrong or what should be done to achieve my > > goal. > > > > > > //read the GeneBank File > > > public static RichSequenceIterator readFile(String input, > > > RichSequenceBuilderFactory seqFactory, > > > Namespace ns) > > > throws IOException, NoSuchElementException, BioException > > > { > > > ns = null; > > > InputStream stream = new FileInputStream(input); > > > BufferedReader rdfFile = new BufferedReader(new > > InputStreamReader(stream)); > > > RichSequenceIterator seqs = > > RichSequence.IOTools.readGenbankDNA(rdfFile,ns); > > > return seqs; > > > } > > > > > > //Retrieve information and convert them in rdf format > > > public void writeToRDFFile(RichSequenceIterator rsi, String output) > > > throws IOException, NoSuchElementException, BioException { > > > //create model for the ontology > > > OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, > > null); > > > OntClass parents; > > > String URI = "http://pbr.wur.nl/#"; > > > > > > while(rsi.hasNext()) > > > { > > > RichSequence seq = rsi.nextRichSequence(); > > > String id = seq.getName(); > > > parents = model.createClass(URI + id); > > > Set author = seq.getRankedDocRefs();//code to clean up Set&convert > > toString > > > String definition = seq.getDescription(); //code to clean up String > > > //Add to model > > > parents.addProperty(DC.description, definition); > > > parents.addProperty(DC.publisher, authors); > > > parents.addComment(taxonomy, "EN"); > > > parents.addProperty(DC.type, organism); > > > //print in rdf format > > > model.write(out, "RDF/XML"); > > > out.close(); } > > > } > > > > > > > > > Thanks, > > > Jean-Charles Ferri?res > > _____________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? Je cr?e ma bo?te mail www.laposte.net From holland at eaglegenomics.com Wed Oct 27 09:16:56 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 27 Oct 2010 14:16:56 +0100 Subject: [Biojava-l] Tr: Retrieve Information from GenBank file In-Reply-To: <21411489.155159.1288184635185.JavaMail.www@wwinf8222> References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> <21411489.155159.1288184635185.JavaMail.www@wwinf8222> Message-ID: <3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com> Have you tried (using the BioJavaX method) looking at the getRichAnnotation() method on the RichSequence that the parser returns? That is where the majority of the GenBank tags should show up in a kind of hash map. Things like protein, product are likely to be found there. Each feature (getFeatureSet() on the RichSequence object) also has its own annotation set for things that are associated with the feature rather than the main sequence. Xrefs meanwhile can be retrieved as getRankedCrossRefs() on each feature, whilst sequence-level document references (including titles, authors, etc.) are found by calling getRankedDocRefs(). This section of the BioJavaX docs goes into great detail where every single part of the Genbank file is stored in the RichSequence objects: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_2 and http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Writing_2 cheers, Richard On 27 Oct 2010, at 14:03, jc.lucky wrote: > > I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data. > My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future. > > Thanks, > > Jean-Charles > > > >> Message du 27/10/10 12:41 >> De : "Scooter Willis" >> A : "jc.lucky" >> Copie ? : "biojava-l lists open-bio org" >> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file >> >> Jean-Charles >> >> I have it on my list to do a GenBank parser but haven't had the time. I >> can't promise anything in the next couple weeks. Can you send some details >> about what a typical use case is for your purpose? Are you trying to get the >> sequence data or are you more interested in the features? >> >> Thanks >> >> Scooter >> >> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote: >> >>> >>> I tried once again with the new version of BioJava but without succeding. >>> Any idea or suggestion? >>> >>> Thanks in advance >>> Regards, >>> >>> Jean-Charles Ferri?res >>> >>> >>>> Message du 22/10/10 10:11 >>>> De : "jc.lucky" >>>> A : biojava-l at lists.open-bio.org >>>> Copie ? : >>>> Objet : [Biojava-l] Retrieve Information from GenBank file >>>> >>>> >>>> Hi >>>> >>>> I'm trying to convert a GenBank file into a rdf file. The gene of >>> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 >>>> >>>> With the below code I can read the GenBank file and I manage to retrieve >>> information and convert them in a rdf format. However I don't succeed in >>> retrieving some information such as Title, protein or product. According to >>> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is >>> possible to do so. >>>> Please help me find what I do wrong or what should be done to achieve my >>> goal. >>>> >>>> //read the GeneBank File >>>> public static RichSequenceIterator readFile(String input, >>>> RichSequenceBuilderFactory seqFactory, >>>> Namespace ns) >>>> throws IOException, NoSuchElementException, BioException >>>> { >>>> ns = null; >>>> InputStream stream = new FileInputStream(input); >>>> BufferedReader rdfFile = new BufferedReader(new >>> InputStreamReader(stream)); >>>> RichSequenceIterator seqs = >>> RichSequence.IOTools.readGenbankDNA(rdfFile,ns); >>>> return seqs; >>>> } >>>> >>>> //Retrieve information and convert them in rdf format >>>> public void writeToRDFFile(RichSequenceIterator rsi, String output) >>>> throws IOException, NoSuchElementException, BioException { >>>> //create model for the ontology >>>> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, >>> null); >>>> OntClass parents; >>>> String URI = "http://pbr.wur.nl/#"; >>>> >>>> while(rsi.hasNext()) >>>> { >>>> RichSequence seq = rsi.nextRichSequence(); >>>> String id = seq.getName(); >>>> parents = model.createClass(URI + id); >>>> Set author = seq.getRankedDocRefs();//code to clean up Set&convert >>> toString >>>> String definition = seq.getDescription(); //code to clean up String >>>> //Add to model >>>> parents.addProperty(DC.description, definition); >>>> parents.addProperty(DC.publisher, authors); >>>> parents.addComment(taxonomy, "EN"); >>>> parents.addProperty(DC.type, organism); >>>> //print in rdf format >>>> model.write(out, "RDF/XML"); >>>> out.close(); } >>>> } >>>> >>>> >>>> Thanks, >>>> Jean-Charles Ferri?res >>> _____________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? > Je cr?e ma bo?te mail www.laposte.net > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jc.lucky at laposte.net Wed Oct 27 09:34:22 2010 From: jc.lucky at laposte.net (jc.lucky) Date: Wed, 27 Oct 2010 15:34:22 +0200 (CEST) Subject: [Biojava-l] Tr: Retrieve Information from GenBank file In-Reply-To: <3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com> References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> <21411489.155159.1288184635185.JavaMail.www@wwinf8222> <3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com> Message-ID: <6229150.91865.1288186462649.JavaMail.www@wwinf8218> Thanks for your reply and indeed as mentioned at the bottom that is what I use to try to retrieve the maximum of information. However and that is my problem the methods described do not provide the required information. For example getRankedDocRefs() provides authors and Journals but no TITLE getFeaturesSet() only provides /organism, /mol_type and /db_xref Thereby I was asking for help and suggestion fo how to fix this "problem". Best, Jean-Charles > Message du 27/10/10 15:17 > De : "Richard Holland" > A : "jc.lucky" > Copie ? : "Scooter Willis" , "biojava-l lists open-bio org" > Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file > > > Have you tried (using the BioJavaX method) looking at the getRichAnnotation() method on the RichSequence that the parser returns? That is where the majority of the GenBank tags should show up in a kind of hash map. Things like protein, product are likely to be found there. Each feature (getFeatureSet() on the RichSequence object) also has its own annotation set for things that are associated with the feature rather than the main sequence. Xrefs meanwhile can be retrieved as getRankedCrossRefs() on each feature, whilst sequence-level document references (including titles, authors, etc.) are found by calling getRankedDocRefs(). > > This section of the BioJavaX docs goes into great detail where every single part of the Genbank file is stored in the RichSequence objects: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_2 and http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Writing_2 > > cheers, > Richard > > On 27 Oct 2010, at 14:03, jc.lucky wrote: > > > > > I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data. > > My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future. > > > > Thanks, > > > > Jean-Charles > > > > > > > >> Message du 27/10/10 12:41 > >> De : "Scooter Willis" > >> A : "jc.lucky" > >> Copie ? : "biojava-l lists open-bio org" > >> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file > >> > >> Jean-Charles > >> > >> I have it on my list to do a GenBank parser but haven't had the time. I > >> can't promise anything in the next couple weeks. Can you send some details > >> about what a typical use case is for your purpose? Are you trying to get the > >> sequence data or are you more interested in the features? > >> > >> Thanks > >> > >> Scooter > >> > >> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote: > >> > >>> > >>> I tried once again with the new version of BioJava but without succeding. > >>> Any idea or suggestion? > >>> > >>> Thanks in advance > >>> Regards, > >>> > >>> Jean-Charles Ferri?res > >>> > >>> > >>>> Message du 22/10/10 10:11 > >>>> De : "jc.lucky" > >>>> A : biojava-l at lists.open-bio.org > >>>> Copie ? : > >>>> Objet : [Biojava-l] Retrieve Information from GenBank file > >>>> > >>>> > >>>> Hi > >>>> > >>>> I'm trying to convert a GenBank file into a rdf file. The gene of > >>> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 > >>>> > >>>> With the below code I can read the GenBank file and I manage to retrieve > >>> information and convert them in a rdf format. However I don't succeed in > >>> retrieving some information such as Title, protein or product. According to > >>> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is > >>> possible to do so. > >>>> Please help me find what I do wrong or what should be done to achieve my > >>> goal. > >>>> > >>>> //read the GeneBank File > >>>> public static RichSequenceIterator readFile(String input, > >>>> RichSequenceBuilderFactory seqFactory, > >>>> Namespace ns) > >>>> throws IOException, NoSuchElementException, BioException > >>>> { > >>>> ns = null; > >>>> InputStream stream = new FileInputStream(input); > >>>> BufferedReader rdfFile = new BufferedReader(new > >>> InputStreamReader(stream)); > >>>> RichSequenceIterator seqs = > >>> RichSequence.IOTools.readGenbankDNA(rdfFile,ns); > >>>> return seqs; > >>>> } > >>>> > >>>> //Retrieve information and convert them in rdf format > >>>> public void writeToRDFFile(RichSequenceIterator rsi, String output) > >>>> throws IOException, NoSuchElementException, BioException { > >>>> //create model for the ontology > >>>> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, > >>> null); > >>>> OntClass parents; > >>>> String URI = "http://pbr.wur.nl/#"; > >>>> > >>>> while(rsi.hasNext()) > >>>> { > >>>> RichSequence seq = rsi.nextRichSequence(); > >>>> String id = seq.getName(); > >>>> parents = model.createClass(URI + id); > >>>> Set author = seq.getRankedDocRefs();//code to clean up Set&convert > >>> toString > >>>> String definition = seq.getDescription(); //code to clean up String > >>>> //Add to model > >>>> parents.addProperty(DC.description, definition); > >>>> parents.addProperty(DC.publisher, authors); > >>>> parents.addComment(taxonomy, "EN"); > >>>> parents.addProperty(DC.type, organism); > >>>> //print in rdf format > >>>> model.write(out, "RDF/XML"); > >>>> out.close(); } > >>>> } > >>>> > >>>> > >>>> Thanks, > >>>> Jean-Charles Ferri?res > >>> _____________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? > > Je cr?e ma bo?te mail www.laposte.net > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? Je cr?e ma bo?te mail www.laposte.net From andreas at sdsc.edu Wed Oct 27 20:47:50 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 27 Oct 2010 17:47:50 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: > I assume AtomCache is a new class in BioJava3? yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 > > I must give you my embarrassed apology...after a bunch of testing I > finally figured out that I had misunderstood where the Parser's error > handling returns control and started going after the wrong exceptions. > ?It does looks like if setParseCAOnly is true, the reader excepts on > chains with no CA's instead of just skipping them, though the other > chains are still parsed into the structure. This sounds like there might be a problem with CA only.. do you have an example ID? also: are you on biojava 1.7 or 3.0 ? Andreas > > -da > > On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >> Hi Daniel, >> >> PDB files are better nowadays, due to remediation, however there are >> still issues.. >> >> it sounds like you just want to figure out how to do the try/catch >> block properly. You could do something like that: >> >> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >> ? ? ? ? ? ? ? ?AtomCache cache = new >> AtomCache("/path/to/your/installation/",splitFileOrganisation); >> >> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >> >> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >> >> ? ? ? ? ? ? ? ? ? ? ? ?try { >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >> >> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >> e.getMessage()); >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >> ? ? ? ? ? ? ? ? ? ? ? ?} >> ? ? ? ? ? ? ? ?} >> >> >> >> >> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>> offense intended, by the way, with respect to PDB errors - obviously >>> the PDB is an indispensable resource for all protein scientists. >>> >>> I am looking at many (fixed-length) pieces of protein chains and doin' >>> stuff with 'em. ?My current code has a pair of nested while loops; the >>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>> and the inner iterates over the pieces from each. ?When >>> StructureExceptions come out of my PDBFileReader object I want to >>> continue the outer loop, moving on to the next set of files without >>> executing any of the code that depends on correct StructureImpl >>> objects from the reader (database updates, the inner loop). >>> Since the reader's methods have their own try-catch blocks, a thrown >>> StructureException is stopped there and never reaches my own error >>> handling. ?I just need to know when those errors occur so I can skip >>> those proteins - I am presuming that the correct entries will outweigh >>> the problem ones by a significant factor and the overall data wont be >>> seriously impacted. >>> >>> -da >>> >>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>> Hi Daniel, >>>> >>>> can you explain a bit more what you are doing, in particular what >>>> errors you would like to deal with on your end? ?You should not need >>>> to worry too much about exception handling. Are there any special >>>> cases you are interested in? ?In this case we should support you with >>>> a clean interface rather than exception handling from your end... >>>> >>>> Andreas >>>> >>>> >>>> >>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>> Hi all, >>>>> Let me first say thanks to all the BioJava community members for >>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>> too trivial. >>>>> >>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>> As is commonly known, these are often rife with errors which can lead >>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>> propagation stops there and my code proceeds blindly along regardless >>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>> in my code where the parser is called, so that I can branch to a >>>>> continue statement and have my batch processing loops move on to the >>>>> next file. >>>>> Should I edit out the try-catch blocks and compile my own version of >>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>> possession of the fields in question? ?In that case, I'm not sure >>>>> which properties will give the most general success information...and >>>>> I'd rather not have to check for /every/ property being correct. >>>>> >>>>> If there is some great way to check if an exception was caught down a >>>>> series of nested method calls, please hit me over the head with it. >>>>> >>>>> Thanks! >>>>> >>>>> -da >>>>> _______________________________________________ >>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>> >>>> >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From dasarnow at gmail.com Thu Oct 28 00:05:18 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Wed, 27 Oct 2010 21:05:18 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: I'm using 1.7, partially because my distro had a package for it and partially because I was initially using the online Javadoc a lot. PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide chain F appears to parse correctly. -da org.biojava.bio.structure.StructureException: could not find chain A ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) ? ? ? ?at fragalign.pair.getStructs(pair.java:42) ? ? ? ?at fragalign.Main.main(Main.java:40) org.biojava.bio.structure.StructureException: could not find chain B ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) ? ? ? ?at fragalign.pair.getStructs(pair.java:42) ? ? ? ?at fragalign.Main.main(Main.java:40) org.biojava.bio.structure.StructureException: did not find chain with chainId >A< ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) ? ? ? ?at fragalign.pair.getStructs(pair.java:42) ? ? ? ?at fragalign.Main.main(Main.java:40) org.biojava.bio.structure.StructureException: did not find chain with chainId >B< ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) ? ? ? ?at fragalign.pair.getStructs(pair.java:42) ? ? ? ?at fragalign.Main.main(Main.java:40) On Wed, Oct 27, 2010 at 17:47, Andreas Prlic wrote: >> I assume AtomCache is a new class in BioJava3? > > yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 > >> >> I must give you my embarrassed apology...after a bunch of testing I >> finally figured out that I had misunderstood where the Parser's error >> handling returns control and started going after the wrong exceptions. >> ?It does looks like if setParseCAOnly is true, the reader excepts on >> chains with no CA's instead of just skipping them, though the other >> chains are still parsed into the structure. > > This sounds like there might be ?a problem with CA only.. do you have > an example ID? also: are you on biojava 1.7 or 3.0 ? > > Andreas > > > >> >> -da >> >> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >>> Hi Daniel, >>> >>> PDB files are better nowadays, due to remediation, however there are >>> still issues.. >>> >>> it sounds like you just want to figure out how to do the try/catch >>> block properly. You could do something like that: >>> >>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >>> ? ? ? ? ? ? ? ?AtomCache cache = new >>> AtomCache("/path/to/your/installation/",splitFileOrganisation); >>> >>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >>> >>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >>> >>> ? ? ? ? ? ? ? ? ? ? ? ?try { >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >>> >>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >>> e.getMessage()); >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >>> ? ? ? ? ? ? ? ? ? ? ? ?} >>> ? ? ? ? ? ? ? ?} >>> >>> >>> >>> >>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>>> offense intended, by the way, with respect to PDB errors - obviously >>>> the PDB is an indispensable resource for all protein scientists. >>>> >>>> I am looking at many (fixed-length) pieces of protein chains and doin' >>>> stuff with 'em. ?My current code has a pair of nested while loops; the >>>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>>> and the inner iterates over the pieces from each. ?When >>>> StructureExceptions come out of my PDBFileReader object I want to >>>> continue the outer loop, moving on to the next set of files without >>>> executing any of the code that depends on correct StructureImpl >>>> objects from the reader (database updates, the inner loop). >>>> Since the reader's methods have their own try-catch blocks, a thrown >>>> StructureException is stopped there and never reaches my own error >>>> handling. ?I just need to know when those errors occur so I can skip >>>> those proteins - I am presuming that the correct entries will outweigh >>>> the problem ones by a significant factor and the overall data wont be >>>> seriously impacted. >>>> >>>> -da >>>> >>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>>> Hi Daniel, >>>>> >>>>> can you explain a bit more what you are doing, in particular what >>>>> errors you would like to deal with on your end? ?You should not need >>>>> to worry too much about exception handling. Are there any special >>>>> cases you are interested in? ?In this case we should support you with >>>>> a clean interface rather than exception handling from your end... >>>>> >>>>> Andreas >>>>> >>>>> >>>>> >>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>>> Hi all, >>>>>> Let me first say thanks to all the BioJava community members for >>>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>>> too trivial. >>>>>> >>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>>> As is commonly known, these are often rife with errors which can lead >>>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>>> propagation stops there and my code proceeds blindly along regardless >>>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>>> in my code where the parser is called, so that I can branch to a >>>>>> continue statement and have my batch processing loops move on to the >>>>>> next file. >>>>>> Should I edit out the try-catch blocks and compile my own version of >>>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>>> possession of the fields in question? ?In that case, I'm not sure >>>>>> which properties will give the most general success information...and >>>>>> I'd rather not have to check for /every/ property being correct. >>>>>> >>>>>> If there is some great way to check if an exception was caught down a >>>>>> series of nested method calls, please hit me over the head with it. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -da >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> >>>>> >>>>> >>> >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From andreas at sdsc.edu Thu Oct 28 13:28:07 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 28 Oct 2010 10:28:07 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Hi Daniel, I just checked, this is a bug which is already resolved in 3.0... If it is an issue for you, you might want to upgrade... (should be very easy, if you start using Maven ...) Thanks, Andreas On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow wrote: > I'm using 1.7, partially because my distro had a package for it and > partially because I was initially using the online Javadoc a lot. > PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've > pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide > chain F appears to parse correctly. > > -da > > org.biojava.bio.structure.StructureException: could not find chain A > ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) > ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) > ? ? ? ?at fragalign.pair.getStructs(pair.java:42) > ? ? ? ?at fragalign.Main.main(Main.java:40) > org.biojava.bio.structure.StructureException: could not find chain B > ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) > ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) > ? ? ? ?at fragalign.pair.getStructs(pair.java:42) > ? ? ? ?at fragalign.Main.main(Main.java:40) > org.biojava.bio.structure.StructureException: did not find chain with > chainId >A< > ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) > ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) > ? ? ? ?at fragalign.pair.getStructs(pair.java:42) > ? ? ? ?at fragalign.Main.main(Main.java:40) > org.biojava.bio.structure.StructureException: did not find chain with > chainId >B< > ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) > ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) > ? ? ? ?at fragalign.pair.getStructs(pair.java:42) > ? ? ? ?at fragalign.Main.main(Main.java:40) > > > On Wed, Oct 27, 2010 at 17:47, Andreas Prlic wrote: >>> I assume AtomCache is a new class in BioJava3? >> >> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 >> >>> >>> I must give you my embarrassed apology...after a bunch of testing I >>> finally figured out that I had misunderstood where the Parser's error >>> handling returns control and started going after the wrong exceptions. >>> ?It does looks like if setParseCAOnly is true, the reader excepts on >>> chains with no CA's instead of just skipping them, though the other >>> chains are still parsed into the structure. >> >> This sounds like there might be ?a problem with CA only.. do you have >> an example ID? also: are you on biojava 1.7 or 3.0 ? >> >> Andreas >> >> >> >>> >>> -da >>> >>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >>>> Hi Daniel, >>>> >>>> PDB files are better nowadays, due to remediation, however there are >>>> still issues.. >>>> >>>> it sounds like you just want to figure out how to do the try/catch >>>> block properly. You could do something like that: >>>> >>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >>>> ? ? ? ? ? ? ? ?AtomCache cache = new >>>> AtomCache("/path/to/your/installation/",splitFileOrganisation); >>>> >>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >>>> >>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >>>> >>>> ? ? ? ? ? ? ? ? ? ? ? ?try { >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >>>> >>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >>>> e.getMessage()); >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >>>> ? ? ? ? ? ? ? ? ? ? ? ?} >>>> ? ? ? ? ? ? ? ?} >>>> >>>> >>>> >>>> >>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>>>> offense intended, by the way, with respect to PDB errors - obviously >>>>> the PDB is an indispensable resource for all protein scientists. >>>>> >>>>> I am looking at many (fixed-length) pieces of protein chains and doin' >>>>> stuff with 'em. ?My current code has a pair of nested while loops; the >>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>>>> and the inner iterates over the pieces from each. ?When >>>>> StructureExceptions come out of my PDBFileReader object I want to >>>>> continue the outer loop, moving on to the next set of files without >>>>> executing any of the code that depends on correct StructureImpl >>>>> objects from the reader (database updates, the inner loop). >>>>> Since the reader's methods have their own try-catch blocks, a thrown >>>>> StructureException is stopped there and never reaches my own error >>>>> handling. ?I just need to know when those errors occur so I can skip >>>>> those proteins - I am presuming that the correct entries will outweigh >>>>> the problem ones by a significant factor and the overall data wont be >>>>> seriously impacted. >>>>> >>>>> -da >>>>> >>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>>>> Hi Daniel, >>>>>> >>>>>> can you explain a bit more what you are doing, in particular what >>>>>> errors you would like to deal with on your end? ?You should not need >>>>>> to worry too much about exception handling. Are there any special >>>>>> cases you are interested in? ?In this case we should support you with >>>>>> a clean interface rather than exception handling from your end... >>>>>> >>>>>> Andreas >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>>>> Hi all, >>>>>>> Let me first say thanks to all the BioJava community members for >>>>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>>>> too trivial. >>>>>>> >>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>>>> As is commonly known, these are often rife with errors which can lead >>>>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>>>> propagation stops there and my code proceeds blindly along regardless >>>>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>>>> in my code where the parser is called, so that I can branch to a >>>>>>> continue statement and have my batch processing loops move on to the >>>>>>> next file. >>>>>>> Should I edit out the try-catch blocks and compile my own version of >>>>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>>>> possession of the fields in question? ?In that case, I'm not sure >>>>>>> which properties will give the most general success information...and >>>>>>> I'd rather not have to check for /every/ property being correct. >>>>>>> >>>>>>> If there is some great way to check if an exception was caught down a >>>>>>> series of nested method calls, please hit me over the head with it. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> -da >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>>>>> >>>>>> >>>> >>> >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From vishalthapar at gmail.com Thu Oct 28 13:40:49 2010 From: vishalthapar at gmail.com (Vishal Thapar) Date: Thu, 28 Oct 2010 13:40:49 -0400 Subject: [Biojava-l] K-mers Message-ID: Hi All, I had a quick question: Does Biojava have a method to generate k-mers or K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer counts for every sequence in a fasta file. If something like this exists it would save me some time to write the code. Thanks, Vishal From jayunit100 at gmail.com Thu Oct 28 15:43:17 2010 From: jayunit100 at gmail.com (Jay Vyas) Date: Thu, 28 Oct 2010 15:43:17 -0400 Subject: [Biojava-l] biojava maven integration Message-ID: Hi guys, I added the following to my pom file org.biojava biojava 3.0-alpha2 biojava-maven-repo BioJava repository http://www.biojava.org/download/maven/ true true But to no avail. Does anyone know how to add biojava3 to the libraries in a maven managed application >? Thanks. From jayunit100 at gmail.com Thu Oct 28 18:51:25 2010 From: jayunit100 at gmail.com (Jay Vyas) Date: Thu, 28 Oct 2010 18:51:25 -0400 Subject: [Biojava-l] biojava maven integration In-Reply-To: <5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk> References: <5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk> Message-ID: Does anybody have a maven POM example of how to integrate biojava into my application ? Thanks! Im currently using biojava 1.7, and have put it in my own, local maven repository. On Thu, Oct 28, 2010 at 3:56 PM, LAW Andy wrote: > Not 100% certain but I *think* you want to depend on biojava-core rather > than biojava. > > Later, > > Andy > > On 28 Oct 2010, at 20:43, Jay Vyas wrote: > > > Hi guys, I added the following to my pom file > > > > > > org.biojava > > biojava > > 3.0-alpha2 > > > > > > > > biojava-maven-repo > > BioJava repository > > http://www.biojava.org/download/maven/ > > > > true > > > > > > true > > > > > > > > > > But to no avail. Does anyone know how to add biojava3 to the libraries > in a > > maven managed application >? > > > > Thanks. > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > -- Jay Vyas MMSB/UCHC From dasarnow at gmail.com Thu Oct 28 19:45:05 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Thu, 28 Oct 2010 16:45:05 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: It's not a big deal - after all if you use CA only, chains with no CA's aren't important, and the error messages aren't that long. But I'm going to switch anyway... I'm getting the dreaded "can't read line length in file" error while trying to checkout biojava-live/trunk, though. -da On Thu, Oct 28, 2010 at 10:28, Andreas Prlic wrote: > Hi Daniel, > > I just checked, this is a bug which is already resolved in 3.0... If > it is an issue for you, you might want to upgrade... (should be very > easy, if you start using Maven ...) > > Thanks, > Andreas > > On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow wrote: >> I'm using 1.7, partially because my distro had a package for it and >> partially because I was initially using the online Javadoc a lot. >> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've >> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide >> chain F appears to parse correctly. >> >> -da >> >> org.biojava.bio.structure.StructureException: could not find chain A >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >> ? ? ? ?at fragalign.Main.main(Main.java:40) >> org.biojava.bio.structure.StructureException: could not find chain B >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >> ? ? ? ?at fragalign.Main.main(Main.java:40) >> org.biojava.bio.structure.StructureException: did not find chain with >> chainId >A< >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >> ? ? ? ?at fragalign.Main.main(Main.java:40) >> org.biojava.bio.structure.StructureException: did not find chain with >> chainId >B< >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >> ? ? ? ?at fragalign.Main.main(Main.java:40) >> >> >> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic wrote: >>>> I assume AtomCache is a new class in BioJava3? >>> >>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 >>> >>>> >>>> I must give you my embarrassed apology...after a bunch of testing I >>>> finally figured out that I had misunderstood where the Parser's error >>>> handling returns control and started going after the wrong exceptions. >>>> ?It does looks like if setParseCAOnly is true, the reader excepts on >>>> chains with no CA's instead of just skipping them, though the other >>>> chains are still parsed into the structure. >>> >>> This sounds like there might be ?a problem with CA only.. do you have >>> an example ID? also: are you on biojava 1.7 or 3.0 ? >>> >>> Andreas >>> >>> >>> >>>> >>>> -da >>>> >>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >>>>> Hi Daniel, >>>>> >>>>> PDB files are better nowadays, due to remediation, however there are >>>>> still issues.. >>>>> >>>>> it sounds like you just want to figure out how to do the try/catch >>>>> block properly. You could do something like that: >>>>> >>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >>>>> ? ? ? ? ? ? ? ?AtomCache cache = new >>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation); >>>>> >>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >>>>> >>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >>>>> >>>>> ? ? ? ? ? ? ? ? ? ? ? ?try { >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >>>>> >>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >>>>> e.getMessage()); >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >>>>> ? ? ? ? ? ? ? ? ? ? ? ?} >>>>> ? ? ? ? ? ? ? ?} >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>>>>> offense intended, by the way, with respect to PDB errors - obviously >>>>>> the PDB is an indispensable resource for all protein scientists. >>>>>> >>>>>> I am looking at many (fixed-length) pieces of protein chains and doin' >>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the >>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>>>>> and the inner iterates over the pieces from each. ?When >>>>>> StructureExceptions come out of my PDBFileReader object I want to >>>>>> continue the outer loop, moving on to the next set of files without >>>>>> executing any of the code that depends on correct StructureImpl >>>>>> objects from the reader (database updates, the inner loop). >>>>>> Since the reader's methods have their own try-catch blocks, a thrown >>>>>> StructureException is stopped there and never reaches my own error >>>>>> handling. ?I just need to know when those errors occur so I can skip >>>>>> those proteins - I am presuming that the correct entries will outweigh >>>>>> the problem ones by a significant factor and the overall data wont be >>>>>> seriously impacted. >>>>>> >>>>>> -da >>>>>> >>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>>>>> Hi Daniel, >>>>>>> >>>>>>> can you explain a bit more what you are doing, in particular what >>>>>>> errors you would like to deal with on your end? ?You should not need >>>>>>> to worry too much about exception handling. Are there any special >>>>>>> cases you are interested in? ?In this case we should support you with >>>>>>> a clean interface rather than exception handling from your end... >>>>>>> >>>>>>> Andreas >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>>>>> Hi all, >>>>>>>> Let me first say thanks to all the BioJava community members for >>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>>>>> too trivial. >>>>>>>> >>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>>>>> As is commonly known, these are often rife with errors which can lead >>>>>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>>>>> propagation stops there and my code proceeds blindly along regardless >>>>>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>>>>> in my code where the parser is called, so that I can branch to a >>>>>>>> continue statement and have my batch processing loops move on to the >>>>>>>> next file. >>>>>>>> Should I edit out the try-catch blocks and compile my own version of >>>>>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>>>>> possession of the fields in question? ?In that case, I'm not sure >>>>>>>> which properties will give the most general success information...and >>>>>>>> I'd rather not have to check for /every/ property being correct. >>>>>>>> >>>>>>>> If there is some great way to check if an exception was caught down a >>>>>>>> series of nested method calls, please hit me over the head with it. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> -da >>>>>>>> _______________________________________________ >>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> ----------------------------------------------------------------------- >>> Dr. Andreas Prlic >>> Senior Scientist, RCSB PDB Protein Data Bank >>> University of California, San Diego >>> (+1) 858.246.0526 >>> ----------------------------------------------------------------------- >>> >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From dasarnow at gmail.com Thu Oct 28 19:51:25 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Thu, 28 Oct 2010 16:51:25 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Ahh, I suppose that is the "problem" referred to in the wiki? I checked out successfully from the repository on github. -da On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow wrote: > It's not a big deal - after all if you use CA only, chains with no > CA's aren't important, and the error messages aren't that long. ?But > I'm going to switch anyway... > I'm getting the dreaded "can't read line length in file" error while > trying to checkout biojava-live/trunk, though. > > -da > > On Thu, Oct 28, 2010 at 10:28, Andreas Prlic wrote: >> Hi Daniel, >> >> I just checked, this is a bug which is already resolved in 3.0... If >> it is an issue for you, you might want to upgrade... (should be very >> easy, if you start using Maven ...) >> >> Thanks, >> Andreas >> >> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow wrote: >>> I'm using 1.7, partially because my distro had a package for it and >>> partially because I was initially using the online Javadoc a lot. >>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've >>> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide >>> chain F appears to parse correctly. >>> >>> -da >>> >>> org.biojava.bio.structure.StructureException: could not find chain A >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>> org.biojava.bio.structure.StructureException: could not find chain B >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>> org.biojava.bio.structure.StructureException: did not find chain with >>> chainId >A< >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>> org.biojava.bio.structure.StructureException: did not find chain with >>> chainId >B< >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>> >>> >>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic wrote: >>>>> I assume AtomCache is a new class in BioJava3? >>>> >>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 >>>> >>>>> >>>>> I must give you my embarrassed apology...after a bunch of testing I >>>>> finally figured out that I had misunderstood where the Parser's error >>>>> handling returns control and started going after the wrong exceptions. >>>>> ?It does looks like if setParseCAOnly is true, the reader excepts on >>>>> chains with no CA's instead of just skipping them, though the other >>>>> chains are still parsed into the structure. >>>> >>>> This sounds like there might be ?a problem with CA only.. do you have >>>> an example ID? also: are you on biojava 1.7 or 3.0 ? >>>> >>>> Andreas >>>> >>>> >>>> >>>>> >>>>> -da >>>>> >>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >>>>>> Hi Daniel, >>>>>> >>>>>> PDB files are better nowadays, due to remediation, however there are >>>>>> still issues.. >>>>>> >>>>>> it sounds like you just want to figure out how to do the try/catch >>>>>> block properly. You could do something like that: >>>>>> >>>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >>>>>> ? ? ? ? ? ? ? ?AtomCache cache = new >>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation); >>>>>> >>>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >>>>>> >>>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >>>>>> >>>>>> ? ? ? ? ? ? ? ? ? ? ? ?try { >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >>>>>> >>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >>>>>> e.getMessage()); >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} >>>>>> ? ? ? ? ? ? ? ?} >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>>>>>> offense intended, by the way, with respect to PDB errors - obviously >>>>>>> the PDB is an indispensable resource for all protein scientists. >>>>>>> >>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin' >>>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the >>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>>>>>> and the inner iterates over the pieces from each. ?When >>>>>>> StructureExceptions come out of my PDBFileReader object I want to >>>>>>> continue the outer loop, moving on to the next set of files without >>>>>>> executing any of the code that depends on correct StructureImpl >>>>>>> objects from the reader (database updates, the inner loop). >>>>>>> Since the reader's methods have their own try-catch blocks, a thrown >>>>>>> StructureException is stopped there and never reaches my own error >>>>>>> handling. ?I just need to know when those errors occur so I can skip >>>>>>> those proteins - I am presuming that the correct entries will outweigh >>>>>>> the problem ones by a significant factor and the overall data wont be >>>>>>> seriously impacted. >>>>>>> >>>>>>> -da >>>>>>> >>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>>>>>> Hi Daniel, >>>>>>>> >>>>>>>> can you explain a bit more what you are doing, in particular what >>>>>>>> errors you would like to deal with on your end? ?You should not need >>>>>>>> to worry too much about exception handling. Are there any special >>>>>>>> cases you are interested in? ?In this case we should support you with >>>>>>>> a clean interface rather than exception handling from your end... >>>>>>>> >>>>>>>> Andreas >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>>>>>> Hi all, >>>>>>>>> Let me first say thanks to all the BioJava community members for >>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>>>>>> too trivial. >>>>>>>>> >>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>>>>>> As is commonly known, these are often rife with errors which can lead >>>>>>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>>>>>> propagation stops there and my code proceeds blindly along regardless >>>>>>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>>>>>> in my code where the parser is called, so that I can branch to a >>>>>>>>> continue statement and have my batch processing loops move on to the >>>>>>>>> next file. >>>>>>>>> Should I edit out the try-catch blocks and compile my own version of >>>>>>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>>>>>> possession of the fields in question? ?In that case, I'm not sure >>>>>>>>> which properties will give the most general success information...and >>>>>>>>> I'd rather not have to check for /every/ property being correct. >>>>>>>>> >>>>>>>>> If there is some great way to check if an exception was caught down a >>>>>>>>> series of nested method calls, please hit me over the head with it. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> -da >>>>>>>>> _______________________________________________ >>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> ----------------------------------------------------------------------- >>>> Dr. Andreas Prlic >>>> Senior Scientist, RCSB PDB Protein Data Bank >>>> University of California, San Diego >>>> (+1) 858.246.0526 >>>> ----------------------------------------------------------------------- >>>> >>> >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> > From andreas at sdsc.edu Thu Oct 28 20:06:55 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 28 Oct 2010 17:06:55 -0700 Subject: [Biojava-l] biojava maven integration In-Reply-To: References: <5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk> Message-ID: Hi Jay, here is some UI code that is using biojava from Maven: http://github.com/biojava/RCSB_SequenceViewer/blob/master/pom.xml Andreas On Thu, Oct 28, 2010 at 3:51 PM, Jay Vyas wrote: > Does anybody have a maven POM example of how to integrate biojava into my > application ? > Thanks! > > Im currently using biojava 1.7, and have put it in my own, local maven > repository. > > > > > On Thu, Oct 28, 2010 at 3:56 PM, LAW Andy wrote: > >> Not 100% certain but I *think* you want to depend on biojava-core rather >> than biojava. >> >> Later, >> >> Andy >> >> On 28 Oct 2010, at 20:43, Jay Vyas wrote: >> >> > Hi guys, I added the following to my pom file >> > >> > ? >> > ? ? ? ?org.biojava >> > ? ? ? ?biojava >> > ? ? ? ?3.0-alpha2 >> > ? >> > >> > >> > ? ? ? ?biojava-maven-repo >> > ? ? ? ?BioJava repository >> > ? ? ? ?http://www.biojava.org/download/maven/ >> > ? ? ? ? >> > ? ? ? ? ? ?true >> > ? ? ? ? >> > ? ? ? ? >> > ? ? ? ? ? ?true >> > ? ? ? ? >> > ? ? >> > >> > >> > But to no avail. ?Does anyone know how to add biojava3 to the libraries >> in a >> > maven managed application >? >> > >> > Thanks. >> > _______________________________________________ >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> > > > -- > Jay Vyas > MMSB/UCHC > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Thu Oct 28 20:08:49 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 28 Oct 2010 17:08:49 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: good, I was just about to say that... ;-) Andreas On Thu, Oct 28, 2010 at 4:51 PM, Daniel Asarnow wrote: > Ahh, I suppose that is the "problem" referred to in the wiki? ?I > checked out successfully from the repository on github. > > -da > > On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow wrote: >> It's not a big deal - after all if you use CA only, chains with no >> CA's aren't important, and the error messages aren't that long. ?But >> I'm going to switch anyway... >> I'm getting the dreaded "can't read line length in file" error while >> trying to checkout biojava-live/trunk, though. >> >> -da >> >> On Thu, Oct 28, 2010 at 10:28, Andreas Prlic wrote: >>> Hi Daniel, >>> >>> I just checked, this is a bug which is already resolved in 3.0... If >>> it is an issue for you, you might want to upgrade... (should be very >>> easy, if you start using Maven ...) >>> >>> Thanks, >>> Andreas >>> >>> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow wrote: >>>> I'm using 1.7, partially because my distro had a package for it and >>>> partially because I was initially using the online Javadoc a lot. >>>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've >>>> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide >>>> chain F appears to parse correctly. >>>> >>>> -da >>>> >>>> org.biojava.bio.structure.StructureException: could not find chain A >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>>> org.biojava.bio.structure.StructureException: could not find chain B >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>>> org.biojava.bio.structure.StructureException: did not find chain with >>>> chainId >A< >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>>> org.biojava.bio.structure.StructureException: did not find chain with >>>> chainId >B< >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>>> >>>> >>>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic wrote: >>>>>> I assume AtomCache is a new class in BioJava3? >>>>> >>>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 >>>>> >>>>>> >>>>>> I must give you my embarrassed apology...after a bunch of testing I >>>>>> finally figured out that I had misunderstood where the Parser's error >>>>>> handling returns control and started going after the wrong exceptions. >>>>>> ?It does looks like if setParseCAOnly is true, the reader excepts on >>>>>> chains with no CA's instead of just skipping them, though the other >>>>>> chains are still parsed into the structure. >>>>> >>>>> This sounds like there might be ?a problem with CA only.. do you have >>>>> an example ID? also: are you on biojava 1.7 or 3.0 ? >>>>> >>>>> Andreas >>>>> >>>>> >>>>> >>>>>> >>>>>> -da >>>>>> >>>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >>>>>>> Hi Daniel, >>>>>>> >>>>>>> PDB files are better nowadays, due to remediation, however there are >>>>>>> still issues.. >>>>>>> >>>>>>> it sounds like you just want to figure out how to do the try/catch >>>>>>> block properly. You could do something like that: >>>>>>> >>>>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >>>>>>> ? ? ? ? ? ? ? ?AtomCache cache = new >>>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation); >>>>>>> >>>>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >>>>>>> >>>>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >>>>>>> >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?try { >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >>>>>>> >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >>>>>>> e.getMessage()); >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} >>>>>>> ? ? ? ? ? ? ? ?} >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>>>>>>> offense intended, by the way, with respect to PDB errors - obviously >>>>>>>> the PDB is an indispensable resource for all protein scientists. >>>>>>>> >>>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin' >>>>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the >>>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>>>>>>> and the inner iterates over the pieces from each. ?When >>>>>>>> StructureExceptions come out of my PDBFileReader object I want to >>>>>>>> continue the outer loop, moving on to the next set of files without >>>>>>>> executing any of the code that depends on correct StructureImpl >>>>>>>> objects from the reader (database updates, the inner loop). >>>>>>>> Since the reader's methods have their own try-catch blocks, a thrown >>>>>>>> StructureException is stopped there and never reaches my own error >>>>>>>> handling. ?I just need to know when those errors occur so I can skip >>>>>>>> those proteins - I am presuming that the correct entries will outweigh >>>>>>>> the problem ones by a significant factor and the overall data wont be >>>>>>>> seriously impacted. >>>>>>>> >>>>>>>> -da >>>>>>>> >>>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>>>>>>> Hi Daniel, >>>>>>>>> >>>>>>>>> can you explain a bit more what you are doing, in particular what >>>>>>>>> errors you would like to deal with on your end? ?You should not need >>>>>>>>> to worry too much about exception handling. Are there any special >>>>>>>>> cases you are interested in? ?In this case we should support you with >>>>>>>>> a clean interface rather than exception handling from your end... >>>>>>>>> >>>>>>>>> Andreas >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>>>>>>> Hi all, >>>>>>>>>> Let me first say thanks to all the BioJava community members for >>>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>>>>>>> too trivial. >>>>>>>>>> >>>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>>>>>>> As is commonly known, these are often rife with errors which can lead >>>>>>>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>>>>>>> propagation stops there and my code proceeds blindly along regardless >>>>>>>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>>>>>>> in my code where the parser is called, so that I can branch to a >>>>>>>>>> continue statement and have my batch processing loops move on to the >>>>>>>>>> next file. >>>>>>>>>> Should I edit out the try-catch blocks and compile my own version of >>>>>>>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>>>>>>> possession of the fields in question? ?In that case, I'm not sure >>>>>>>>>> which properties will give the most general success information...and >>>>>>>>>> I'd rather not have to check for /every/ property being correct. >>>>>>>>>> >>>>>>>>>> If there is some great way to check if an exception was caught down a >>>>>>>>>> series of nested method calls, please hit me over the head with it. >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> -da >>>>>>>>>> _______________________________________________ >>>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> ----------------------------------------------------------------------- >>>>> Dr. Andreas Prlic >>>>> Senior Scientist, RCSB PDB Protein Data Bank >>>>> University of California, San Diego >>>>> (+1) 858.246.0526 >>>>> ----------------------------------------------------------------------- >>>>> >>>> >>> >>> >>> >>> -- >>> ----------------------------------------------------------------------- >>> Dr. Andreas Prlic >>> Senior Scientist, RCSB PDB Protein Data Bank >>> University of California, San Diego >>> (+1) 858.246.0526 >>> ----------------------------------------------------------------------- >>> >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From ayates at ebi.ac.uk Fri Oct 29 04:12:09 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 09:12:09 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Hi Vishal, As far as I am aware there is nothing which will generate them in BioJava at the moment. However it is possible to do it with BioJava3: public static void main(String[] args) { DNASequence d = new DNASequence("ATGATC"); System.out.println("Non-Overlap"); nonOverlap(d); System.out.println("Overlap"); overlap(d); } public static final int KMER = 3; //Generate triplets overlapping public static void overlap(Sequence d) { List> l = new ArrayList>(); for(int i=1; i<=KMER; i++) { SequenceView sub = d.getSubSequence( i, d.getLength()); WindowedSequence w = new WindowedSequence(sub, KMER); l.add(w); } //Will return ATG, ATC, TGA & GAT for(WindowedSequence w: l) { for(List subList: w) { System.out.println(subList); } } } //Generate triplet Compound lists non-overlapping public static void nonOverlap(Sequence d) { WindowedSequence w = new WindowedSequence(d, KMER); //Will return ATG & ATC for(List subList: w) { System.out.println(subList); } } The disadvantage of all of these solutions is that they generate lists of Compounds so kmer generation can/will be a memory intensive operation. This does mean it has to be since sub sequences are thin wrappers around an underlying sequence. Also the overlap solution is non-optimal since it iterates through each window rather than stepping through delegating onto each base in turn (hence why we get ATG & ATC before TGA) As for unique k-mers that's something which would require a bit more engineering & would be better suited to a solution built around a Trie (prefix tree). Hope this helps, Andy On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > Hi All, > > I had a quick question: Does Biojava have a method to generate k-mers or > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer > counts for every sequence in a fasta file. If something like this exists it > would save me some time to write the code. > > Thanks, > > Vishal > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jbdundas at gmail.com Fri Oct 29 05:12:53 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Fri, 29 Oct 2010 14:42:53 +0530 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Dear Friends, Thanks to Vishal & Andy for this. I actually needed this code too.. Vishal, I think Andy's suggestions may be a good option to include in BioJava 3. Would you like to add this to the BioJava 3. Thanks again. Regards, Jitesh Dundas On 10/29/10, Andy Yates wrote: > Hi Vishal, > > As far as I am aware there is nothing which will generate them in BioJava at > the moment. However it is possible to do it with BioJava3: > > public static void main(String[] args) { > DNASequence d = new DNASequence("ATGATC"); > System.out.println("Non-Overlap"); > nonOverlap(d); > System.out.println("Overlap"); > overlap(d); > } > > public static final int KMER = 3; > > //Generate triplets overlapping > public static void overlap(Sequence d) { > List> l = > new ArrayList>(); > for(int i=1; i<=KMER; i++) { > SequenceView sub = d.getSubSequence( > i, d.getLength()); > WindowedSequence w = > new WindowedSequence(sub, KMER); > l.add(w); > } > > //Will return ATG, ATC, TGA & GAT > for(WindowedSequence w: l) { > for(List subList: w) { > System.out.println(subList); > } > } > } > > //Generate triplet Compound lists non-overlapping > public static void nonOverlap(Sequence d) { > WindowedSequence w = > new WindowedSequence(d, KMER); > //Will return ATG & ATC > for(List subList: w) { > System.out.println(subList); > } > } > > The disadvantage of all of these solutions is that they generate lists of > Compounds so kmer generation can/will be a memory intensive operation. This > does mean it has to be since sub sequences are thin wrappers around an > underlying sequence. Also the overlap solution is non-optimal since it > iterates through each window rather than stepping through delegating onto > each base in turn (hence why we get ATG & ATC before TGA) > > As for unique k-mers that's something which would require a bit more > engineering & would be better suited to a solution built around a Trie > (prefix tree). > > Hope this helps, > > Andy > > On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > >> Hi All, >> >> I had a quick question: Does Biojava have a method to generate k-mers or >> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >> counts for every sequence in a fasta file. If something like this exists >> it >> would save me some time to write the code. >> >> Thanks, >> >> Vishal >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Fri Oct 29 05:20:36 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 10:20:36 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Okay couple of points here: 1). Which biojava3 module? This sounds like something for the genomic module rather than core 2). It'll need some more work. I'm not happy about using the WindowedSequenceView in its current state. I think an alteration to avoid it making Lists would be a good idea (plus recent developments in the API as to its main use means this is a viable change). Also it should return the overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6 Comments? Andy On 29 Oct 2010, at 10:12, jitesh dundas wrote: > Dear Friends, > > Thanks to Vishal & Andy for this. I actually needed this code too.. > Vishal, I think Andy's suggestions may be a good option to include in > BioJava 3. Would you like to add this to the BioJava 3. > > Thanks again. > > Regards, > Jitesh Dundas > > On 10/29/10, Andy Yates wrote: >> Hi Vishal, >> >> As far as I am aware there is nothing which will generate them in BioJava at >> the moment. However it is possible to do it with BioJava3: >> >> public static void main(String[] args) { >> DNASequence d = new DNASequence("ATGATC"); >> System.out.println("Non-Overlap"); >> nonOverlap(d); >> System.out.println("Overlap"); >> overlap(d); >> } >> >> public static final int KMER = 3; >> >> //Generate triplets overlapping >> public static void overlap(Sequence d) { >> List> l = >> new ArrayList>(); >> for(int i=1; i<=KMER; i++) { >> SequenceView sub = d.getSubSequence( >> i, d.getLength()); >> WindowedSequence w = >> new WindowedSequence(sub, KMER); >> l.add(w); >> } >> >> //Will return ATG, ATC, TGA & GAT >> for(WindowedSequence w: l) { >> for(List subList: w) { >> System.out.println(subList); >> } >> } >> } >> >> //Generate triplet Compound lists non-overlapping >> public static void nonOverlap(Sequence d) { >> WindowedSequence w = >> new WindowedSequence(d, KMER); >> //Will return ATG & ATC >> for(List subList: w) { >> System.out.println(subList); >> } >> } >> >> The disadvantage of all of these solutions is that they generate lists of >> Compounds so kmer generation can/will be a memory intensive operation. This >> does mean it has to be since sub sequences are thin wrappers around an >> underlying sequence. Also the overlap solution is non-optimal since it >> iterates through each window rather than stepping through delegating onto >> each base in turn (hence why we get ATG & ATC before TGA) >> >> As for unique k-mers that's something which would require a bit more >> engineering & would be better suited to a solution built around a Trie >> (prefix tree). >> >> Hope this helps, >> >> Andy >> >> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >> >>> Hi All, >>> >>> I had a quick question: Does Biojava have a method to generate k-mers or >>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >>> counts for every sequence in a fasta file. If something like this exists >>> it >>> would save me some time to write the code. >>> >>> Thanks, >>> >>> Vishal >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jbdundas at gmail.com Fri Oct 29 06:00:44 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Fri, 29 Oct 2010 15:30:44 +0530 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Dear Sir, Is there any way to detect patterns in the recorded k-mers . I have a large set of miRNAs (study for mutations and patgerns for gastric cancer).I made a record of k-mers for each sequence but the patterns that are generated are difficult to track. Can BioJava do this point. Regular Expressions in Java maybe useful here.. Request expert advise in this.Any other s/w that might be useful. Thanks, Jitesh Dundas On 10/29/10, jitesh dundas wrote: > Dear Friends, > > Thanks to Vishal & Andy for this. I actually needed this code too.. > Vishal, I think Andy's suggestions may be a good option to include in > BioJava 3. Would you like to add this to the BioJava 3. > > Thanks again. > > Regards, > Jitesh Dundas > > On 10/29/10, Andy Yates wrote: >> Hi Vishal, >> >> As far as I am aware there is nothing which will generate them in BioJava >> at >> the moment. However it is possible to do it with BioJava3: >> >> public static void main(String[] args) { >> DNASequence d = new DNASequence("ATGATC"); >> System.out.println("Non-Overlap"); >> nonOverlap(d); >> System.out.println("Overlap"); >> overlap(d); >> } >> >> public static final int KMER = 3; >> >> //Generate triplets overlapping >> public static void overlap(Sequence d) { >> List> l = >> new ArrayList>(); >> for(int i=1; i<=KMER; i++) { >> SequenceView sub = d.getSubSequence( >> i, d.getLength()); >> WindowedSequence w = >> new WindowedSequence(sub, KMER); >> l.add(w); >> } >> >> //Will return ATG, ATC, TGA & GAT >> for(WindowedSequence w: l) { >> for(List subList: w) { >> System.out.println(subList); >> } >> } >> } >> >> //Generate triplet Compound lists non-overlapping >> public static void nonOverlap(Sequence d) { >> WindowedSequence w = >> new WindowedSequence(d, KMER); >> //Will return ATG & ATC >> for(List subList: w) { >> System.out.println(subList); >> } >> } >> >> The disadvantage of all of these solutions is that they generate lists of >> Compounds so kmer generation can/will be a memory intensive operation. >> This >> does mean it has to be since sub sequences are thin wrappers around an >> underlying sequence. Also the overlap solution is non-optimal since it >> iterates through each window rather than stepping through delegating onto >> each base in turn (hence why we get ATG & ATC before TGA) >> >> As for unique k-mers that's something which would require a bit more >> engineering & would be better suited to a solution built around a Trie >> (prefix tree). >> >> Hope this helps, >> >> Andy >> >> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >> >>> Hi All, >>> >>> I had a quick question: Does Biojava have a method to generate k-mers or >>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >>> counts for every sequence in a fasta file. If something like this exists >>> it >>> would save me some time to write the code. >>> >>> Thanks, >>> >>> Vishal >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > From jbdundas at gmail.com Fri Oct 29 06:04:35 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Fri, 29 Oct 2010 15:34:35 +0530 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: You are right again my friend.Definitely that would hang up my machine with the xml file parsing activity. This is about sequence alignment and related modules. I will look at this today and send a fix on that.Hope that you can help. PS: what about pattern matching in sequences?interesting to have in biojava 3 ? Regards, JD On 10/29/10, Andy Yates wrote: > Okay couple of points here: > > 1). Which biojava3 module? This sounds like something for the genomic module > rather than core > > 2). It'll need some more work. I'm not happy about using the > WindowedSequenceView in its current state. I think an alteration to avoid it > making Lists would be a good idea (plus recent developments in the API as to > its main use means this is a viable change). Also it should return the > overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6 > > Comments? > > Andy > > On 29 Oct 2010, at 10:12, jitesh dundas wrote: > >> Dear Friends, >> >> Thanks to Vishal & Andy for this. I actually needed this code too.. >> Vishal, I think Andy's suggestions may be a good option to include in >> BioJava 3. Would you like to add this to the BioJava 3. >> >> Thanks again. >> >> Regards, >> Jitesh Dundas >> >> On 10/29/10, Andy Yates wrote: >>> Hi Vishal, >>> >>> As far as I am aware there is nothing which will generate them in BioJava >>> at >>> the moment. However it is possible to do it with BioJava3: >>> >>> public static void main(String[] args) { >>> DNASequence d = new DNASequence("ATGATC"); >>> System.out.println("Non-Overlap"); >>> nonOverlap(d); >>> System.out.println("Overlap"); >>> overlap(d); >>> } >>> >>> public static final int KMER = 3; >>> >>> //Generate triplets overlapping >>> public static void overlap(Sequence d) { >>> List> l = >>> new ArrayList>(); >>> for(int i=1; i<=KMER; i++) { >>> SequenceView sub = d.getSubSequence( >>> i, d.getLength()); >>> WindowedSequence w = >>> new WindowedSequence(sub, KMER); >>> l.add(w); >>> } >>> >>> //Will return ATG, ATC, TGA & GAT >>> for(WindowedSequence w: l) { >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> } >>> >>> //Generate triplet Compound lists non-overlapping >>> public static void nonOverlap(Sequence d) { >>> WindowedSequence w = >>> new WindowedSequence(d, KMER); >>> //Will return ATG & ATC >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> >>> The disadvantage of all of these solutions is that they generate lists of >>> Compounds so kmer generation can/will be a memory intensive operation. >>> This >>> does mean it has to be since sub sequences are thin wrappers around an >>> underlying sequence. Also the overlap solution is non-optimal since it >>> iterates through each window rather than stepping through delegating onto >>> each base in turn (hence why we get ATG & ATC before TGA) >>> >>> As for unique k-mers that's something which would require a bit more >>> engineering & would be better suited to a solution built around a Trie >>> (prefix tree). >>> >>> Hope this helps, >>> >>> Andy >>> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>> >>>> Hi All, >>>> >>>> I had a quick question: Does Biojava have a method to generate k-mers or >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >>>> counts for every sequence in a fasta file. If something like this exists >>>> it >>>> would save me some time to write the code. >>>> >>>> Thanks, >>>> >>>> Vishal >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > From ayates at ebi.ac.uk Fri Oct 29 06:09:11 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 11:09:11 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: <5832FAFE-FEC3-4A7C-9469-3C334551900B@ebi.ac.uk> One of the disadvantages of the Sequence based system is that we have no support for searching in sequences with patterns like regular expressions. Whilst it's possible to convert a Sequence into a String & then perform the expression but that is a sub-optimal solution. Looking at the Pattern code in Java6 it can take in a CharSequence which one could write an adaptor to make a Sequence act as a CharSequence for the matching procedure but really it looks like a lot of work. As for a way of doing matching to sequence HMMER3 is awesome :) Andy On 29 Oct 2010, at 11:00, jitesh dundas wrote: > Dear Sir, > > Is there any way to detect patterns in the recorded k-mers . > > I have a large set of miRNAs (study for mutations and patgerns for > gastric cancer).I made a record of k-mers for each sequence but the > patterns that are generated are difficult to track. > > Can BioJava do this point. Regular Expressions in Java maybe useful here.. > > Request expert advise in this.Any other s/w that might be useful. > > Thanks, > Jitesh Dundas > > On 10/29/10, jitesh dundas wrote: >> Dear Friends, >> >> Thanks to Vishal & Andy for this. I actually needed this code too.. >> Vishal, I think Andy's suggestions may be a good option to include in >> BioJava 3. Would you like to add this to the BioJava 3. >> >> Thanks again. >> >> Regards, >> Jitesh Dundas >> >> On 10/29/10, Andy Yates wrote: >>> Hi Vishal, >>> >>> As far as I am aware there is nothing which will generate them in BioJava >>> at >>> the moment. However it is possible to do it with BioJava3: >>> >>> public static void main(String[] args) { >>> DNASequence d = new DNASequence("ATGATC"); >>> System.out.println("Non-Overlap"); >>> nonOverlap(d); >>> System.out.println("Overlap"); >>> overlap(d); >>> } >>> >>> public static final int KMER = 3; >>> >>> //Generate triplets overlapping >>> public static void overlap(Sequence d) { >>> List> l = >>> new ArrayList>(); >>> for(int i=1; i<=KMER; i++) { >>> SequenceView sub = d.getSubSequence( >>> i, d.getLength()); >>> WindowedSequence w = >>> new WindowedSequence(sub, KMER); >>> l.add(w); >>> } >>> >>> //Will return ATG, ATC, TGA & GAT >>> for(WindowedSequence w: l) { >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> } >>> >>> //Generate triplet Compound lists non-overlapping >>> public static void nonOverlap(Sequence d) { >>> WindowedSequence w = >>> new WindowedSequence(d, KMER); >>> //Will return ATG & ATC >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> >>> The disadvantage of all of these solutions is that they generate lists of >>> Compounds so kmer generation can/will be a memory intensive operation. >>> This >>> does mean it has to be since sub sequences are thin wrappers around an >>> underlying sequence. Also the overlap solution is non-optimal since it >>> iterates through each window rather than stepping through delegating onto >>> each base in turn (hence why we get ATG & ATC before TGA) >>> >>> As for unique k-mers that's something which would require a bit more >>> engineering & would be better suited to a solution built around a Trie >>> (prefix tree). >>> >>> Hope this helps, >>> >>> Andy >>> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>> >>>> Hi All, >>>> >>>> I had a quick question: Does Biojava have a method to generate k-mers or >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >>>> counts for every sequence in a fasta file. If something like this exists >>>> it >>>> would save me some time to write the code. >>>> >>>> Thanks, >>>> >>>> Vishal >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jnarayan81 at gmail.com Fri Oct 29 07:46:11 2010 From: jnarayan81 at gmail.com (jitendra narayan) Date: Fri, 29 Oct 2010 17:16:11 +0530 Subject: [Biojava-l] New Biojava Logo Message-ID: Dear All I have designed a n new biojava logo. Please see the detail of it: http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg I need your valuable suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo thanks -- Jitendra Narayan Bioinformatist www.bioinformaticsonline.com From genjasp at gmail.com Fri Oct 29 09:05:57 2010 From: genjasp at gmail.com (Alessandro Cipriani) Date: Fri, 29 Oct 2010 15:05:57 +0200 Subject: [Biojava-l] New Biojava Logo In-Reply-To: References: Message-ID: Great Logo!!! :D 2010/10/29 jitendra narayan : > Dear All > I have designed a n new biojava logo. Please see the detail of it: > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg > I need your valuable > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo > > > thanks > > -- > Jitendra Narayan > Bioinformatist > www.bioinformaticsonline.com > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Alessandro Cipriani (+39) 3206009509 (+39) 3931311792 http://www.cipriania.it skype:genjasp at gmail.com msn:jaspzz From vishalthapar at gmail.com Fri Oct 29 12:27:11 2010 From: vishalthapar at gmail.com (Vishal Thapar) Date: Fri, 29 Oct 2010 12:27:11 -0400 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Hi Andy, This is good to have. I feel that including it as a part of core may not be necessary but having it as part of Genomic module in biojava3 will be nice. There is a project Bioinformatica http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich does something similar although not exactly. It counts the k-mers in a given fasta file but it does not count k-mers for each sequence within the file, just all within a file. This is a good feature to have specially if one is trying to find patterns within sequences which is what I am trying to do. It would most certainly be helpful to have a k-mer counting algorithm that counts k-mer frequency for each sequence. The way to go would be to use suffix trees. Again I don't know if biojava has a suffix tree api or not since I haven't used java in a while and am just switching back to it. A paper on using suffix trees to generate genome wide k-mer frequencies is: http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, software is tallymer). It would be some work to implement this in java as a module for biojava3 but I can see that this will be helpful. Again, for small fasta files, it might not be efficient to create a suffix tree but for bigger files, I think that might be the way to go. Thats just my two cents.What do you think? -vishal On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: > Hi Vishal, > > As far as I am aware there is nothing which will generate them in BioJava > at the moment. However it is possible to do it with BioJava3: > > public static void main(String[] args) { > DNASequence d = new DNASequence("ATGATC"); > System.out.println("Non-Overlap"); > nonOverlap(d); > System.out.println("Overlap"); > overlap(d); > } > > public static final int KMER = 3; > > //Generate triplets overlapping > public static void overlap(Sequence d) { > List> l = > new ArrayList>(); > for(int i=1; i<=KMER; i++) { > SequenceView sub = d.getSubSequence( > i, d.getLength()); > WindowedSequence w = > new WindowedSequence(sub, KMER); > l.add(w); > } > > //Will return ATG, ATC, TGA & GAT > for(WindowedSequence w: l) { > for(List subList: w) { > System.out.println(subList); > } > } > } > > //Generate triplet Compound lists non-overlapping > public static void nonOverlap(Sequence d) { > WindowedSequence w = > new WindowedSequence(d, KMER); > //Will return ATG & ATC > for(List subList: w) { > System.out.println(subList); > } > } > > The disadvantage of all of these solutions is that they generate lists of > Compounds so kmer generation can/will be a memory intensive operation. This > does mean it has to be since sub sequences are thin wrappers around an > underlying sequence. Also the overlap solution is non-optimal since it > iterates through each window rather than stepping through delegating onto > each base in turn (hence why we get ATG & ATC before TGA) > > As for unique k-mers that's something which would require a bit more > engineering & would be better suited to a solution built around a Trie > (prefix tree). > > Hope this helps, > > Andy > > On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > > > Hi All, > > > > I had a quick question: Does Biojava have a method to generate k-mers or > > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer > > counts for every sequence in a fasta file. If something like this exists > it > > would save me some time to write the code. > > > > Thanks, > > > > Vishal > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > -- *Vishal Thapar, Ph.D.* *Scientific informatics Analyst Cold Spring Harbor Lab Quick Bldg, Lowe Lab 1 Bungtown Road Cold Spring Harbor, NY - 11724* From phidias51 at gmail.com Fri Oct 29 12:56:45 2010 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 29 Oct 2010 09:56:45 -0700 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: It might be useful to make the K-mer storage mechanism pluggable. This would allow a developer to use anything from a simple MultiMap, to a NoSQL key-value database to store K-mers. You could plugin custom map implementations to allow you to keep a count of the number of instances of particular K-mers that were found. It might also be useful to be able to do set operations on those K-mer collections. You could use it to determine which K-mers were present in a pathogen and not in a host. http://www.ncbi.nlm.nih.gov/pubmed/20428334 http://www.ncbi.nlm.nih.gov/pubmed/16403026 Cheers, Mark card.ly: On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar wrote: > Hi Andy, > > This is good to have. I feel that including it as a part of core may not be > necessary but having it as part of Genomic module in biojava3 will be nice. > There is a project Bioinformatica > > http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich > does something similar although not exactly. It counts the k-mers in a > given fasta file but it does not count k-mers for each sequence within the > file, just all within a file. This is a good feature to have specially if > one is trying to find patterns within sequences which is what I am trying > to > do. It would most certainly be helpful to have a k-mer counting algorithm > that counts k-mer frequency for each sequence. The way to go would be to > use > suffix trees. Again I don't know if biojava has a suffix tree api or not > since I haven't used java in a while and am just switching back to it. A > paper on using suffix trees to generate genome wide k-mer frequencies is: > http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, > software > is tallymer). It would be some work to implement this in java as a module > for biojava3 but I can see that this will be helpful. Again, for small > fasta > files, it might not be efficient to create a suffix tree but for bigger > files, I think that might be the way to go. > > Thats just my two cents.What do you think? > > -vishal > > On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: > > > Hi Vishal, > > > > As far as I am aware there is nothing which will generate them in BioJava > > at the moment. However it is possible to do it with BioJava3: > > > > public static void main(String[] args) { > > DNASequence d = new DNASequence("ATGATC"); > > System.out.println("Non-Overlap"); > > nonOverlap(d); > > System.out.println("Overlap"); > > overlap(d); > > } > > > > public static final int KMER = 3; > > > > //Generate triplets overlapping > > public static void overlap(Sequence d) { > > List> l = > > new ArrayList>(); > > for(int i=1; i<=KMER; i++) { > > SequenceView sub = d.getSubSequence( > > i, d.getLength()); > > WindowedSequence w = > > new WindowedSequence(sub, KMER); > > l.add(w); > > } > > > > //Will return ATG, ATC, TGA & GAT > > for(WindowedSequence w: l) { > > for(List subList: w) { > > System.out.println(subList); > > } > > } > > } > > > > //Generate triplet Compound lists non-overlapping > > public static void nonOverlap(Sequence d) { > > WindowedSequence w = > > new WindowedSequence(d, KMER); > > //Will return ATG & ATC > > for(List subList: w) { > > System.out.println(subList); > > } > > } > > > > The disadvantage of all of these solutions is that they generate lists of > > Compounds so kmer generation can/will be a memory intensive operation. > This > > does mean it has to be since sub sequences are thin wrappers around an > > underlying sequence. Also the overlap solution is non-optimal since it > > iterates through each window rather than stepping through delegating onto > > each base in turn (hence why we get ATG & ATC before TGA) > > > > As for unique k-mers that's something which would require a bit more > > engineering & would be better suited to a solution built around a Trie > > (prefix tree). > > > > Hope this helps, > > > > Andy > > > > On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > > > > > Hi All, > > > > > > I had a quick question: Does Biojava have a method to generate k-mers > or > > > K-mer counting in a given Fasta Sequence / File? Basically, I want > k-mer > > > counts for every sequence in a fasta file. If something like this > exists > > it > > > would save me some time to write the code. > > > > > > Thanks, > > > > > > Vishal > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > > Andrew Yates Ensembl Genomes Engineer > > EMBL-EBI Tel: +44-(0)1223-492538 > > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > > > > > > -- > *Vishal Thapar, Ph.D.* > *Scientific informatics Analyst > Cold Spring Harbor Lab > Quick Bldg, Lowe Lab > 1 Bungtown Road > Cold Spring Harbor, NY - 11724* > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Fri Oct 29 14:32:45 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 19:32:45 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Hi Vishal, There's no suffix tree impl in BioJava but if you want to give it a shot then go for it :). I'm interested in how they work but no time to implement it. As for efficiency give it a shot & lets see what it does. Andy On 29 Oct 2010, at 17:27, Vishal Thapar wrote: > Hi Andy, > > This is good to have. I feel that including it as a part of core may not be necessary but having it as part of Genomic module in biojava3 will be nice. There is a project Bioinformatica http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequence which does something similar although not exactly. It counts the k-mers in a given fasta file but it does not count k-mers for each sequence within the file, just all within a file. This is a good feature to have specially if one is trying to find patterns within sequences which is what I am trying to do. It would most certainly be helpful to have a k-mer counting algorithm that counts k-mer frequency for each sequence. The way to go would be to use suffix trees. Again I don't know if biojava has a suffix tree api or not since I haven't used java in a while and am just switching back to it. A paper on using suffix trees to generate genome wide k-mer frequencies is: http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, software is tallymer). It would be some work to implement this in java as a module for biojava3 but I can see that this will be helpful. Again, for small fasta files, it might not be efficient to create a suffix tree but for bigger files, I think that might be the way to go. > > Thats just my two cents.What do you think? > > -vishal > > On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: > Hi Vishal, > > As far as I am aware there is nothing which will generate them in BioJava at the moment. However it is possible to do it with BioJava3: > > public static void main(String[] args) { > DNASequence d = new DNASequence("ATGATC"); > System.out.println("Non-Overlap"); > nonOverlap(d); > System.out.println("Overlap"); > overlap(d); > } > > public static final int KMER = 3; > > //Generate triplets overlapping > public static void overlap(Sequence d) { > List> l = > new ArrayList>(); > for(int i=1; i<=KMER; i++) { > SequenceView sub = d.getSubSequence( > i, d.getLength()); > WindowedSequence w = > new WindowedSequence(sub, KMER); > l.add(w); > } > > //Will return ATG, ATC, TGA & GAT > for(WindowedSequence w: l) { > for(List subList: w) { > System.out.println(subList); > } > } > } > > //Generate triplet Compound lists non-overlapping > public static void nonOverlap(Sequence d) { > WindowedSequence w = > new WindowedSequence(d, KMER); > //Will return ATG & ATC > for(List subList: w) { > System.out.println(subList); > } > } > > The disadvantage of all of these solutions is that they generate lists of Compounds so kmer generation can/will be a memory intensive operation. This does mean it has to be since sub sequences are thin wrappers around an underlying sequence. Also the overlap solution is non-optimal since it iterates through each window rather than stepping through delegating onto each base in turn (hence why we get ATG & ATC before TGA) > > As for unique k-mers that's something which would require a bit more engineering & would be better suited to a solution built around a Trie (prefix tree). > > Hope this helps, > > Andy > > On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > > > Hi All, > > > > I had a quick question: Does Biojava have a method to generate k-mers or > > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer > > counts for every sequence in a fasta file. If something like this exists it > > would save me some time to write the code. > > > > Thanks, > > > > Vishal > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > -- > Vishal Thapar, Ph.D. > Scientific informatics Analyst > Cold Spring Harbor Lab > Quick Bldg, Lowe Lab > 1 Bungtown Road > Cold Spring Harbor, NY - 11724 > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From ayates at ebi.ac.uk Fri Oct 29 14:35:43 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 19:35:43 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> So if it's a suffix tree that's quite a fixed data structure so the chances of developing a pluggable mechanism there would be hard. I think there also has to be a limit as to what we can sensibly do. If people want to contribute this kind of work though then it's all be very well received (with the corresponding test environment/cases of course). Cheers, Andy On 29 Oct 2010, at 17:56, Mark Fortner wrote: > It might be useful to make the K-mer storage mechanism pluggable. This > would allow a developer to use anything from a simple MultiMap, to a NoSQL > key-value database to store K-mers. You could plugin custom map > implementations to allow you to keep a count of the number of instances of > particular K-mers that were found. It might also be useful to be able to do > set operations on those K-mer collections. You could use it to determine > which K-mers were present in a pathogen and not in a host. > http://www.ncbi.nlm.nih.gov/pubmed/20428334 > http://www.ncbi.nlm.nih.gov/pubmed/16403026 > > Cheers, > > Mark > > card.ly: > > > On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar wrote: > >> Hi Andy, >> >> This is good to have. I feel that including it as a part of core may not be >> necessary but having it as part of Genomic module in biojava3 will be nice. >> There is a project Bioinformatica >> >> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >> does something similar although not exactly. It counts the k-mers in a >> given fasta file but it does not count k-mers for each sequence within the >> file, just all within a file. This is a good feature to have specially if >> one is trying to find patterns within sequences which is what I am trying >> to >> do. It would most certainly be helpful to have a k-mer counting algorithm >> that counts k-mer frequency for each sequence. The way to go would be to >> use >> suffix trees. Again I don't know if biojava has a suffix tree api or not >> since I haven't used java in a while and am just switching back to it. A >> paper on using suffix trees to generate genome wide k-mer frequencies is: >> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >> software >> is tallymer). It would be some work to implement this in java as a module >> for biojava3 but I can see that this will be helpful. Again, for small >> fasta >> files, it might not be efficient to create a suffix tree but for bigger >> files, I think that might be the way to go. >> >> Thats just my two cents.What do you think? >> >> -vishal >> >> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >> >>> Hi Vishal, >>> >>> As far as I am aware there is nothing which will generate them in BioJava >>> at the moment. However it is possible to do it with BioJava3: >>> >>> public static void main(String[] args) { >>> DNASequence d = new DNASequence("ATGATC"); >>> System.out.println("Non-Overlap"); >>> nonOverlap(d); >>> System.out.println("Overlap"); >>> overlap(d); >>> } >>> >>> public static final int KMER = 3; >>> >>> //Generate triplets overlapping >>> public static void overlap(Sequence d) { >>> List> l = >>> new ArrayList>(); >>> for(int i=1; i<=KMER; i++) { >>> SequenceView sub = d.getSubSequence( >>> i, d.getLength()); >>> WindowedSequence w = >>> new WindowedSequence(sub, KMER); >>> l.add(w); >>> } >>> >>> //Will return ATG, ATC, TGA & GAT >>> for(WindowedSequence w: l) { >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> } >>> >>> //Generate triplet Compound lists non-overlapping >>> public static void nonOverlap(Sequence d) { >>> WindowedSequence w = >>> new WindowedSequence(d, KMER); >>> //Will return ATG & ATC >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> >>> The disadvantage of all of these solutions is that they generate lists of >>> Compounds so kmer generation can/will be a memory intensive operation. >> This >>> does mean it has to be since sub sequences are thin wrappers around an >>> underlying sequence. Also the overlap solution is non-optimal since it >>> iterates through each window rather than stepping through delegating onto >>> each base in turn (hence why we get ATG & ATC before TGA) >>> >>> As for unique k-mers that's something which would require a bit more >>> engineering & would be better suited to a solution built around a Trie >>> (prefix tree). >>> >>> Hope this helps, >>> >>> Andy >>> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>> >>>> Hi All, >>>> >>>> I had a quick question: Does Biojava have a method to generate k-mers >> or >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >> k-mer >>>> counts for every sequence in a fasta file. If something like this >> exists >>> it >>>> would save me some time to write the code. >>>> >>>> Thanks, >>>> >>>> Vishal >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >> >> >> -- >> *Vishal Thapar, Ph.D.* >> *Scientific informatics Analyst >> Cold Spring Harbor Lab >> Quick Bldg, Lowe Lab >> 1 Bungtown Road >> Cold Spring Harbor, NY - 11724* >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jayunit100 at gmail.com Fri Oct 29 14:40:46 2010 From: jayunit100 at gmail.com (Jay Vyas) Date: Fri, 29 Oct 2010 14:40:46 -0400 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21 In-Reply-To: References: Message-ID: Hi guys : Im trying to break up a biojava project built on 1.7 into biojava 3, and am having to look up some modules etc... Im having trouble finding biojava3 javadocs ? Unfortunately, the 'googleable' java docs are all from 1.7 ..... Where is the formal/generated javadoc info for biojava3 ? is it online ? From phidias51 at gmail.com Fri Oct 29 14:48:53 2010 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 29 Oct 2010 11:48:53 -0700 Subject: [Biojava-l] K-mers In-Reply-To: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> Message-ID: I was thinking more along the lines of using something that implements the Map interface. This would allow a developer to easily unit test the code without having to load the data for a genome. You would also be able to provide different implementations to suit your needs. If you wanted to use a suffix tree as the underlying implementation, that would be OK, but you would have other options as well. Cheers, Mark card.ly: On Fri, Oct 29, 2010 at 11:35 AM, Andy Yates wrote: > So if it's a suffix tree that's quite a fixed data structure so the chances > of developing a pluggable mechanism there would be hard. I think there also > has to be a limit as to what we can sensibly do. If people want to > contribute this kind of work though then it's all be very well received > (with the corresponding test environment/cases of course). > > Cheers, > > Andy > > On 29 Oct 2010, at 17:56, Mark Fortner wrote: > > > It might be useful to make the K-mer storage mechanism pluggable. This > > would allow a developer to use anything from a simple MultiMap, to a > NoSQL > > key-value database to store K-mers. You could plugin custom map > > implementations to allow you to keep a count of the number of instances > of > > particular K-mers that were found. It might also be useful to be able to > do > > set operations on those K-mer collections. You could use it to determine > > which K-mers were present in a pathogen and not in a host. > > http://www.ncbi.nlm.nih.gov/pubmed/20428334 > > http://www.ncbi.nlm.nih.gov/pubmed/16403026 > > > > Cheers, > > > > Mark > > > > card.ly: > > > > > > On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >wrote: > > > >> Hi Andy, > >> > >> This is good to have. I feel that including it as a part of core may not > be > >> necessary but having it as part of Genomic module in biojava3 will be > nice. > >> There is a project Bioinformatica > >> > >> > http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich > >> does something similar although not exactly. It counts the k-mers in a > >> given fasta file but it does not count k-mers for each sequence within > the > >> file, just all within a file. This is a good feature to have specially > if > >> one is trying to find patterns within sequences which is what I am > trying > >> to > >> do. It would most certainly be helpful to have a k-mer counting > algorithm > >> that counts k-mer frequency for each sequence. The way to go would be to > >> use > >> suffix trees. Again I don't know if biojava has a suffix tree api or not > >> since I haven't used java in a while and am just switching back to it. A > >> paper on using suffix trees to generate genome wide k-mer frequencies > is: > >> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, > >> software > >> is tallymer). It would be some work to implement this in java as a > module > >> for biojava3 but I can see that this will be helpful. Again, for small > >> fasta > >> files, it might not be efficient to create a suffix tree but for bigger > >> files, I think that might be the way to go. > >> > >> Thats just my two cents.What do you think? > >> > >> -vishal > >> > >> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: > >> > >>> Hi Vishal, > >>> > >>> As far as I am aware there is nothing which will generate them in > BioJava > >>> at the moment. However it is possible to do it with BioJava3: > >>> > >>> public static void main(String[] args) { > >>> DNASequence d = new DNASequence("ATGATC"); > >>> System.out.println("Non-Overlap"); > >>> nonOverlap(d); > >>> System.out.println("Overlap"); > >>> overlap(d); > >>> } > >>> > >>> public static final int KMER = 3; > >>> > >>> //Generate triplets overlapping > >>> public static void overlap(Sequence d) { > >>> List> l = > >>> new ArrayList>(); > >>> for(int i=1; i<=KMER; i++) { > >>> SequenceView sub = d.getSubSequence( > >>> i, d.getLength()); > >>> WindowedSequence w = > >>> new WindowedSequence(sub, KMER); > >>> l.add(w); > >>> } > >>> > >>> //Will return ATG, ATC, TGA & GAT > >>> for(WindowedSequence w: l) { > >>> for(List subList: w) { > >>> System.out.println(subList); > >>> } > >>> } > >>> } > >>> > >>> //Generate triplet Compound lists non-overlapping > >>> public static void nonOverlap(Sequence d) { > >>> WindowedSequence w = > >>> new WindowedSequence(d, KMER); > >>> //Will return ATG & ATC > >>> for(List subList: w) { > >>> System.out.println(subList); > >>> } > >>> } > >>> > >>> The disadvantage of all of these solutions is that they generate lists > of > >>> Compounds so kmer generation can/will be a memory intensive operation. > >> This > >>> does mean it has to be since sub sequences are thin wrappers around an > >>> underlying sequence. Also the overlap solution is non-optimal since it > >>> iterates through each window rather than stepping through delegating > onto > >>> each base in turn (hence why we get ATG & ATC before TGA) > >>> > >>> As for unique k-mers that's something which would require a bit more > >>> engineering & would be better suited to a solution built around a Trie > >>> (prefix tree). > >>> > >>> Hope this helps, > >>> > >>> Andy > >>> > >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > >>> > >>>> Hi All, > >>>> > >>>> I had a quick question: Does Biojava have a method to generate k-mers > >> or > >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want > >> k-mer > >>>> counts for every sequence in a fasta file. If something like this > >> exists > >>> it > >>>> would save me some time to write the code. > >>>> > >>>> Thanks, > >>>> > >>>> Vishal > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>> > >>> -- > >>> Andrew Yates Ensembl Genomes Engineer > >>> EMBL-EBI Tel: +44-(0)1223-492538 > >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >>> > >>> > >>> > >>> > >>> > >> > >> > >> -- > >> *Vishal Thapar, Ph.D.* > >> *Scientific informatics Analyst > >> Cold Spring Harbor Lab > >> Quick Bldg, Lowe Lab > >> 1 Bungtown Road > >> Cold Spring Harbor, NY - 11724* > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > From jbdundas at gmail.com Fri Oct 29 14:50:11 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 30 Oct 2010 00:20:11 +0530 Subject: [Biojava-l] K-mers In-Reply-To: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> Message-ID: I agree Andy. These have become standard functionalities that scientists do these days. I am all for implementing that in BioJava3. Java isn't that efficient for such functionalities so we will surely need more effort compared to the same in Python/Perl. Regards, Jitesh Dundas On 10/30/10, Andy Yates wrote: > So if it's a suffix tree that's quite a fixed data structure so the chances > of developing a pluggable mechanism there would be hard. I think there also > has to be a limit as to what we can sensibly do. If people want to > contribute this kind of work though then it's all be very well received > (with the corresponding test environment/cases of course). > > Cheers, > > Andy > > On 29 Oct 2010, at 17:56, Mark Fortner wrote: > >> It might be useful to make the K-mer storage mechanism pluggable. This >> would allow a developer to use anything from a simple MultiMap, to a NoSQL >> key-value database to store K-mers. You could plugin custom map >> implementations to allow you to keep a count of the number of instances of >> particular K-mers that were found. It might also be useful to be able to >> do >> set operations on those K-mer collections. You could use it to determine >> which K-mers were present in a pathogen and not in a host. >> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >> >> Cheers, >> >> Mark >> >> card.ly: >> >> >> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >> wrote: >> >>> Hi Andy, >>> >>> This is good to have. I feel that including it as a part of core may not >>> be >>> necessary but having it as part of Genomic module in biojava3 will be >>> nice. >>> There is a project Bioinformatica >>> >>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>> does something similar although not exactly. It counts the k-mers in a >>> given fasta file but it does not count k-mers for each sequence within >>> the >>> file, just all within a file. This is a good feature to have specially if >>> one is trying to find patterns within sequences which is what I am trying >>> to >>> do. It would most certainly be helpful to have a k-mer counting algorithm >>> that counts k-mer frequency for each sequence. The way to go would be to >>> use >>> suffix trees. Again I don't know if biojava has a suffix tree api or not >>> since I haven't used java in a while and am just switching back to it. A >>> paper on using suffix trees to generate genome wide k-mer frequencies is: >>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>> software >>> is tallymer). It would be some work to implement this in java as a module >>> for biojava3 but I can see that this will be helpful. Again, for small >>> fasta >>> files, it might not be efficient to create a suffix tree but for bigger >>> files, I think that might be the way to go. >>> >>> Thats just my two cents.What do you think? >>> >>> -vishal >>> >>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >>> >>>> Hi Vishal, >>>> >>>> As far as I am aware there is nothing which will generate them in >>>> BioJava >>>> at the moment. However it is possible to do it with BioJava3: >>>> >>>> public static void main(String[] args) { >>>> DNASequence d = new DNASequence("ATGATC"); >>>> System.out.println("Non-Overlap"); >>>> nonOverlap(d); >>>> System.out.println("Overlap"); >>>> overlap(d); >>>> } >>>> >>>> public static final int KMER = 3; >>>> >>>> //Generate triplets overlapping >>>> public static void overlap(Sequence d) { >>>> List> l = >>>> new ArrayList>(); >>>> for(int i=1; i<=KMER; i++) { >>>> SequenceView sub = d.getSubSequence( >>>> i, d.getLength()); >>>> WindowedSequence w = >>>> new WindowedSequence(sub, KMER); >>>> l.add(w); >>>> } >>>> >>>> //Will return ATG, ATC, TGA & GAT >>>> for(WindowedSequence w: l) { >>>> for(List subList: w) { >>>> System.out.println(subList); >>>> } >>>> } >>>> } >>>> >>>> //Generate triplet Compound lists non-overlapping >>>> public static void nonOverlap(Sequence d) { >>>> WindowedSequence w = >>>> new WindowedSequence(d, KMER); >>>> //Will return ATG & ATC >>>> for(List subList: w) { >>>> System.out.println(subList); >>>> } >>>> } >>>> >>>> The disadvantage of all of these solutions is that they generate lists >>>> of >>>> Compounds so kmer generation can/will be a memory intensive operation. >>> This >>>> does mean it has to be since sub sequences are thin wrappers around an >>>> underlying sequence. Also the overlap solution is non-optimal since it >>>> iterates through each window rather than stepping through delegating >>>> onto >>>> each base in turn (hence why we get ATG & ATC before TGA) >>>> >>>> As for unique k-mers that's something which would require a bit more >>>> engineering & would be better suited to a solution built around a Trie >>>> (prefix tree). >>>> >>>> Hope this helps, >>>> >>>> Andy >>>> >>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>> >>>>> Hi All, >>>>> >>>>> I had a quick question: Does Biojava have a method to generate k-mers >>> or >>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>> k-mer >>>>> counts for every sequence in a fasta file. If something like this >>> exists >>>> it >>>>> would save me some time to write the code. >>>>> >>>>> Thanks, >>>>> >>>>> Vishal >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> -- >>>> Andrew Yates Ensembl Genomes Engineer >>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>> >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> *Vishal Thapar, Ph.D.* >>> *Scientific informatics Analyst >>> Cold Spring Harbor Lab >>> Quick Bldg, Lowe Lab >>> 1 Bungtown Road >>> Cold Spring Harbor, NY - 11724* >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From willishf at ufl.edu Fri Oct 29 15:20:19 2010 From: willishf at ufl.edu (Scooter Willis) Date: Fri, 29 Oct 2010 15:20:19 -0400 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21 In-Reply-To: References: Message-ID: Jay I don't think we have pushed the biojava3 docs up to a place where google can find them. From the nightly build http://www.biojava.org/download/maven/org/biojava/ you can find javadocs in the jar files. Biojava3 has two parts now. The older 1.7 modules refactored into standalone jar files when possible but it is still a very cross dependent code base. Then the newer modules labeled biojava3- are a clean break from 1.7 so depending on what you are doing it may be easy/difficult to start using the newer biojava3 code without lots of changes in your code. Thanks Scooter On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas wrote: > Hi guys : Im trying to break up a biojava project built on 1.7 into biojava > 3, and am having to look up some modules etc... > Im having trouble finding biojava3 javadocs ? Unfortunately, the > 'googleable' java docs are all from 1.7 ..... > > Where is the formal/generated javadoc info for biojava3 ? is it online ? > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From markjschreiber at gmail.com Fri Oct 29 15:25:12 2010 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 29 Oct 2010 15:25:12 -0400 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21 In-Reply-To: References: Message-ID: It might pay to put the link to the docs on the top level page. You may need to get an Admin to change the front page. On Fri, Oct 29, 2010 at 3:20 PM, Scooter Willis wrote: > Jay > > I don't think we have pushed the biojava3 docs up to a place where google > can find them. From the nightly build > http://www.biojava.org/download/maven/org/biojava/ you can find javadocs > in > the jar files. Biojava3 has two parts now. The older 1.7 modules refactored > into standalone jar files when possible but it is still a very cross > dependent code base. Then the newer modules labeled biojava3- are a clean > break from 1.7 so depending on what you are doing it may be easy/difficult > to start using the newer biojava3 code without lots of changes in your > code. > > Thanks > > Scooter > > On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas wrote: > > > Hi guys : Im trying to break up a biojava project built on 1.7 into > biojava > > 3, and am having to look up some modules etc... > > Im having trouble finding biojava3 javadocs ? Unfortunately, the > > 'googleable' java docs are all from 1.7 ..... > > > > Where is the formal/generated javadoc info for biojava3 ? is it online ? > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Fri Oct 29 15:34:11 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 20:34:11 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> Message-ID: <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> So we've got some basic kmer work now in SVN. If you look in the class SequenceMixin there are two static methods there for generating the two types of k-mers. It's not developed with Map storage in mind & I'll leave the door open there for anyone else to come in & develop it. The k-mers are also not unique across the sequence but it's a start :) Share & enjoy! Andy On 29 Oct 2010, at 19:50, jitesh dundas wrote: > I agree Andy. These have become standard functionalities that > scientists do these days. I am all for implementing that in BioJava3. > Java isn't that efficient for such functionalities so we will surely > need more effort compared to the same in Python/Perl. > > Regards, > Jitesh Dundas > > On 10/30/10, Andy Yates wrote: >> So if it's a suffix tree that's quite a fixed data structure so the chances >> of developing a pluggable mechanism there would be hard. I think there also >> has to be a limit as to what we can sensibly do. If people want to >> contribute this kind of work though then it's all be very well received >> (with the corresponding test environment/cases of course). >> >> Cheers, >> >> Andy >> >> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >> >>> It might be useful to make the K-mer storage mechanism pluggable. This >>> would allow a developer to use anything from a simple MultiMap, to a NoSQL >>> key-value database to store K-mers. You could plugin custom map >>> implementations to allow you to keep a count of the number of instances of >>> particular K-mers that were found. It might also be useful to be able to >>> do >>> set operations on those K-mer collections. You could use it to determine >>> which K-mers were present in a pathogen and not in a host. >>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>> >>> Cheers, >>> >>> Mark >>> >>> card.ly: >>> >>> >>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>> wrote: >>> >>>> Hi Andy, >>>> >>>> This is good to have. I feel that including it as a part of core may not >>>> be >>>> necessary but having it as part of Genomic module in biojava3 will be >>>> nice. >>>> There is a project Bioinformatica >>>> >>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>> does something similar although not exactly. It counts the k-mers in a >>>> given fasta file but it does not count k-mers for each sequence within >>>> the >>>> file, just all within a file. This is a good feature to have specially if >>>> one is trying to find patterns within sequences which is what I am trying >>>> to >>>> do. It would most certainly be helpful to have a k-mer counting algorithm >>>> that counts k-mer frequency for each sequence. The way to go would be to >>>> use >>>> suffix trees. Again I don't know if biojava has a suffix tree api or not >>>> since I haven't used java in a while and am just switching back to it. A >>>> paper on using suffix trees to generate genome wide k-mer frequencies is: >>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>> software >>>> is tallymer). It would be some work to implement this in java as a module >>>> for biojava3 but I can see that this will be helpful. Again, for small >>>> fasta >>>> files, it might not be efficient to create a suffix tree but for bigger >>>> files, I think that might be the way to go. >>>> >>>> Thats just my two cents.What do you think? >>>> >>>> -vishal >>>> >>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >>>> >>>>> Hi Vishal, >>>>> >>>>> As far as I am aware there is nothing which will generate them in >>>>> BioJava >>>>> at the moment. However it is possible to do it with BioJava3: >>>>> >>>>> public static void main(String[] args) { >>>>> DNASequence d = new DNASequence("ATGATC"); >>>>> System.out.println("Non-Overlap"); >>>>> nonOverlap(d); >>>>> System.out.println("Overlap"); >>>>> overlap(d); >>>>> } >>>>> >>>>> public static final int KMER = 3; >>>>> >>>>> //Generate triplets overlapping >>>>> public static void overlap(Sequence d) { >>>>> List> l = >>>>> new ArrayList>(); >>>>> for(int i=1; i<=KMER; i++) { >>>>> SequenceView sub = d.getSubSequence( >>>>> i, d.getLength()); >>>>> WindowedSequence w = >>>>> new WindowedSequence(sub, KMER); >>>>> l.add(w); >>>>> } >>>>> >>>>> //Will return ATG, ATC, TGA & GAT >>>>> for(WindowedSequence w: l) { >>>>> for(List subList: w) { >>>>> System.out.println(subList); >>>>> } >>>>> } >>>>> } >>>>> >>>>> //Generate triplet Compound lists non-overlapping >>>>> public static void nonOverlap(Sequence d) { >>>>> WindowedSequence w = >>>>> new WindowedSequence(d, KMER); >>>>> //Will return ATG & ATC >>>>> for(List subList: w) { >>>>> System.out.println(subList); >>>>> } >>>>> } >>>>> >>>>> The disadvantage of all of these solutions is that they generate lists >>>>> of >>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>> This >>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>> iterates through each window rather than stepping through delegating >>>>> onto >>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>> >>>>> As for unique k-mers that's something which would require a bit more >>>>> engineering & would be better suited to a solution built around a Trie >>>>> (prefix tree). >>>>> >>>>> Hope this helps, >>>>> >>>>> Andy >>>>> >>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>> or >>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>> k-mer >>>>>> counts for every sequence in a fasta file. If something like this >>>> exists >>>>> it >>>>>> would save me some time to write the code. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Vishal >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>>> -- >>>>> Andrew Yates Ensembl Genomes Engineer >>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Vishal Thapar, Ph.D.* >>>> *Scientific informatics Analyst >>>> Cold Spring Harbor Lab >>>> Quick Bldg, Lowe Lab >>>> 1 Bungtown Road >>>> Cold Spring Harbor, NY - 11724* >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jbdundas at gmail.com Fri Oct 29 15:43:38 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 30 Oct 2010 01:13:38 +0530 Subject: [Biojava-l] K-mers In-Reply-To: <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> Message-ID: That is good news.Thanks for the directions Andy. I have already started on this.Let me analyze and write the code now. Maybe a next month deadline is not unreachable in this case. Here we go! JD On 10/30/10, Andy Yates wrote: > So we've got some basic kmer work now in SVN. If you look in the class > SequenceMixin there are two static methods there for generating the two > types of k-mers. It's not developed with Map storage in mind & I'll leave > the door open there for anyone else to come in & develop it. The k-mers are > also not unique across the sequence but it's a start :) > > Share & enjoy! > > Andy > > On 29 Oct 2010, at 19:50, jitesh dundas wrote: > >> I agree Andy. These have become standard functionalities that >> scientists do these days. I am all for implementing that in BioJava3. >> Java isn't that efficient for such functionalities so we will surely >> need more effort compared to the same in Python/Perl. >> >> Regards, >> Jitesh Dundas >> >> On 10/30/10, Andy Yates wrote: >>> So if it's a suffix tree that's quite a fixed data structure so the >>> chances >>> of developing a pluggable mechanism there would be hard. I think there >>> also >>> has to be a limit as to what we can sensibly do. If people want to >>> contribute this kind of work though then it's all be very well received >>> (with the corresponding test environment/cases of course). >>> >>> Cheers, >>> >>> Andy >>> >>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >>> >>>> It might be useful to make the K-mer storage mechanism pluggable. This >>>> would allow a developer to use anything from a simple MultiMap, to a >>>> NoSQL >>>> key-value database to store K-mers. You could plugin custom map >>>> implementations to allow you to keep a count of the number of instances >>>> of >>>> particular K-mers that were found. It might also be useful to be able >>>> to >>>> do >>>> set operations on those K-mer collections. You could use it to >>>> determine >>>> which K-mers were present in a pathogen and not in a host. >>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>>> >>>> Cheers, >>>> >>>> Mark >>>> >>>> card.ly: >>>> >>>> >>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>>> wrote: >>>> >>>>> Hi Andy, >>>>> >>>>> This is good to have. I feel that including it as a part of core may >>>>> not >>>>> be >>>>> necessary but having it as part of Genomic module in biojava3 will be >>>>> nice. >>>>> There is a project Bioinformatica >>>>> >>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>>> does something similar although not exactly. It counts the k-mers in a >>>>> given fasta file but it does not count k-mers for each sequence within >>>>> the >>>>> file, just all within a file. This is a good feature to have specially >>>>> if >>>>> one is trying to find patterns within sequences which is what I am >>>>> trying >>>>> to >>>>> do. It would most certainly be helpful to have a k-mer counting >>>>> algorithm >>>>> that counts k-mer frequency for each sequence. The way to go would be >>>>> to >>>>> use >>>>> suffix trees. Again I don't know if biojava has a suffix tree api or >>>>> not >>>>> since I haven't used java in a while and am just switching back to it. >>>>> A >>>>> paper on using suffix trees to generate genome wide k-mer frequencies >>>>> is: >>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>>> software >>>>> is tallymer). It would be some work to implement this in java as a >>>>> module >>>>> for biojava3 but I can see that this will be helpful. Again, for small >>>>> fasta >>>>> files, it might not be efficient to create a suffix tree but for bigger >>>>> files, I think that might be the way to go. >>>>> >>>>> Thats just my two cents.What do you think? >>>>> >>>>> -vishal >>>>> >>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >>>>> >>>>>> Hi Vishal, >>>>>> >>>>>> As far as I am aware there is nothing which will generate them in >>>>>> BioJava >>>>>> at the moment. However it is possible to do it with BioJava3: >>>>>> >>>>>> public static void main(String[] args) { >>>>>> DNASequence d = new DNASequence("ATGATC"); >>>>>> System.out.println("Non-Overlap"); >>>>>> nonOverlap(d); >>>>>> System.out.println("Overlap"); >>>>>> overlap(d); >>>>>> } >>>>>> >>>>>> public static final int KMER = 3; >>>>>> >>>>>> //Generate triplets overlapping >>>>>> public static void overlap(Sequence d) { >>>>>> List> l = >>>>>> new ArrayList>(); >>>>>> for(int i=1; i<=KMER; i++) { >>>>>> SequenceView sub = d.getSubSequence( >>>>>> i, d.getLength()); >>>>>> WindowedSequence w = >>>>>> new WindowedSequence(sub, KMER); >>>>>> l.add(w); >>>>>> } >>>>>> >>>>>> //Will return ATG, ATC, TGA & GAT >>>>>> for(WindowedSequence w: l) { >>>>>> for(List subList: w) { >>>>>> System.out.println(subList); >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> //Generate triplet Compound lists non-overlapping >>>>>> public static void nonOverlap(Sequence d) { >>>>>> WindowedSequence w = >>>>>> new WindowedSequence(d, KMER); >>>>>> //Will return ATG & ATC >>>>>> for(List subList: w) { >>>>>> System.out.println(subList); >>>>>> } >>>>>> } >>>>>> >>>>>> The disadvantage of all of these solutions is that they generate lists >>>>>> of >>>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>>> This >>>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>>> iterates through each window rather than stepping through delegating >>>>>> onto >>>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>>> >>>>>> As for unique k-mers that's something which would require a bit more >>>>>> engineering & would be better suited to a solution built around a Trie >>>>>> (prefix tree). >>>>>> >>>>>> Hope this helps, >>>>>> >>>>>> Andy >>>>>> >>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>>> or >>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>>> k-mer >>>>>>> counts for every sequence in a fasta file. If something like this >>>>> exists >>>>>> it >>>>>>> would save me some time to write the code. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Vishal >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> >>>>>> -- >>>>>> Andrew Yates Ensembl Genomes Engineer >>>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Vishal Thapar, Ph.D.* >>>>> *Scientific informatics Analyst >>>>> Cold Spring Harbor Lab >>>>> Quick Bldg, Lowe Lab >>>>> 1 Bungtown Road >>>>> Cold Spring Harbor, NY - 11724* >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > From jayunit100 at gmail.com Fri Oct 29 17:39:34 2010 From: jayunit100 at gmail.com (Jay Vyas) Date: Fri, 29 Oct 2010 17:39:34 -0400 Subject: [Biojava-l] JavaDocs and Backwards compatibility Message-ID: Thanks, I am now all up to date with biojava 3.0 and it really works well. It really would be valuable to have some public biojava java docs ! This is because, for example, when I completely removed biojava 1.7, and replaced it with biojava 3.0, it was somewhat tedious to refactor/find old classes under new package names, for example : For example, org.biojava3.alignment. SimpleSubstitutionMatrix; org.biojava3.alignment.template.SubstitutionMatrix; From andreas at sdsc.edu Fri Oct 29 17:59:23 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 29 Oct 2010 14:59:23 -0700 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21 In-Reply-To: References: Message-ID: Ideally I would like to see the automated build system also deploy the latest javadocs on the website. I guess I should play around with the maven site-plugin if it can do that ... or does anybody have a recommendation for any other plugin? Andreas On Fri, Oct 29, 2010 at 12:25 PM, Mark Schreiber wrote: > It might pay to put the link to the docs on the top level page. > > You may need to get an Admin to change the front page. > > On Fri, Oct 29, 2010 at 3:20 PM, Scooter Willis wrote: > >> Jay >> >> I don't think we have pushed the biojava3 docs up to a place where google >> can find them. From the nightly build >> http://www.biojava.org/download/maven/org/biojava/ you can find javadocs >> in >> the jar files. Biojava3 has two parts now. The older 1.7 modules refactored >> into standalone jar files when possible but it is still a very cross >> dependent code base. Then the newer modules labeled biojava3- are a clean >> break from 1.7 so depending on what you are doing it may be easy/difficult >> to start using the newer biojava3 code without lots of changes in your >> code. >> >> Thanks >> >> Scooter >> >> On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas wrote: >> >> > Hi guys : Im trying to break up a biojava project built on 1.7 into >> biojava >> > 3, and am having to look up some modules etc... >> > Im having trouble finding biojava3 javadocs ? ?Unfortunately, the >> > 'googleable' java docs are all from 1.7 ..... >> > >> > Where is the formal/generated javadoc info for biojava3 ? is it online ? >> > _______________________________________________ >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > >> > >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From simon.rayner.cn at gmail.com Fri Oct 29 19:38:13 2010 From: simon.rayner.cn at gmail.com (simon rayner) Date: Sat, 30 Oct 2010 07:38:13 +0800 Subject: [Biojava-l] New Biojava Logo In-Reply-To: References: Message-ID: just a suggestion, but might beans falling out the cup suggest that biojava is unstable? just offering feedback, i still think it looks very slick! On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani wrote: > Great Logo!!! > > :D > > 2010/10/29 jitendra narayan : > > Dear All > > I have designed a n new biojava logo. Please see the detail of it: > > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg > > I need your > valuable > > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo > > > > > > thanks > > > > -- > > Jitendra Narayan > > Bioinformatist > > www.bioinformaticsonline.com > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > Alessandro Cipriani > (+39) 3206009509 > (+39) 3931311792 > http://www.cipriania.it > skype:genjasp at gmail.com > msn:jaspzz > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Simon Rayner State Key Laboratory of Virology Wuhan Institute of Virology Chinese Academy of Sciences Wuhan, Hubei 430071 P.R.China +86 (27) 87199895 (office) +86 18627113001 (cell) From phidias51 at gmail.com Fri Oct 29 19:49:54 2010 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 29 Oct 2010 16:49:54 -0700 Subject: [Biojava-l] New Biojava Logo In-Reply-To: References: Message-ID: The first logo looks nice; however, I don't see anything in it that connects it to biology. The second logo is too close to Oracle's logo, and I suspect would require written permission from them in order to use it. Cheers, Mark card.ly: On Fri, Oct 29, 2010 at 4:38 PM, simon rayner wrote: > just a suggestion, but might beans falling out the cup suggest that biojava > is unstable? just offering feedback, i still think it looks very slick! > > On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani >wrote: > > > Great Logo!!! > > > > :D > > > > 2010/10/29 jitendra narayan : > > > Dear All > > > I have designed a n new biojava logo. Please see the detail of it: > > > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg > > > I need your > > valuable > > > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo > > > > > > > > > thanks > > > > > > -- > > > Jitendra Narayan > > > Bioinformatist > > > www.bioinformaticsonline.com > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > -- > > Alessandro Cipriani > > (+39) 3206009509 > > (+39) 3931311792 > > http://www.cipriania.it > > skype:genjasp at gmail.com < > skype%3Agenjasp at gmail.com > > > msn:jaspzz > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > Simon Rayner > > State Key Laboratory of Virology > Wuhan Institute of Virology > Chinese Academy of Sciences > Wuhan, Hubei 430071 > P.R.China > > +86 (27) 87199895 (office) > +86 18627113001 (cell) > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From willishf at ufl.edu Fri Oct 29 20:02:32 2010 From: willishf at ufl.edu (Scooter Willis) Date: Fri, 29 Oct 2010 20:02:32 -0400 Subject: [Biojava-l] New Biojava Logo In-Reply-To: References: Message-ID: Jitendra Could you morph from the coffee liquid to a DNA helix? Scooter On Fri, Oct 29, 2010 at 7:49 PM, Mark Fortner wrote: > The first logo looks nice; however, I don't see anything in it that > connects > it to biology. The second logo is too close to Oracle's logo, and I > suspect > would require written permission from them in order to use it. > > Cheers, > > Mark > > card.ly: > > > On Fri, Oct 29, 2010 at 4:38 PM, simon rayner >wrote: > > > just a suggestion, but might beans falling out the cup suggest that > biojava > > is unstable? just offering feedback, i still think it looks very slick! > > > > On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani > >wrote: > > > > > Great Logo!!! > > > > > > :D > > > > > > 2010/10/29 jitendra narayan : > > > > Dear All > > > > I have designed a n new biojava logo. Please see the detail of it: > > > > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg > > > > I need your > > > valuable > > > > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo > > > > > > > > > > > > thanks > > > > > > > > -- > > > > Jitendra Narayan > > > > Bioinformatist > > > > www.bioinformaticsonline.com > > > > _______________________________________________ > > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > > > > > > -- > > > Alessandro Cipriani > > > (+39) 3206009509 > > > (+39) 3931311792 > > > http://www.cipriania.it > > > skype:genjasp at gmail.com < > skype%3Agenjasp at gmail.com > < > > skype%3Agenjasp at gmail.com < > skype%253Agenjasp at gmail.com >> > > > msn:jaspzz > > > > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > -- > > Simon Rayner > > > > State Key Laboratory of Virology > > Wuhan Institute of Virology > > Chinese Academy of Sciences > > Wuhan, Hubei 430071 > > P.R.China > > > > +86 (27) 87199895 (office) > > +86 18627113001 (cell) > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From ayates at ebi.ac.uk Sat Oct 30 05:20:30 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Sat, 30 Oct 2010 10:20:30 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> Message-ID: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk> You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight. Just goes to show you should always do more testing than you think :). Andy On 29 Oct 2010, at 20:43, jitesh dundas wrote: > That is good news.Thanks for the directions Andy. > > I have already started on this.Let me analyze and write the code now. > > Maybe a next month deadline is not unreachable in this case. > > Here we go! > JD > > On 10/30/10, Andy Yates wrote: >> So we've got some basic kmer work now in SVN. If you look in the class >> SequenceMixin there are two static methods there for generating the two >> types of k-mers. It's not developed with Map storage in mind & I'll leave >> the door open there for anyone else to come in & develop it. The k-mers are >> also not unique across the sequence but it's a start :) >> >> Share & enjoy! >> >> Andy >> >> On 29 Oct 2010, at 19:50, jitesh dundas wrote: >> >>> I agree Andy. These have become standard functionalities that >>> scientists do these days. I am all for implementing that in BioJava3. >>> Java isn't that efficient for such functionalities so we will surely >>> need more effort compared to the same in Python/Perl. >>> >>> Regards, >>> Jitesh Dundas >>> >>> On 10/30/10, Andy Yates wrote: >>>> So if it's a suffix tree that's quite a fixed data structure so the >>>> chances >>>> of developing a pluggable mechanism there would be hard. I think there >>>> also >>>> has to be a limit as to what we can sensibly do. If people want to >>>> contribute this kind of work though then it's all be very well received >>>> (with the corresponding test environment/cases of course). >>>> >>>> Cheers, >>>> >>>> Andy >>>> >>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >>>> >>>>> It might be useful to make the K-mer storage mechanism pluggable. This >>>>> would allow a developer to use anything from a simple MultiMap, to a >>>>> NoSQL >>>>> key-value database to store K-mers. You could plugin custom map >>>>> implementations to allow you to keep a count of the number of instances >>>>> of >>>>> particular K-mers that were found. It might also be useful to be able >>>>> to >>>>> do >>>>> set operations on those K-mer collections. You could use it to >>>>> determine >>>>> which K-mers were present in a pathogen and not in a host. >>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>>>> >>>>> Cheers, >>>>> >>>>> Mark >>>>> >>>>> card.ly: >>>>> >>>>> >>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>>>> wrote: >>>>> >>>>>> Hi Andy, >>>>>> >>>>>> This is good to have. I feel that including it as a part of core may >>>>>> not >>>>>> be >>>>>> necessary but having it as part of Genomic module in biojava3 will be >>>>>> nice. >>>>>> There is a project Bioinformatica >>>>>> >>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>>>> does something similar although not exactly. It counts the k-mers in a >>>>>> given fasta file but it does not count k-mers for each sequence within >>>>>> the >>>>>> file, just all within a file. This is a good feature to have specially >>>>>> if >>>>>> one is trying to find patterns within sequences which is what I am >>>>>> trying >>>>>> to >>>>>> do. It would most certainly be helpful to have a k-mer counting >>>>>> algorithm >>>>>> that counts k-mer frequency for each sequence. The way to go would be >>>>>> to >>>>>> use >>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or >>>>>> not >>>>>> since I haven't used java in a while and am just switching back to it. >>>>>> A >>>>>> paper on using suffix trees to generate genome wide k-mer frequencies >>>>>> is: >>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>>>> software >>>>>> is tallymer). It would be some work to implement this in java as a >>>>>> module >>>>>> for biojava3 but I can see that this will be helpful. Again, for small >>>>>> fasta >>>>>> files, it might not be efficient to create a suffix tree but for bigger >>>>>> files, I think that might be the way to go. >>>>>> >>>>>> Thats just my two cents.What do you think? >>>>>> >>>>>> -vishal >>>>>> >>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >>>>>> >>>>>>> Hi Vishal, >>>>>>> >>>>>>> As far as I am aware there is nothing which will generate them in >>>>>>> BioJava >>>>>>> at the moment. However it is possible to do it with BioJava3: >>>>>>> >>>>>>> public static void main(String[] args) { >>>>>>> DNASequence d = new DNASequence("ATGATC"); >>>>>>> System.out.println("Non-Overlap"); >>>>>>> nonOverlap(d); >>>>>>> System.out.println("Overlap"); >>>>>>> overlap(d); >>>>>>> } >>>>>>> >>>>>>> public static final int KMER = 3; >>>>>>> >>>>>>> //Generate triplets overlapping >>>>>>> public static void overlap(Sequence d) { >>>>>>> List> l = >>>>>>> new ArrayList>(); >>>>>>> for(int i=1; i<=KMER; i++) { >>>>>>> SequenceView sub = d.getSubSequence( >>>>>>> i, d.getLength()); >>>>>>> WindowedSequence w = >>>>>>> new WindowedSequence(sub, KMER); >>>>>>> l.add(w); >>>>>>> } >>>>>>> >>>>>>> //Will return ATG, ATC, TGA & GAT >>>>>>> for(WindowedSequence w: l) { >>>>>>> for(List subList: w) { >>>>>>> System.out.println(subList); >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> //Generate triplet Compound lists non-overlapping >>>>>>> public static void nonOverlap(Sequence d) { >>>>>>> WindowedSequence w = >>>>>>> new WindowedSequence(d, KMER); >>>>>>> //Will return ATG & ATC >>>>>>> for(List subList: w) { >>>>>>> System.out.println(subList); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> The disadvantage of all of these solutions is that they generate lists >>>>>>> of >>>>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>>>> This >>>>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>>>> iterates through each window rather than stepping through delegating >>>>>>> onto >>>>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>>>> >>>>>>> As for unique k-mers that's something which would require a bit more >>>>>>> engineering & would be better suited to a solution built around a Trie >>>>>>> (prefix tree). >>>>>>> >>>>>>> Hope this helps, >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>>>> or >>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>>>> k-mer >>>>>>>> counts for every sequence in a fasta file. If something like this >>>>>> exists >>>>>>> it >>>>>>>> would save me some time to write the code. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Vishal >>>>>>>> _______________________________________________ >>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>>>>>> -- >>>>>>> Andrew Yates Ensembl Genomes Engineer >>>>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Vishal Thapar, Ph.D.* >>>>>> *Scientific informatics Analyst >>>>>> Cold Spring Harbor Lab >>>>>> Quick Bldg, Lowe Lab >>>>>> 1 Bungtown Road >>>>>> Cold Spring Harbor, NY - 11724* >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> -- >>>> Andrew Yates Ensembl Genomes Engineer >>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jbdundas at gmail.com Sat Oct 30 05:40:35 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 30 Oct 2010 15:10:35 +0530 Subject: [Biojava-l] K-mers In-Reply-To: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk> References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk> Message-ID: I got your point Andy. .Thanks. On Sat, Oct 30, 2010 at 2:50 PM, Andy Yates wrote: > You should be aware I just found a bug in the code. This has been fixed but > the bug will still be in the alpha3 release. I would recommend either > building a version yourself or if Andreas can post up the continuous > integration server address there will be a release tonight. > > Just goes to show you should always do more testing than you think :). > > Andy > > On 29 Oct 2010, at 20:43, jitesh dundas wrote: > > > That is good news.Thanks for the directions Andy. > > > > I have already started on this.Let me analyze and write the code now. > > > > Maybe a next month deadline is not unreachable in this case. > > > > Here we go! > > JD > > > > On 10/30/10, Andy Yates wrote: > >> So we've got some basic kmer work now in SVN. If you look in the class > >> SequenceMixin there are two static methods there for generating the two > >> types of k-mers. It's not developed with Map storage in mind & I'll > leave > >> the door open there for anyone else to come in & develop it. The k-mers > are > >> also not unique across the sequence but it's a start :) > >> > >> Share & enjoy! > >> > >> Andy > >> > >> On 29 Oct 2010, at 19:50, jitesh dundas wrote: > >> > >>> I agree Andy. These have become standard functionalities that > >>> scientists do these days. I am all for implementing that in BioJava3. > >>> Java isn't that efficient for such functionalities so we will surely > >>> need more effort compared to the same in Python/Perl. > >>> > >>> Regards, > >>> Jitesh Dundas > >>> > >>> On 10/30/10, Andy Yates wrote: > >>>> So if it's a suffix tree that's quite a fixed data structure so the > >>>> chances > >>>> of developing a pluggable mechanism there would be hard. I think there > >>>> also > >>>> has to be a limit as to what we can sensibly do. If people want to > >>>> contribute this kind of work though then it's all be very well > received > >>>> (with the corresponding test environment/cases of course). > >>>> > >>>> Cheers, > >>>> > >>>> Andy > >>>> > >>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: > >>>> > >>>>> It might be useful to make the K-mer storage mechanism pluggable. > This > >>>>> would allow a developer to use anything from a simple MultiMap, to a > >>>>> NoSQL > >>>>> key-value database to store K-mers. You could plugin custom map > >>>>> implementations to allow you to keep a count of the number of > instances > >>>>> of > >>>>> particular K-mers that were found. It might also be useful to be > able > >>>>> to > >>>>> do > >>>>> set operations on those K-mer collections. You could use it to > >>>>> determine > >>>>> which K-mers were present in a pathogen and not in a host. > >>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 > >>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Mark > >>>>> > >>>>> card.ly: > >>>>> > >>>>> > >>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar > >>>>> wrote: > >>>>> > >>>>>> Hi Andy, > >>>>>> > >>>>>> This is good to have. I feel that including it as a part of core may > >>>>>> not > >>>>>> be > >>>>>> necessary but having it as part of Genomic module in biojava3 will > be > >>>>>> nice. > >>>>>> There is a project Bioinformatica > >>>>>> > >>>>>> > http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich > >>>>>> does something similar although not exactly. It counts the k-mers in > a > >>>>>> given fasta file but it does not count k-mers for each sequence > within > >>>>>> the > >>>>>> file, just all within a file. This is a good feature to have > specially > >>>>>> if > >>>>>> one is trying to find patterns within sequences which is what I am > >>>>>> trying > >>>>>> to > >>>>>> do. It would most certainly be helpful to have a k-mer counting > >>>>>> algorithm > >>>>>> that counts k-mer frequency for each sequence. The way to go would > be > >>>>>> to > >>>>>> use > >>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or > >>>>>> not > >>>>>> since I haven't used java in a while and am just switching back to > it. > >>>>>> A > >>>>>> paper on using suffix trees to generate genome wide k-mer > frequencies > >>>>>> is: > >>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, > >>>>>> software > >>>>>> is tallymer). It would be some work to implement this in java as a > >>>>>> module > >>>>>> for biojava3 but I can see that this will be helpful. Again, for > small > >>>>>> fasta > >>>>>> files, it might not be efficient to create a suffix tree but for > bigger > >>>>>> files, I think that might be the way to go. > >>>>>> > >>>>>> Thats just my two cents.What do you think? > >>>>>> > >>>>>> -vishal > >>>>>> > >>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates > wrote: > >>>>>> > >>>>>>> Hi Vishal, > >>>>>>> > >>>>>>> As far as I am aware there is nothing which will generate them in > >>>>>>> BioJava > >>>>>>> at the moment. However it is possible to do it with BioJava3: > >>>>>>> > >>>>>>> public static void main(String[] args) { > >>>>>>> DNASequence d = new DNASequence("ATGATC"); > >>>>>>> System.out.println("Non-Overlap"); > >>>>>>> nonOverlap(d); > >>>>>>> System.out.println("Overlap"); > >>>>>>> overlap(d); > >>>>>>> } > >>>>>>> > >>>>>>> public static final int KMER = 3; > >>>>>>> > >>>>>>> //Generate triplets overlapping > >>>>>>> public static void overlap(Sequence d) { > >>>>>>> List> l = > >>>>>>> new ArrayList>(); > >>>>>>> for(int i=1; i<=KMER; i++) { > >>>>>>> SequenceView sub = d.getSubSequence( > >>>>>>> i, d.getLength()); > >>>>>>> WindowedSequence w = > >>>>>>> new WindowedSequence(sub, KMER); > >>>>>>> l.add(w); > >>>>>>> } > >>>>>>> > >>>>>>> //Will return ATG, ATC, TGA & GAT > >>>>>>> for(WindowedSequence w: l) { > >>>>>>> for(List subList: w) { > >>>>>>> System.out.println(subList); > >>>>>>> } > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> //Generate triplet Compound lists non-overlapping > >>>>>>> public static void nonOverlap(Sequence d) { > >>>>>>> WindowedSequence w = > >>>>>>> new WindowedSequence(d, KMER); > >>>>>>> //Will return ATG & ATC > >>>>>>> for(List subList: w) { > >>>>>>> System.out.println(subList); > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> The disadvantage of all of these solutions is that they generate > lists > >>>>>>> of > >>>>>>> Compounds so kmer generation can/will be a memory intensive > operation. > >>>>>> This > >>>>>>> does mean it has to be since sub sequences are thin wrappers around > an > >>>>>>> underlying sequence. Also the overlap solution is non-optimal since > it > >>>>>>> iterates through each window rather than stepping through > delegating > >>>>>>> onto > >>>>>>> each base in turn (hence why we get ATG & ATC before TGA) > >>>>>>> > >>>>>>> As for unique k-mers that's something which would require a bit > more > >>>>>>> engineering & would be better suited to a solution built around a > Trie > >>>>>>> (prefix tree). > >>>>>>> > >>>>>>> Hope this helps, > >>>>>>> > >>>>>>> Andy > >>>>>>> > >>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > >>>>>>> > >>>>>>>> Hi All, > >>>>>>>> > >>>>>>>> I had a quick question: Does Biojava have a method to generate > k-mers > >>>>>> or > >>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want > >>>>>> k-mer > >>>>>>>> counts for every sequence in a fasta file. If something like this > >>>>>> exists > >>>>>>> it > >>>>>>>> would save me some time to write the code. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Vishal > >>>>>>>> _______________________________________________ > >>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>>> > >>>>>>> -- > >>>>>>> Andrew Yates Ensembl Genomes Engineer > >>>>>>> EMBL-EBI Tel: +44-(0)1223-492538 > >>>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >>>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> *Vishal Thapar, Ph.D.* > >>>>>> *Scientific informatics Analyst > >>>>>> Cold Spring Harbor Lab > >>>>>> Quick Bldg, Lowe Lab > >>>>>> 1 Bungtown Road > >>>>>> Cold Spring Harbor, NY - 11724* > >>>>>> _______________________________________________ > >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>> > >>>>> _______________________________________________ > >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > >>>> -- > >>>> Andrew Yates Ensembl Genomes Engineer > >>>> EMBL-EBI Tel: +44-(0)1223-492538 > >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > >> > >> -- > >> Andrew Yates Ensembl Genomes Engineer > >> EMBL-EBI Tel: +44-(0)1223-492538 > >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >> > >> > >> > >> > >> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > From andreas at sdsc.edu Sat Oct 30 06:50:48 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sat, 30 Oct 2010 06:50:48 -0400 Subject: [Biojava-l] K-mers In-Reply-To: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk> References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk> Message-ID: just kicked off a new build.. alpha4 should be on the servers shortly... you don't need cruisecontrol for a release. Anybody with an ssh account on portal.open-bio (and set up ssh keys correctly) can do mvn release:clean release:prepare release:perform A On Sat, Oct 30, 2010 at 5:20 AM, Andy Yates wrote: > You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight. > > Just goes to show you should always do more testing than you think :). > > Andy > > On 29 Oct 2010, at 20:43, jitesh dundas wrote: > >> That is good news.Thanks for the directions Andy. >> >> I have already started on this.Let me analyze and write the code now. >> >> Maybe a next month deadline is not unreachable in this case. >> >> Here we go! >> JD >> >> On 10/30/10, Andy Yates wrote: >>> So we've got some basic kmer work now in SVN. If you look in the class >>> SequenceMixin there are two static methods there for generating the two >>> types of k-mers. It's not developed with Map storage in mind & I'll leave >>> the door open there for anyone else to come in & develop it. The k-mers are >>> also not unique across the sequence but it's a start :) >>> >>> Share & enjoy! >>> >>> Andy >>> >>> On 29 Oct 2010, at 19:50, jitesh dundas wrote: >>> >>>> I agree Andy. These have become standard functionalities that >>>> scientists do these days. I am all for implementing that in BioJava3. >>>> Java isn't that efficient for such functionalities so we will surely >>>> need more effort compared to the same in Python/Perl. >>>> >>>> Regards, >>>> Jitesh Dundas >>>> >>>> On 10/30/10, Andy Yates wrote: >>>>> So if it's a suffix tree that's quite a fixed data structure so the >>>>> chances >>>>> of developing a pluggable mechanism there would be hard. I think there >>>>> also >>>>> has to be a limit as to what we can sensibly do. If people want to >>>>> contribute this kind of work though then it's all be very well received >>>>> (with the corresponding test environment/cases of course). >>>>> >>>>> Cheers, >>>>> >>>>> Andy >>>>> >>>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >>>>> >>>>>> It might be useful to make the K-mer storage mechanism pluggable. ?This >>>>>> would allow a developer to use anything from a simple MultiMap, to a >>>>>> NoSQL >>>>>> key-value database to store K-mers. ?You could plugin custom map >>>>>> implementations to allow you to keep a count of the number of instances >>>>>> of >>>>>> particular K-mers that were found. ?It might also be useful to be able >>>>>> to >>>>>> do >>>>>> set operations on those K-mer collections. ?You could use it to >>>>>> determine >>>>>> which K-mers were present in a pathogen and not in a host. >>>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Mark >>>>>> >>>>>> card.ly: >>>>>> >>>>>> >>>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>>>>> wrote: >>>>>> >>>>>>> Hi Andy, >>>>>>> >>>>>>> This is good to have. I feel that including it as a part of core may >>>>>>> not >>>>>>> be >>>>>>> necessary but having it as part of Genomic module in biojava3 will be >>>>>>> nice. >>>>>>> There is a project Bioinformatica >>>>>>> >>>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>>>>> does something similar although not exactly. It counts the k-mers in a >>>>>>> given fasta file but it does not count k-mers for each sequence within >>>>>>> the >>>>>>> file, just all within a file. This is a good feature to have specially >>>>>>> if >>>>>>> one is trying to find patterns within sequences which is what I am >>>>>>> trying >>>>>>> to >>>>>>> do. It would most certainly be helpful to have a k-mer counting >>>>>>> algorithm >>>>>>> that counts k-mer frequency for each sequence. The way to go would be >>>>>>> to >>>>>>> use >>>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or >>>>>>> not >>>>>>> since I haven't used java in a while and am just switching back to it. >>>>>>> A >>>>>>> paper on using suffix trees to generate genome wide k-mer frequencies >>>>>>> is: >>>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>>>>> software >>>>>>> is tallymer). It would be some work to implement this in java as a >>>>>>> module >>>>>>> for biojava3 but I can see that this will be helpful. Again, for small >>>>>>> fasta >>>>>>> files, it might not be efficient to create a suffix tree but for bigger >>>>>>> files, I think that might be the way to go. >>>>>>> >>>>>>> Thats just my two cents.What do you think? >>>>>>> >>>>>>> -vishal >>>>>>> >>>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >>>>>>> >>>>>>>> Hi Vishal, >>>>>>>> >>>>>>>> As far as I am aware there is nothing which will generate them in >>>>>>>> BioJava >>>>>>>> at the moment. However it is possible to do it with BioJava3: >>>>>>>> >>>>>>>> public static void main(String[] args) { >>>>>>>> DNASequence d = new DNASequence("ATGATC"); >>>>>>>> System.out.println("Non-Overlap"); >>>>>>>> nonOverlap(d); >>>>>>>> System.out.println("Overlap"); >>>>>>>> overlap(d); >>>>>>>> } >>>>>>>> >>>>>>>> public static final int KMER = 3; >>>>>>>> >>>>>>>> //Generate triplets overlapping >>>>>>>> public static void overlap(Sequence d) { >>>>>>>> List> l = >>>>>>>> ? ? ? ? new ArrayList>(); >>>>>>>> for(int i=1; i<=KMER; i++) { >>>>>>>> ? ? SequenceView sub = d.getSubSequence( >>>>>>>> ? ? ? ? ? ? i, d.getLength()); >>>>>>>> ? ? WindowedSequence w = >>>>>>>> ? ? ? ? new WindowedSequence(sub, KMER); >>>>>>>> ? ? l.add(w); >>>>>>>> } >>>>>>>> >>>>>>>> //Will return ATG, ATC, TGA & GAT >>>>>>>> for(WindowedSequence w: l) { >>>>>>>> ? ? for(List subList: w) { >>>>>>>> ? ? ? ? System.out.println(subList); >>>>>>>> ? ? } >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> //Generate triplet Compound lists non-overlapping >>>>>>>> public static void nonOverlap(Sequence d) { >>>>>>>> WindowedSequence w = >>>>>>>> ? ? ? ? new WindowedSequence(d, KMER); >>>>>>>> //Will return ATG & ATC >>>>>>>> for(List subList: w) { >>>>>>>> ? ? System.out.println(subList); >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> The disadvantage of all of these solutions is that they generate lists >>>>>>>> of >>>>>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>>>>> This >>>>>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>>>>> iterates through each window rather than stepping through delegating >>>>>>>> onto >>>>>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>>>>> >>>>>>>> As for unique k-mers that's something which would require a bit more >>>>>>>> engineering & would be better suited to a solution built around a Trie >>>>>>>> (prefix tree). >>>>>>>> >>>>>>>> Hope this helps, >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>>>>> or >>>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>>>>> k-mer >>>>>>>>> counts for every sequence in a fasta file. If something like this >>>>>>> exists >>>>>>>> it >>>>>>>>> would save me some time to write the code. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Vishal >>>>>>>>> _______________________________________________ >>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>> >>>>>>>> -- >>>>>>>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer >>>>>>>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 >>>>>>>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 >>>>>>>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Vishal Thapar, Ph.D.* >>>>>>> *Scientific informatics Analyst >>>>>>> Cold Spring Harbor Lab >>>>>>> Quick Bldg, Lowe Lab >>>>>>> 1 Bungtown Road >>>>>>> Cold Spring Harbor, NY - 11724* >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>>> -- >>>>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer >>>>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 >>>>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 >>>>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>> >>> -- >>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer >>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> > > -- > Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer > EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From dasarnow at gmail.com Sun Oct 31 19:56:05 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Sun, 31 Oct 2010 16:56:05 -0700 Subject: [Biojava-l] Superimposing structure pieces Message-ID: I've been trying to pull out pieces of protein chains and superimpose them...my current code (as generic-ified code snips below) works, but I wonder if it couldn't be faster. Has anyone worked on similar methods? Any other advice? Best regards everyone, da Getting residue CA's as Atom[]: for (int i; i < length; i++) { someAtoms[i] = someChain.getSeqResGroup(start + i).getAtom("CA"); } Superimposing/aligning: SVDSuperimposer svds = new SVDSuperimposer(someAtoms1, someAtoms2); Matrix rot = svds.getRotation(); Atom trans = svds.getTranslation(); for (int i = 0; i < length; i++) { Calc.rotate(someAtoms1[i], rot); Calc.shift(someAtoms1[i], trans); } SVDSuperimposer.getRmsd(someAtoms1, someAtoms2); From andreas at sdsc.edu Sun Oct 31 23:08:00 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 31 Oct 2010 23:08:00 -0400 Subject: [Biojava-l] Superimposing structure pieces In-Reply-To: References: Message-ID: Hi Daniel, couple of thoughts when I see this: - in case you have not seen this yet, take a look at docu on structure alignment: http://biojava.org/wiki/BioJava:CookBook:PDB:align - the direction of your rotations is wrong,the SVDSuperimposer gives you the operations to be applied on the second atom set. - there is some utility methods in StructureTools, that might come in handy. e.g. Atom[] ca1 = StructureTools.getAtomCAArray(structure1); Atom[] ca2 = StructureTools.getAtomCAArray(structure2); - any particular reason why you are working with SEQRES records? for the superposition it might be sufficient to work with the ATOM records only, which can give you a quicker parsing of the files, since you can turn off the alignment of ATOM and SEQRES. Having said that, there can be situations when you actually might want it, e.g. see SmithWaterman3Daligner, which does a sequence based structure alignment... hope that helps, Andreas On Sun, Oct 31, 2010 at 7:56 PM, Daniel Asarnow wrote: > I've been trying to pull out pieces of protein chains and superimpose > them...my current code (as generic-ified code snips below) works, but > I wonder if it couldn't be faster. > Has anyone worked on similar methods? ?Any other advice? > > Best regards everyone, > da > > Getting residue CA's as Atom[]: > > for (int i; i < length; i++) { > ? ?someAtoms[i] = someChain.getSeqResGroup(start + i).getAtom("CA"); > } > > Superimposing/aligning: > > SVDSuperimposer svds = new SVDSuperimposer(someAtoms1, someAtoms2); > Matrix rot = svds.getRotation(); > Atom trans = svds.getTranslation(); > for (int i = 0; i < length; i++) { > ? ?Calc.rotate(someAtoms1[i], rot); > ? ?Calc.shift(someAtoms1[i], trans); > } > SVDSuperimposer.getRmsd(someAtoms1, someAtoms2); > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From asandro1501 at gmail.com Fri Oct 1 16:52:50 2010 From: asandro1501 at gmail.com (Alex Silva) Date: Fri, 1 Oct 2010 13:52:50 -0300 Subject: [Biojava-l] Help files genbank Message-ID: Hi I am asking again for help reading a file format in genbank, I need to do the analysis of the headers. I could not use any because I am a beginner in java. Does anyone have some code that you used for this? Em portugu?s Estou solicitando novamente uma ajuda para leitura de arquivos no formato genbank, preciso fazer a analise dos cabe?alhos. N?o consegui utilizar nenhum porque sou iniciante em java. Algu?m tem algum c?digo que tenha utilizado para isso? -- Alex Silva G.R.A. Sistemas Corporativos msn: gra.sistemas at hotmail.com 55-9165-7378 From holland at eaglegenomics.com Fri Oct 1 16:56:09 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 1 Oct 2010 17:56:09 +0100 Subject: [Biojava-l] Help files genbank In-Reply-To: References: Message-ID: This is a good starting point: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_and_writing_files. On 1 Oct 2010, at 17:52, Alex Silva wrote: > Hi > > I am asking again for help reading a file format in genbank, I need to do > the analysis of the headers. I could not use any because I am a beginner in > java. Does anyone have some code that you used for this? > > > > > Em portugu?s > > Estou solicitando novamente uma ajuda para leitura de arquivos no formato > genbank, preciso fazer a analise dos cabe?alhos. N?o consegui utilizar > nenhum porque sou iniciante em java. Algu?m tem algum c?digo que tenha > utilizado para isso? > > -- > Alex Silva > G.R.A. Sistemas Corporativos > msn: gra.sistemas at hotmail.com > 55-9165-7378 > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From pjotr.public23 at thebird.nl Sat Oct 2 09:15:06 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Sat, 2 Oct 2010 11:15:06 +0200 Subject: [Biojava-l] BioJava <-> R Message-ID: <20101002091506.GA17702@thebird.nl> Anyone here who has real experience using the JRI? Who would be interested, and have some exposure to, invoking R from Java through a native interface in bioinformatics? Pj. From hlapp at drycafe.net Sun Oct 3 01:26:49 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sat, 2 Oct 2010 21:26:49 -0400 Subject: [Biojava-l] BioJava <-> R In-Reply-To: <20101002091506.GA17702@thebird.nl> References: <20101002091506.GA17702@thebird.nl> Message-ID: <74DF3E4D-FC22-4719-9E6B-08248B14D4AA@drycafe.net> We use this in the Mesquite<->R bridge. I haven't worked much on the Java to R side, but it seems to work well. http://mesquiteproject.org/packages/Mesquite.R/ -hilmar On Oct 2, 2010, at 5:15 AM, Pjotr Prins wrote: > Anyone here who has real experience using the JRI? Who would be > interested, and have some exposure to, invoking R from Java through a > native interface in bioinformatics? > > Pj. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From andrew.mcsweeny at rockets.utoledo.edu Tue Oct 12 21:41:07 2010 From: andrew.mcsweeny at rockets.utoledo.edu (McSweeny, Andrew J) Date: Tue, 12 Oct 2010 21:41:07 +0000 Subject: [Biojava-l] How to share code while protecting copyrights? Message-ID: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Hi, I am working on a project which simulates sexual reproduction in a population of digital organisms. Their genome is just a contig from hg18. It's pretty interesting and I can talk more about it in the future.... Anyways, how can I share my code for this project without having to worry that someone else will use it to publish a paper before my group does? I'm certain nobody in the open source community would do that, but how do I convince my group that opening our project to BioJava is a good idea? -Andrew From andreas at sdsc.edu Wed Oct 13 06:02:34 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 12 Oct 2010 23:02:34 -0700 Subject: [Biojava-l] biojava 3.0 release plan Message-ID: Hi, BioJava 3 has matured massively in SVN during this year and it is time to prepare a first release. I propose the following release plan. See also two other topics for discussion below. Release Plan 3.0 * Alpha release build(s) during the next days I will start to provide a first alpha release build. This will be followed by semi-regular follow up alpha builds (depending on SVN activity) - During the next weeks any missing features should be committed to SVN. Refactoring of code can still be done during this time. - Add and update documentation in wiki - Module maintainers: check compile warnings for your modules in automated builds. Make sure no compile warnings are being displayed. * Beta release build(s) the first beta release is scheduled for the weekend Nov 21st. - From this point on only minor changes (bug fixes) should be added to the code base - Module maintainers: check and update javadoc for your modules * Release 3.0 The 3.0 Release is scheduled for Dez 12th There are two things we should still discuss: * backwards compatibility: the current "core" module contains tons of legacy 1.7 code. Shall I go ahead and delete this module? * documentation: The wiki contains tons of documentation for 1.7 which will not be useful for 3.0. As a procedure for cleaning this up and avoiding confusion I suggest to move all 1.7 related docu into a special section of the wiki. All toplevel links to documentation should point to 3.0. Any other suggestions? Andreas From markjschreiber at gmail.com Wed Oct 13 09:26:04 2010 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 13 Oct 2010 11:26:04 +0200 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: Hi - My understanding of copyright is that it is yours as soon as you assert that it is your creation. You can simply add a copyright statement to each file containing the code (in the header for example). The reality is that defending copyright is your responsibility. If someone violates it, you have to take them to court or issue a legal letter. You can also put an appropriate license on the code specifying how it can be used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick one of these that best matches your needs. BioJava code is LGPL so if you want your code to go into the BioJava code base you will need to make your code LGPL. It's always a good idea to add @author tags to Java code to ensure appropriate attribution. Finally, if someone steals your code and publishes results before you then you can always make a complaint to the journal editors. If it is a reputable journal, and you have reasonable proof the editor should take some action such as forcing a retraction. You can also make a distribution agreement saying that if someone uses this code they agree not to publish without first consulting you. If you want to make it really water tight, get a lawyer and explain specifically what you want to share and what you want to protect or prevent. - Mark On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J < andrew.mcsweeny at rockets.utoledo.edu> wrote: > Hi, > > I am working on a project which simulates sexual reproduction in a > population of digital organisms. Their genome is just a contig from hg18. > It's pretty interesting and I can talk more about it in the future.... > > Anyways, how can I share my code for this project without having to worry > that someone else will use it to publish a paper before my group does? > > I'm certain nobody in the open source community would do that, but how do I > convince my group that opening our project to BioJava is a good idea? > > -Andrew > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From markjschreiber at gmail.com Wed Oct 13 09:28:05 2010 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 13 Oct 2010 11:28:05 +0200 Subject: [Biojava-l] [Biojava-dev] biojava 3.0 release plan In-Reply-To: References: Message-ID: Hi Andreas - Excellent work from the team this year. I would recommend removing as much legacy code as possible and removing (preferably rewriting) the legacy documentation. I think it would be better to have no docs than out of date docs. - Mark On Wed, Oct 13, 2010 at 8:02 AM, Andreas Prlic wrote: > Hi, > > BioJava 3 has matured massively in SVN during this year and it is time to > prepare a first release. I propose the following release plan. See also two > other topics for discussion below. > > Release Plan 3.0 > > * Alpha release build(s) > during the next days I will start to provide a first alpha release build. > This will be followed by semi-regular follow up alpha builds (depending on > SVN activity) > > - During the next weeks any missing features should be committed to SVN. > Refactoring of code can still be done during this time. > - Add and update documentation in wiki > - Module maintainers: check compile warnings for your modules in automated > builds. Make sure no compile warnings are being displayed. > > > * Beta release build(s) > the first beta release is scheduled for the weekend Nov 21st. > > - From this point on only minor changes (bug fixes) should be added to the > code base > - Module maintainers: check and update javadoc for your modules > > * Release 3.0 > The 3.0 Release is scheduled for Dez 12th > > > There are two things we should still discuss: > > * backwards compatibility: > the current "core" module contains tons of legacy 1.7 code. Shall I go > ahead > and delete this module? > > * documentation: > The wiki contains tons of documentation for 1.7 which will not be useful > for > 3.0. As a procedure for cleaning this up and avoiding confusion I suggest > to > move all 1.7 related docu into a special section of the wiki. All toplevel > links to documentation should point to 3.0. Any other suggestions? > > > Andreas > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > From paolo.romano at istge.it Wed Oct 13 10:17:27 2010 From: paolo.romano at istge.it (Paolo Romano) Date: Wed, 13 Oct 2010 12:17:27 +0200 Subject: [Biojava-l] NETTAB 2010 Biological Wikis: Call for posters and participation Message-ID: <201010131018.o9DAHTjq009877@clus2.istge.it> Apologizes for duplications ==== Joint NETTAB 2010 and BBCC 2010 workshop Biological Wikis November 29 - December 1, 2010 Congress Center, University of Naples "Federico II", Naples, Italy http://www.nettab.org/2010/ The joint NETTAB and BBCC 2010 workshop on "Biological Wikis" promises to be a great meeting for all researchers involved in the exploitation of wikis in biology. Come and discuss your ideas and doubts with such scientists as Alex Bateman, Alexander Pico, Andrew Su, Dan Bolser, Robert Hoffmann, Thomas Kelder, Mike Cariaso, Adam Godzik, Luca Toldo and many other who, we hope, will join the workshop. It's a great chance to follow smart tutorials and lectures on WikiPathways, WikiGenes, Semantic Wiki, PDBWiki, Gene Wiki and a proficient use of Wikipedia. See a list of keynote speakers and tutorials at http://www.nettab.org/2010/progr.html . There still is time to submit abstracts for posters and software demonstrations until next October 17, 2010! The complete Call is available on-line at http://www.nettab.org/2010/call.html . Registration is open at http://www.nettab.org/2010/rform.html . Register within next October 29, 2010 and take profit of early registration fees. A reduction of 20 euro applies to all fees for members of ISCB and other societies and networks. More reductions are foreseen for PhD students. Further information is availble at http://www.nettab.org/2010/ . Looking forward to seeing you soon in Naples. Paolo Romano Paolo Romano (paolo.romano at istge.it) Bioinformatics National Cancer Research Institute (IST) Largo Rosanna Benzi, 10, I-16132, Genova, Italy Tel: +39-010-5737-288 Fax: +39-010-5737-295 Skype: p.romano Web: http://www.nettab.org/promano/ From pjotr.public23 at thebird.nl Wed Oct 13 11:15:41 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 13:15:41 +0200 Subject: [Biojava-l] BioJava translation Message-ID: <20101013111541.GA512@thebird.nl> I am using biojava-1.7.1 nucleotide -> amino acid translation. It is rather slow. In fact, the biopython equivalent in native Python is twice as fast. EMBOSS is again magnitudes faster. I am using something like rna = RNATools.createRNA(nucleotides); aa = RNATools.translate(rna); Embarrassingly, even the R version is faster in the GeneR module, as it uses a C module. I have a feeling this has to do with typed object creation at every level, whereas Python and others uses plain character Strings. Any plans for speeding this up on the JVM? Pj. From pjotr.public23 at thebird.nl Wed Oct 13 11:40:37 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 13:40:37 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> Message-ID: <20101013114037.GA1166@thebird.nl> Great! You mean BJ3 translation should work? Do you have a short example of use? Pj. On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: > BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. From holland at eaglegenomics.com Wed Oct 13 11:27:05 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 13 Oct 2010 12:27:05 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013111541.GA512@thebird.nl> References: <20101013111541.GA512@thebird.nl> Message-ID: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. On 13 Oct 2010, at 12:15, Pjotr Prins wrote: > I am using biojava-1.7.1 nucleotide -> amino acid translation. It is > rather slow. In fact, the biopython equivalent in native Python is > twice as fast. EMBOSS is again magnitudes faster. I am using > something like > > rna = RNATools.createRNA(nucleotides); > aa = RNATools.translate(rna); > > Embarrassingly, even the R version is faster in the GeneR module, as > it uses a C module. > > I have a feeling this has to do with typed object creation at every > level, whereas Python and others uses plain character Strings. > > Any plans for speeding this up on the JVM? > > Pj. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From holland at eaglegenomics.com Wed Oct 13 11:42:21 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 13 Oct 2010 12:42:21 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013114037.GA1166@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> Message-ID: <507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com> Afraid I'm a bit out of touch but someone else on this list should be able to help. Andy or Andreas maybe? On 13 Oct 2010, at 12:40, Pjotr Prins wrote: > Great! You mean BJ3 translation should work? Do you have a short > example of use? > > Pj. > > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: >> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From pjotr.public23 at thebird.nl Wed Oct 13 11:48:07 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 13:48:07 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: <507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <507A4D9B-ADF5-47D9-87FE-A34E9D4C519A@eaglegenomics.com> Message-ID: <20101013114807.GA1569@thebird.nl> On Wed, Oct 13, 2010 at 12:42:21PM +0100, Richard Holland wrote: > Afraid I'm a bit out of touch but someone else on this list should > be able to help. Andy or Andreas maybe? It is not on the wiki yet, and I must admit I get lost in the source tree. Any short example will do, translating from an ntseq (String) to aaseq (String). Pj. From ayates at ebi.ac.uk Wed Oct 13 11:50:25 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 13 Oct 2010 12:50:25 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013114037.GA1166@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> Message-ID: <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> As of the moment there are the translation test cases which is the best documentation: http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java This hopefully will give you a good idea about how to go about it. I was managing over 1000 translations per second of BRCA2 going from mRNA to peptide with checks. YMMV but I hope this is a lot faster than what you're currently seeing. Translation supports a lot of different modes with TranscriptionEngine being the place to configure this. The Javadoc should be good enough to help you through the different modes available Andy On 13 Oct 2010, at 12:40, Pjotr Prins wrote: > Great! You mean BJ3 translation should work? Do you have a short > example of use? > > Pj. > > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: >> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From koen.bruynseels at cropdesign.com Wed Oct 13 12:16:00 2010 From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com) Date: Wed, 13 Oct 2010 14:16:00 +0200 Subject: [Biojava-l] Koen Bruynseels is out of the office. Message-ID: I will be out of the office starting 10/12/2010 and will not return until 10/14/2010. I will respond to your message when I return. From andreas at sdsc.edu Wed Oct 13 15:42:44 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 13 Oct 2010 08:42:44 -0700 Subject: [Biojava-l] BioJava translation In-Reply-To: <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> Message-ID: Hi Andy, any chance to add some wiki documentation for this as well? Would be great.... Andreas On Wed, Oct 13, 2010 at 4:50 AM, Andy Yates wrote: > As of the moment there are the translation test cases which is the best > documentation: > > > http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java > > This hopefully will give you a good idea about how to go about it. I was > managing over 1000 translations per second of BRCA2 going from mRNA to > peptide with checks. YMMV but I hope this is a lot faster than what you're > currently seeing. > > Translation supports a lot of different modes with TranscriptionEngine > being the place to configure this. The Javadoc should be good enough to help > you through the different modes available > > Andy > > > On 13 Oct 2010, at 12:40, Pjotr Prins wrote: > > > Great! You mean BJ3 translation should work? Do you have a short > > example of use? > > > > Pj. > > > > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: > >> BJ3 should be replacing most sequence operations with string operations, > making the whole thing much faster. > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From ayates at ebi.ac.uk Wed Oct 13 15:46:58 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 13 Oct 2010 16:46:58 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> Message-ID: <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk> I will try my best to Andy On 13 Oct 2010, at 16:42, Andreas Prlic wrote: > > Hi Andy, > > any chance to add some wiki documentation for this as well? Would be great.... > > Andreas > > > On Wed, Oct 13, 2010 at 4:50 AM, Andy Yates wrote: > As of the moment there are the translation test cases which is the best documentation: > > http://github.com/biojava/biojava/blob/master/biojava3-core/src/test/java/org/biojava3/core/sequence/TranslationTest.java > > This hopefully will give you a good idea about how to go about it. I was managing over 1000 translations per second of BRCA2 going from mRNA to peptide with checks. YMMV but I hope this is a lot faster than what you're currently seeing. > > Translation supports a lot of different modes with TranscriptionEngine being the place to configure this. The Javadoc should be good enough to help you through the different modes available > > Andy > > > On 13 Oct 2010, at 12:40, Pjotr Prins wrote: > > > Great! You mean BJ3 translation should work? Do you have a short > > example of use? > > > > Pj. > > > > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: > >> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From pjotr.public23 at thebird.nl Wed Oct 13 15:58:44 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 17:58:44 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk> Message-ID: <20101013155844.GA2918@thebird.nl> On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote: > I will try my best to Make sure to add the sequence should be uppercase. Took me a while to crack that, as I only got a null pointer exception. Pj. From holland at eaglegenomics.com Wed Oct 13 16:02:24 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 13 Oct 2010 17:02:24 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013155844.GA2918@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk> <20101013155844.GA2918@thebird.nl> Message-ID: <1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com> whuh??? Shouldn't we be coding to cater for all case mixtures?! On 13 Oct 2010, at 16:58, Pjotr Prins wrote: > On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote: >> I will try my best to > > Make sure to add the sequence should be uppercase. Took me a while to > crack that, as I only got a null pointer exception. > > Pj. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From ayates at ebi.ac.uk Wed Oct 13 16:11:40 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 13 Oct 2010 17:11:40 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013114037.GA1166@thebird.nl> <5E2E80E1-0ABB-485A-B8F3-98E89DD33705@ebi.ac.uk> <3BCA5112-C87C-4833-A04A-5D6BC2F52A44@ebi.ac.uk> <20101013155844.GA2918@thebird.nl> <1FB97ED8-6A64-42B8-8680-6ED60A44B70F@eaglegenomics.com> Message-ID: <7740A206-98A0-4FBC-9CF8-B1AC0DE7D859@ebi.ac.uk> I also thought we were as well. I can investigate On 13 Oct 2010, at 17:02, Richard Holland wrote: > whuh??? Shouldn't we be coding to cater for all case mixtures?! > > > On 13 Oct 2010, at 16:58, Pjotr Prins wrote: > >> On Wed, Oct 13, 2010 at 04:46:58PM +0100, Andy Yates wrote: >>> I will try my best to >> >> Make sure to add the sequence should be uppercase. Took me a while to >> crack that, as I only got a null pointer exception. >> >> Pj. >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From pjotr.public23 at thebird.nl Wed Oct 13 16:13:36 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 18:13:36 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> Message-ID: <20101013161336.GA3184@thebird.nl> On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: > BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. Good news, BJ3 is a lot faster! The previous version took 2 minutes for the C.elegans genome (33 Mb), the BJ3 version takes 27sec on my modest Thinkpad X61 laptop. After parsing the Fasta and turning it into an upper case string the actual translation takes 16sec. Only the C implementations are faster. Here the relevant Scala code: import bio._ import java.io._ import org.biojava3.core.sequence._ import org.biojava3.core.sequence.transcription.TranscriptionEngine import org.biojava3.core.sequence.io.IUPACParser // fetching infile from command line... IUPACParser.getInstance().getTable(1); // not sure we need this IUPACParser.getInstance().getTable("UNIVERSAL"); val engine = TranscriptionEngine.getDefault() val f = new FastaReader(infile) f.foreach { res => val (id,tag,dna) = res println(List(">",id).mkString) val dna2 = new DNASequence(dna.mkString.toUpperCase) val rna = dna2.getRNASequence(engine) println(rna.getProteinSequence(engine)) } } prints: >B0222.10 MLYWNDLNTVGIVADTIWKYYADQYKRLIKEHSKIRFNPLLHAASVIRIFDIHINLFNNFVTFLLVIFLYIFLIYYVTFFVFPFGPLRVSHWMRFFKIIISRSHLG >B0222.11 MSRRTASKLLVFVFLCSLCFGTQRYDMPRKIDLFNDLITQSTTPASPKCQCLPPTTPSTPPNCIPYDSRLQAASLEEAIVAFPDLTITRQEKTQQSTATLNNCKTKQCRDCYKDLRSQLRKVGLLPGTIDQVFHNQRNFTTCQKYRFARQDKGVYEKKKKAKQHYDWDYVEYDEDEDDDYFWDGLFWKKKRNVLKKIVKRDVEATTAISQPPNSTAMNSTGIIGIRFPISCTTRGVTPDGLGTVSLCSTCWVWRRLPSTYYPAYLNEVVCDYADTSCLSGYASCQTGTQQLNVLRNDSGKLIPISVSAGINCECRLAVGSTLESLVLGQGISKAMPPIDTTSTKPPNLATSTTSHS (...) Pj. From ayates at ebi.ac.uk Wed Oct 13 16:25:41 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 13 Oct 2010 17:25:41 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013161336.GA3184@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> Message-ID: That's great news and should be even faster once we get rid of the requirement to upper case since you're having to parse the same sequence twice. I wonder what the C version does to make itself even faster Andy On 13 Oct 2010, at 17:13, Pjotr Prins wrote: > On Wed, Oct 13, 2010 at 12:27:05PM +0100, Richard Holland wrote: >> BJ3 should be replacing most sequence operations with string operations, making the whole thing much faster. > > Good news, BJ3 is a lot faster! The previous version took 2 minutes > for the C.elegans genome (33 Mb), the BJ3 version takes 27sec on my > modest Thinkpad X61 laptop. After parsing the Fasta and turning it > into an upper case string the actual translation takes 16sec. > > Only the C implementations are faster. > > Here the relevant Scala code: > > import bio._ > import java.io._ > import org.biojava3.core.sequence._ > import org.biojava3.core.sequence.transcription.TranscriptionEngine > import org.biojava3.core.sequence.io.IUPACParser > > // fetching infile from command line... > > IUPACParser.getInstance().getTable(1); // not sure we need this > IUPACParser.getInstance().getTable("UNIVERSAL"); > val engine = TranscriptionEngine.getDefault() > val f = new FastaReader(infile) > f.foreach { > res => > val (id,tag,dna) = res > println(List(">",id).mkString) > val dna2 = new DNASequence(dna.mkString.toUpperCase) > val rna = dna2.getRNASequence(engine) > println(rna.getProteinSequence(engine)) > } > } > > prints: > >> B0222.10 > MLYWNDLNTVGIVADTIWKYYADQYKRLIKEHSKIRFNPLLHAASVIRIFDIHINLFNNFVTFLLVIFLYIFLIYYVTFFVFPFGPLRVSHWMRFFKIIISRSHLG >> B0222.11 > MSRRTASKLLVFVFLCSLCFGTQRYDMPRKIDLFNDLITQSTTPASPKCQCLPPTTPSTPPNCIPYDSRLQAASLEEAIVAFPDLTITRQEKTQQSTATLNNCKTKQCRDCYKDLRSQLRKVGLLPGTIDQVFHNQRNFTTCQKYRFARQDKGVYEKKKKAKQHYDWDYVEYDEDEDDDYFWDGLFWKKKRNVLKKIVKRDVEATTAISQPPNSTAMNSTGIIGIRFPISCTTRGVTPDGLGTVSLCSTCWVWRRLPSTYYPAYLNEVVCDYADTSCLSGYASCQTGTQQLNVLRNDSGKLIPISVSAGINCECRLAVGSTLESLVLGQGISKAMPPIDTTSTKPPNLATSTTSHS > (...) > > Pj. > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From pjotr.public23 at thebird.nl Wed Oct 13 16:34:23 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 18:34:23 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> Message-ID: <20101013163423.GA3849@thebird.nl> On Wed, Oct 13, 2010 at 05:25:41PM +0100, Andy Yates wrote: > That's great news and should be even faster once we get rid of the requirement to upper case since you're having to parse the same sequence twice. > > I wonder what the C version does to make itself even faster The EMBOSS implementation is fastest by a mile - takes less than 3 seconds. But the code is, uhm, hard to read. I think table lookups will win in C, whatever you try. But it may be an interesting exercise if we can get close. Note I am perhaps not using the fastest JVM. java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode) Pj. From willishf at ufl.edu Wed Oct 13 17:16:01 2010 From: willishf at ufl.edu (Scooter Willis) Date: Wed, 13 Oct 2010 13:16:01 -0400 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013163423.GA3849@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> <20101013163423.GA3849@thebird.nl> Message-ID: The Biojava3 has an additional validation layer and object creation going from DNA sequence to RNA sequence and then using the appropriate translation rules to return a protein sequence. Could be easily twice as fast if you went from DNA sequence to ProteinSequence which would put it at 8 seconds. We are going to carry a performance penalty setting everything up as a proper object versus doing a simple String to String translation. On Wed, Oct 13, 2010 at 12:34 PM, Pjotr Prins wrote: > On Wed, Oct 13, 2010 at 05:25:41PM +0100, Andy Yates wrote: > > That's great news and should be even faster once we get rid of the > requirement to upper case since you're having to parse the same sequence > twice. > > > > I wonder what the C version does to make itself even faster > > The EMBOSS implementation is fastest by a mile - takes less than 3 > seconds. But the code is, uhm, hard to read. > > I think table lookups will win in C, whatever you try. But it may be an > interesting exercise if we can get close. Note I am perhaps not using the > fastest JVM. > > java version "1.6.0_20" > Java(TM) SE Runtime Environment (build 1.6.0_20-b02) > Java HotSpot(TM) Server VM (build 16.3-b01, mixed mode) > > Pj. > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From pjotr.public23 at thebird.nl Wed Oct 13 18:17:12 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 20:17:12 +0200 Subject: [Biojava-l] BioJava translation In-Reply-To: References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> <20101013163423.GA3849@thebird.nl> Message-ID: <20101013181712.GA4482@thebird.nl> I think it is a good idea. From a purist point of view you may object (it is not biological), but most libraries do exactly that. If direct translation gets it down to 8sec, we may well half that with further tweaking. Pj. On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote: > The Biojava3 has an additional validation layer and object creation going > from DNA sequence to RNA sequence and then using the appropriate translation > rules to return a protein sequence. Could be easily twice as fast if you > went from DNA sequence to ProteinSequence which would put it at 8 seconds. > We are going to carry a performance penalty setting everything up as a > proper object versus doing a simple String to String translation. From darnells at dnastar.com Wed Oct 13 18:21:52 2010 From: darnells at dnastar.com (Steve Darnell) Date: Wed, 13 Oct 2010 13:21:52 -0500 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: Andrew, Forgive me for being pessimistic, but I do not believe you can publically distribute your code without running the risk of being scooped. Mark's suggestions are very good; however, the safest route would be to withhold distribution of your code until your work is published (or at very least accepted). Also, I would suggest this argument for convincing your group to use BioJava (disclaimer - I am not a lawyer). Under the LGPL, you are not obligated to release your source code if: (1) you create a "work based on the library" (e.g. direct modifications or additions to the licensed work) but do not distribute it, and (2) you create a "work that uses the library" by dynamically linking your work to the licensed work (see distribution clause #5 of the LGPL: http://www.gnu.org/licenses/lgpl-2.1.html) If you follow choice #2, you can license and distribute your work under terms of your group's choosing (open or closed, submit it to the BioJava developers for inclusion or not) while gaining the benefit of reusing BioJava. ~Steve -----Original Message----- From: biojava-l-bounces at lists.open-bio.org [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark Schreiber Sent: Wednesday, October 13, 2010 4:26 AM To: McSweeny, Andrew J Cc: biojava-l at biojava.org Subject: Re: [Biojava-l] How to share code while protecting copyrights? Hi - My understanding of copyright is that it is yours as soon as you assert that it is your creation. You can simply add a copyright statement to each file containing the code (in the header for example). The reality is that defending copyright is your responsibility. If someone violates it, you have to take them to court or issue a legal letter. You can also put an appropriate license on the code specifying how it can be used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick one of these that best matches your needs. BioJava code is LGPL so if you want your code to go into the BioJava code base you will need to make your code LGPL. It's always a good idea to add @author tags to Java code to ensure appropriate attribution. Finally, if someone steals your code and publishes results before you then you can always make a complaint to the journal editors. If it is a reputable journal, and you have reasonable proof the editor should take some action such as forcing a retraction. You can also make a distribution agreement saying that if someone uses this code they agree not to publish without first consulting you. If you want to make it really water tight, get a lawyer and explain specifically what you want to share and what you want to protect or prevent. - Mark On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J < andrew.mcsweeny at rockets.utoledo.edu> wrote: > Hi, > > I am working on a project which simulates sexual reproduction in a > population of digital organisms. Their genome is just a contig from hg18. > It's pretty interesting and I can talk more about it in the future.... > > Anyways, how can I share my code for this project without having to worry > that someone else will use it to publish a paper before my group does? > > I'm certain nobody in the open source community would do that, but how do I > convince my group that opening our project to BioJava is a good idea? > > -Andrew > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From andreas at sdsc.edu Wed Oct 13 18:48:32 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 13 Oct 2010 11:48:32 -0700 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: > Forgive me for being pessimistic, but I do not believe you can > publically distribute your code without running the risk of being > scooped. Mark's suggestions are very good; however, the safest route > would be to withhold distribution of your code until your work is > published (or at very least accepted). > I think that is too conservative - if getting scooped is an issue, I would release the code shortly before submission of the first manuscript to a journal. That way the source code can form part of the publication and the referees can view the code during the review process. Many views/downloads of articles happen in the first few weeks after publication. Having a link to the source code in the paper can be a great advertisement for the open source project and help in community-building. Andreas > > Also, I would suggest this argument for convincing your group to use > BioJava (disclaimer - I am not a lawyer). > > Under the LGPL, you are not obligated to release your source code if: > > (1) you create a "work based on the library" (e.g. direct modifications > or additions to the licensed work) but do not distribute it, and > (2) you create a "work that uses the library" by dynamically linking > your work to the licensed work (see distribution clause #5 of the LGPL: > http://www.gnu.org/licenses/lgpl-2.1.html) > > If you follow choice #2, you can license and distribute your work under > terms of your group's choosing (open or closed, submit it to the BioJava > developers for inclusion or not) while gaining the benefit of reusing > BioJava. > > ~Steve > > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org > [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark > Schreiber > Sent: Wednesday, October 13, 2010 4:26 AM > To: McSweeny, Andrew J > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] How to share code while protecting copyrights? > > Hi - > > My understanding of copyright is that it is yours as soon as you assert > that > it is your creation. You can simply add a copyright statement to each > file > containing the code (in the header for example). The reality is that > defending copyright is your responsibility. If someone violates it, you > have > to take them to court or issue a legal letter. > > You can also put an appropriate license on the code specifying how it > can be > used. Examples include GPL, LGPL, BSD, Apache License etc. You can pick > one > of these that best matches your needs. BioJava code is LGPL so if you > want > your code to go into the BioJava code base you will need to make your > code > LGPL. > > It's always a good idea to add @author tags to Java code to ensure > appropriate attribution. > > Finally, if someone steals your code and publishes results before you > then > you can always make a complaint to the journal editors. If it is a > reputable > journal, and you have reasonable proof the editor should take some > action > such as forcing a retraction. You can also make a distribution > agreement > saying that if someone uses this code they agree not to publish without > first consulting you. > > If you want to make it really water tight, get a lawyer and explain > specifically what you want to share and what you want to protect or > prevent. > > - Mark > > On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J < > andrew.mcsweeny at rockets.utoledo.edu> wrote: > > > Hi, > > > > I am working on a project which simulates sexual reproduction in a > > population of digital organisms. Their genome is just a contig from > hg18. > > It's pretty interesting and I can talk more about it in the > future.... > > > > Anyways, how can I share my code for this project without having to > worry > > that someone else will use it to publish a paper before my group does? > > > > I'm certain nobody in the open source community would do that, but how > do I > > convince my group that opening our project to BioJava is a good idea? > > > > -Andrew > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas.prlic at gmail.com Wed Oct 13 19:18:12 2010 From: andreas.prlic at gmail.com (Andreas Prlic) Date: Wed, 13 Oct 2010 12:18:12 -0700 Subject: [Biojava-l] Questions related to biojava In-Reply-To: References: Message-ID: Hi Madhu, best to keep such mails on the mailing list, otherwise they might get lost in my flood of emails... see my reply below. On Wed, Oct 13, 2010 at 12:08 PM, Madhusudan Gujral wrote: > Hi Andreas, > > I have couple of questions related to biojava. I would greatly appreciate > if you could provide directions. > > Is the biojava version 3.0 mature? > Is there any pom file for biojava that I can work with? > Is there a single tool to validate a fasta file? > > - biojava 3.0 is in preparation of getting released. It is not release ready but some of the modules are already used in some production environments - not sure what you mean with this question. You can see the source code in SVN/git and there is also an automated build server providing snapshot builds that can be used for Maven installations. - what kind of vallidation do you have in mind? biojava3-core can do FASTA parsing for you... Andreas From willishf at ufl.edu Wed Oct 13 19:16:39 2010 From: willishf at ufl.edu (Scooter Willis) Date: Wed, 13 Oct 2010 15:16:39 -0400 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013181712.GA4482@thebird.nl> References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> <20101013163423.GA3849@thebird.nl> <20101013181712.GA4482@thebird.nl> Message-ID: Pjotr What is an extra 8 seconds among friends if you know you are going to get the correct answer and you can change the rules if needed!!! Are you parsing the C.elgans genome or DNA representation of each protein in the C.elgans genome? If you take out the println statement that will help speed things up a bunch. Java System.out is always slow. I am checking on the problem with upper case. That shouldn't be an issue. Thanks Scooter On Wed, Oct 13, 2010 at 2:17 PM, Pjotr Prins wrote: > I think it is a good idea. From a purist point of view you may object > (it is not biological), but most libraries do exactly that. > > If direct translation gets it down to 8sec, we may well half that > with further tweaking. > > Pj. > > On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote: > > The Biojava3 has an additional validation layer and object creation going > > from DNA sequence to RNA sequence and then using the appropriate > translation > > rules to return a protein sequence. Could be easily twice as fast if you > > went from DNA sequence to ProteinSequence which would put it at 8 > seconds. > > We are going to carry a performance penalty setting everything up as a > > proper object versus doing a simple String to String translation. > > From pjotr.public23 at thebird.nl Wed Oct 13 21:05:46 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Wed, 13 Oct 2010 23:05:46 +0200 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: <20101013210546.GB5479@thebird.nl> Is that idea of getting scooped realistic? All my code is online, that is my scientific track record, next to my papers. Online OSS code may bring benefits when other people find bugs, or even improve things. I don't worry about getting scooped. First it is easy to prove it is mine, exactly because it is out in the open, and second it takes more than plain old code to get something published in a journal. In the rare case an idea is so sensitive and easy to copy, you can publish it with some part missing. I think too much code sits on planks gathering dust, just because people have these worries. It is old school. We are in the business of moving science forward - writing beautiful tools. Nothing less. Pj. On Wed, Oct 13, 2010 at 11:48:32AM -0700, Andreas Prlic wrote: > > Forgive me for being pessimistic, but I do not believe you can > > publically distribute your code without running the risk of being > > scooped. Mark's suggestions are very good; however, the safest route > > would be to withhold distribution of your code until your work is > > published (or at very least accepted). From andreas at sdsc.edu Wed Oct 13 21:24:54 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 13 Oct 2010 14:24:54 -0700 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: <20101013210546.GB5479@thebird.nl> References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> <20101013210546.GB5479@thebird.nl> Message-ID: nicely put :-) A On Wed, Oct 13, 2010 at 2:05 PM, Pjotr Prins wrote: > Is that idea of getting scooped realistic? > > All my code is online, that is my scientific track record, next to my > papers. > > Online OSS code may bring benefits when other people find bugs, or > even improve things. I don't worry about getting scooped. First it is > easy to prove it is mine, exactly because it is out in the open, and > second it takes more than plain old code to get something published in > a journal. > > In the rare case an idea is so sensitive and easy to copy, you can > publish it with some part missing. > > I think too much code sits on planks gathering dust, just because > people have these worries. It is old school. We are in the business > of moving science forward - writing beautiful tools. Nothing less. > > Pj. > > On Wed, Oct 13, 2010 at 11:48:32AM -0700, Andreas Prlic wrote: > > > Forgive me for being pessimistic, but I do not believe you can > > > publically distribute your code without running the risk of being > > > scooped. Mark's suggestions are very good; however, the safest route > > > would be to withhold distribution of your code until your work is > > > published (or at very least accepted). > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From hlapp at drycafe.net Wed Oct 13 21:44:36 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 13 Oct 2010 16:44:36 -0500 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: How and when you want to be attributed in publications, and what you want someone else not to publish on, is an ethical matter. Licenses are legal instruments and not suited for ethical questions or social conventions. Rather, this is addressed by ethical and social conventions and requests. A good example is the Ft Lauderdale agreement, which is not a legal instrument but an ethical request of those who peruse immediate- release sequencing data. If you have ethical or social requests to make of those who peruse your code, state them explicitly in a README and in the code. By their nature, you can't legally enforce them. However, ethical behavior is policed - by all of us as a scientific community, not in the courts. -hilmar On Oct 13, 2010, at 1:21 PM, Steve Darnell wrote: > Andrew, > > Forgive me for being pessimistic, but I do not believe you can > publically distribute your code without running the risk of being > scooped. Mark's suggestions are very good; however, the safest route > would be to withhold distribution of your code until your work is > published (or at very least accepted). > > Also, I would suggest this argument for convincing your group to use > BioJava (disclaimer - I am not a lawyer). > > Under the LGPL, you are not obligated to release your source code if: > > (1) you create a "work based on the library" (e.g. direct > modifications > or additions to the licensed work) but do not distribute it, and > (2) you create a "work that uses the library" by dynamically linking > your work to the licensed work (see distribution clause #5 of the > LGPL: > http://www.gnu.org/licenses/lgpl-2.1.html) > > If you follow choice #2, you can license and distribute your work > under > terms of your group's choosing (open or closed, submit it to the > BioJava > developers for inclusion or not) while gaining the benefit of reusing > BioJava. > > ~Steve > > -----Original Message----- > From: biojava-l-bounces at lists.open-bio.org > [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Mark > Schreiber > Sent: Wednesday, October 13, 2010 4:26 AM > To: McSweeny, Andrew J > Cc: biojava-l at biojava.org > Subject: Re: [Biojava-l] How to share code while protecting > copyrights? > > Hi - > > My understanding of copyright is that it is yours as soon as you > assert > that > it is your creation. You can simply add a copyright statement to each > file > containing the code (in the header for example). The reality is that > defending copyright is your responsibility. If someone violates it, > you > have > to take them to court or issue a legal letter. > > You can also put an appropriate license on the code specifying how it > can be > used. Examples include GPL, LGPL, BSD, Apache License etc. You can > pick > one > of these that best matches your needs. BioJava code is LGPL so if you > want > your code to go into the BioJava code base you will need to make your > code > LGPL. > > It's always a good idea to add @author tags to Java code to ensure > appropriate attribution. > > Finally, if someone steals your code and publishes results before you > then > you can always make a complaint to the journal editors. If it is a > reputable > journal, and you have reasonable proof the editor should take some > action > such as forcing a retraction. You can also make a distribution > agreement > saying that if someone uses this code they agree not to publish > without > first consulting you. > > If you want to make it really water tight, get a lawyer and explain > specifically what you want to share and what you want to protect or > prevent. > > - Mark > > On Tue, Oct 12, 2010 at 11:41 PM, McSweeny, Andrew J < > andrew.mcsweeny at rockets.utoledo.edu> wrote: > >> Hi, >> >> I am working on a project which simulates sexual reproduction in a >> population of digital organisms. Their genome is just a contig from > hg18. >> It's pretty interesting and I can talk more about it in the > future.... >> >> Anyways, how can I share my code for this project without having to > worry >> that someone else will use it to publish a paper before my group >> does? >> >> I'm certain nobody in the open source community would do that, but >> how > do I >> convince my group that opening our project to BioJava is a good idea? >> >> -Andrew >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From ayates at ebi.ac.uk Wed Oct 13 22:52:17 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 13 Oct 2010 23:52:17 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: References: <20101013111541.GA512@thebird.nl> <9749C719-26AE-48F5-82B0-4624C0BDDA97@eaglegenomics.com> <20101013161336.GA3184@thebird.nl> <20101013163423.GA3849@thebird.nl> <20101013181712.GA4482@thebird.nl> Message-ID: <7E59B83F-8371-4F79-AC4C-57D1A49A9398@ebi.ac.uk> LOL well you could always parallelise it :) I've gone & pushed a new version of the translator code to the SVN repo so it'll filter through to the public server soon. There's an added test case as well. The overall impact of this change seems to be about 25 translations of BRCA2 per second so it is significant; our current limit looks to be approx. 200 per second. I hope you find this is faster without the need to edit & parse a Sequence String twice Andy On 13 Oct 2010, at 20:16, Scooter Willis wrote: > Pjotr > > What is an extra 8 seconds among friends if you know you are going to get the correct answer and you can change the rules if needed!!! > > Are you parsing the C.elgans genome or DNA representation of each protein in the C.elgans genome? > > If you take out the println statement that will help speed things up a bunch. Java System.out is always slow. > > I am checking on the problem with upper case. That shouldn't be an issue. > > Thanks > > Scooter > > > On Wed, Oct 13, 2010 at 2:17 PM, Pjotr Prins wrote: > I think it is a good idea. From a purist point of view you may object > (it is not biological), but most libraries do exactly that. > > If direct translation gets it down to 8sec, we may well half that > with further tweaking. > > Pj. > > On Wed, Oct 13, 2010 at 01:16:01PM -0400, Scooter Willis wrote: > > The Biojava3 has an additional validation layer and object creation going > > from DNA sequence to RNA sequence and then using the appropriate translation > > rules to return a protein sequence. Could be easily twice as fast if you > > went from DNA sequence to ProteinSequence which would put it at 8 seconds. > > We are going to carry a performance penalty setting everything up as a > > proper object versus doing a simple String to String translation. > > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From pjotr.public23 at thebird.nl Thu Oct 14 07:00:12 2010 From: pjotr.public23 at thebird.nl (Pjotr Prins) Date: Thu, 14 Oct 2010 09:00:12 +0200 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> Message-ID: <20101014070012.GA7296@thebird.nl> On Wed, Oct 13, 2010 at 04:44:36PM -0500, Hilmar Lapp wrote: > By their nature, you can't legally enforce them. However, ethical > behavior is policed - by all of us as a scientific community, not in the > courts. I know people who make it their business to pursue companies that do not honour OSS licenses. The companies always have to retrack. Is there any precedent in science where open source software was used to scoop research? And how did that scientist fare? With scientists I can't see it happening. Getting caught out that way will hurt all future prospects for an individual or group. With this reasoning you are best off putting code in the public domain as fast as possible. Pj. From hlapp at drycafe.net Thu Oct 14 14:47:19 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Thu, 14 Oct 2010 09:47:19 -0500 Subject: [Biojava-l] How to share code while protecting copyrights? In-Reply-To: <20101014070012.GA7296@thebird.nl> References: <469B4CD3D7690A418E8F96B7BA4585F812052E03@BL2PRD0103MB054.prod.exchangelabs.com> <20101014070012.GA7296@thebird.nl> Message-ID: On Oct 14, 2010, at 2:00 AM, Pjotr Prins wrote: > I know people who make it their business to pursue companies that do > not honour OSS licenses. The companies always have to retrack. Of course. That's a legal issue. Attribution on publications, or what someone publishes on reusing your stuff, is not a legal issue. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri Oct 15 11:53:13 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 15 Oct 2010 12:53:13 +0100 Subject: [Biojava-l] BioJava translation In-Reply-To: <20101013111541.GA512@thebird.nl> References: <20101013111541.GA512@thebird.nl> Message-ID: On Wed, Oct 13, 2010 at 12:15 PM, Pjotr Prins wrote: > I am using biojava-1.7.1 nucleotide -> amino acid translation. It is > rather slow. In fact, the biopython equivalent in native Python is > twice as fast. EMBOSS is again magnitudes faster. I am using > something like > > ?rna = RNATools.createRNA(nucleotides); > ?aa = RNATools.translate(rna); > > Embarrassingly, even the R version is faster in the GeneR module, as > it uses a C module. > > I have a feeling this has to do with typed object creation at every > level, whereas Python and others uses plain character Strings. > > Any plans for speeding this up on the JVM? > > Pj. Actually (assuming you are not explicitly using strings), Biopython would also be using objects for each sequence, which does impose a speed penalty. Peter From kurka at mikro.biologie.tu-muenchen.de Tue Oct 19 11:25:31 2010 From: kurka at mikro.biologie.tu-muenchen.de (Hedwig Kurka) Date: Tue, 19 Oct 2010 13:25:31 +0200 Subject: [Biojava-l] feature request - full query description from blast result Message-ID: <4CBD802B.7030809@mikro.biologie.tu-muenchen.de> Hi all, I just read in a blast file and I want to get the full query description. For example, when I have that query: Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11 ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2 (1208 letters) I get as query-information locus_tag= CD0002 The rest is truncated. In the biojava-mailinglist I found the same question http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html And Mark suggested to make a request for improvement, but as I see it, nothing happened. So I would like to ask, if you can change it. Or is it changed and I don't see it. Thanks, Hedwig From sb.genny at gmail.com Thu Oct 21 14:28:53 2010 From: sb.genny at gmail.com (sobia idrees) Date: Thu, 21 Oct 2010 19:28:53 +0500 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9 In-Reply-To: References: Message-ID: Hi I want to develop phylogenetics application in biojava..but need help to do that..Kindly help me in developing some applications.. Thanks in anticipation Regards, Sobia Idrees On Tue, Oct 19, 2010 at 9:00 PM, wrote: > Send Biojava-l mailing list submissions to > biojava-l at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biojava-l > or, via email, send a message with subject or body 'help' to > biojava-l-request at lists.open-bio.org > > You can reach the person managing the list at > biojava-l-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biojava-l digest..." > > > Today's Topics: > > 1. feature request - full query description from blast result > (Hedwig Kurka) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 19 Oct 2010 13:25:31 +0200 > From: Hedwig Kurka > Subject: [Biojava-l] feature request - full query description from > blast result > To: biojava-l at lists.open-bio.org > Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de> > Content-Type: text/plain; charset=ISO-8859-15 > > Hi all, > > I just read in a blast file and I want to get the full query description. > For example, when I have that query: > Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase > III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11 > ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2 > (1208 letters) > > I get as query-information locus_tag= CD0002 > The rest is truncated. > > In the biojava-mailinglist I found the same question > http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html > > And Mark suggested to make a request for improvement, but as I see it, > nothing happened. So I would like to ask, if you can change it. Or is it > changed and I don't see it. > > Thanks, > Hedwig > > > ------------------------------ > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > End of Biojava-l Digest, Vol 93, Issue 9 > **************************************** > From sb.genny at gmail.com Thu Oct 21 14:30:35 2010 From: sb.genny at gmail.com (sobia idrees) Date: Thu, 21 Oct 2010 19:30:35 +0500 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9 In-Reply-To: References: Message-ID: Hi I have developed some web based and desktop based applications using biojava..Can it be published in Biojava journal? Thanks, Sobia Idrees On Thu, Oct 21, 2010 at 7:28 PM, sobia idrees wrote: > Hi > > I want to develop phylogenetics application in biojava..but need help to do > that..Kindly help me in developing some applications.. > > Thanks in anticipation > > Regards, > Sobia Idrees > > > On Tue, Oct 19, 2010 at 9:00 PM, wrote: > >> Send Biojava-l mailing list submissions to >> biojava-l at lists.open-bio.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> or, via email, send a message with subject or body 'help' to >> biojava-l-request at lists.open-bio.org >> >> You can reach the person managing the list at >> biojava-l-owner at lists.open-bio.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of Biojava-l digest..." >> >> >> Today's Topics: >> >> 1. feature request - full query description from blast result >> (Hedwig Kurka) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Tue, 19 Oct 2010 13:25:31 +0200 >> From: Hedwig Kurka >> Subject: [Biojava-l] feature request - full query description from >> blast result >> To: biojava-l at lists.open-bio.org >> Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de> >> Content-Type: text/plain; charset=ISO-8859-15 >> >> Hi all, >> >> I just read in a blast file and I want to get the full query description. >> For example, when I have that query: >> Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase >> III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11 >> ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2 >> (1208 letters) >> >> I get as query-information locus_tag= CD0002 >> The rest is truncated. >> >> In the biojava-mailinglist I found the same question >> http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html >> >> And Mark suggested to make a request for improvement, but as I see it, >> nothing happened. So I would like to ask, if you can change it. Or is it >> changed and I don't see it. >> >> Thanks, >> Hedwig >> >> >> ------------------------------ >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> >> End of Biojava-l Digest, Vol 93, Issue 9 >> **************************************** >> > > From holland at eaglegenomics.com Thu Oct 21 14:41:35 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 21 Oct 2010 15:41:35 +0100 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 9 In-Reply-To: References: Message-ID: <97591963-F741-45C1-8E9D-231A5D05D4DA@eaglegenomics.com> There is no such thing as a Biojava journal. You would need to submit your paper to one of the main bioinformatics journals. cheers, Richard On 21 Oct 2010, at 15:30, sobia idrees wrote: > Hi > > I have developed some web based and desktop based applications using > biojava..Can it be published in Biojava journal? > > Thanks, > Sobia Idrees > > On Thu, Oct 21, 2010 at 7:28 PM, sobia idrees wrote: > >> Hi >> >> I want to develop phylogenetics application in biojava..but need help to do >> that..Kindly help me in developing some applications.. >> >> Thanks in anticipation >> >> Regards, >> Sobia Idrees >> >> >> On Tue, Oct 19, 2010 at 9:00 PM, wrote: >> >>> Send Biojava-l mailing list submissions to >>> biojava-l at lists.open-bio.org >>> >>> To subscribe or unsubscribe via the World Wide Web, visit >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> or, via email, send a message with subject or body 'help' to >>> biojava-l-request at lists.open-bio.org >>> >>> You can reach the person managing the list at >>> biojava-l-owner at lists.open-bio.org >>> >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of Biojava-l digest..." >>> >>> >>> Today's Topics: >>> >>> 1. feature request - full query description from blast result >>> (Hedwig Kurka) >>> >>> >>> ---------------------------------------------------------------------- >>> >>> Message: 1 >>> Date: Tue, 19 Oct 2010 13:25:31 +0200 >>> From: Hedwig Kurka >>> Subject: [Biojava-l] feature request - full query description from >>> blast result >>> To: biojava-l at lists.open-bio.org >>> Message-ID: <4CBD802B.7030809 at mikro.biologie.tu-muenchen.de> >>> Content-Type: text/plain; charset=ISO-8859-15 >>> >>> Hi all, >>> >>> I just read in a blast file and I want to get the full query description. >>> For example, when I have that query: >>> Query= ||/locus_tag= CD0002 ||/gene=dnaN ||/product= DNA polymerase >>> III subunit beta ||/protein_id= YP_001086465.1 ||/transl_table=11 >>> ||/db_xref=||/organism=Clostridium difficile 630 |feature_id=2 >>> (1208 letters) >>> >>> I get as query-information locus_tag= CD0002 >>> The rest is truncated. >>> >>> In the biojava-mailinglist I found the same question >>> http://www.mail-archive.com/biojava-l at lists.open-bio.org/msg01022.html >>> >>> And Mark suggested to make a request for improvement, but as I see it, >>> nothing happened. So I would like to ask, if you can change it. Or is it >>> changed and I don't see it. >>> >>> Thanks, >>> Hedwig >>> >>> >>> ------------------------------ >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> >>> End of Biojava-l Digest, Vol 93, Issue 9 >>> **************************************** >>> >> >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jc.lucky at laposte.net Fri Oct 22 08:11:43 2010 From: jc.lucky at laposte.net (jc.lucky) Date: Fri, 22 Oct 2010 10:11:43 +0200 (CEST) Subject: [Biojava-l] Retrieve Information from GenBank file Message-ID: <31170592.35650.1287735103724.JavaMail.www@wwinf8210> Hi I'm trying to convert a GenBank file into a rdf file. The gene of interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 With the below code I can read the GenBank file and I manage to retrieve information and convert them in a rdf format. However I don't succeed in retrieving some information such as Title, protein or product. According to this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is possible to do so. Please help me find what I do wrong or what should be done to achieve my goal. //read the GeneBank File public static RichSequenceIterator readFile(String input, RichSequenceBuilderFactory seqFactory, Namespace ns) throws IOException, NoSuchElementException, BioException { ns = null; InputStream stream = new FileInputStream(input); BufferedReader rdfFile = new BufferedReader(new InputStreamReader(stream)); RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(rdfFile,ns); return seqs; } //Retrieve information and convert them in rdf format public void writeToRDFFile(RichSequenceIterator rsi, String output) throws IOException, NoSuchElementException, BioException { //create model for the ontology OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, null); OntClass parents; String URI = "http://pbr.wur.nl/#"; while(rsi.hasNext()) { RichSequence seq = rsi.nextRichSequence(); String id = seq.getName(); parents = model.createClass(URI + id); Set author = seq.getRankedDocRefs();//code to clean up Set&convert toString String definition = seq.getDescription(); //code to clean up String //Add to model parents.addProperty(DC.description, definition); parents.addProperty(DC.publisher, authors); parents.addComment(taxonomy, "EN"); parents.addProperty(DC.type, organism); //print in rdf format model.write(out, "RDF/XML"); out.close(); } } Thanks, Jean-Charles Ferri?res Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? Je cr?e ma bo?te mail www.laposte.net From andreas at sdsc.edu Fri Oct 22 19:56:49 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 22 Oct 2010 12:56:49 -0700 Subject: [Biojava-l] 3.0-alpha2 Message-ID: Hi, In preparation for the upcoming biojava 3 release, 3.0-alpha2 has just been released on http://biojava.org/download/maven/ Andreas From cfriedline at vcu.edu Sun Oct 24 14:38:46 2010 From: cfriedline at vcu.edu (Chris Friedline) Date: Sun, 24 Oct 2010 10:38:46 -0400 Subject: [Biojava-l] Test Message Message-ID: Per Andreas, this is a test. Chris -- PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From cfriedline at vcu.edu Sun Oct 24 14:57:48 2010 From: cfriedline at vcu.edu (Chris Friedline) Date: Sun, 24 Oct 2010 10:57:48 -0400 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch Message-ID: Hello, I am getting a weird problem with protein alignment using NeedlemanWunsch in 1.7.1, in that the alignment does not span the entire length of the proteins. I've verified that this should not happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI. I'm reluctant to switch to BioJava3 at this time, since performance is about 2-3x slower than 1.7.1 for the alignments, and I'm doing about 350,000 of them. An example of this alignment error, is shown here: http://pastebin.com/mdX516R6 Notice that the alignment stops 1 amino acid short of the end in both cases. The parameters for the alignment are: BLOSUM50, gapOpen=10, gapExtend=2. Thanks, Chris -- PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From andreas.draeger at uni-tuebingen.de Sun Oct 24 16:01:05 2010 From: andreas.draeger at uni-tuebingen.de (Andreas Draeger) Date: Sun, 24 Oct 2010 18:01:05 +0200 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: References: Message-ID: <4CC45841.5080604@uni-tuebingen.de> Hi Chris, Thank you for reprorting this problem. It would be very nice if you could also provide your source code. Then I would like to test what happens. You can send source code, substitution matrix, and the two example protein sequences that cause the problems directly to me. I'll then have a look into it. Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From cfriedline at vcu.edu Sun Oct 24 18:04:25 2010 From: cfriedline at vcu.edu (Chris Friedline) Date: Sun, 24 Oct 2010 14:04:25 -0400 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: <4CC45841.5080604@uni-tuebingen.de> References: <4CC45841.5080604@uni-tuebingen.de> Message-ID: Thanks, Andreas. I've sent you the information that you asked for below. Chris On Sun, Oct 24, 2010 at 12:01 PM, Andreas Draeger wrote: > Hi Chris, > > Thank you for reprorting this problem. It would be very nice if you > could also provide your source code. Then I would like to test what > happens. You can send source code, substitution matrix, and the two > example protein sequences that cause the problems directly to me. I'll > then have a look into it. > > Cheers > Andreas > > -- > Dipl.-Bioinform. Andreas Dr?ger > Eberhard Karls University T?bingen > Center for Bioinformatics (ZBIT) > Sand 1 > 72076 T?bingen > Germany > > Phone: +49-7071-29-70436 > Fax: ? +49-7071-29-5091 > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From koen.bruynseels at cropdesign.com Mon Oct 25 16:15:59 2010 From: koen.bruynseels at cropdesign.com (koen.bruynseels at cropdesign.com) Date: Mon, 25 Oct 2010 18:15:59 +0200 Subject: [Biojava-l] Koen Bruynseels is out of the office. Message-ID: I will be out of the office starting 10/25/2010 and will not return until 11/02/2010. I will respond to your message when I return. From andreas at sdsc.edu Tue Oct 26 18:42:29 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 26 Oct 2010 11:42:29 -0700 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: References: Message-ID: Hi Chris, about your comment that the biojava3-alignment is slower than the 1.7 one: Do you have any data if this is coming from the io or is the actual alignment calculation slower? Andreas On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline wrote: > Hello, > > I am getting a weird problem with protein alignment using > NeedlemanWunsch in 1.7.1, in that the alignment does not span the > entire length of the proteins. ?I've verified that this should not > happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI. > I'm reluctant to switch to BioJava3 at this time, since performance is > about 2-3x slower than 1.7.1 for the alignments, and I'm doing about > 350,000 of them. > > An example of this alignment error, is shown here: http://pastebin.com/mdX516R6 > > Notice that the alignment stops 1 amino acid short of the end in both > cases. ?The parameters for the alignment are: BLOSUM50, gapOpen=10, > gapExtend=2. > > Thanks, > Chris > > -- > PhD Candidate, Integrative Life Sciences > Virginia Commonwealth University > Richmond, VA > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From cfriedline at vcu.edu Tue Oct 26 19:21:39 2010 From: cfriedline at vcu.edu (Chris Friedline) Date: Tue, 26 Oct 2010 15:21:39 -0400 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: References: Message-ID: Hi Andreas, The io should be the same, since I've used the same set of genes for testing both. So, I'm guessing it's either the alignment calculation or the new biojava design contributing to the slowness. Chris On Tue, Oct 26, 2010 at 2:42 PM, Andreas Prlic wrote: > Hi Chris, > > about your comment that the biojava3-alignment is slower than the 1.7 > one: Do you have any data if this is coming from the io or is the > actual alignment calculation slower? > > Andreas > > On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline > wrote: > > Hello, > > > > I am getting a weird problem with protein alignment using > > NeedlemanWunsch in 1.7.1, in that the alignment does not span the > > entire length of the proteins. I've verified that this should not > > happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI. > > I'm reluctant to switch to BioJava3 at this time, since performance is > > about 2-3x slower than 1.7.1 for the alignments, and I'm doing about > > 350,000 of them. > > > > An example of this alignment error, is shown here: > http://pastebin.com/mdX516R6 > > > > Notice that the alignment stops 1 amino acid short of the end in both > > cases. The parameters for the alignment are: BLOSUM50, gapOpen=10, > > gapExtend=2. > > > > Thanks, > > Chris > > > > -- > > PhD Candidate, Integrative Life Sciences > > Virginia Commonwealth University > > Richmond, VA > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > -- PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From cfriedline at vcu.edu Tue Oct 26 19:29:30 2010 From: cfriedline at vcu.edu (Chris Friedline) Date: Tue, 26 Oct 2010 15:29:30 -0400 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: References: Message-ID: That's something I'll need to go back and revisit after my deadline passes at the end of this week. Initially, I was creating them on the fly at the time of alignment, but it would be more efficient to store them that way in the gene object itself. ?I was also passing an InputStreamReader for the substitution matrix each time (pulling the matrix from my jar), but storing it as a string would also be a better option, especially since I'm threading and there are so many alignments. Chris On Tue, Oct 26, 2010 at 3:23 PM, Andreas Prlic wrote: > > ok, how do you create the biojava3 Sequence objects? just trying to > find out where the bottlenecks are, so we can fix them... > > A > > On Tue, Oct 26, 2010 at 12:20 PM, Chris Friedline wrote: > > Hi, > > The io should be the same, since I've used the same set of genes for testing > > both. ?So, it's either the alignment calculation or the new biojava design > > contributing to the slowness. > > Chris > > > > On Tue, Oct 26, 2010 at 2:42 PM, Andreas Prlic wrote: > >> > >> Hi Chris, > >> > >> about your comment that the biojava3-alignment is slower than the 1.7 > >> one: Do you have any data if this is coming from the io or is the > >> actual alignment calculation slower? > >> > >> Andreas > >> > >> On Sun, Oct 24, 2010 at 7:57 AM, Chris Friedline > >> wrote: > >> > Hello, > >> > > >> > I am getting a weird problem with protein alignment using > >> > NeedlemanWunsch in 1.7.1, in that the alignment does not span the > >> > entire length of the proteins. ?I've verified that this should not > >> > happen with needle (from EMBOSS), neobio, BioJava3, and NW on NCBI. > >> > I'm reluctant to switch to BioJava3 at this time, since performance is > >> > about 2-3x slower than 1.7.1 for the alignments, and I'm doing about > >> > 350,000 of them. > >> > > >> > An example of this alignment error, is shown here: > >> > http://pastebin.com/mdX516R6 > >> > > >> > Notice that the alignment stops 1 amino acid short of the end in both > >> > cases. ?The parameters for the alignment are: BLOSUM50, gapOpen=10, > >> > gapExtend=2. > >> > > >> > Thanks, > >> > Chris > >> > > >> > -- > >> > PhD Candidate, Integrative Life Sciences > >> > Virginia Commonwealth University > >> > Richmond, VA > >> > _______________________________________________ > >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > >> > http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > >> > >> > >> > >> -- > >> ----------------------------------------------------------------------- > >> Dr. Andreas Prlic > >> Senior Scientist, RCSB PDB Protein Data Bank > >> University of California, San Diego > >> (+1) 858.246.0526 > >> ----------------------------------------------------------------------- > > > > > > > > -- > > PhD Candidate, Integrative Life Sciences > > Virginia Commonwealth University > > Richmond, VA > > > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- -- PhD Candidate, Integrative Life Sciences Virginia Commonwealth University Richmond, VA From andreas.draeger at uni-tuebingen.de Tue Oct 26 22:18:00 2010 From: andreas.draeger at uni-tuebingen.de (=?ISO-8859-1?Q?Andreas_Dr=E4ger?=) Date: Tue, 26 Oct 2010 23:18:00 +0100 Subject: [Biojava-l] Global alignment problem (bug?) in NeedlemanWunsch In-Reply-To: References: Message-ID: <4CC75398.7000301@uni-tuebingen.de> Hi all, By the way, I would like to mention that the bug has been fixed. It was a problem with the way how the alignment was presented to the user afterwards, i.e., a problem of the formatting algorithm. The alignment itself was correct and also when obtaining the GappedSequences after the alignment, these were correct. The problem was that the formatter was started with the original lenght of the sequences, which is usually to short after inserting gaps. This is now solved and the alignment should work fine now. Cheers Andreas -- Dipl.-Bioinform. Andreas Dr?ger Eberhard Karls University T?bingen Center for Bioinformatics (ZBIT) Sand 1 72076 T?bingen Germany Phone: +49-7071-29-70436 Fax: +49-7071-29-5091 From dasarnow at gmail.com Wed Oct 27 03:54:43 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Tue, 26 Oct 2010 20:54:43 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader Message-ID: Hi all, Let me first say thanks to all the BioJava community members for delivering such a useful set of libraries, and that I'm still a newbie when it comes to BioJava (and Java) so forgive me if my question is too trivial. I am doing work on lots (at least thousands) of PDB files from RCSB. As is commonly known, these are often rife with errors which can lead to exceptions during parsing with PDBFileParser. Because PDBFileParser's methods contain their own try-catch blocks, exception propagation stops there and my code proceeds blindly along regardless of any error checking I do. I would like to catch the exceptions up in my code where the parser is called, so that I can branch to a continue statement and have my batch processing loops move on to the next file. Should I edit out the try-catch blocks and compile my own version of the library? Or should I test the returned StructureImpl objects for possession of the fields in question? In that case, I'm not sure which properties will give the most general success information...and I'd rather not have to check for /every/ property being correct. If there is some great way to check if an exception was caught down a series of nested method calls, please hit me over the head with it. Thanks! -da From andreas at sdsc.edu Wed Oct 27 04:11:28 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 26 Oct 2010 21:11:28 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Hi Daniel, can you explain a bit more what you are doing, in particular what errors you would like to deal with on your end? You should not need to worry too much about exception handling. Are there any special cases you are interested in? In this case we should support you with a clean interface rather than exception handling from your end... Andreas On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: > Hi all, > Let me first say thanks to all the BioJava community members for > delivering such a useful set of libraries, and that I'm still a newbie > when it comes to BioJava (and Java) so forgive me if my question is > too trivial. > > I am doing work on lots (at least thousands) of PDB files from RCSB. > As is commonly known, these are often rife with errors which can lead > to exceptions during parsing with PDBFileParser. ?Because > PDBFileParser's methods contain their own try-catch blocks, exception > propagation stops there and my code proceeds blindly along regardless > of any error checking I do. ?I would like to catch the exceptions up > in my code where the parser is called, so that I can branch to a > continue statement and have my batch processing loops move on to the > next file. > Should I edit out the try-catch blocks and compile my own version of > the library? ?Or should I test the returned StructureImpl objects for > possession of the fields in question? ?In that case, I'm not sure > which properties will give the most general success information...and > I'd rather not have to check for /every/ property being correct. > > If there is some great way to check if an exception was caught down a > series of nested method calls, please hit me over the head with it. > > Thanks! > > -da > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From dasarnow at gmail.com Wed Oct 27 04:59:56 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Tue, 26 Oct 2010 21:59:56 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Glad to hear it, who doesn't like support or clean interfaces?. No offense intended, by the way, with respect to PDB errors - obviously the PDB is an indispensable resource for all protein scientists. I am looking at many (fixed-length) pieces of protein chains and doin' stuff with 'em. My current code has a pair of nested while loops; the outer iterates over PDB entries (locally rsync'd copy), parsing them and the inner iterates over the pieces from each. When StructureExceptions come out of my PDBFileReader object I want to continue the outer loop, moving on to the next set of files without executing any of the code that depends on correct StructureImpl objects from the reader (database updates, the inner loop). Since the reader's methods have their own try-catch blocks, a thrown StructureException is stopped there and never reaches my own error handling. I just need to know when those errors occur so I can skip those proteins - I am presuming that the correct entries will outweigh the problem ones by a significant factor and the overall data wont be seriously impacted. -da On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: > Hi Daniel, > > can you explain a bit more what you are doing, in particular what > errors you would like to deal with on your end? ?You should not need > to worry too much about exception handling. Are there any special > cases you are interested in? ?In this case we should support you with > a clean interface rather than exception handling from your end... > > Andreas > > > > On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >> Hi all, >> Let me first say thanks to all the BioJava community members for >> delivering such a useful set of libraries, and that I'm still a newbie >> when it comes to BioJava (and Java) so forgive me if my question is >> too trivial. >> >> I am doing work on lots (at least thousands) of PDB files from RCSB. >> As is commonly known, these are often rife with errors which can lead >> to exceptions during parsing with PDBFileParser. ?Because >> PDBFileParser's methods contain their own try-catch blocks, exception >> propagation stops there and my code proceeds blindly along regardless >> of any error checking I do. ?I would like to catch the exceptions up >> in my code where the parser is called, so that I can branch to a >> continue statement and have my batch processing loops move on to the >> next file. >> Should I edit out the try-catch blocks and compile my own version of >> the library? ?Or should I test the returned StructureImpl objects for >> possession of the fields in question? ?In that case, I'm not sure >> which properties will give the most general success information...and >> I'd rather not have to check for /every/ property being correct. >> >> If there is some great way to check if an exception was caught down a >> series of nested method calls, please hit me over the head with it. >> >> Thanks! >> >> -da >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From dasarnow at gmail.com Wed Oct 27 05:03:59 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Tue, 26 Oct 2010 22:03:59 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: I think that would be perfect...and of course I'm happy perform testing on whatever gets cooked up. -da 2010/10/26 Amr Al-Hossary : > We can?add some thing like an exception tracing queue, that can be?checked > for later by the caller. > > would that be OK? > > Amr > >> Date: Tue, 26 Oct 2010 21:11:28 -0700 >> From: andreas at sdsc.edu >> To: dasarnow at gmail.com >> CC: biojava-l at lists.open-bio.org >> Subject: Re: [Biojava-l] Bad PDB files and batch processing with >> PDBFileReader >> >> Hi Daniel, >> >> can you explain a bit more what you are doing, in particular what >> errors you would like to deal with on your end? You should not need >> to worry too much about exception handling. Are there any special >> cases you are interested in? In this case we should support you with >> a clean interface rather than exception handling from your end... >> >> Andreas >> >> >> >> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow >> wrote: >> > Hi all, >> > Let me first say thanks to all the BioJava community members for >> > delivering such a useful set of libraries, and that I'm still a newbie >> > when it comes to BioJava (and Java) so forgive me if my question is >> > too trivial. >> > >> > I am doing work on lots (at least thousands) of PDB files from RCSB. >> > As is commonly known, these are often rife with errors which can lead >> > to exceptions during parsing with PDBFileParser. ?Because >> > PDBFileParser's methods contain their own try-catch blocks, exception >> > propagation stops there and my code proceeds blindly along regardless >> > of any error checking I do. ?I would like to catch the exceptions up >> > in my code where the parser is called, so that I can branch to a >> > continue statement and have my batch processing loops move on to the >> > next file. >> > Should I edit out the try-catch blocks and compile my own version of >> > the library? ?Or should I test the returned StructureImpl objects for >> > possession of the fields in question? ?In that case, I'm not sure >> > which properties will give the most general success information...and >> > I'd rather not have to check for /every/ property being correct. >> > >> > If there is some great way to check if an exception was caught down a >> > series of nested method calls, please hit me over the head with it. >> > >> > Thanks! >> > >> > -da >> > _______________________________________________ >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > From andreas at sdsc.edu Wed Oct 27 05:19:07 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 26 Oct 2010 22:19:07 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Hi Daniel, PDB files are better nowadays, due to remediation, however there are still issues.. it sounds like you just want to figure out how to do the try/catch block properly. You could do something like that: boolean splitFileOrganisation = true; AtomCache cache = new AtomCache("/path/to/your/installation/",splitFileOrganisation); String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; for (String pdbID : pdbIDs){ try { Structure s = cache.getStructure(pdbID); if ( s == null) { System.out.println("could not find structure " + pdbID); continue; } // do something with the structure - your inner loop System.out.println(s); } catch (Exception e){ // something crazy happened... System.err.println("Can't load structure " + pdbID + " reason: " + e.getMessage()); e.printStackTrace(); } } On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: > Glad to hear it, who doesn't like support or clean interfaces?. ?No > offense intended, by the way, with respect to PDB errors - obviously > the PDB is an indispensable resource for all protein scientists. > > I am looking at many (fixed-length) pieces of protein chains and doin' > stuff with 'em. ?My current code has a pair of nested while loops; the > outer iterates over PDB entries (locally rsync'd copy), parsing them > and the inner iterates over the pieces from each. ?When > StructureExceptions come out of my PDBFileReader object I want to > continue the outer loop, moving on to the next set of files without > executing any of the code that depends on correct StructureImpl > objects from the reader (database updates, the inner loop). > Since the reader's methods have their own try-catch blocks, a thrown > StructureException is stopped there and never reaches my own error > handling. ?I just need to know when those errors occur so I can skip > those proteins - I am presuming that the correct entries will outweigh > the problem ones by a significant factor and the overall data wont be > seriously impacted. > > -da > > On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >> Hi Daniel, >> >> can you explain a bit more what you are doing, in particular what >> errors you would like to deal with on your end? ?You should not need >> to worry too much about exception handling. Are there any special >> cases you are interested in? ?In this case we should support you with >> a clean interface rather than exception handling from your end... >> >> Andreas >> >> >> >> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>> Hi all, >>> Let me first say thanks to all the BioJava community members for >>> delivering such a useful set of libraries, and that I'm still a newbie >>> when it comes to BioJava (and Java) so forgive me if my question is >>> too trivial. >>> >>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>> As is commonly known, these are often rife with errors which can lead >>> to exceptions during parsing with PDBFileParser. ?Because >>> PDBFileParser's methods contain their own try-catch blocks, exception >>> propagation stops there and my code proceeds blindly along regardless >>> of any error checking I do. ?I would like to catch the exceptions up >>> in my code where the parser is called, so that I can branch to a >>> continue statement and have my batch processing loops move on to the >>> next file. >>> Should I edit out the try-catch blocks and compile my own version of >>> the library? ?Or should I test the returned StructureImpl objects for >>> possession of the fields in question? ?In that case, I'm not sure >>> which properties will give the most general success information...and >>> I'd rather not have to check for /every/ property being correct. >>> >>> If there is some great way to check if an exception was caught down a >>> series of nested method calls, please hit me over the head with it. >>> >>> Thanks! >>> >>> -da >>> _______________________________________________ >>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> >> From andreas at sdsc.edu Wed Oct 27 06:01:38 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 26 Oct 2010 23:01:38 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Hi Amr, 2010/10/26 Amr Al-Hossary : > We can?add some thing like an exception tracing queue, that can be?checked > for later by the caller. thanks for your suggestion. In terms of API I would prefer if we can separare a user from inconsistencies in the files and I hope we won't need such a queue... If something is off, the code is written to ignore or work around issues... Abdreas > would that be OK? > > Amr > >> Date: Tue, 26 Oct 2010 21:11:28 -0700 >> From: andreas at sdsc.edu >> To: dasarnow at gmail.com >> CC: biojava-l at lists.open-bio.org >> Subject: Re: [Biojava-l] Bad PDB files and batch processing with >> PDBFileReader >> >> Hi Daniel, >> >> can you explain a bit more what you are doing, in particular what >> errors you would like to deal with on your end? You should not need >> to worry too much about exception handling. Are there any special >> cases you are interested in? In this case we should support you with >> a clean interface rather than exception handling from your end... >> >> Andreas >> >> >> >> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow >> wrote: >> > Hi all, >> > Let me first say thanks to all the BioJava community members for >> > delivering such a useful set of libraries, and that I'm still a newbie >> > when i! t comes to BioJava (and Java) so forgive me if my question is >> > too trivial. >> > >> > I am doing work on lots (at least thousands) of PDB files from RCSB. >> > As is commonly known, these are often rife with errors which can lead >> > to exceptions during parsing with PDBFileParser. ?Because >> > PDBFileParser's methods contain their own try-catch blocks, exception >> > propagation stops there and my code proceeds blindly along regardless >> > of any error checking I do. ?I would like to catch the exceptions up >> > in my code where the parser is called, so that I can branch to a >> > continue statement and have my batch processing loops move on to the >> > next file. >> > Should I edit out the try-catch blocks and compile my own version of >> > the library? ?Or should I test the returned StructureImpl objects for >> > possession of the fields i! n question? ?In that case, I'm not sure >> > which proper ties will give the most general success information...and >> > I'd rather not have to check for /every/ property being correct. >> > >> > If there is some great way to check if an exception was caught down a >> > series of nested method calls, please hit me over the head with it. >> > >> > Thanks! >> > >> > -da >> > _______________________________________________ >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > >> >> >> From dasarnow at gmail.com Wed Oct 27 07:26:22 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Wed, 27 Oct 2010 00:26:22 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: I assume AtomCache is a new class in BioJava3? I must give you my embarrassed apology...after a bunch of testing I finally figured out that I had misunderstood where the Parser's error handling returns control and started going after the wrong exceptions. It does looks like if setParseCAOnly is true, the reader excepts on chains with no CA's instead of just skipping them, though the other chains are still parsed into the structure. -da On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: > Hi Daniel, > > PDB files are better nowadays, due to remediation, however there are > still issues.. > > it sounds like you just want to figure out how to do the try/catch > block properly. You could do something like that: > > ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; > ? ? ? ? ? ? ? ?AtomCache cache = new > AtomCache("/path/to/your/installation/",splitFileOrganisation); > > ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; > > ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ > > ? ? ? ? ? ? ? ? ? ? ? ?try { > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); > > ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + > e.getMessage()); > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); > ? ? ? ? ? ? ? ? ? ? ? ?} > ? ? ? ? ? ? ? ?} > > > > > On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >> Glad to hear it, who doesn't like support or clean interfaces?. ?No >> offense intended, by the way, with respect to PDB errors - obviously >> the PDB is an indispensable resource for all protein scientists. >> >> I am looking at many (fixed-length) pieces of protein chains and doin' >> stuff with 'em. ?My current code has a pair of nested while loops; the >> outer iterates over PDB entries (locally rsync'd copy), parsing them >> and the inner iterates over the pieces from each. ?When >> StructureExceptions come out of my PDBFileReader object I want to >> continue the outer loop, moving on to the next set of files without >> executing any of the code that depends on correct StructureImpl >> objects from the reader (database updates, the inner loop). >> Since the reader's methods have their own try-catch blocks, a thrown >> StructureException is stopped there and never reaches my own error >> handling. ?I just need to know when those errors occur so I can skip >> those proteins - I am presuming that the correct entries will outweigh >> the problem ones by a significant factor and the overall data wont be >> seriously impacted. >> >> -da >> >> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>> Hi Daniel, >>> >>> can you explain a bit more what you are doing, in particular what >>> errors you would like to deal with on your end? ?You should not need >>> to worry too much about exception handling. Are there any special >>> cases you are interested in? ?In this case we should support you with >>> a clean interface rather than exception handling from your end... >>> >>> Andreas >>> >>> >>> >>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>> Hi all, >>>> Let me first say thanks to all the BioJava community members for >>>> delivering such a useful set of libraries, and that I'm still a newbie >>>> when it comes to BioJava (and Java) so forgive me if my question is >>>> too trivial. >>>> >>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>> As is commonly known, these are often rife with errors which can lead >>>> to exceptions during parsing with PDBFileParser. ?Because >>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>> propagation stops there and my code proceeds blindly along regardless >>>> of any error checking I do. ?I would like to catch the exceptions up >>>> in my code where the parser is called, so that I can branch to a >>>> continue statement and have my batch processing loops move on to the >>>> next file. >>>> Should I edit out the try-catch blocks and compile my own version of >>>> the library? ?Or should I test the returned StructureImpl objects for >>>> possession of the fields in question? ?In that case, I'm not sure >>>> which properties will give the most general success information...and >>>> I'd rather not have to check for /every/ property being correct. >>>> >>>> If there is some great way to check if an exception was caught down a >>>> series of nested method calls, please hit me over the head with it. >>>> >>>> Thanks! >>>> >>>> -da >>>> _______________________________________________ >>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> >>> > From jc.lucky at laposte.net Wed Oct 27 08:11:13 2010 From: jc.lucky at laposte.net (jc.lucky) Date: Wed, 27 Oct 2010 10:11:13 +0200 (CEST) Subject: [Biojava-l] Tr: Retrieve Information from GenBank file Message-ID: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> I tried once again with the new version of BioJava but without succeding. Any idea or suggestion? Thanks in advance Regards, Jean-Charles Ferri?res > Message du 22/10/10 10:11 > De : "jc.lucky" > A : biojava-l at lists.open-bio.org > Copie ? : > Objet : [Biojava-l] Retrieve Information from GenBank file > > > Hi > > I'm trying to convert a GenBank file into a rdf file. The gene of interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 > > With the below code I can read the GenBank file and I manage to retrieve information and convert them in a rdf format. However I don't succeed in retrieving some information such as Title, protein or product. According to this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is possible to do so. > Please help me find what I do wrong or what should be done to achieve my goal. > > //read the GeneBank File > public static RichSequenceIterator readFile(String input, > RichSequenceBuilderFactory seqFactory, > Namespace ns) > throws IOException, NoSuchElementException, BioException > { > ns = null; > InputStream stream = new FileInputStream(input); > BufferedReader rdfFile = new BufferedReader(new InputStreamReader(stream)); > RichSequenceIterator seqs = RichSequence.IOTools.readGenbankDNA(rdfFile,ns); > return seqs; > } > > //Retrieve information and convert them in rdf format > public void writeToRDFFile(RichSequenceIterator rsi, String output) > throws IOException, NoSuchElementException, BioException { > //create model for the ontology > OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, null); > OntClass parents; > String URI = "http://pbr.wur.nl/#"; > > while(rsi.hasNext()) > { > RichSequence seq = rsi.nextRichSequence(); > String id = seq.getName(); > parents = model.createClass(URI + id); > Set author = seq.getRankedDocRefs();//code to clean up Set&convert toString > String definition = seq.getDescription(); //code to clean up String > //Add to model > parents.addProperty(DC.description, definition); > parents.addProperty(DC.publisher, authors); > parents.addComment(taxonomy, "EN"); > parents.addProperty(DC.type, organism); > //print in rdf format > model.write(out, "RDF/XML"); > out.close(); } > } > > > Thanks, > Jean-Charles Ferri?res _____________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? Je cr?e ma bo?te mail www.laposte.net From willishf at ufl.edu Wed Oct 27 10:41:06 2010 From: willishf at ufl.edu (Scooter Willis) Date: Wed, 27 Oct 2010 06:41:06 -0400 Subject: [Biojava-l] Tr: Retrieve Information from GenBank file In-Reply-To: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> Message-ID: Jean-Charles I have it on my list to do a GenBank parser but haven't had the time. I can't promise anything in the next couple weeks. Can you send some details about what a typical use case is for your purpose? Are you trying to get the sequence data or are you more interested in the features? Thanks Scooter On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote: > > I tried once again with the new version of BioJava but without succeding. > Any idea or suggestion? > > Thanks in advance > Regards, > > Jean-Charles Ferri?res > > > > Message du 22/10/10 10:11 > > De : "jc.lucky" > > A : biojava-l at lists.open-bio.org > > Copie ? : > > Objet : [Biojava-l] Retrieve Information from GenBank file > > > > > > Hi > > > > I'm trying to convert a GenBank file into a rdf file. The gene of > interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 > > > > With the below code I can read the GenBank file and I manage to retrieve > information and convert them in a rdf format. However I don't succeed in > retrieving some information such as Title, protein or product. According to > this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is > possible to do so. > > Please help me find what I do wrong or what should be done to achieve my > goal. > > > > //read the GeneBank File > > public static RichSequenceIterator readFile(String input, > > RichSequenceBuilderFactory seqFactory, > > Namespace ns) > > throws IOException, NoSuchElementException, BioException > > { > > ns = null; > > InputStream stream = new FileInputStream(input); > > BufferedReader rdfFile = new BufferedReader(new > InputStreamReader(stream)); > > RichSequenceIterator seqs = > RichSequence.IOTools.readGenbankDNA(rdfFile,ns); > > return seqs; > > } > > > > //Retrieve information and convert them in rdf format > > public void writeToRDFFile(RichSequenceIterator rsi, String output) > > throws IOException, NoSuchElementException, BioException { > > //create model for the ontology > > OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, > null); > > OntClass parents; > > String URI = "http://pbr.wur.nl/#"; > > > > while(rsi.hasNext()) > > { > > RichSequence seq = rsi.nextRichSequence(); > > String id = seq.getName(); > > parents = model.createClass(URI + id); > > Set author = seq.getRankedDocRefs();//code to clean up Set&convert > toString > > String definition = seq.getDescription(); //code to clean up String > > //Add to model > > parents.addProperty(DC.description, definition); > > parents.addProperty(DC.publisher, authors); > > parents.addComment(taxonomy, "EN"); > > parents.addProperty(DC.type, organism); > > //print in rdf format > > model.write(out, "RDF/XML"); > > out.close(); } > > } > > > > > > Thanks, > > Jean-Charles Ferri?res > _____________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous > tente ? > Je cr?e ma bo?te mail www.laposte.net > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From jc.lucky at laposte.net Wed Oct 27 13:03:55 2010 From: jc.lucky at laposte.net (jc.lucky) Date: Wed, 27 Oct 2010 15:03:55 +0200 (CEST) Subject: [Biojava-l] Tr: Retrieve Information from GenBank file In-Reply-To: References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> Message-ID: <21411489.155159.1288184635185.JavaMail.www@wwinf8222> I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data. My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future. Thanks, Jean-Charles > Message du 27/10/10 12:41 > De : "Scooter Willis" > A : "jc.lucky" > Copie ? : "biojava-l lists open-bio org" > Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file > > Jean-Charles > > I have it on my list to do a GenBank parser but haven't had the time. I > can't promise anything in the next couple weeks. Can you send some details > about what a typical use case is for your purpose? Are you trying to get the > sequence data or are you more interested in the features? > > Thanks > > Scooter > > On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote: > > > > > I tried once again with the new version of BioJava but without succeding. > > Any idea or suggestion? > > > > Thanks in advance > > Regards, > > > > Jean-Charles Ferri?res > > > > > > > Message du 22/10/10 10:11 > > > De : "jc.lucky" > > > A : biojava-l at lists.open-bio.org > > > Copie ? : > > > Objet : [Biojava-l] Retrieve Information from GenBank file > > > > > > > > > Hi > > > > > > I'm trying to convert a GenBank file into a rdf file. The gene of > > interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 > > > > > > With the below code I can read the GenBank file and I manage to retrieve > > information and convert them in a rdf format. However I don't succeed in > > retrieving some information such as Title, protein or product. According to > > this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is > > possible to do so. > > > Please help me find what I do wrong or what should be done to achieve my > > goal. > > > > > > //read the GeneBank File > > > public static RichSequenceIterator readFile(String input, > > > RichSequenceBuilderFactory seqFactory, > > > Namespace ns) > > > throws IOException, NoSuchElementException, BioException > > > { > > > ns = null; > > > InputStream stream = new FileInputStream(input); > > > BufferedReader rdfFile = new BufferedReader(new > > InputStreamReader(stream)); > > > RichSequenceIterator seqs = > > RichSequence.IOTools.readGenbankDNA(rdfFile,ns); > > > return seqs; > > > } > > > > > > //Retrieve information and convert them in rdf format > > > public void writeToRDFFile(RichSequenceIterator rsi, String output) > > > throws IOException, NoSuchElementException, BioException { > > > //create model for the ontology > > > OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, > > null); > > > OntClass parents; > > > String URI = "http://pbr.wur.nl/#"; > > > > > > while(rsi.hasNext()) > > > { > > > RichSequence seq = rsi.nextRichSequence(); > > > String id = seq.getName(); > > > parents = model.createClass(URI + id); > > > Set author = seq.getRankedDocRefs();//code to clean up Set&convert > > toString > > > String definition = seq.getDescription(); //code to clean up String > > > //Add to model > > > parents.addProperty(DC.description, definition); > > > parents.addProperty(DC.publisher, authors); > > > parents.addComment(taxonomy, "EN"); > > > parents.addProperty(DC.type, organism); > > > //print in rdf format > > > model.write(out, "RDF/XML"); > > > out.close(); } > > > } > > > > > > > > > Thanks, > > > Jean-Charles Ferri?res > > _____________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? Je cr?e ma bo?te mail www.laposte.net From holland at eaglegenomics.com Wed Oct 27 13:16:56 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 27 Oct 2010 14:16:56 +0100 Subject: [Biojava-l] Tr: Retrieve Information from GenBank file In-Reply-To: <21411489.155159.1288184635185.JavaMail.www@wwinf8222> References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> <21411489.155159.1288184635185.JavaMail.www@wwinf8222> Message-ID: <3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com> Have you tried (using the BioJavaX method) looking at the getRichAnnotation() method on the RichSequence that the parser returns? That is where the majority of the GenBank tags should show up in a kind of hash map. Things like protein, product are likely to be found there. Each feature (getFeatureSet() on the RichSequence object) also has its own annotation set for things that are associated with the feature rather than the main sequence. Xrefs meanwhile can be retrieved as getRankedCrossRefs() on each feature, whilst sequence-level document references (including titles, authors, etc.) are found by calling getRankedDocRefs(). This section of the BioJavaX docs goes into great detail where every single part of the Genbank file is stored in the RichSequence objects: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_2 and http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Writing_2 cheers, Richard On 27 Oct 2010, at 14:03, jc.lucky wrote: > > I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data. > My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future. > > Thanks, > > Jean-Charles > > > >> Message du 27/10/10 12:41 >> De : "Scooter Willis" >> A : "jc.lucky" >> Copie ? : "biojava-l lists open-bio org" >> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file >> >> Jean-Charles >> >> I have it on my list to do a GenBank parser but haven't had the time. I >> can't promise anything in the next couple weeks. Can you send some details >> about what a typical use case is for your purpose? Are you trying to get the >> sequence data or are you more interested in the features? >> >> Thanks >> >> Scooter >> >> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote: >> >>> >>> I tried once again with the new version of BioJava but without succeding. >>> Any idea or suggestion? >>> >>> Thanks in advance >>> Regards, >>> >>> Jean-Charles Ferri?res >>> >>> >>>> Message du 22/10/10 10:11 >>>> De : "jc.lucky" >>>> A : biojava-l at lists.open-bio.org >>>> Copie ? : >>>> Objet : [Biojava-l] Retrieve Information from GenBank file >>>> >>>> >>>> Hi >>>> >>>> I'm trying to convert a GenBank file into a rdf file. The gene of >>> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 >>>> >>>> With the below code I can read the GenBank file and I manage to retrieve >>> information and convert them in a rdf format. However I don't succeed in >>> retrieving some information such as Title, protein or product. According to >>> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is >>> possible to do so. >>>> Please help me find what I do wrong or what should be done to achieve my >>> goal. >>>> >>>> //read the GeneBank File >>>> public static RichSequenceIterator readFile(String input, >>>> RichSequenceBuilderFactory seqFactory, >>>> Namespace ns) >>>> throws IOException, NoSuchElementException, BioException >>>> { >>>> ns = null; >>>> InputStream stream = new FileInputStream(input); >>>> BufferedReader rdfFile = new BufferedReader(new >>> InputStreamReader(stream)); >>>> RichSequenceIterator seqs = >>> RichSequence.IOTools.readGenbankDNA(rdfFile,ns); >>>> return seqs; >>>> } >>>> >>>> //Retrieve information and convert them in rdf format >>>> public void writeToRDFFile(RichSequenceIterator rsi, String output) >>>> throws IOException, NoSuchElementException, BioException { >>>> //create model for the ontology >>>> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, >>> null); >>>> OntClass parents; >>>> String URI = "http://pbr.wur.nl/#"; >>>> >>>> while(rsi.hasNext()) >>>> { >>>> RichSequence seq = rsi.nextRichSequence(); >>>> String id = seq.getName(); >>>> parents = model.createClass(URI + id); >>>> Set author = seq.getRankedDocRefs();//code to clean up Set&convert >>> toString >>>> String definition = seq.getDescription(); //code to clean up String >>>> //Add to model >>>> parents.addProperty(DC.description, definition); >>>> parents.addProperty(DC.publisher, authors); >>>> parents.addComment(taxonomy, "EN"); >>>> parents.addProperty(DC.type, organism); >>>> //print in rdf format >>>> model.write(out, "RDF/XML"); >>>> out.close(); } >>>> } >>>> >>>> >>>> Thanks, >>>> Jean-Charles Ferri?res >>> _____________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? > Je cr?e ma bo?te mail www.laposte.net > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jc.lucky at laposte.net Wed Oct 27 13:34:22 2010 From: jc.lucky at laposte.net (jc.lucky) Date: Wed, 27 Oct 2010 15:34:22 +0200 (CEST) Subject: [Biojava-l] Tr: Retrieve Information from GenBank file In-Reply-To: <3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com> References: <27010118.206733.1288167073872.JavaMail.www@wwinf8210> <21411489.155159.1288184635185.JavaMail.www@wwinf8222> <3326080B-30B3-445F-A661-49A51DED54EE@eaglegenomics.com> Message-ID: <6229150.91865.1288186462649.JavaMail.www@wwinf8218> Thanks for your reply and indeed as mentioned at the bottom that is what I use to try to retrieve the maximum of information. However and that is my problem the methods described do not provide the required information. For example getRankedDocRefs() provides authors and Journals but no TITLE getFeaturesSet() only provides /organism, /mol_type and /db_xref Thereby I was asking for help and suggestion fo how to fix this "problem". Best, Jean-Charles > Message du 27/10/10 15:17 > De : "Richard Holland" > A : "jc.lucky" > Copie ? : "Scooter Willis" , "biojava-l lists open-bio org" > Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file > > > Have you tried (using the BioJavaX method) looking at the getRichAnnotation() method on the RichSequence that the parser returns? That is where the majority of the GenBank tags should show up in a kind of hash map. Things like protein, product are likely to be found there. Each feature (getFeatureSet() on the RichSequence object) also has its own annotation set for things that are associated with the feature rather than the main sequence. Xrefs meanwhile can be retrieved as getRankedCrossRefs() on each feature, whilst sequence-level document references (including titles, authors, etc.) are found by calling getRankedDocRefs(). > > This section of the BioJavaX docs goes into great detail where every single part of the Genbank file is stored in the RichSequence objects: http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Reading_2 and http://www.biojava.org/wiki/BioJava:BioJavaXDocs#Writing_2 > > cheers, > Richard > > On 27 Oct 2010, at 14:03, jc.lucky wrote: > > > > > I'm more interesting in the features (regqrding protein-ID, taxon, xref, product) and retrieving information about articles (authors, title). I don't look at all to the sequence data. > > My purpose is to be able to read the GenBank file to retrieve those information so that I can proceed a conversion to a semantic rdf format file. I'm working on a specific gene at the moment but it would be interesting to extend to any GenBank file in the future. > > > > Thanks, > > > > Jean-Charles > > > > > > > >> Message du 27/10/10 12:41 > >> De : "Scooter Willis" > >> A : "jc.lucky" > >> Copie ? : "biojava-l lists open-bio org" > >> Objet : Re: [Biojava-l] Tr: Retrieve Information from GenBank file > >> > >> Jean-Charles > >> > >> I have it on my list to do a GenBank parser but haven't had the time. I > >> can't promise anything in the next couple weeks. Can you send some details > >> about what a typical use case is for your purpose? Are you trying to get the > >> sequence data or are you more interested in the features? > >> > >> Thanks > >> > >> Scooter > >> > >> On Wed, Oct 27, 2010 at 4:11 AM, jc.lucky wrote: > >> > >>> > >>> I tried once again with the new version of BioJava but without succeding. > >>> Any idea or suggestion? > >>> > >>> Thanks in advance > >>> Regards, > >>> > >>> Jean-Charles Ferri?res > >>> > >>> > >>>> Message du 22/10/10 10:11 > >>>> De : "jc.lucky" > >>>> A : biojava-l at lists.open-bio.org > >>>> Copie ? : > >>>> Objet : [Biojava-l] Retrieve Information from GenBank file > >>>> > >>>> > >>>> Hi > >>>> > >>>> I'm trying to convert a GenBank file into a rdf file. The gene of > >>> interest can be found a t : http://www.ncbi.nlm.nih.gov/protein/284794945 > >>>> > >>>> With the below code I can read the GenBank file and I manage to retrieve > >>> information and convert them in a rdf format. However I don't succeed in > >>> retrieving some information such as Title, protein or product. According to > >>> this page (http://www.biojava.org/wiki/BioJava:BioJavaXDocs#GenBan)it is > >>> possible to do so. > >>>> Please help me find what I do wrong or what should be done to achieve my > >>> goal. > >>>> > >>>> //read the GeneBank File > >>>> public static RichSequenceIterator readFile(String input, > >>>> RichSequenceBuilderFactory seqFactory, > >>>> Namespace ns) > >>>> throws IOException, NoSuchElementException, BioException > >>>> { > >>>> ns = null; > >>>> InputStream stream = new FileInputStream(input); > >>>> BufferedReader rdfFile = new BufferedReader(new > >>> InputStreamReader(stream)); > >>>> RichSequenceIterator seqs = > >>> RichSequence.IOTools.readGenbankDNA(rdfFile,ns); > >>>> return seqs; > >>>> } > >>>> > >>>> //Retrieve information and convert them in rdf format > >>>> public void writeToRDFFile(RichSequenceIterator rsi, String output) > >>>> throws IOException, NoSuchElementException, BioException { > >>>> //create model for the ontology > >>>> OntModel model = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM, > >>> null); > >>>> OntClass parents; > >>>> String URI = "http://pbr.wur.nl/#"; > >>>> > >>>> while(rsi.hasNext()) > >>>> { > >>>> RichSequence seq = rsi.nextRichSequence(); > >>>> String id = seq.getName(); > >>>> parents = model.createClass(URI + id); > >>>> Set author = seq.getRankedDocRefs();//code to clean up Set&convert > >>> toString > >>>> String definition = seq.getDescription(); //code to clean up String > >>>> //Add to model > >>>> parents.addProperty(DC.description, definition); > >>>> parents.addProperty(DC.publisher, authors); > >>>> parents.addComment(taxonomy, "EN"); > >>>> parents.addProperty(DC.type, organism); > >>>> //print in rdf format > >>>> model.write(out, "RDF/XML"); > >>>> out.close(); } > >>>> } > >>>> > >>>> > >>>> Thanks, > >>>> Jean-Charles Ferri?res > >>> _____________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? > > Je cr?e ma bo?te mail www.laposte.net > > > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > Une messagerie gratuite, garantie ? vie et des services en plus, ?a vous tente ? Je cr?e ma bo?te mail www.laposte.net From andreas at sdsc.edu Thu Oct 28 00:47:50 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 27 Oct 2010 17:47:50 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: > I assume AtomCache is a new class in BioJava3? yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 > > I must give you my embarrassed apology...after a bunch of testing I > finally figured out that I had misunderstood where the Parser's error > handling returns control and started going after the wrong exceptions. > ?It does looks like if setParseCAOnly is true, the reader excepts on > chains with no CA's instead of just skipping them, though the other > chains are still parsed into the structure. This sounds like there might be a problem with CA only.. do you have an example ID? also: are you on biojava 1.7 or 3.0 ? Andreas > > -da > > On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >> Hi Daniel, >> >> PDB files are better nowadays, due to remediation, however there are >> still issues.. >> >> it sounds like you just want to figure out how to do the try/catch >> block properly. You could do something like that: >> >> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >> ? ? ? ? ? ? ? ?AtomCache cache = new >> AtomCache("/path/to/your/installation/",splitFileOrganisation); >> >> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >> >> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >> >> ? ? ? ? ? ? ? ? ? ? ? ?try { >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >> >> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >> e.getMessage()); >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >> ? ? ? ? ? ? ? ? ? ? ? ?} >> ? ? ? ? ? ? ? ?} >> >> >> >> >> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>> offense intended, by the way, with respect to PDB errors - obviously >>> the PDB is an indispensable resource for all protein scientists. >>> >>> I am looking at many (fixed-length) pieces of protein chains and doin' >>> stuff with 'em. ?My current code has a pair of nested while loops; the >>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>> and the inner iterates over the pieces from each. ?When >>> StructureExceptions come out of my PDBFileReader object I want to >>> continue the outer loop, moving on to the next set of files without >>> executing any of the code that depends on correct StructureImpl >>> objects from the reader (database updates, the inner loop). >>> Since the reader's methods have their own try-catch blocks, a thrown >>> StructureException is stopped there and never reaches my own error >>> handling. ?I just need to know when those errors occur so I can skip >>> those proteins - I am presuming that the correct entries will outweigh >>> the problem ones by a significant factor and the overall data wont be >>> seriously impacted. >>> >>> -da >>> >>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>> Hi Daniel, >>>> >>>> can you explain a bit more what you are doing, in particular what >>>> errors you would like to deal with on your end? ?You should not need >>>> to worry too much about exception handling. Are there any special >>>> cases you are interested in? ?In this case we should support you with >>>> a clean interface rather than exception handling from your end... >>>> >>>> Andreas >>>> >>>> >>>> >>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>> Hi all, >>>>> Let me first say thanks to all the BioJava community members for >>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>> too trivial. >>>>> >>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>> As is commonly known, these are often rife with errors which can lead >>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>> propagation stops there and my code proceeds blindly along regardless >>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>> in my code where the parser is called, so that I can branch to a >>>>> continue statement and have my batch processing loops move on to the >>>>> next file. >>>>> Should I edit out the try-catch blocks and compile my own version of >>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>> possession of the fields in question? ?In that case, I'm not sure >>>>> which properties will give the most general success information...and >>>>> I'd rather not have to check for /every/ property being correct. >>>>> >>>>> If there is some great way to check if an exception was caught down a >>>>> series of nested method calls, please hit me over the head with it. >>>>> >>>>> Thanks! >>>>> >>>>> -da >>>>> _______________________________________________ >>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>> >>>> >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From dasarnow at gmail.com Thu Oct 28 04:05:18 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Wed, 27 Oct 2010 21:05:18 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: I'm using 1.7, partially because my distro had a package for it and partially because I was initially using the online Javadoc a lot. PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide chain F appears to parse correctly. -da org.biojava.bio.structure.StructureException: could not find chain A ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) ? ? ? ?at fragalign.pair.getStructs(pair.java:42) ? ? ? ?at fragalign.Main.main(Main.java:40) org.biojava.bio.structure.StructureException: could not find chain B ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) ? ? ? ?at fragalign.pair.getStructs(pair.java:42) ? ? ? ?at fragalign.Main.main(Main.java:40) org.biojava.bio.structure.StructureException: did not find chain with chainId >A< ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) ? ? ? ?at fragalign.pair.getStructs(pair.java:42) ? ? ? ?at fragalign.Main.main(Main.java:40) org.biojava.bio.structure.StructureException: did not find chain with chainId >B< ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) ? ? ? ?at fragalign.pair.getStructs(pair.java:42) ? ? ? ?at fragalign.Main.main(Main.java:40) On Wed, Oct 27, 2010 at 17:47, Andreas Prlic wrote: >> I assume AtomCache is a new class in BioJava3? > > yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 > >> >> I must give you my embarrassed apology...after a bunch of testing I >> finally figured out that I had misunderstood where the Parser's error >> handling returns control and started going after the wrong exceptions. >> ?It does looks like if setParseCAOnly is true, the reader excepts on >> chains with no CA's instead of just skipping them, though the other >> chains are still parsed into the structure. > > This sounds like there might be ?a problem with CA only.. do you have > an example ID? also: are you on biojava 1.7 or 3.0 ? > > Andreas > > > >> >> -da >> >> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >>> Hi Daniel, >>> >>> PDB files are better nowadays, due to remediation, however there are >>> still issues.. >>> >>> it sounds like you just want to figure out how to do the try/catch >>> block properly. You could do something like that: >>> >>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >>> ? ? ? ? ? ? ? ?AtomCache cache = new >>> AtomCache("/path/to/your/installation/",splitFileOrganisation); >>> >>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >>> >>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >>> >>> ? ? ? ? ? ? ? ? ? ? ? ?try { >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >>> >>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >>> e.getMessage()); >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >>> ? ? ? ? ? ? ? ? ? ? ? ?} >>> ? ? ? ? ? ? ? ?} >>> >>> >>> >>> >>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>>> offense intended, by the way, with respect to PDB errors - obviously >>>> the PDB is an indispensable resource for all protein scientists. >>>> >>>> I am looking at many (fixed-length) pieces of protein chains and doin' >>>> stuff with 'em. ?My current code has a pair of nested while loops; the >>>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>>> and the inner iterates over the pieces from each. ?When >>>> StructureExceptions come out of my PDBFileReader object I want to >>>> continue the outer loop, moving on to the next set of files without >>>> executing any of the code that depends on correct StructureImpl >>>> objects from the reader (database updates, the inner loop). >>>> Since the reader's methods have their own try-catch blocks, a thrown >>>> StructureException is stopped there and never reaches my own error >>>> handling. ?I just need to know when those errors occur so I can skip >>>> those proteins - I am presuming that the correct entries will outweigh >>>> the problem ones by a significant factor and the overall data wont be >>>> seriously impacted. >>>> >>>> -da >>>> >>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>>> Hi Daniel, >>>>> >>>>> can you explain a bit more what you are doing, in particular what >>>>> errors you would like to deal with on your end? ?You should not need >>>>> to worry too much about exception handling. Are there any special >>>>> cases you are interested in? ?In this case we should support you with >>>>> a clean interface rather than exception handling from your end... >>>>> >>>>> Andreas >>>>> >>>>> >>>>> >>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>>> Hi all, >>>>>> Let me first say thanks to all the BioJava community members for >>>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>>> too trivial. >>>>>> >>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>>> As is commonly known, these are often rife with errors which can lead >>>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>>> propagation stops there and my code proceeds blindly along regardless >>>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>>> in my code where the parser is called, so that I can branch to a >>>>>> continue statement and have my batch processing loops move on to the >>>>>> next file. >>>>>> Should I edit out the try-catch blocks and compile my own version of >>>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>>> possession of the fields in question? ?In that case, I'm not sure >>>>>> which properties will give the most general success information...and >>>>>> I'd rather not have to check for /every/ property being correct. >>>>>> >>>>>> If there is some great way to check if an exception was caught down a >>>>>> series of nested method calls, please hit me over the head with it. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -da >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> >>>>> >>>>> >>> >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From andreas at sdsc.edu Thu Oct 28 17:28:07 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 28 Oct 2010 10:28:07 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Hi Daniel, I just checked, this is a bug which is already resolved in 3.0... If it is an issue for you, you might want to upgrade... (should be very easy, if you start using Maven ...) Thanks, Andreas On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow wrote: > I'm using 1.7, partially because my distro had a package for it and > partially because I was initially using the online Javadoc a lot. > PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've > pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide > chain F appears to parse correctly. > > -da > > org.biojava.bio.structure.StructureException: could not find chain A > ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) > ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) > ? ? ? ?at fragalign.pair.getStructs(pair.java:42) > ? ? ? ?at fragalign.Main.main(Main.java:40) > org.biojava.bio.structure.StructureException: could not find chain B > ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) > ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) > ? ? ? ?at fragalign.pair.getStructs(pair.java:42) > ? ? ? ?at fragalign.Main.main(Main.java:40) > org.biojava.bio.structure.StructureException: did not find chain with > chainId >A< > ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) > ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) > ? ? ? ?at fragalign.pair.getStructs(pair.java:42) > ? ? ? ?at fragalign.Main.main(Main.java:40) > org.biojava.bio.structure.StructureException: did not find chain with > chainId >B< > ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) > ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) > ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) > ? ? ? ?at fragalign.pair.getStructs(pair.java:42) > ? ? ? ?at fragalign.Main.main(Main.java:40) > > > On Wed, Oct 27, 2010 at 17:47, Andreas Prlic wrote: >>> I assume AtomCache is a new class in BioJava3? >> >> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 >> >>> >>> I must give you my embarrassed apology...after a bunch of testing I >>> finally figured out that I had misunderstood where the Parser's error >>> handling returns control and started going after the wrong exceptions. >>> ?It does looks like if setParseCAOnly is true, the reader excepts on >>> chains with no CA's instead of just skipping them, though the other >>> chains are still parsed into the structure. >> >> This sounds like there might be ?a problem with CA only.. do you have >> an example ID? also: are you on biojava 1.7 or 3.0 ? >> >> Andreas >> >> >> >>> >>> -da >>> >>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >>>> Hi Daniel, >>>> >>>> PDB files are better nowadays, due to remediation, however there are >>>> still issues.. >>>> >>>> it sounds like you just want to figure out how to do the try/catch >>>> block properly. You could do something like that: >>>> >>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >>>> ? ? ? ? ? ? ? ?AtomCache cache = new >>>> AtomCache("/path/to/your/installation/",splitFileOrganisation); >>>> >>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >>>> >>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >>>> >>>> ? ? ? ? ? ? ? ? ? ? ? ?try { >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >>>> >>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >>>> e.getMessage()); >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >>>> ? ? ? ? ? ? ? ? ? ? ? ?} >>>> ? ? ? ? ? ? ? ?} >>>> >>>> >>>> >>>> >>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>>>> offense intended, by the way, with respect to PDB errors - obviously >>>>> the PDB is an indispensable resource for all protein scientists. >>>>> >>>>> I am looking at many (fixed-length) pieces of protein chains and doin' >>>>> stuff with 'em. ?My current code has a pair of nested while loops; the >>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>>>> and the inner iterates over the pieces from each. ?When >>>>> StructureExceptions come out of my PDBFileReader object I want to >>>>> continue the outer loop, moving on to the next set of files without >>>>> executing any of the code that depends on correct StructureImpl >>>>> objects from the reader (database updates, the inner loop). >>>>> Since the reader's methods have their own try-catch blocks, a thrown >>>>> StructureException is stopped there and never reaches my own error >>>>> handling. ?I just need to know when those errors occur so I can skip >>>>> those proteins - I am presuming that the correct entries will outweigh >>>>> the problem ones by a significant factor and the overall data wont be >>>>> seriously impacted. >>>>> >>>>> -da >>>>> >>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>>>> Hi Daniel, >>>>>> >>>>>> can you explain a bit more what you are doing, in particular what >>>>>> errors you would like to deal with on your end? ?You should not need >>>>>> to worry too much about exception handling. Are there any special >>>>>> cases you are interested in? ?In this case we should support you with >>>>>> a clean interface rather than exception handling from your end... >>>>>> >>>>>> Andreas >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>>>> Hi all, >>>>>>> Let me first say thanks to all the BioJava community members for >>>>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>>>> too trivial. >>>>>>> >>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>>>> As is commonly known, these are often rife with errors which can lead >>>>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>>>> propagation stops there and my code proceeds blindly along regardless >>>>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>>>> in my code where the parser is called, so that I can branch to a >>>>>>> continue statement and have my batch processing loops move on to the >>>>>>> next file. >>>>>>> Should I edit out the try-catch blocks and compile my own version of >>>>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>>>> possession of the fields in question? ?In that case, I'm not sure >>>>>>> which properties will give the most general success information...and >>>>>>> I'd rather not have to check for /every/ property being correct. >>>>>>> >>>>>>> If there is some great way to check if an exception was caught down a >>>>>>> series of nested method calls, please hit me over the head with it. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> -da >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>>>>> >>>>>> >>>> >>> >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From vishalthapar at gmail.com Thu Oct 28 17:40:49 2010 From: vishalthapar at gmail.com (Vishal Thapar) Date: Thu, 28 Oct 2010 13:40:49 -0400 Subject: [Biojava-l] K-mers Message-ID: Hi All, I had a quick question: Does Biojava have a method to generate k-mers or K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer counts for every sequence in a fasta file. If something like this exists it would save me some time to write the code. Thanks, Vishal From jayunit100 at gmail.com Thu Oct 28 19:43:17 2010 From: jayunit100 at gmail.com (Jay Vyas) Date: Thu, 28 Oct 2010 15:43:17 -0400 Subject: [Biojava-l] biojava maven integration Message-ID: Hi guys, I added the following to my pom file org.biojava biojava 3.0-alpha2 biojava-maven-repo BioJava repository http://www.biojava.org/download/maven/ true true But to no avail. Does anyone know how to add biojava3 to the libraries in a maven managed application >? Thanks. From jayunit100 at gmail.com Thu Oct 28 22:51:25 2010 From: jayunit100 at gmail.com (Jay Vyas) Date: Thu, 28 Oct 2010 18:51:25 -0400 Subject: [Biojava-l] biojava maven integration In-Reply-To: <5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk> References: <5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk> Message-ID: Does anybody have a maven POM example of how to integrate biojava into my application ? Thanks! Im currently using biojava 1.7, and have put it in my own, local maven repository. On Thu, Oct 28, 2010 at 3:56 PM, LAW Andy wrote: > Not 100% certain but I *think* you want to depend on biojava-core rather > than biojava. > > Later, > > Andy > > On 28 Oct 2010, at 20:43, Jay Vyas wrote: > > > Hi guys, I added the following to my pom file > > > > > > org.biojava > > biojava > > 3.0-alpha2 > > > > > > > > biojava-maven-repo > > BioJava repository > > http://www.biojava.org/download/maven/ > > > > true > > > > > > true > > > > > > > > > > But to no avail. Does anyone know how to add biojava3 to the libraries > in a > > maven managed application >? > > > > Thanks. > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > -- Jay Vyas MMSB/UCHC From dasarnow at gmail.com Thu Oct 28 23:45:05 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Thu, 28 Oct 2010 16:45:05 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: It's not a big deal - after all if you use CA only, chains with no CA's aren't important, and the error messages aren't that long. But I'm going to switch anyway... I'm getting the dreaded "can't read line length in file" error while trying to checkout biojava-live/trunk, though. -da On Thu, Oct 28, 2010 at 10:28, Andreas Prlic wrote: > Hi Daniel, > > I just checked, this is a bug which is already resolved in 3.0... If > it is an issue for you, you might want to upgrade... (should be very > easy, if you start using Maven ...) > > Thanks, > Andreas > > On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow wrote: >> I'm using 1.7, partially because my distro had a package for it and >> partially because I was initially using the online Javadoc a lot. >> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've >> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide >> chain F appears to parse correctly. >> >> -da >> >> org.biojava.bio.structure.StructureException: could not find chain A >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >> ? ? ? ?at fragalign.Main.main(Main.java:40) >> org.biojava.bio.structure.StructureException: could not find chain B >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >> ? ? ? ?at fragalign.Main.main(Main.java:40) >> org.biojava.bio.structure.StructureException: did not find chain with >> chainId >A< >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >> ? ? ? ?at fragalign.Main.main(Main.java:40) >> org.biojava.bio.structure.StructureException: did not find chain with >> chainId >B< >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >> ? ? ? ?at fragalign.Main.main(Main.java:40) >> >> >> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic wrote: >>>> I assume AtomCache is a new class in BioJava3? >>> >>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 >>> >>>> >>>> I must give you my embarrassed apology...after a bunch of testing I >>>> finally figured out that I had misunderstood where the Parser's error >>>> handling returns control and started going after the wrong exceptions. >>>> ?It does looks like if setParseCAOnly is true, the reader excepts on >>>> chains with no CA's instead of just skipping them, though the other >>>> chains are still parsed into the structure. >>> >>> This sounds like there might be ?a problem with CA only.. do you have >>> an example ID? also: are you on biojava 1.7 or 3.0 ? >>> >>> Andreas >>> >>> >>> >>>> >>>> -da >>>> >>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >>>>> Hi Daniel, >>>>> >>>>> PDB files are better nowadays, due to remediation, however there are >>>>> still issues.. >>>>> >>>>> it sounds like you just want to figure out how to do the try/catch >>>>> block properly. You could do something like that: >>>>> >>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >>>>> ? ? ? ? ? ? ? ?AtomCache cache = new >>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation); >>>>> >>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >>>>> >>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >>>>> >>>>> ? ? ? ? ? ? ? ? ? ? ? ?try { >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >>>>> >>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >>>>> e.getMessage()); >>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >>>>> ? ? ? ? ? ? ? ? ? ? ? ?} >>>>> ? ? ? ? ? ? ? ?} >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>>>>> offense intended, by the way, with respect to PDB errors - obviously >>>>>> the PDB is an indispensable resource for all protein scientists. >>>>>> >>>>>> I am looking at many (fixed-length) pieces of protein chains and doin' >>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the >>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>>>>> and the inner iterates over the pieces from each. ?When >>>>>> StructureExceptions come out of my PDBFileReader object I want to >>>>>> continue the outer loop, moving on to the next set of files without >>>>>> executing any of the code that depends on correct StructureImpl >>>>>> objects from the reader (database updates, the inner loop). >>>>>> Since the reader's methods have their own try-catch blocks, a thrown >>>>>> StructureException is stopped there and never reaches my own error >>>>>> handling. ?I just need to know when those errors occur so I can skip >>>>>> those proteins - I am presuming that the correct entries will outweigh >>>>>> the problem ones by a significant factor and the overall data wont be >>>>>> seriously impacted. >>>>>> >>>>>> -da >>>>>> >>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>>>>> Hi Daniel, >>>>>>> >>>>>>> can you explain a bit more what you are doing, in particular what >>>>>>> errors you would like to deal with on your end? ?You should not need >>>>>>> to worry too much about exception handling. Are there any special >>>>>>> cases you are interested in? ?In this case we should support you with >>>>>>> a clean interface rather than exception handling from your end... >>>>>>> >>>>>>> Andreas >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>>>>> Hi all, >>>>>>>> Let me first say thanks to all the BioJava community members for >>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>>>>> too trivial. >>>>>>>> >>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>>>>> As is commonly known, these are often rife with errors which can lead >>>>>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>>>>> propagation stops there and my code proceeds blindly along regardless >>>>>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>>>>> in my code where the parser is called, so that I can branch to a >>>>>>>> continue statement and have my batch processing loops move on to the >>>>>>>> next file. >>>>>>>> Should I edit out the try-catch blocks and compile my own version of >>>>>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>>>>> possession of the fields in question? ?In that case, I'm not sure >>>>>>>> which properties will give the most general success information...and >>>>>>>> I'd rather not have to check for /every/ property being correct. >>>>>>>> >>>>>>>> If there is some great way to check if an exception was caught down a >>>>>>>> series of nested method calls, please hit me over the head with it. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> -da >>>>>>>> _______________________________________________ >>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> ----------------------------------------------------------------------- >>> Dr. Andreas Prlic >>> Senior Scientist, RCSB PDB Protein Data Bank >>> University of California, San Diego >>> (+1) 858.246.0526 >>> ----------------------------------------------------------------------- >>> >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > From dasarnow at gmail.com Thu Oct 28 23:51:25 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Thu, 28 Oct 2010 16:51:25 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: Ahh, I suppose that is the "problem" referred to in the wiki? I checked out successfully from the repository on github. -da On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow wrote: > It's not a big deal - after all if you use CA only, chains with no > CA's aren't important, and the error messages aren't that long. ?But > I'm going to switch anyway... > I'm getting the dreaded "can't read line length in file" error while > trying to checkout biojava-live/trunk, though. > > -da > > On Thu, Oct 28, 2010 at 10:28, Andreas Prlic wrote: >> Hi Daniel, >> >> I just checked, this is a bug which is already resolved in 3.0... If >> it is an issue for you, you might want to upgrade... (should be very >> easy, if you start using Maven ...) >> >> Thanks, >> Andreas >> >> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow wrote: >>> I'm using 1.7, partially because my distro had a package for it and >>> partially because I was initially using the online Javadoc a lot. >>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've >>> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide >>> chain F appears to parse correctly. >>> >>> -da >>> >>> org.biojava.bio.structure.StructureException: could not find chain A >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>> org.biojava.bio.structure.StructureException: could not find chain B >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>> org.biojava.bio.structure.StructureException: did not find chain with >>> chainId >A< >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>> org.biojava.bio.structure.StructureException: did not find chain with >>> chainId >B< >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>> >>> >>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic wrote: >>>>> I assume AtomCache is a new class in BioJava3? >>>> >>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 >>>> >>>>> >>>>> I must give you my embarrassed apology...after a bunch of testing I >>>>> finally figured out that I had misunderstood where the Parser's error >>>>> handling returns control and started going after the wrong exceptions. >>>>> ?It does looks like if setParseCAOnly is true, the reader excepts on >>>>> chains with no CA's instead of just skipping them, though the other >>>>> chains are still parsed into the structure. >>>> >>>> This sounds like there might be ?a problem with CA only.. do you have >>>> an example ID? also: are you on biojava 1.7 or 3.0 ? >>>> >>>> Andreas >>>> >>>> >>>> >>>>> >>>>> -da >>>>> >>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >>>>>> Hi Daniel, >>>>>> >>>>>> PDB files are better nowadays, due to remediation, however there are >>>>>> still issues.. >>>>>> >>>>>> it sounds like you just want to figure out how to do the try/catch >>>>>> block properly. You could do something like that: >>>>>> >>>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >>>>>> ? ? ? ? ? ? ? ?AtomCache cache = new >>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation); >>>>>> >>>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >>>>>> >>>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >>>>>> >>>>>> ? ? ? ? ? ? ? ? ? ? ? ?try { >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >>>>>> >>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >>>>>> e.getMessage()); >>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} >>>>>> ? ? ? ? ? ? ? ?} >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>>>>>> offense intended, by the way, with respect to PDB errors - obviously >>>>>>> the PDB is an indispensable resource for all protein scientists. >>>>>>> >>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin' >>>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the >>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>>>>>> and the inner iterates over the pieces from each. ?When >>>>>>> StructureExceptions come out of my PDBFileReader object I want to >>>>>>> continue the outer loop, moving on to the next set of files without >>>>>>> executing any of the code that depends on correct StructureImpl >>>>>>> objects from the reader (database updates, the inner loop). >>>>>>> Since the reader's methods have their own try-catch blocks, a thrown >>>>>>> StructureException is stopped there and never reaches my own error >>>>>>> handling. ?I just need to know when those errors occur so I can skip >>>>>>> those proteins - I am presuming that the correct entries will outweigh >>>>>>> the problem ones by a significant factor and the overall data wont be >>>>>>> seriously impacted. >>>>>>> >>>>>>> -da >>>>>>> >>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>>>>>> Hi Daniel, >>>>>>>> >>>>>>>> can you explain a bit more what you are doing, in particular what >>>>>>>> errors you would like to deal with on your end? ?You should not need >>>>>>>> to worry too much about exception handling. Are there any special >>>>>>>> cases you are interested in? ?In this case we should support you with >>>>>>>> a clean interface rather than exception handling from your end... >>>>>>>> >>>>>>>> Andreas >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>>>>>> Hi all, >>>>>>>>> Let me first say thanks to all the BioJava community members for >>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>>>>>> too trivial. >>>>>>>>> >>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>>>>>> As is commonly known, these are often rife with errors which can lead >>>>>>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>>>>>> propagation stops there and my code proceeds blindly along regardless >>>>>>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>>>>>> in my code where the parser is called, so that I can branch to a >>>>>>>>> continue statement and have my batch processing loops move on to the >>>>>>>>> next file. >>>>>>>>> Should I edit out the try-catch blocks and compile my own version of >>>>>>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>>>>>> possession of the fields in question? ?In that case, I'm not sure >>>>>>>>> which properties will give the most general success information...and >>>>>>>>> I'd rather not have to check for /every/ property being correct. >>>>>>>>> >>>>>>>>> If there is some great way to check if an exception was caught down a >>>>>>>>> series of nested method calls, please hit me over the head with it. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> -da >>>>>>>>> _______________________________________________ >>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> ----------------------------------------------------------------------- >>>> Dr. Andreas Prlic >>>> Senior Scientist, RCSB PDB Protein Data Bank >>>> University of California, San Diego >>>> (+1) 858.246.0526 >>>> ----------------------------------------------------------------------- >>>> >>> >> >> >> >> -- >> ----------------------------------------------------------------------- >> Dr. Andreas Prlic >> Senior Scientist, RCSB PDB Protein Data Bank >> University of California, San Diego >> (+1) 858.246.0526 >> ----------------------------------------------------------------------- >> > From andreas at sdsc.edu Fri Oct 29 00:06:55 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 28 Oct 2010 17:06:55 -0700 Subject: [Biojava-l] biojava maven integration In-Reply-To: References: <5B102E94-8CD1-4516-906D-B9EF62516CF4@exseed.ed.ac.uk> Message-ID: Hi Jay, here is some UI code that is using biojava from Maven: http://github.com/biojava/RCSB_SequenceViewer/blob/master/pom.xml Andreas On Thu, Oct 28, 2010 at 3:51 PM, Jay Vyas wrote: > Does anybody have a maven POM example of how to integrate biojava into my > application ? > Thanks! > > Im currently using biojava 1.7, and have put it in my own, local maven > repository. > > > > > On Thu, Oct 28, 2010 at 3:56 PM, LAW Andy wrote: > >> Not 100% certain but I *think* you want to depend on biojava-core rather >> than biojava. >> >> Later, >> >> Andy >> >> On 28 Oct 2010, at 20:43, Jay Vyas wrote: >> >> > Hi guys, I added the following to my pom file >> > >> > ? >> > ? ? ? ?org.biojava >> > ? ? ? ?biojava >> > ? ? ? ?3.0-alpha2 >> > ? >> > >> > >> > ? ? ? ?biojava-maven-repo >> > ? ? ? ?BioJava repository >> > ? ? ? ?http://www.biojava.org/download/maven/ >> > ? ? ? ? >> > ? ? ? ? ? ?true >> > ? ? ? ? >> > ? ? ? ? >> > ? ? ? ? ? ?true >> > ? ? ? ? >> > ? ? >> > >> > >> > But to no avail. ?Does anyone know how to add biojava3 to the libraries >> in a >> > maven managed application >? >> > >> > Thanks. >> > _______________________________________________ >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> > > > -- > Jay Vyas > MMSB/UCHC > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Fri Oct 29 00:08:49 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 28 Oct 2010 17:08:49 -0700 Subject: [Biojava-l] Bad PDB files and batch processing with PDBFileReader In-Reply-To: References: Message-ID: good, I was just about to say that... ;-) Andreas On Thu, Oct 28, 2010 at 4:51 PM, Daniel Asarnow wrote: > Ahh, I suppose that is the "problem" referred to in the wiki? ?I > checked out successfully from the repository on github. > > -da > > On Thu, Oct 28, 2010 at 16:45, Daniel Asarnow wrote: >> It's not a big deal - after all if you use CA only, chains with no >> CA's aren't important, and the error messages aren't that long. ?But >> I'm going to switch anyway... >> I'm getting the dreaded "can't read line length in file" error while >> trying to checkout biojava-live/trunk, though. >> >> -da >> >> On Thu, Oct 28, 2010 at 10:28, Andreas Prlic wrote: >>> Hi Daniel, >>> >>> I just checked, this is a bug which is already resolved in 3.0... If >>> it is an issue for you, you might want to upgrade... (should be very >>> easy, if you start using Maven ...) >>> >>> Thanks, >>> Andreas >>> >>> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow wrote: >>>> I'm using 1.7, partially because my distro had a package for it and >>>> partially because I was initially using the online Javadoc a lot. >>>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've >>>> pasted them below. ?Chain A exists in the PDB but is DNA, polypeptide >>>> chain F appears to parse correctly. >>>> >>>> -da >>>> >>>> org.biojava.bio.structure.StructureException: could not find chain A >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>>> org.biojava.bio.structure.StructureException: could not find chain B >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217) >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>>> org.biojava.bio.structure.StructureException: did not find chain with >>>> chainId >A< >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>>> org.biojava.bio.structure.StructureException: did not find chain with >>>> chainId >B< >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541) >>>> ? ? ? ?at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963) >>>> ? ? ? ?at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452) >>>> ? ? ? ?at fragalign.pair.getStructs(pair.java:42) >>>> ? ? ? ?at fragalign.Main.main(Main.java:40) >>>> >>>> >>>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic wrote: >>>>>> I assume AtomCache is a new class in BioJava3? >>>>> >>>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0 >>>>> >>>>>> >>>>>> I must give you my embarrassed apology...after a bunch of testing I >>>>>> finally figured out that I had misunderstood where the Parser's error >>>>>> handling returns control and started going after the wrong exceptions. >>>>>> ?It does looks like if setParseCAOnly is true, the reader excepts on >>>>>> chains with no CA's instead of just skipping them, though the other >>>>>> chains are still parsed into the structure. >>>>> >>>>> This sounds like there might be ?a problem with CA only.. do you have >>>>> an example ID? also: are you on biojava 1.7 or 3.0 ? >>>>> >>>>> Andreas >>>>> >>>>> >>>>> >>>>>> >>>>>> -da >>>>>> >>>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic wrote: >>>>>>> Hi Daniel, >>>>>>> >>>>>>> PDB files are better nowadays, due to remediation, however there are >>>>>>> still issues.. >>>>>>> >>>>>>> it sounds like you just want to figure out how to do the try/catch >>>>>>> block properly. You could do something like that: >>>>>>> >>>>>>> ? ? ? ? ? ? ? ?boolean splitFileOrganisation = true; >>>>>>> ? ? ? ? ? ? ? ?AtomCache cache = new >>>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation); >>>>>>> >>>>>>> ? ? ? ? ? ? ? ?String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" }; >>>>>>> >>>>>>> ? ? ? ? ? ? ? ?for (String pdbID : pdbIDs){ >>>>>>> >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?try { >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Structure s = cache.getStructure(pdbID); >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if ( s == null) { >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println("could not find structure " + pdbID); >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue; >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// do something with the structure - your inner loop >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.out.println(s); >>>>>>> >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} catch (Exception e){ >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// something crazy happened... >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?System.err.println("Can't load structure " + pdbID + " reason: " + >>>>>>> e.getMessage()); >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?e.printStackTrace(); >>>>>>> ? ? ? ? ? ? ? ? ? ? ? ?} >>>>>>> ? ? ? ? ? ? ? ?} >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow wrote: >>>>>>>> Glad to hear it, who doesn't like support or clean interfaces?. ?No >>>>>>>> offense intended, by the way, with respect to PDB errors - obviously >>>>>>>> the PDB is an indispensable resource for all protein scientists. >>>>>>>> >>>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin' >>>>>>>> stuff with 'em. ?My current code has a pair of nested while loops; the >>>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them >>>>>>>> and the inner iterates over the pieces from each. ?When >>>>>>>> StructureExceptions come out of my PDBFileReader object I want to >>>>>>>> continue the outer loop, moving on to the next set of files without >>>>>>>> executing any of the code that depends on correct StructureImpl >>>>>>>> objects from the reader (database updates, the inner loop). >>>>>>>> Since the reader's methods have their own try-catch blocks, a thrown >>>>>>>> StructureException is stopped there and never reaches my own error >>>>>>>> handling. ?I just need to know when those errors occur so I can skip >>>>>>>> those proteins - I am presuming that the correct entries will outweigh >>>>>>>> the problem ones by a significant factor and the overall data wont be >>>>>>>> seriously impacted. >>>>>>>> >>>>>>>> -da >>>>>>>> >>>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic wrote: >>>>>>>>> Hi Daniel, >>>>>>>>> >>>>>>>>> can you explain a bit more what you are doing, in particular what >>>>>>>>> errors you would like to deal with on your end? ?You should not need >>>>>>>>> to worry too much about exception handling. Are there any special >>>>>>>>> cases you are interested in? ?In this case we should support you with >>>>>>>>> a clean interface rather than exception handling from your end... >>>>>>>>> >>>>>>>>> Andreas >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow wrote: >>>>>>>>>> Hi all, >>>>>>>>>> Let me first say thanks to all the BioJava community members for >>>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie >>>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is >>>>>>>>>> too trivial. >>>>>>>>>> >>>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB. >>>>>>>>>> As is commonly known, these are often rife with errors which can lead >>>>>>>>>> to exceptions during parsing with PDBFileParser. ?Because >>>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception >>>>>>>>>> propagation stops there and my code proceeds blindly along regardless >>>>>>>>>> of any error checking I do. ?I would like to catch the exceptions up >>>>>>>>>> in my code where the parser is called, so that I can branch to a >>>>>>>>>> continue statement and have my batch processing loops move on to the >>>>>>>>>> next file. >>>>>>>>>> Should I edit out the try-catch blocks and compile my own version of >>>>>>>>>> the library? ?Or should I test the returned StructureImpl objects for >>>>>>>>>> possession of the fields in question? ?In that case, I'm not sure >>>>>>>>>> which properties will give the most general success information...and >>>>>>>>>> I'd rather not have to check for /every/ property being correct. >>>>>>>>>> >>>>>>>>>> If there is some great way to check if an exception was caught down a >>>>>>>>>> series of nested method calls, please hit me over the head with it. >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> -da >>>>>>>>>> _______________________________________________ >>>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> ----------------------------------------------------------------------- >>>>> Dr. Andreas Prlic >>>>> Senior Scientist, RCSB PDB Protein Data Bank >>>>> University of California, San Diego >>>>> (+1) 858.246.0526 >>>>> ----------------------------------------------------------------------- >>>>> >>>> >>> >>> >>> >>> -- >>> ----------------------------------------------------------------------- >>> Dr. Andreas Prlic >>> Senior Scientist, RCSB PDB Protein Data Bank >>> University of California, San Diego >>> (+1) 858.246.0526 >>> ----------------------------------------------------------------------- >>> >> > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From ayates at ebi.ac.uk Fri Oct 29 08:12:09 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 09:12:09 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Hi Vishal, As far as I am aware there is nothing which will generate them in BioJava at the moment. However it is possible to do it with BioJava3: public static void main(String[] args) { DNASequence d = new DNASequence("ATGATC"); System.out.println("Non-Overlap"); nonOverlap(d); System.out.println("Overlap"); overlap(d); } public static final int KMER = 3; //Generate triplets overlapping public static void overlap(Sequence d) { List> l = new ArrayList>(); for(int i=1; i<=KMER; i++) { SequenceView sub = d.getSubSequence( i, d.getLength()); WindowedSequence w = new WindowedSequence(sub, KMER); l.add(w); } //Will return ATG, ATC, TGA & GAT for(WindowedSequence w: l) { for(List subList: w) { System.out.println(subList); } } } //Generate triplet Compound lists non-overlapping public static void nonOverlap(Sequence d) { WindowedSequence w = new WindowedSequence(d, KMER); //Will return ATG & ATC for(List subList: w) { System.out.println(subList); } } The disadvantage of all of these solutions is that they generate lists of Compounds so kmer generation can/will be a memory intensive operation. This does mean it has to be since sub sequences are thin wrappers around an underlying sequence. Also the overlap solution is non-optimal since it iterates through each window rather than stepping through delegating onto each base in turn (hence why we get ATG & ATC before TGA) As for unique k-mers that's something which would require a bit more engineering & would be better suited to a solution built around a Trie (prefix tree). Hope this helps, Andy On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > Hi All, > > I had a quick question: Does Biojava have a method to generate k-mers or > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer > counts for every sequence in a fasta file. If something like this exists it > would save me some time to write the code. > > Thanks, > > Vishal > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jbdundas at gmail.com Fri Oct 29 09:12:53 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Fri, 29 Oct 2010 14:42:53 +0530 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Dear Friends, Thanks to Vishal & Andy for this. I actually needed this code too.. Vishal, I think Andy's suggestions may be a good option to include in BioJava 3. Would you like to add this to the BioJava 3. Thanks again. Regards, Jitesh Dundas On 10/29/10, Andy Yates wrote: > Hi Vishal, > > As far as I am aware there is nothing which will generate them in BioJava at > the moment. However it is possible to do it with BioJava3: > > public static void main(String[] args) { > DNASequence d = new DNASequence("ATGATC"); > System.out.println("Non-Overlap"); > nonOverlap(d); > System.out.println("Overlap"); > overlap(d); > } > > public static final int KMER = 3; > > //Generate triplets overlapping > public static void overlap(Sequence d) { > List> l = > new ArrayList>(); > for(int i=1; i<=KMER; i++) { > SequenceView sub = d.getSubSequence( > i, d.getLength()); > WindowedSequence w = > new WindowedSequence(sub, KMER); > l.add(w); > } > > //Will return ATG, ATC, TGA & GAT > for(WindowedSequence w: l) { > for(List subList: w) { > System.out.println(subList); > } > } > } > > //Generate triplet Compound lists non-overlapping > public static void nonOverlap(Sequence d) { > WindowedSequence w = > new WindowedSequence(d, KMER); > //Will return ATG & ATC > for(List subList: w) { > System.out.println(subList); > } > } > > The disadvantage of all of these solutions is that they generate lists of > Compounds so kmer generation can/will be a memory intensive operation. This > does mean it has to be since sub sequences are thin wrappers around an > underlying sequence. Also the overlap solution is non-optimal since it > iterates through each window rather than stepping through delegating onto > each base in turn (hence why we get ATG & ATC before TGA) > > As for unique k-mers that's something which would require a bit more > engineering & would be better suited to a solution built around a Trie > (prefix tree). > > Hope this helps, > > Andy > > On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > >> Hi All, >> >> I had a quick question: Does Biojava have a method to generate k-mers or >> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >> counts for every sequence in a fasta file. If something like this exists >> it >> would save me some time to write the code. >> >> Thanks, >> >> Vishal >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Fri Oct 29 09:20:36 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 10:20:36 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Okay couple of points here: 1). Which biojava3 module? This sounds like something for the genomic module rather than core 2). It'll need some more work. I'm not happy about using the WindowedSequenceView in its current state. I think an alteration to avoid it making Lists would be a good idea (plus recent developments in the API as to its main use means this is a viable change). Also it should return the overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6 Comments? Andy On 29 Oct 2010, at 10:12, jitesh dundas wrote: > Dear Friends, > > Thanks to Vishal & Andy for this. I actually needed this code too.. > Vishal, I think Andy's suggestions may be a good option to include in > BioJava 3. Would you like to add this to the BioJava 3. > > Thanks again. > > Regards, > Jitesh Dundas > > On 10/29/10, Andy Yates wrote: >> Hi Vishal, >> >> As far as I am aware there is nothing which will generate them in BioJava at >> the moment. However it is possible to do it with BioJava3: >> >> public static void main(String[] args) { >> DNASequence d = new DNASequence("ATGATC"); >> System.out.println("Non-Overlap"); >> nonOverlap(d); >> System.out.println("Overlap"); >> overlap(d); >> } >> >> public static final int KMER = 3; >> >> //Generate triplets overlapping >> public static void overlap(Sequence d) { >> List> l = >> new ArrayList>(); >> for(int i=1; i<=KMER; i++) { >> SequenceView sub = d.getSubSequence( >> i, d.getLength()); >> WindowedSequence w = >> new WindowedSequence(sub, KMER); >> l.add(w); >> } >> >> //Will return ATG, ATC, TGA & GAT >> for(WindowedSequence w: l) { >> for(List subList: w) { >> System.out.println(subList); >> } >> } >> } >> >> //Generate triplet Compound lists non-overlapping >> public static void nonOverlap(Sequence d) { >> WindowedSequence w = >> new WindowedSequence(d, KMER); >> //Will return ATG & ATC >> for(List subList: w) { >> System.out.println(subList); >> } >> } >> >> The disadvantage of all of these solutions is that they generate lists of >> Compounds so kmer generation can/will be a memory intensive operation. This >> does mean it has to be since sub sequences are thin wrappers around an >> underlying sequence. Also the overlap solution is non-optimal since it >> iterates through each window rather than stepping through delegating onto >> each base in turn (hence why we get ATG & ATC before TGA) >> >> As for unique k-mers that's something which would require a bit more >> engineering & would be better suited to a solution built around a Trie >> (prefix tree). >> >> Hope this helps, >> >> Andy >> >> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >> >>> Hi All, >>> >>> I had a quick question: Does Biojava have a method to generate k-mers or >>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >>> counts for every sequence in a fasta file. If something like this exists >>> it >>> would save me some time to write the code. >>> >>> Thanks, >>> >>> Vishal >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jbdundas at gmail.com Fri Oct 29 10:00:44 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Fri, 29 Oct 2010 15:30:44 +0530 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Dear Sir, Is there any way to detect patterns in the recorded k-mers . I have a large set of miRNAs (study for mutations and patgerns for gastric cancer).I made a record of k-mers for each sequence but the patterns that are generated are difficult to track. Can BioJava do this point. Regular Expressions in Java maybe useful here.. Request expert advise in this.Any other s/w that might be useful. Thanks, Jitesh Dundas On 10/29/10, jitesh dundas wrote: > Dear Friends, > > Thanks to Vishal & Andy for this. I actually needed this code too.. > Vishal, I think Andy's suggestions may be a good option to include in > BioJava 3. Would you like to add this to the BioJava 3. > > Thanks again. > > Regards, > Jitesh Dundas > > On 10/29/10, Andy Yates wrote: >> Hi Vishal, >> >> As far as I am aware there is nothing which will generate them in BioJava >> at >> the moment. However it is possible to do it with BioJava3: >> >> public static void main(String[] args) { >> DNASequence d = new DNASequence("ATGATC"); >> System.out.println("Non-Overlap"); >> nonOverlap(d); >> System.out.println("Overlap"); >> overlap(d); >> } >> >> public static final int KMER = 3; >> >> //Generate triplets overlapping >> public static void overlap(Sequence d) { >> List> l = >> new ArrayList>(); >> for(int i=1; i<=KMER; i++) { >> SequenceView sub = d.getSubSequence( >> i, d.getLength()); >> WindowedSequence w = >> new WindowedSequence(sub, KMER); >> l.add(w); >> } >> >> //Will return ATG, ATC, TGA & GAT >> for(WindowedSequence w: l) { >> for(List subList: w) { >> System.out.println(subList); >> } >> } >> } >> >> //Generate triplet Compound lists non-overlapping >> public static void nonOverlap(Sequence d) { >> WindowedSequence w = >> new WindowedSequence(d, KMER); >> //Will return ATG & ATC >> for(List subList: w) { >> System.out.println(subList); >> } >> } >> >> The disadvantage of all of these solutions is that they generate lists of >> Compounds so kmer generation can/will be a memory intensive operation. >> This >> does mean it has to be since sub sequences are thin wrappers around an >> underlying sequence. Also the overlap solution is non-optimal since it >> iterates through each window rather than stepping through delegating onto >> each base in turn (hence why we get ATG & ATC before TGA) >> >> As for unique k-mers that's something which would require a bit more >> engineering & would be better suited to a solution built around a Trie >> (prefix tree). >> >> Hope this helps, >> >> Andy >> >> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >> >>> Hi All, >>> >>> I had a quick question: Does Biojava have a method to generate k-mers or >>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >>> counts for every sequence in a fasta file. If something like this exists >>> it >>> would save me some time to write the code. >>> >>> Thanks, >>> >>> Vishal >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > From jbdundas at gmail.com Fri Oct 29 10:04:35 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Fri, 29 Oct 2010 15:34:35 +0530 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: You are right again my friend.Definitely that would hang up my machine with the xml file parsing activity. This is about sequence alignment and related modules. I will look at this today and send a fix on that.Hope that you can help. PS: what about pattern matching in sequences?interesting to have in biojava 3 ? Regards, JD On 10/29/10, Andy Yates wrote: > Okay couple of points here: > > 1). Which biojava3 module? This sounds like something for the genomic module > rather than core > > 2). It'll need some more work. I'm not happy about using the > WindowedSequenceView in its current state. I think an alteration to avoid it > making Lists would be a good idea (plus recent developments in the API as to > its main use means this is a viable change). Also it should return the > overlapping ones in base order i.e. 1->3, 2->4 not 1->3, 4->6 > > Comments? > > Andy > > On 29 Oct 2010, at 10:12, jitesh dundas wrote: > >> Dear Friends, >> >> Thanks to Vishal & Andy for this. I actually needed this code too.. >> Vishal, I think Andy's suggestions may be a good option to include in >> BioJava 3. Would you like to add this to the BioJava 3. >> >> Thanks again. >> >> Regards, >> Jitesh Dundas >> >> On 10/29/10, Andy Yates wrote: >>> Hi Vishal, >>> >>> As far as I am aware there is nothing which will generate them in BioJava >>> at >>> the moment. However it is possible to do it with BioJava3: >>> >>> public static void main(String[] args) { >>> DNASequence d = new DNASequence("ATGATC"); >>> System.out.println("Non-Overlap"); >>> nonOverlap(d); >>> System.out.println("Overlap"); >>> overlap(d); >>> } >>> >>> public static final int KMER = 3; >>> >>> //Generate triplets overlapping >>> public static void overlap(Sequence d) { >>> List> l = >>> new ArrayList>(); >>> for(int i=1; i<=KMER; i++) { >>> SequenceView sub = d.getSubSequence( >>> i, d.getLength()); >>> WindowedSequence w = >>> new WindowedSequence(sub, KMER); >>> l.add(w); >>> } >>> >>> //Will return ATG, ATC, TGA & GAT >>> for(WindowedSequence w: l) { >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> } >>> >>> //Generate triplet Compound lists non-overlapping >>> public static void nonOverlap(Sequence d) { >>> WindowedSequence w = >>> new WindowedSequence(d, KMER); >>> //Will return ATG & ATC >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> >>> The disadvantage of all of these solutions is that they generate lists of >>> Compounds so kmer generation can/will be a memory intensive operation. >>> This >>> does mean it has to be since sub sequences are thin wrappers around an >>> underlying sequence. Also the overlap solution is non-optimal since it >>> iterates through each window rather than stepping through delegating onto >>> each base in turn (hence why we get ATG & ATC before TGA) >>> >>> As for unique k-mers that's something which would require a bit more >>> engineering & would be better suited to a solution built around a Trie >>> (prefix tree). >>> >>> Hope this helps, >>> >>> Andy >>> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>> >>>> Hi All, >>>> >>>> I had a quick question: Does Biojava have a method to generate k-mers or >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >>>> counts for every sequence in a fasta file. If something like this exists >>>> it >>>> would save me some time to write the code. >>>> >>>> Thanks, >>>> >>>> Vishal >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > From ayates at ebi.ac.uk Fri Oct 29 10:09:11 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 11:09:11 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: <5832FAFE-FEC3-4A7C-9469-3C334551900B@ebi.ac.uk> One of the disadvantages of the Sequence based system is that we have no support for searching in sequences with patterns like regular expressions. Whilst it's possible to convert a Sequence into a String & then perform the expression but that is a sub-optimal solution. Looking at the Pattern code in Java6 it can take in a CharSequence which one could write an adaptor to make a Sequence act as a CharSequence for the matching procedure but really it looks like a lot of work. As for a way of doing matching to sequence HMMER3 is awesome :) Andy On 29 Oct 2010, at 11:00, jitesh dundas wrote: > Dear Sir, > > Is there any way to detect patterns in the recorded k-mers . > > I have a large set of miRNAs (study for mutations and patgerns for > gastric cancer).I made a record of k-mers for each sequence but the > patterns that are generated are difficult to track. > > Can BioJava do this point. Regular Expressions in Java maybe useful here.. > > Request expert advise in this.Any other s/w that might be useful. > > Thanks, > Jitesh Dundas > > On 10/29/10, jitesh dundas wrote: >> Dear Friends, >> >> Thanks to Vishal & Andy for this. I actually needed this code too.. >> Vishal, I think Andy's suggestions may be a good option to include in >> BioJava 3. Would you like to add this to the BioJava 3. >> >> Thanks again. >> >> Regards, >> Jitesh Dundas >> >> On 10/29/10, Andy Yates wrote: >>> Hi Vishal, >>> >>> As far as I am aware there is nothing which will generate them in BioJava >>> at >>> the moment. However it is possible to do it with BioJava3: >>> >>> public static void main(String[] args) { >>> DNASequence d = new DNASequence("ATGATC"); >>> System.out.println("Non-Overlap"); >>> nonOverlap(d); >>> System.out.println("Overlap"); >>> overlap(d); >>> } >>> >>> public static final int KMER = 3; >>> >>> //Generate triplets overlapping >>> public static void overlap(Sequence d) { >>> List> l = >>> new ArrayList>(); >>> for(int i=1; i<=KMER; i++) { >>> SequenceView sub = d.getSubSequence( >>> i, d.getLength()); >>> WindowedSequence w = >>> new WindowedSequence(sub, KMER); >>> l.add(w); >>> } >>> >>> //Will return ATG, ATC, TGA & GAT >>> for(WindowedSequence w: l) { >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> } >>> >>> //Generate triplet Compound lists non-overlapping >>> public static void nonOverlap(Sequence d) { >>> WindowedSequence w = >>> new WindowedSequence(d, KMER); >>> //Will return ATG & ATC >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> >>> The disadvantage of all of these solutions is that they generate lists of >>> Compounds so kmer generation can/will be a memory intensive operation. >>> This >>> does mean it has to be since sub sequences are thin wrappers around an >>> underlying sequence. Also the overlap solution is non-optimal since it >>> iterates through each window rather than stepping through delegating onto >>> each base in turn (hence why we get ATG & ATC before TGA) >>> >>> As for unique k-mers that's something which would require a bit more >>> engineering & would be better suited to a solution built around a Trie >>> (prefix tree). >>> >>> Hope this helps, >>> >>> Andy >>> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>> >>>> Hi All, >>>> >>>> I had a quick question: Does Biojava have a method to generate k-mers or >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer >>>> counts for every sequence in a fasta file. If something like this exists >>>> it >>>> would save me some time to write the code. >>>> >>>> Thanks, >>>> >>>> Vishal >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jnarayan81 at gmail.com Fri Oct 29 11:46:11 2010 From: jnarayan81 at gmail.com (jitendra narayan) Date: Fri, 29 Oct 2010 17:16:11 +0530 Subject: [Biojava-l] New Biojava Logo Message-ID: Dear All I have designed a n new biojava logo. Please see the detail of it: http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg I need your valuable suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo thanks -- Jitendra Narayan Bioinformatist www.bioinformaticsonline.com From genjasp at gmail.com Fri Oct 29 13:05:57 2010 From: genjasp at gmail.com (Alessandro Cipriani) Date: Fri, 29 Oct 2010 15:05:57 +0200 Subject: [Biojava-l] New Biojava Logo In-Reply-To: References: Message-ID: Great Logo!!! :D 2010/10/29 jitendra narayan : > Dear All > I have designed a n new biojava logo. Please see the detail of it: > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg > I need your valuable > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo > > > thanks > > -- > Jitendra Narayan > Bioinformatist > www.bioinformaticsonline.com > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Alessandro Cipriani (+39) 3206009509 (+39) 3931311792 http://www.cipriania.it skype:genjasp at gmail.com msn:jaspzz From vishalthapar at gmail.com Fri Oct 29 16:27:11 2010 From: vishalthapar at gmail.com (Vishal Thapar) Date: Fri, 29 Oct 2010 12:27:11 -0400 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Hi Andy, This is good to have. I feel that including it as a part of core may not be necessary but having it as part of Genomic module in biojava3 will be nice. There is a project Bioinformatica http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich does something similar although not exactly. It counts the k-mers in a given fasta file but it does not count k-mers for each sequence within the file, just all within a file. This is a good feature to have specially if one is trying to find patterns within sequences which is what I am trying to do. It would most certainly be helpful to have a k-mer counting algorithm that counts k-mer frequency for each sequence. The way to go would be to use suffix trees. Again I don't know if biojava has a suffix tree api or not since I haven't used java in a while and am just switching back to it. A paper on using suffix trees to generate genome wide k-mer frequencies is: http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, software is tallymer). It would be some work to implement this in java as a module for biojava3 but I can see that this will be helpful. Again, for small fasta files, it might not be efficient to create a suffix tree but for bigger files, I think that might be the way to go. Thats just my two cents.What do you think? -vishal On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: > Hi Vishal, > > As far as I am aware there is nothing which will generate them in BioJava > at the moment. However it is possible to do it with BioJava3: > > public static void main(String[] args) { > DNASequence d = new DNASequence("ATGATC"); > System.out.println("Non-Overlap"); > nonOverlap(d); > System.out.println("Overlap"); > overlap(d); > } > > public static final int KMER = 3; > > //Generate triplets overlapping > public static void overlap(Sequence d) { > List> l = > new ArrayList>(); > for(int i=1; i<=KMER; i++) { > SequenceView sub = d.getSubSequence( > i, d.getLength()); > WindowedSequence w = > new WindowedSequence(sub, KMER); > l.add(w); > } > > //Will return ATG, ATC, TGA & GAT > for(WindowedSequence w: l) { > for(List subList: w) { > System.out.println(subList); > } > } > } > > //Generate triplet Compound lists non-overlapping > public static void nonOverlap(Sequence d) { > WindowedSequence w = > new WindowedSequence(d, KMER); > //Will return ATG & ATC > for(List subList: w) { > System.out.println(subList); > } > } > > The disadvantage of all of these solutions is that they generate lists of > Compounds so kmer generation can/will be a memory intensive operation. This > does mean it has to be since sub sequences are thin wrappers around an > underlying sequence. Also the overlap solution is non-optimal since it > iterates through each window rather than stepping through delegating onto > each base in turn (hence why we get ATG & ATC before TGA) > > As for unique k-mers that's something which would require a bit more > engineering & would be better suited to a solution built around a Trie > (prefix tree). > > Hope this helps, > > Andy > > On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > > > Hi All, > > > > I had a quick question: Does Biojava have a method to generate k-mers or > > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer > > counts for every sequence in a fasta file. If something like this exists > it > > would save me some time to write the code. > > > > Thanks, > > > > Vishal > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > -- *Vishal Thapar, Ph.D.* *Scientific informatics Analyst Cold Spring Harbor Lab Quick Bldg, Lowe Lab 1 Bungtown Road Cold Spring Harbor, NY - 11724* From phidias51 at gmail.com Fri Oct 29 16:56:45 2010 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 29 Oct 2010 09:56:45 -0700 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: It might be useful to make the K-mer storage mechanism pluggable. This would allow a developer to use anything from a simple MultiMap, to a NoSQL key-value database to store K-mers. You could plugin custom map implementations to allow you to keep a count of the number of instances of particular K-mers that were found. It might also be useful to be able to do set operations on those K-mer collections. You could use it to determine which K-mers were present in a pathogen and not in a host. http://www.ncbi.nlm.nih.gov/pubmed/20428334 http://www.ncbi.nlm.nih.gov/pubmed/16403026 Cheers, Mark card.ly: On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar wrote: > Hi Andy, > > This is good to have. I feel that including it as a part of core may not be > necessary but having it as part of Genomic module in biojava3 will be nice. > There is a project Bioinformatica > > http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich > does something similar although not exactly. It counts the k-mers in a > given fasta file but it does not count k-mers for each sequence within the > file, just all within a file. This is a good feature to have specially if > one is trying to find patterns within sequences which is what I am trying > to > do. It would most certainly be helpful to have a k-mer counting algorithm > that counts k-mer frequency for each sequence. The way to go would be to > use > suffix trees. Again I don't know if biojava has a suffix tree api or not > since I haven't used java in a while and am just switching back to it. A > paper on using suffix trees to generate genome wide k-mer frequencies is: > http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, > software > is tallymer). It would be some work to implement this in java as a module > for biojava3 but I can see that this will be helpful. Again, for small > fasta > files, it might not be efficient to create a suffix tree but for bigger > files, I think that might be the way to go. > > Thats just my two cents.What do you think? > > -vishal > > On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: > > > Hi Vishal, > > > > As far as I am aware there is nothing which will generate them in BioJava > > at the moment. However it is possible to do it with BioJava3: > > > > public static void main(String[] args) { > > DNASequence d = new DNASequence("ATGATC"); > > System.out.println("Non-Overlap"); > > nonOverlap(d); > > System.out.println("Overlap"); > > overlap(d); > > } > > > > public static final int KMER = 3; > > > > //Generate triplets overlapping > > public static void overlap(Sequence d) { > > List> l = > > new ArrayList>(); > > for(int i=1; i<=KMER; i++) { > > SequenceView sub = d.getSubSequence( > > i, d.getLength()); > > WindowedSequence w = > > new WindowedSequence(sub, KMER); > > l.add(w); > > } > > > > //Will return ATG, ATC, TGA & GAT > > for(WindowedSequence w: l) { > > for(List subList: w) { > > System.out.println(subList); > > } > > } > > } > > > > //Generate triplet Compound lists non-overlapping > > public static void nonOverlap(Sequence d) { > > WindowedSequence w = > > new WindowedSequence(d, KMER); > > //Will return ATG & ATC > > for(List subList: w) { > > System.out.println(subList); > > } > > } > > > > The disadvantage of all of these solutions is that they generate lists of > > Compounds so kmer generation can/will be a memory intensive operation. > This > > does mean it has to be since sub sequences are thin wrappers around an > > underlying sequence. Also the overlap solution is non-optimal since it > > iterates through each window rather than stepping through delegating onto > > each base in turn (hence why we get ATG & ATC before TGA) > > > > As for unique k-mers that's something which would require a bit more > > engineering & would be better suited to a solution built around a Trie > > (prefix tree). > > > > Hope this helps, > > > > Andy > > > > On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > > > > > Hi All, > > > > > > I had a quick question: Does Biojava have a method to generate k-mers > or > > > K-mer counting in a given Fasta Sequence / File? Basically, I want > k-mer > > > counts for every sequence in a fasta file. If something like this > exists > > it > > > would save me some time to write the code. > > > > > > Thanks, > > > > > > Vishal > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > -- > > Andrew Yates Ensembl Genomes Engineer > > EMBL-EBI Tel: +44-(0)1223-492538 > > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > > > > > > -- > *Vishal Thapar, Ph.D.* > *Scientific informatics Analyst > Cold Spring Harbor Lab > Quick Bldg, Lowe Lab > 1 Bungtown Road > Cold Spring Harbor, NY - 11724* > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Fri Oct 29 18:32:45 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 19:32:45 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: Hi Vishal, There's no suffix tree impl in BioJava but if you want to give it a shot then go for it :). I'm interested in how they work but no time to implement it. As for efficiency give it a shot & lets see what it does. Andy On 29 Oct 2010, at 17:27, Vishal Thapar wrote: > Hi Andy, > > This is good to have. I feel that including it as a part of core may not be necessary but having it as part of Genomic module in biojava3 will be nice. There is a project Bioinformatica http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequence which does something similar although not exactly. It counts the k-mers in a given fasta file but it does not count k-mers for each sequence within the file, just all within a file. This is a good feature to have specially if one is trying to find patterns within sequences which is what I am trying to do. It would most certainly be helpful to have a k-mer counting algorithm that counts k-mer frequency for each sequence. The way to go would be to use suffix trees. Again I don't know if biojava has a suffix tree api or not since I haven't used java in a while and am just switching back to it. A paper on using suffix trees to generate genome wide k-mer frequencies is: http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, software is tallymer). It would be some work to implement this in java as a module for biojava3 but I can see that this will be helpful. Again, for small fasta files, it might not be efficient to create a suffix tree but for bigger files, I think that might be the way to go. > > Thats just my two cents.What do you think? > > -vishal > > On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: > Hi Vishal, > > As far as I am aware there is nothing which will generate them in BioJava at the moment. However it is possible to do it with BioJava3: > > public static void main(String[] args) { > DNASequence d = new DNASequence("ATGATC"); > System.out.println("Non-Overlap"); > nonOverlap(d); > System.out.println("Overlap"); > overlap(d); > } > > public static final int KMER = 3; > > //Generate triplets overlapping > public static void overlap(Sequence d) { > List> l = > new ArrayList>(); > for(int i=1; i<=KMER; i++) { > SequenceView sub = d.getSubSequence( > i, d.getLength()); > WindowedSequence w = > new WindowedSequence(sub, KMER); > l.add(w); > } > > //Will return ATG, ATC, TGA & GAT > for(WindowedSequence w: l) { > for(List subList: w) { > System.out.println(subList); > } > } > } > > //Generate triplet Compound lists non-overlapping > public static void nonOverlap(Sequence d) { > WindowedSequence w = > new WindowedSequence(d, KMER); > //Will return ATG & ATC > for(List subList: w) { > System.out.println(subList); > } > } > > The disadvantage of all of these solutions is that they generate lists of Compounds so kmer generation can/will be a memory intensive operation. This does mean it has to be since sub sequences are thin wrappers around an underlying sequence. Also the overlap solution is non-optimal since it iterates through each window rather than stepping through delegating onto each base in turn (hence why we get ATG & ATC before TGA) > > As for unique k-mers that's something which would require a bit more engineering & would be better suited to a solution built around a Trie (prefix tree). > > Hope this helps, > > Andy > > On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > > > Hi All, > > > > I had a quick question: Does Biojava have a method to generate k-mers or > > K-mer counting in a given Fasta Sequence / File? Basically, I want k-mer > > counts for every sequence in a fasta file. If something like this exists it > > would save me some time to write the code. > > > > Thanks, > > > > Vishal > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > -- > Vishal Thapar, Ph.D. > Scientific informatics Analyst > Cold Spring Harbor Lab > Quick Bldg, Lowe Lab > 1 Bungtown Road > Cold Spring Harbor, NY - 11724 > -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From ayates at ebi.ac.uk Fri Oct 29 18:35:43 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 19:35:43 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: Message-ID: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> So if it's a suffix tree that's quite a fixed data structure so the chances of developing a pluggable mechanism there would be hard. I think there also has to be a limit as to what we can sensibly do. If people want to contribute this kind of work though then it's all be very well received (with the corresponding test environment/cases of course). Cheers, Andy On 29 Oct 2010, at 17:56, Mark Fortner wrote: > It might be useful to make the K-mer storage mechanism pluggable. This > would allow a developer to use anything from a simple MultiMap, to a NoSQL > key-value database to store K-mers. You could plugin custom map > implementations to allow you to keep a count of the number of instances of > particular K-mers that were found. It might also be useful to be able to do > set operations on those K-mer collections. You could use it to determine > which K-mers were present in a pathogen and not in a host. > http://www.ncbi.nlm.nih.gov/pubmed/20428334 > http://www.ncbi.nlm.nih.gov/pubmed/16403026 > > Cheers, > > Mark > > card.ly: > > > On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar wrote: > >> Hi Andy, >> >> This is good to have. I feel that including it as a part of core may not be >> necessary but having it as part of Genomic module in biojava3 will be nice. >> There is a project Bioinformatica >> >> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >> does something similar although not exactly. It counts the k-mers in a >> given fasta file but it does not count k-mers for each sequence within the >> file, just all within a file. This is a good feature to have specially if >> one is trying to find patterns within sequences which is what I am trying >> to >> do. It would most certainly be helpful to have a k-mer counting algorithm >> that counts k-mer frequency for each sequence. The way to go would be to >> use >> suffix trees. Again I don't know if biojava has a suffix tree api or not >> since I haven't used java in a while and am just switching back to it. A >> paper on using suffix trees to generate genome wide k-mer frequencies is: >> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >> software >> is tallymer). It would be some work to implement this in java as a module >> for biojava3 but I can see that this will be helpful. Again, for small >> fasta >> files, it might not be efficient to create a suffix tree but for bigger >> files, I think that might be the way to go. >> >> Thats just my two cents.What do you think? >> >> -vishal >> >> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >> >>> Hi Vishal, >>> >>> As far as I am aware there is nothing which will generate them in BioJava >>> at the moment. However it is possible to do it with BioJava3: >>> >>> public static void main(String[] args) { >>> DNASequence d = new DNASequence("ATGATC"); >>> System.out.println("Non-Overlap"); >>> nonOverlap(d); >>> System.out.println("Overlap"); >>> overlap(d); >>> } >>> >>> public static final int KMER = 3; >>> >>> //Generate triplets overlapping >>> public static void overlap(Sequence d) { >>> List> l = >>> new ArrayList>(); >>> for(int i=1; i<=KMER; i++) { >>> SequenceView sub = d.getSubSequence( >>> i, d.getLength()); >>> WindowedSequence w = >>> new WindowedSequence(sub, KMER); >>> l.add(w); >>> } >>> >>> //Will return ATG, ATC, TGA & GAT >>> for(WindowedSequence w: l) { >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> } >>> >>> //Generate triplet Compound lists non-overlapping >>> public static void nonOverlap(Sequence d) { >>> WindowedSequence w = >>> new WindowedSequence(d, KMER); >>> //Will return ATG & ATC >>> for(List subList: w) { >>> System.out.println(subList); >>> } >>> } >>> >>> The disadvantage of all of these solutions is that they generate lists of >>> Compounds so kmer generation can/will be a memory intensive operation. >> This >>> does mean it has to be since sub sequences are thin wrappers around an >>> underlying sequence. Also the overlap solution is non-optimal since it >>> iterates through each window rather than stepping through delegating onto >>> each base in turn (hence why we get ATG & ATC before TGA) >>> >>> As for unique k-mers that's something which would require a bit more >>> engineering & would be better suited to a solution built around a Trie >>> (prefix tree). >>> >>> Hope this helps, >>> >>> Andy >>> >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>> >>>> Hi All, >>>> >>>> I had a quick question: Does Biojava have a method to generate k-mers >> or >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >> k-mer >>>> counts for every sequence in a fasta file. If something like this >> exists >>> it >>>> would save me some time to write the code. >>>> >>>> Thanks, >>>> >>>> Vishal >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >> >> >> -- >> *Vishal Thapar, Ph.D.* >> *Scientific informatics Analyst >> Cold Spring Harbor Lab >> Quick Bldg, Lowe Lab >> 1 Bungtown Road >> Cold Spring Harbor, NY - 11724* >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jayunit100 at gmail.com Fri Oct 29 18:40:46 2010 From: jayunit100 at gmail.com (Jay Vyas) Date: Fri, 29 Oct 2010 14:40:46 -0400 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21 In-Reply-To: References: Message-ID: Hi guys : Im trying to break up a biojava project built on 1.7 into biojava 3, and am having to look up some modules etc... Im having trouble finding biojava3 javadocs ? Unfortunately, the 'googleable' java docs are all from 1.7 ..... Where is the formal/generated javadoc info for biojava3 ? is it online ? From phidias51 at gmail.com Fri Oct 29 18:48:53 2010 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 29 Oct 2010 11:48:53 -0700 Subject: [Biojava-l] K-mers In-Reply-To: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> Message-ID: I was thinking more along the lines of using something that implements the Map interface. This would allow a developer to easily unit test the code without having to load the data for a genome. You would also be able to provide different implementations to suit your needs. If you wanted to use a suffix tree as the underlying implementation, that would be OK, but you would have other options as well. Cheers, Mark card.ly: On Fri, Oct 29, 2010 at 11:35 AM, Andy Yates wrote: > So if it's a suffix tree that's quite a fixed data structure so the chances > of developing a pluggable mechanism there would be hard. I think there also > has to be a limit as to what we can sensibly do. If people want to > contribute this kind of work though then it's all be very well received > (with the corresponding test environment/cases of course). > > Cheers, > > Andy > > On 29 Oct 2010, at 17:56, Mark Fortner wrote: > > > It might be useful to make the K-mer storage mechanism pluggable. This > > would allow a developer to use anything from a simple MultiMap, to a > NoSQL > > key-value database to store K-mers. You could plugin custom map > > implementations to allow you to keep a count of the number of instances > of > > particular K-mers that were found. It might also be useful to be able to > do > > set operations on those K-mer collections. You could use it to determine > > which K-mers were present in a pathogen and not in a host. > > http://www.ncbi.nlm.nih.gov/pubmed/20428334 > > http://www.ncbi.nlm.nih.gov/pubmed/16403026 > > > > Cheers, > > > > Mark > > > > card.ly: > > > > > > On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >wrote: > > > >> Hi Andy, > >> > >> This is good to have. I feel that including it as a part of core may not > be > >> necessary but having it as part of Genomic module in biojava3 will be > nice. > >> There is a project Bioinformatica > >> > >> > http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich > >> does something similar although not exactly. It counts the k-mers in a > >> given fasta file but it does not count k-mers for each sequence within > the > >> file, just all within a file. This is a good feature to have specially > if > >> one is trying to find patterns within sequences which is what I am > trying > >> to > >> do. It would most certainly be helpful to have a k-mer counting > algorithm > >> that counts k-mer frequency for each sequence. The way to go would be to > >> use > >> suffix trees. Again I don't know if biojava has a suffix tree api or not > >> since I haven't used java in a while and am just switching back to it. A > >> paper on using suffix trees to generate genome wide k-mer frequencies > is: > >> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, > >> software > >> is tallymer). It would be some work to implement this in java as a > module > >> for biojava3 but I can see that this will be helpful. Again, for small > >> fasta > >> files, it might not be efficient to create a suffix tree but for bigger > >> files, I think that might be the way to go. > >> > >> Thats just my two cents.What do you think? > >> > >> -vishal > >> > >> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: > >> > >>> Hi Vishal, > >>> > >>> As far as I am aware there is nothing which will generate them in > BioJava > >>> at the moment. However it is possible to do it with BioJava3: > >>> > >>> public static void main(String[] args) { > >>> DNASequence d = new DNASequence("ATGATC"); > >>> System.out.println("Non-Overlap"); > >>> nonOverlap(d); > >>> System.out.println("Overlap"); > >>> overlap(d); > >>> } > >>> > >>> public static final int KMER = 3; > >>> > >>> //Generate triplets overlapping > >>> public static void overlap(Sequence d) { > >>> List> l = > >>> new ArrayList>(); > >>> for(int i=1; i<=KMER; i++) { > >>> SequenceView sub = d.getSubSequence( > >>> i, d.getLength()); > >>> WindowedSequence w = > >>> new WindowedSequence(sub, KMER); > >>> l.add(w); > >>> } > >>> > >>> //Will return ATG, ATC, TGA & GAT > >>> for(WindowedSequence w: l) { > >>> for(List subList: w) { > >>> System.out.println(subList); > >>> } > >>> } > >>> } > >>> > >>> //Generate triplet Compound lists non-overlapping > >>> public static void nonOverlap(Sequence d) { > >>> WindowedSequence w = > >>> new WindowedSequence(d, KMER); > >>> //Will return ATG & ATC > >>> for(List subList: w) { > >>> System.out.println(subList); > >>> } > >>> } > >>> > >>> The disadvantage of all of these solutions is that they generate lists > of > >>> Compounds so kmer generation can/will be a memory intensive operation. > >> This > >>> does mean it has to be since sub sequences are thin wrappers around an > >>> underlying sequence. Also the overlap solution is non-optimal since it > >>> iterates through each window rather than stepping through delegating > onto > >>> each base in turn (hence why we get ATG & ATC before TGA) > >>> > >>> As for unique k-mers that's something which would require a bit more > >>> engineering & would be better suited to a solution built around a Trie > >>> (prefix tree). > >>> > >>> Hope this helps, > >>> > >>> Andy > >>> > >>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > >>> > >>>> Hi All, > >>>> > >>>> I had a quick question: Does Biojava have a method to generate k-mers > >> or > >>>> K-mer counting in a given Fasta Sequence / File? Basically, I want > >> k-mer > >>>> counts for every sequence in a fasta file. If something like this > >> exists > >>> it > >>>> would save me some time to write the code. > >>>> > >>>> Thanks, > >>>> > >>>> Vishal > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>> > >>> -- > >>> Andrew Yates Ensembl Genomes Engineer > >>> EMBL-EBI Tel: +44-(0)1223-492538 > >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >>> > >>> > >>> > >>> > >>> > >> > >> > >> -- > >> *Vishal Thapar, Ph.D.* > >> *Scientific informatics Analyst > >> Cold Spring Harbor Lab > >> Quick Bldg, Lowe Lab > >> 1 Bungtown Road > >> Cold Spring Harbor, NY - 11724* > >> _______________________________________________ > >> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biojava-l > >> > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > From jbdundas at gmail.com Fri Oct 29 18:50:11 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 30 Oct 2010 00:20:11 +0530 Subject: [Biojava-l] K-mers In-Reply-To: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> Message-ID: I agree Andy. These have become standard functionalities that scientists do these days. I am all for implementing that in BioJava3. Java isn't that efficient for such functionalities so we will surely need more effort compared to the same in Python/Perl. Regards, Jitesh Dundas On 10/30/10, Andy Yates wrote: > So if it's a suffix tree that's quite a fixed data structure so the chances > of developing a pluggable mechanism there would be hard. I think there also > has to be a limit as to what we can sensibly do. If people want to > contribute this kind of work though then it's all be very well received > (with the corresponding test environment/cases of course). > > Cheers, > > Andy > > On 29 Oct 2010, at 17:56, Mark Fortner wrote: > >> It might be useful to make the K-mer storage mechanism pluggable. This >> would allow a developer to use anything from a simple MultiMap, to a NoSQL >> key-value database to store K-mers. You could plugin custom map >> implementations to allow you to keep a count of the number of instances of >> particular K-mers that were found. It might also be useful to be able to >> do >> set operations on those K-mer collections. You could use it to determine >> which K-mers were present in a pathogen and not in a host. >> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >> >> Cheers, >> >> Mark >> >> card.ly: >> >> >> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >> wrote: >> >>> Hi Andy, >>> >>> This is good to have. I feel that including it as a part of core may not >>> be >>> necessary but having it as part of Genomic module in biojava3 will be >>> nice. >>> There is a project Bioinformatica >>> >>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>> does something similar although not exactly. It counts the k-mers in a >>> given fasta file but it does not count k-mers for each sequence within >>> the >>> file, just all within a file. This is a good feature to have specially if >>> one is trying to find patterns within sequences which is what I am trying >>> to >>> do. It would most certainly be helpful to have a k-mer counting algorithm >>> that counts k-mer frequency for each sequence. The way to go would be to >>> use >>> suffix trees. Again I don't know if biojava has a suffix tree api or not >>> since I haven't used java in a while and am just switching back to it. A >>> paper on using suffix trees to generate genome wide k-mer frequencies is: >>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>> software >>> is tallymer). It would be some work to implement this in java as a module >>> for biojava3 but I can see that this will be helpful. Again, for small >>> fasta >>> files, it might not be efficient to create a suffix tree but for bigger >>> files, I think that might be the way to go. >>> >>> Thats just my two cents.What do you think? >>> >>> -vishal >>> >>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >>> >>>> Hi Vishal, >>>> >>>> As far as I am aware there is nothing which will generate them in >>>> BioJava >>>> at the moment. However it is possible to do it with BioJava3: >>>> >>>> public static void main(String[] args) { >>>> DNASequence d = new DNASequence("ATGATC"); >>>> System.out.println("Non-Overlap"); >>>> nonOverlap(d); >>>> System.out.println("Overlap"); >>>> overlap(d); >>>> } >>>> >>>> public static final int KMER = 3; >>>> >>>> //Generate triplets overlapping >>>> public static void overlap(Sequence d) { >>>> List> l = >>>> new ArrayList>(); >>>> for(int i=1; i<=KMER; i++) { >>>> SequenceView sub = d.getSubSequence( >>>> i, d.getLength()); >>>> WindowedSequence w = >>>> new WindowedSequence(sub, KMER); >>>> l.add(w); >>>> } >>>> >>>> //Will return ATG, ATC, TGA & GAT >>>> for(WindowedSequence w: l) { >>>> for(List subList: w) { >>>> System.out.println(subList); >>>> } >>>> } >>>> } >>>> >>>> //Generate triplet Compound lists non-overlapping >>>> public static void nonOverlap(Sequence d) { >>>> WindowedSequence w = >>>> new WindowedSequence(d, KMER); >>>> //Will return ATG & ATC >>>> for(List subList: w) { >>>> System.out.println(subList); >>>> } >>>> } >>>> >>>> The disadvantage of all of these solutions is that they generate lists >>>> of >>>> Compounds so kmer generation can/will be a memory intensive operation. >>> This >>>> does mean it has to be since sub sequences are thin wrappers around an >>>> underlying sequence. Also the overlap solution is non-optimal since it >>>> iterates through each window rather than stepping through delegating >>>> onto >>>> each base in turn (hence why we get ATG & ATC before TGA) >>>> >>>> As for unique k-mers that's something which would require a bit more >>>> engineering & would be better suited to a solution built around a Trie >>>> (prefix tree). >>>> >>>> Hope this helps, >>>> >>>> Andy >>>> >>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>> >>>>> Hi All, >>>>> >>>>> I had a quick question: Does Biojava have a method to generate k-mers >>> or >>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>> k-mer >>>>> counts for every sequence in a fasta file. If something like this >>> exists >>>> it >>>>> would save me some time to write the code. >>>>> >>>>> Thanks, >>>>> >>>>> Vishal >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> -- >>>> Andrew Yates Ensembl Genomes Engineer >>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>> >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> *Vishal Thapar, Ph.D.* >>> *Scientific informatics Analyst >>> Cold Spring Harbor Lab >>> Quick Bldg, Lowe Lab >>> 1 Bungtown Road >>> Cold Spring Harbor, NY - 11724* >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From willishf at ufl.edu Fri Oct 29 19:20:19 2010 From: willishf at ufl.edu (Scooter Willis) Date: Fri, 29 Oct 2010 15:20:19 -0400 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21 In-Reply-To: References: Message-ID: Jay I don't think we have pushed the biojava3 docs up to a place where google can find them. From the nightly build http://www.biojava.org/download/maven/org/biojava/ you can find javadocs in the jar files. Biojava3 has two parts now. The older 1.7 modules refactored into standalone jar files when possible but it is still a very cross dependent code base. Then the newer modules labeled biojava3- are a clean break from 1.7 so depending on what you are doing it may be easy/difficult to start using the newer biojava3 code without lots of changes in your code. Thanks Scooter On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas wrote: > Hi guys : Im trying to break up a biojava project built on 1.7 into biojava > 3, and am having to look up some modules etc... > Im having trouble finding biojava3 javadocs ? Unfortunately, the > 'googleable' java docs are all from 1.7 ..... > > Where is the formal/generated javadoc info for biojava3 ? is it online ? > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From markjschreiber at gmail.com Fri Oct 29 19:25:12 2010 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 29 Oct 2010 15:25:12 -0400 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21 In-Reply-To: References: Message-ID: It might pay to put the link to the docs on the top level page. You may need to get an Admin to change the front page. On Fri, Oct 29, 2010 at 3:20 PM, Scooter Willis wrote: > Jay > > I don't think we have pushed the biojava3 docs up to a place where google > can find them. From the nightly build > http://www.biojava.org/download/maven/org/biojava/ you can find javadocs > in > the jar files. Biojava3 has two parts now. The older 1.7 modules refactored > into standalone jar files when possible but it is still a very cross > dependent code base. Then the newer modules labeled biojava3- are a clean > break from 1.7 so depending on what you are doing it may be easy/difficult > to start using the newer biojava3 code without lots of changes in your > code. > > Thanks > > Scooter > > On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas wrote: > > > Hi guys : Im trying to break up a biojava project built on 1.7 into > biojava > > 3, and am having to look up some modules etc... > > Im having trouble finding biojava3 javadocs ? Unfortunately, the > > 'googleable' java docs are all from 1.7 ..... > > > > Where is the formal/generated javadoc info for biojava3 ? is it online ? > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From ayates at ebi.ac.uk Fri Oct 29 19:34:11 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Fri, 29 Oct 2010 20:34:11 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> Message-ID: <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> So we've got some basic kmer work now in SVN. If you look in the class SequenceMixin there are two static methods there for generating the two types of k-mers. It's not developed with Map storage in mind & I'll leave the door open there for anyone else to come in & develop it. The k-mers are also not unique across the sequence but it's a start :) Share & enjoy! Andy On 29 Oct 2010, at 19:50, jitesh dundas wrote: > I agree Andy. These have become standard functionalities that > scientists do these days. I am all for implementing that in BioJava3. > Java isn't that efficient for such functionalities so we will surely > need more effort compared to the same in Python/Perl. > > Regards, > Jitesh Dundas > > On 10/30/10, Andy Yates wrote: >> So if it's a suffix tree that's quite a fixed data structure so the chances >> of developing a pluggable mechanism there would be hard. I think there also >> has to be a limit as to what we can sensibly do. If people want to >> contribute this kind of work though then it's all be very well received >> (with the corresponding test environment/cases of course). >> >> Cheers, >> >> Andy >> >> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >> >>> It might be useful to make the K-mer storage mechanism pluggable. This >>> would allow a developer to use anything from a simple MultiMap, to a NoSQL >>> key-value database to store K-mers. You could plugin custom map >>> implementations to allow you to keep a count of the number of instances of >>> particular K-mers that were found. It might also be useful to be able to >>> do >>> set operations on those K-mer collections. You could use it to determine >>> which K-mers were present in a pathogen and not in a host. >>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>> >>> Cheers, >>> >>> Mark >>> >>> card.ly: >>> >>> >>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>> wrote: >>> >>>> Hi Andy, >>>> >>>> This is good to have. I feel that including it as a part of core may not >>>> be >>>> necessary but having it as part of Genomic module in biojava3 will be >>>> nice. >>>> There is a project Bioinformatica >>>> >>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>> does something similar although not exactly. It counts the k-mers in a >>>> given fasta file but it does not count k-mers for each sequence within >>>> the >>>> file, just all within a file. This is a good feature to have specially if >>>> one is trying to find patterns within sequences which is what I am trying >>>> to >>>> do. It would most certainly be helpful to have a k-mer counting algorithm >>>> that counts k-mer frequency for each sequence. The way to go would be to >>>> use >>>> suffix trees. Again I don't know if biojava has a suffix tree api or not >>>> since I haven't used java in a while and am just switching back to it. A >>>> paper on using suffix trees to generate genome wide k-mer frequencies is: >>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>> software >>>> is tallymer). It would be some work to implement this in java as a module >>>> for biojava3 but I can see that this will be helpful. Again, for small >>>> fasta >>>> files, it might not be efficient to create a suffix tree but for bigger >>>> files, I think that might be the way to go. >>>> >>>> Thats just my two cents.What do you think? >>>> >>>> -vishal >>>> >>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >>>> >>>>> Hi Vishal, >>>>> >>>>> As far as I am aware there is nothing which will generate them in >>>>> BioJava >>>>> at the moment. However it is possible to do it with BioJava3: >>>>> >>>>> public static void main(String[] args) { >>>>> DNASequence d = new DNASequence("ATGATC"); >>>>> System.out.println("Non-Overlap"); >>>>> nonOverlap(d); >>>>> System.out.println("Overlap"); >>>>> overlap(d); >>>>> } >>>>> >>>>> public static final int KMER = 3; >>>>> >>>>> //Generate triplets overlapping >>>>> public static void overlap(Sequence d) { >>>>> List> l = >>>>> new ArrayList>(); >>>>> for(int i=1; i<=KMER; i++) { >>>>> SequenceView sub = d.getSubSequence( >>>>> i, d.getLength()); >>>>> WindowedSequence w = >>>>> new WindowedSequence(sub, KMER); >>>>> l.add(w); >>>>> } >>>>> >>>>> //Will return ATG, ATC, TGA & GAT >>>>> for(WindowedSequence w: l) { >>>>> for(List subList: w) { >>>>> System.out.println(subList); >>>>> } >>>>> } >>>>> } >>>>> >>>>> //Generate triplet Compound lists non-overlapping >>>>> public static void nonOverlap(Sequence d) { >>>>> WindowedSequence w = >>>>> new WindowedSequence(d, KMER); >>>>> //Will return ATG & ATC >>>>> for(List subList: w) { >>>>> System.out.println(subList); >>>>> } >>>>> } >>>>> >>>>> The disadvantage of all of these solutions is that they generate lists >>>>> of >>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>> This >>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>> iterates through each window rather than stepping through delegating >>>>> onto >>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>> >>>>> As for unique k-mers that's something which would require a bit more >>>>> engineering & would be better suited to a solution built around a Trie >>>>> (prefix tree). >>>>> >>>>> Hope this helps, >>>>> >>>>> Andy >>>>> >>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>> or >>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>> k-mer >>>>>> counts for every sequence in a fasta file. If something like this >>>> exists >>>>> it >>>>>> would save me some time to write the code. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Vishal >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>>> -- >>>>> Andrew Yates Ensembl Genomes Engineer >>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Vishal Thapar, Ph.D.* >>>> *Scientific informatics Analyst >>>> Cold Spring Harbor Lab >>>> Quick Bldg, Lowe Lab >>>> 1 Bungtown Road >>>> Cold Spring Harbor, NY - 11724* >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> >> _______________________________________________ >> Biojava-l mailing list - Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jbdundas at gmail.com Fri Oct 29 19:43:38 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 30 Oct 2010 01:13:38 +0530 Subject: [Biojava-l] K-mers In-Reply-To: <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> Message-ID: That is good news.Thanks for the directions Andy. I have already started on this.Let me analyze and write the code now. Maybe a next month deadline is not unreachable in this case. Here we go! JD On 10/30/10, Andy Yates wrote: > So we've got some basic kmer work now in SVN. If you look in the class > SequenceMixin there are two static methods there for generating the two > types of k-mers. It's not developed with Map storage in mind & I'll leave > the door open there for anyone else to come in & develop it. The k-mers are > also not unique across the sequence but it's a start :) > > Share & enjoy! > > Andy > > On 29 Oct 2010, at 19:50, jitesh dundas wrote: > >> I agree Andy. These have become standard functionalities that >> scientists do these days. I am all for implementing that in BioJava3. >> Java isn't that efficient for such functionalities so we will surely >> need more effort compared to the same in Python/Perl. >> >> Regards, >> Jitesh Dundas >> >> On 10/30/10, Andy Yates wrote: >>> So if it's a suffix tree that's quite a fixed data structure so the >>> chances >>> of developing a pluggable mechanism there would be hard. I think there >>> also >>> has to be a limit as to what we can sensibly do. If people want to >>> contribute this kind of work though then it's all be very well received >>> (with the corresponding test environment/cases of course). >>> >>> Cheers, >>> >>> Andy >>> >>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >>> >>>> It might be useful to make the K-mer storage mechanism pluggable. This >>>> would allow a developer to use anything from a simple MultiMap, to a >>>> NoSQL >>>> key-value database to store K-mers. You could plugin custom map >>>> implementations to allow you to keep a count of the number of instances >>>> of >>>> particular K-mers that were found. It might also be useful to be able >>>> to >>>> do >>>> set operations on those K-mer collections. You could use it to >>>> determine >>>> which K-mers were present in a pathogen and not in a host. >>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>>> >>>> Cheers, >>>> >>>> Mark >>>> >>>> card.ly: >>>> >>>> >>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>>> wrote: >>>> >>>>> Hi Andy, >>>>> >>>>> This is good to have. I feel that including it as a part of core may >>>>> not >>>>> be >>>>> necessary but having it as part of Genomic module in biojava3 will be >>>>> nice. >>>>> There is a project Bioinformatica >>>>> >>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>>> does something similar although not exactly. It counts the k-mers in a >>>>> given fasta file but it does not count k-mers for each sequence within >>>>> the >>>>> file, just all within a file. This is a good feature to have specially >>>>> if >>>>> one is trying to find patterns within sequences which is what I am >>>>> trying >>>>> to >>>>> do. It would most certainly be helpful to have a k-mer counting >>>>> algorithm >>>>> that counts k-mer frequency for each sequence. The way to go would be >>>>> to >>>>> use >>>>> suffix trees. Again I don't know if biojava has a suffix tree api or >>>>> not >>>>> since I haven't used java in a while and am just switching back to it. >>>>> A >>>>> paper on using suffix trees to generate genome wide k-mer frequencies >>>>> is: >>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>>> software >>>>> is tallymer). It would be some work to implement this in java as a >>>>> module >>>>> for biojava3 but I can see that this will be helpful. Again, for small >>>>> fasta >>>>> files, it might not be efficient to create a suffix tree but for bigger >>>>> files, I think that might be the way to go. >>>>> >>>>> Thats just my two cents.What do you think? >>>>> >>>>> -vishal >>>>> >>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >>>>> >>>>>> Hi Vishal, >>>>>> >>>>>> As far as I am aware there is nothing which will generate them in >>>>>> BioJava >>>>>> at the moment. However it is possible to do it with BioJava3: >>>>>> >>>>>> public static void main(String[] args) { >>>>>> DNASequence d = new DNASequence("ATGATC"); >>>>>> System.out.println("Non-Overlap"); >>>>>> nonOverlap(d); >>>>>> System.out.println("Overlap"); >>>>>> overlap(d); >>>>>> } >>>>>> >>>>>> public static final int KMER = 3; >>>>>> >>>>>> //Generate triplets overlapping >>>>>> public static void overlap(Sequence d) { >>>>>> List> l = >>>>>> new ArrayList>(); >>>>>> for(int i=1; i<=KMER; i++) { >>>>>> SequenceView sub = d.getSubSequence( >>>>>> i, d.getLength()); >>>>>> WindowedSequence w = >>>>>> new WindowedSequence(sub, KMER); >>>>>> l.add(w); >>>>>> } >>>>>> >>>>>> //Will return ATG, ATC, TGA & GAT >>>>>> for(WindowedSequence w: l) { >>>>>> for(List subList: w) { >>>>>> System.out.println(subList); >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> //Generate triplet Compound lists non-overlapping >>>>>> public static void nonOverlap(Sequence d) { >>>>>> WindowedSequence w = >>>>>> new WindowedSequence(d, KMER); >>>>>> //Will return ATG & ATC >>>>>> for(List subList: w) { >>>>>> System.out.println(subList); >>>>>> } >>>>>> } >>>>>> >>>>>> The disadvantage of all of these solutions is that they generate lists >>>>>> of >>>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>>> This >>>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>>> iterates through each window rather than stepping through delegating >>>>>> onto >>>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>>> >>>>>> As for unique k-mers that's something which would require a bit more >>>>>> engineering & would be better suited to a solution built around a Trie >>>>>> (prefix tree). >>>>>> >>>>>> Hope this helps, >>>>>> >>>>>> Andy >>>>>> >>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>>> or >>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>>> k-mer >>>>>>> counts for every sequence in a fasta file. If something like this >>>>> exists >>>>>> it >>>>>>> would save me some time to write the code. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Vishal >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> >>>>>> -- >>>>>> Andrew Yates Ensembl Genomes Engineer >>>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Vishal Thapar, Ph.D.* >>>>> *Scientific informatics Analyst >>>>> Cold Spring Harbor Lab >>>>> Quick Bldg, Lowe Lab >>>>> 1 Bungtown Road >>>>> Cold Spring Harbor, NY - 11724* >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> >>> -- >>> Andrew Yates Ensembl Genomes Engineer >>> EMBL-EBI Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > From jayunit100 at gmail.com Fri Oct 29 21:39:34 2010 From: jayunit100 at gmail.com (Jay Vyas) Date: Fri, 29 Oct 2010 17:39:34 -0400 Subject: [Biojava-l] JavaDocs and Backwards compatibility Message-ID: Thanks, I am now all up to date with biojava 3.0 and it really works well. It really would be valuable to have some public biojava java docs ! This is because, for example, when I completely removed biojava 1.7, and replaced it with biojava 3.0, it was somewhat tedious to refactor/find old classes under new package names, for example : For example, org.biojava3.alignment. SimpleSubstitutionMatrix; org.biojava3.alignment.template.SubstitutionMatrix; From andreas at sdsc.edu Fri Oct 29 21:59:23 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Fri, 29 Oct 2010 14:59:23 -0700 Subject: [Biojava-l] Biojava-l Digest, Vol 93, Issue 21 In-Reply-To: References: Message-ID: Ideally I would like to see the automated build system also deploy the latest javadocs on the website. I guess I should play around with the maven site-plugin if it can do that ... or does anybody have a recommendation for any other plugin? Andreas On Fri, Oct 29, 2010 at 12:25 PM, Mark Schreiber wrote: > It might pay to put the link to the docs on the top level page. > > You may need to get an Admin to change the front page. > > On Fri, Oct 29, 2010 at 3:20 PM, Scooter Willis wrote: > >> Jay >> >> I don't think we have pushed the biojava3 docs up to a place where google >> can find them. From the nightly build >> http://www.biojava.org/download/maven/org/biojava/ you can find javadocs >> in >> the jar files. Biojava3 has two parts now. The older 1.7 modules refactored >> into standalone jar files when possible but it is still a very cross >> dependent code base. Then the newer modules labeled biojava3- are a clean >> break from 1.7 so depending on what you are doing it may be easy/difficult >> to start using the newer biojava3 code without lots of changes in your >> code. >> >> Thanks >> >> Scooter >> >> On Fri, Oct 29, 2010 at 2:40 PM, Jay Vyas wrote: >> >> > Hi guys : Im trying to break up a biojava project built on 1.7 into >> biojava >> > 3, and am having to look up some modules etc... >> > Im having trouble finding biojava3 javadocs ? ?Unfortunately, the >> > 'googleable' java docs are all from 1.7 ..... >> > >> > Where is the formal/generated javadoc info for biojava3 ? is it online ? >> > _______________________________________________ >> > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biojava-l >> > >> > >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From simon.rayner.cn at gmail.com Fri Oct 29 23:38:13 2010 From: simon.rayner.cn at gmail.com (simon rayner) Date: Sat, 30 Oct 2010 07:38:13 +0800 Subject: [Biojava-l] New Biojava Logo In-Reply-To: References: Message-ID: just a suggestion, but might beans falling out the cup suggest that biojava is unstable? just offering feedback, i still think it looks very slick! On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani wrote: > Great Logo!!! > > :D > > 2010/10/29 jitendra narayan : > > Dear All > > I have designed a n new biojava logo. Please see the detail of it: > > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg > > I need your > valuable > > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo > > > > > > thanks > > > > -- > > Jitendra Narayan > > Bioinformatist > > www.bioinformaticsonline.com > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > Alessandro Cipriani > (+39) 3206009509 > (+39) 3931311792 > http://www.cipriania.it > skype:genjasp at gmail.com > msn:jaspzz > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- Simon Rayner State Key Laboratory of Virology Wuhan Institute of Virology Chinese Academy of Sciences Wuhan, Hubei 430071 P.R.China +86 (27) 87199895 (office) +86 18627113001 (cell) From phidias51 at gmail.com Fri Oct 29 23:49:54 2010 From: phidias51 at gmail.com (Mark Fortner) Date: Fri, 29 Oct 2010 16:49:54 -0700 Subject: [Biojava-l] New Biojava Logo In-Reply-To: References: Message-ID: The first logo looks nice; however, I don't see anything in it that connects it to biology. The second logo is too close to Oracle's logo, and I suspect would require written permission from them in order to use it. Cheers, Mark card.ly: On Fri, Oct 29, 2010 at 4:38 PM, simon rayner wrote: > just a suggestion, but might beans falling out the cup suggest that biojava > is unstable? just offering feedback, i still think it looks very slick! > > On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani >wrote: > > > Great Logo!!! > > > > :D > > > > 2010/10/29 jitendra narayan : > > > Dear All > > > I have designed a n new biojava logo. Please see the detail of it: > > > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg > > > I need your > > valuable > > > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo > > > > > > > > > thanks > > > > > > -- > > > Jitendra Narayan > > > Bioinformatist > > > www.bioinformaticsonline.com > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > -- > > Alessandro Cipriani > > (+39) 3206009509 > > (+39) 3931311792 > > http://www.cipriania.it > > skype:genjasp at gmail.com < > skype%3Agenjasp at gmail.com > > > msn:jaspzz > > > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > -- > Simon Rayner > > State Key Laboratory of Virology > Wuhan Institute of Virology > Chinese Academy of Sciences > Wuhan, Hubei 430071 > P.R.China > > +86 (27) 87199895 (office) > +86 18627113001 (cell) > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > From willishf at ufl.edu Sat Oct 30 00:02:32 2010 From: willishf at ufl.edu (Scooter Willis) Date: Fri, 29 Oct 2010 20:02:32 -0400 Subject: [Biojava-l] New Biojava Logo In-Reply-To: References: Message-ID: Jitendra Could you morph from the coffee liquid to a DNA helix? Scooter On Fri, Oct 29, 2010 at 7:49 PM, Mark Fortner wrote: > The first logo looks nice; however, I don't see anything in it that > connects > it to biology. The second logo is too close to Oracle's logo, and I > suspect > would require written permission from them in order to use it. > > Cheers, > > Mark > > card.ly: > > > On Fri, Oct 29, 2010 at 4:38 PM, simon rayner >wrote: > > > just a suggestion, but might beans falling out the cup suggest that > biojava > > is unstable? just offering feedback, i still think it looks very slick! > > > > On Fri, Oct 29, 2010 at 9:05 PM, Alessandro Cipriani > >wrote: > > > > > Great Logo!!! > > > > > > :D > > > > > > 2010/10/29 jitendra narayan : > > > > Dear All > > > > I have designed a n new biojava logo. Please see the detail of it: > > > > http://biojava.org/wiki/File:Biojava_logo_jitendra.jpg > > > > I need your > > > valuable > > > > suggestion on wiki page: http://biojava.org/wiki/BioJava:Logo > > > > > > > > > > > > thanks > > > > > > > > -- > > > > Jitendra Narayan > > > > Bioinformatist > > > > www.bioinformaticsonline.com > > > > _______________________________________________ > > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > > > > > > -- > > > Alessandro Cipriani > > > (+39) 3206009509 > > > (+39) 3931311792 > > > http://www.cipriania.it > > > skype:genjasp at gmail.com < > skype%3Agenjasp at gmail.com > < > > skype%3Agenjasp at gmail.com < > skype%253Agenjasp at gmail.com >> > > > msn:jaspzz > > > > > > _______________________________________________ > > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > > > > > > > > > -- > > Simon Rayner > > > > State Key Laboratory of Virology > > Wuhan Institute of Virology > > Chinese Academy of Sciences > > Wuhan, Hubei 430071 > > P.R.China > > > > +86 (27) 87199895 (office) > > +86 18627113001 (cell) > > _______________________________________________ > > Biojava-l mailing list - Biojava-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-l > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > > From ayates at ebi.ac.uk Sat Oct 30 09:20:30 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Sat, 30 Oct 2010 10:20:30 +0100 Subject: [Biojava-l] K-mers In-Reply-To: References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> Message-ID: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk> You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight. Just goes to show you should always do more testing than you think :). Andy On 29 Oct 2010, at 20:43, jitesh dundas wrote: > That is good news.Thanks for the directions Andy. > > I have already started on this.Let me analyze and write the code now. > > Maybe a next month deadline is not unreachable in this case. > > Here we go! > JD > > On 10/30/10, Andy Yates wrote: >> So we've got some basic kmer work now in SVN. If you look in the class >> SequenceMixin there are two static methods there for generating the two >> types of k-mers. It's not developed with Map storage in mind & I'll leave >> the door open there for anyone else to come in & develop it. The k-mers are >> also not unique across the sequence but it's a start :) >> >> Share & enjoy! >> >> Andy >> >> On 29 Oct 2010, at 19:50, jitesh dundas wrote: >> >>> I agree Andy. These have become standard functionalities that >>> scientists do these days. I am all for implementing that in BioJava3. >>> Java isn't that efficient for such functionalities so we will surely >>> need more effort compared to the same in Python/Perl. >>> >>> Regards, >>> Jitesh Dundas >>> >>> On 10/30/10, Andy Yates wrote: >>>> So if it's a suffix tree that's quite a fixed data structure so the >>>> chances >>>> of developing a pluggable mechanism there would be hard. I think there >>>> also >>>> has to be a limit as to what we can sensibly do. If people want to >>>> contribute this kind of work though then it's all be very well received >>>> (with the corresponding test environment/cases of course). >>>> >>>> Cheers, >>>> >>>> Andy >>>> >>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >>>> >>>>> It might be useful to make the K-mer storage mechanism pluggable. This >>>>> would allow a developer to use anything from a simple MultiMap, to a >>>>> NoSQL >>>>> key-value database to store K-mers. You could plugin custom map >>>>> implementations to allow you to keep a count of the number of instances >>>>> of >>>>> particular K-mers that were found. It might also be useful to be able >>>>> to >>>>> do >>>>> set operations on those K-mer collections. You could use it to >>>>> determine >>>>> which K-mers were present in a pathogen and not in a host. >>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>>>> >>>>> Cheers, >>>>> >>>>> Mark >>>>> >>>>> card.ly: >>>>> >>>>> >>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>>>> wrote: >>>>> >>>>>> Hi Andy, >>>>>> >>>>>> This is good to have. I feel that including it as a part of core may >>>>>> not >>>>>> be >>>>>> necessary but having it as part of Genomic module in biojava3 will be >>>>>> nice. >>>>>> There is a project Bioinformatica >>>>>> >>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>>>> does something similar although not exactly. It counts the k-mers in a >>>>>> given fasta file but it does not count k-mers for each sequence within >>>>>> the >>>>>> file, just all within a file. This is a good feature to have specially >>>>>> if >>>>>> one is trying to find patterns within sequences which is what I am >>>>>> trying >>>>>> to >>>>>> do. It would most certainly be helpful to have a k-mer counting >>>>>> algorithm >>>>>> that counts k-mer frequency for each sequence. The way to go would be >>>>>> to >>>>>> use >>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or >>>>>> not >>>>>> since I haven't used java in a while and am just switching back to it. >>>>>> A >>>>>> paper on using suffix trees to generate genome wide k-mer frequencies >>>>>> is: >>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>>>> software >>>>>> is tallymer). It would be some work to implement this in java as a >>>>>> module >>>>>> for biojava3 but I can see that this will be helpful. Again, for small >>>>>> fasta >>>>>> files, it might not be efficient to create a suffix tree but for bigger >>>>>> files, I think that might be the way to go. >>>>>> >>>>>> Thats just my two cents.What do you think? >>>>>> >>>>>> -vishal >>>>>> >>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >>>>>> >>>>>>> Hi Vishal, >>>>>>> >>>>>>> As far as I am aware there is nothing which will generate them in >>>>>>> BioJava >>>>>>> at the moment. However it is possible to do it with BioJava3: >>>>>>> >>>>>>> public static void main(String[] args) { >>>>>>> DNASequence d = new DNASequence("ATGATC"); >>>>>>> System.out.println("Non-Overlap"); >>>>>>> nonOverlap(d); >>>>>>> System.out.println("Overlap"); >>>>>>> overlap(d); >>>>>>> } >>>>>>> >>>>>>> public static final int KMER = 3; >>>>>>> >>>>>>> //Generate triplets overlapping >>>>>>> public static void overlap(Sequence d) { >>>>>>> List> l = >>>>>>> new ArrayList>(); >>>>>>> for(int i=1; i<=KMER; i++) { >>>>>>> SequenceView sub = d.getSubSequence( >>>>>>> i, d.getLength()); >>>>>>> WindowedSequence w = >>>>>>> new WindowedSequence(sub, KMER); >>>>>>> l.add(w); >>>>>>> } >>>>>>> >>>>>>> //Will return ATG, ATC, TGA & GAT >>>>>>> for(WindowedSequence w: l) { >>>>>>> for(List subList: w) { >>>>>>> System.out.println(subList); >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> //Generate triplet Compound lists non-overlapping >>>>>>> public static void nonOverlap(Sequence d) { >>>>>>> WindowedSequence w = >>>>>>> new WindowedSequence(d, KMER); >>>>>>> //Will return ATG & ATC >>>>>>> for(List subList: w) { >>>>>>> System.out.println(subList); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> The disadvantage of all of these solutions is that they generate lists >>>>>>> of >>>>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>>>> This >>>>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>>>> iterates through each window rather than stepping through delegating >>>>>>> onto >>>>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>>>> >>>>>>> As for unique k-mers that's something which would require a bit more >>>>>>> engineering & would be better suited to a solution built around a Trie >>>>>>> (prefix tree). >>>>>>> >>>>>>> Hope this helps, >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>>>> or >>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>>>> k-mer >>>>>>>> counts for every sequence in a fasta file. If something like this >>>>>> exists >>>>>>> it >>>>>>>> would save me some time to write the code. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Vishal >>>>>>>> _______________________________________________ >>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>>>>>> -- >>>>>>> Andrew Yates Ensembl Genomes Engineer >>>>>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Vishal Thapar, Ph.D.* >>>>>> *Scientific informatics Analyst >>>>>> Cold Spring Harbor Lab >>>>>> Quick Bldg, Lowe Lab >>>>>> 1 Bungtown Road >>>>>> Cold Spring Harbor, NY - 11724* >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>> >>>>> _______________________________________________ >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >>>> -- >>>> Andrew Yates Ensembl Genomes Engineer >>>> EMBL-EBI Tel: +44-(0)1223-492538 >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>> >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From jbdundas at gmail.com Sat Oct 30 09:40:35 2010 From: jbdundas at gmail.com (jitesh dundas) Date: Sat, 30 Oct 2010 15:10:35 +0530 Subject: [Biojava-l] K-mers In-Reply-To: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk> References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk> Message-ID: I got your point Andy. .Thanks. On Sat, Oct 30, 2010 at 2:50 PM, Andy Yates wrote: > You should be aware I just found a bug in the code. This has been fixed but > the bug will still be in the alpha3 release. I would recommend either > building a version yourself or if Andreas can post up the continuous > integration server address there will be a release tonight. > > Just goes to show you should always do more testing than you think :). > > Andy > > On 29 Oct 2010, at 20:43, jitesh dundas wrote: > > > That is good news.Thanks for the directions Andy. > > > > I have already started on this.Let me analyze and write the code now. > > > > Maybe a next month deadline is not unreachable in this case. > > > > Here we go! > > JD > > > > On 10/30/10, Andy Yates wrote: > >> So we've got some basic kmer work now in SVN. If you look in the class > >> SequenceMixin there are two static methods there for generating the two > >> types of k-mers. It's not developed with Map storage in mind & I'll > leave > >> the door open there for anyone else to come in & develop it. The k-mers > are > >> also not unique across the sequence but it's a start :) > >> > >> Share & enjoy! > >> > >> Andy > >> > >> On 29 Oct 2010, at 19:50, jitesh dundas wrote: > >> > >>> I agree Andy. These have become standard functionalities that > >>> scientists do these days. I am all for implementing that in BioJava3. > >>> Java isn't that efficient for such functionalities so we will surely > >>> need more effort compared to the same in Python/Perl. > >>> > >>> Regards, > >>> Jitesh Dundas > >>> > >>> On 10/30/10, Andy Yates wrote: > >>>> So if it's a suffix tree that's quite a fixed data structure so the > >>>> chances > >>>> of developing a pluggable mechanism there would be hard. I think there > >>>> also > >>>> has to be a limit as to what we can sensibly do. If people want to > >>>> contribute this kind of work though then it's all be very well > received > >>>> (with the corresponding test environment/cases of course). > >>>> > >>>> Cheers, > >>>> > >>>> Andy > >>>> > >>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: > >>>> > >>>>> It might be useful to make the K-mer storage mechanism pluggable. > This > >>>>> would allow a developer to use anything from a simple MultiMap, to a > >>>>> NoSQL > >>>>> key-value database to store K-mers. You could plugin custom map > >>>>> implementations to allow you to keep a count of the number of > instances > >>>>> of > >>>>> particular K-mers that were found. It might also be useful to be > able > >>>>> to > >>>>> do > >>>>> set operations on those K-mer collections. You could use it to > >>>>> determine > >>>>> which K-mers were present in a pathogen and not in a host. > >>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 > >>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Mark > >>>>> > >>>>> card.ly: > >>>>> > >>>>> > >>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar > >>>>> wrote: > >>>>> > >>>>>> Hi Andy, > >>>>>> > >>>>>> This is good to have. I feel that including it as a part of core may > >>>>>> not > >>>>>> be > >>>>>> necessary but having it as part of Genomic module in biojava3 will > be > >>>>>> nice. > >>>>>> There is a project Bioinformatica > >>>>>> > >>>>>> > http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich > >>>>>> does something similar although not exactly. It counts the k-mers in > a > >>>>>> given fasta file but it does not count k-mers for each sequence > within > >>>>>> the > >>>>>> file, just all within a file. This is a good feature to have > specially > >>>>>> if > >>>>>> one is trying to find patterns within sequences which is what I am > >>>>>> trying > >>>>>> to > >>>>>> do. It would most certainly be helpful to have a k-mer counting > >>>>>> algorithm > >>>>>> that counts k-mer frequency for each sequence. The way to go would > be > >>>>>> to > >>>>>> use > >>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or > >>>>>> not > >>>>>> since I haven't used java in a while and am just switching back to > it. > >>>>>> A > >>>>>> paper on using suffix trees to generate genome wide k-mer > frequencies > >>>>>> is: > >>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, > >>>>>> software > >>>>>> is tallymer). It would be some work to implement this in java as a > >>>>>> module > >>>>>> for biojava3 but I can see that this will be helpful. Again, for > small > >>>>>> fasta > >>>>>> files, it might not be efficient to create a suffix tree but for > bigger > >>>>>> files, I think that might be the way to go. > >>>>>> > >>>>>> Thats just my two cents.What do you think? > >>>>>> > >>>>>> -vishal > >>>>>> > >>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates > wrote: > >>>>>> > >>>>>>> Hi Vishal, > >>>>>>> > >>>>>>> As far as I am aware there is nothing which will generate them in > >>>>>>> BioJava > >>>>>>> at the moment. However it is possible to do it with BioJava3: > >>>>>>> > >>>>>>> public static void main(String[] args) { > >>>>>>> DNASequence d = new DNASequence("ATGATC"); > >>>>>>> System.out.println("Non-Overlap"); > >>>>>>> nonOverlap(d); > >>>>>>> System.out.println("Overlap"); > >>>>>>> overlap(d); > >>>>>>> } > >>>>>>> > >>>>>>> public static final int KMER = 3; > >>>>>>> > >>>>>>> //Generate triplets overlapping > >>>>>>> public static void overlap(Sequence d) { > >>>>>>> List> l = > >>>>>>> new ArrayList>(); > >>>>>>> for(int i=1; i<=KMER; i++) { > >>>>>>> SequenceView sub = d.getSubSequence( > >>>>>>> i, d.getLength()); > >>>>>>> WindowedSequence w = > >>>>>>> new WindowedSequence(sub, KMER); > >>>>>>> l.add(w); > >>>>>>> } > >>>>>>> > >>>>>>> //Will return ATG, ATC, TGA & GAT > >>>>>>> for(WindowedSequence w: l) { > >>>>>>> for(List subList: w) { > >>>>>>> System.out.println(subList); > >>>>>>> } > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> //Generate triplet Compound lists non-overlapping > >>>>>>> public static void nonOverlap(Sequence d) { > >>>>>>> WindowedSequence w = > >>>>>>> new WindowedSequence(d, KMER); > >>>>>>> //Will return ATG & ATC > >>>>>>> for(List subList: w) { > >>>>>>> System.out.println(subList); > >>>>>>> } > >>>>>>> } > >>>>>>> > >>>>>>> The disadvantage of all of these solutions is that they generate > lists > >>>>>>> of > >>>>>>> Compounds so kmer generation can/will be a memory intensive > operation. > >>>>>> This > >>>>>>> does mean it has to be since sub sequences are thin wrappers around > an > >>>>>>> underlying sequence. Also the overlap solution is non-optimal since > it > >>>>>>> iterates through each window rather than stepping through > delegating > >>>>>>> onto > >>>>>>> each base in turn (hence why we get ATG & ATC before TGA) > >>>>>>> > >>>>>>> As for unique k-mers that's something which would require a bit > more > >>>>>>> engineering & would be better suited to a solution built around a > Trie > >>>>>>> (prefix tree). > >>>>>>> > >>>>>>> Hope this helps, > >>>>>>> > >>>>>>> Andy > >>>>>>> > >>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: > >>>>>>> > >>>>>>>> Hi All, > >>>>>>>> > >>>>>>>> I had a quick question: Does Biojava have a method to generate > k-mers > >>>>>> or > >>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want > >>>>>> k-mer > >>>>>>>> counts for every sequence in a fasta file. If something like this > >>>>>> exists > >>>>>>> it > >>>>>>>> would save me some time to write the code. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Vishal > >>>>>>>> _______________________________________________ > >>>>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>>> > >>>>>>> -- > >>>>>>> Andrew Yates Ensembl Genomes Engineer > >>>>>>> EMBL-EBI Tel: +44-(0)1223-492538 > >>>>>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >>>>>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> *Vishal Thapar, Ph.D.* > >>>>>> *Scientific informatics Analyst > >>>>>> Cold Spring Harbor Lab > >>>>>> Quick Bldg, Lowe Lab > >>>>>> 1 Bungtown Road > >>>>>> Cold Spring Harbor, NY - 11724* > >>>>>> _______________________________________________ > >>>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>>>> > >>>>> _______________________________________________ > >>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > >>>> -- > >>>> Andrew Yates Ensembl Genomes Engineer > >>>> EMBL-EBI Tel: +44-(0)1223-492538 > >>>> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >>>> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/biojava-l > >>>> > >> > >> -- > >> Andrew Yates Ensembl Genomes Engineer > >> EMBL-EBI Tel: +44-(0)1223-492538 > >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > >> > >> > >> > >> > >> > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > From andreas at sdsc.edu Sat Oct 30 10:50:48 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sat, 30 Oct 2010 06:50:48 -0400 Subject: [Biojava-l] K-mers In-Reply-To: <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk> References: <06F6B7C8-5467-4F3C-A4AB-13878185F758@ebi.ac.uk> <23A62127-787F-4B63-AC50-D799EF40144D@ebi.ac.uk> <1DEA8143-0B65-41FD-BD2E-500531F1D6A1@ebi.ac.uk> Message-ID: just kicked off a new build.. alpha4 should be on the servers shortly... you don't need cruisecontrol for a release. Anybody with an ssh account on portal.open-bio (and set up ssh keys correctly) can do mvn release:clean release:prepare release:perform A On Sat, Oct 30, 2010 at 5:20 AM, Andy Yates wrote: > You should be aware I just found a bug in the code. This has been fixed but the bug will still be in the alpha3 release. I would recommend either building a version yourself or if Andreas can post up the continuous integration server address there will be a release tonight. > > Just goes to show you should always do more testing than you think :). > > Andy > > On 29 Oct 2010, at 20:43, jitesh dundas wrote: > >> That is good news.Thanks for the directions Andy. >> >> I have already started on this.Let me analyze and write the code now. >> >> Maybe a next month deadline is not unreachable in this case. >> >> Here we go! >> JD >> >> On 10/30/10, Andy Yates wrote: >>> So we've got some basic kmer work now in SVN. If you look in the class >>> SequenceMixin there are two static methods there for generating the two >>> types of k-mers. It's not developed with Map storage in mind & I'll leave >>> the door open there for anyone else to come in & develop it. The k-mers are >>> also not unique across the sequence but it's a start :) >>> >>> Share & enjoy! >>> >>> Andy >>> >>> On 29 Oct 2010, at 19:50, jitesh dundas wrote: >>> >>>> I agree Andy. These have become standard functionalities that >>>> scientists do these days. I am all for implementing that in BioJava3. >>>> Java isn't that efficient for such functionalities so we will surely >>>> need more effort compared to the same in Python/Perl. >>>> >>>> Regards, >>>> Jitesh Dundas >>>> >>>> On 10/30/10, Andy Yates wrote: >>>>> So if it's a suffix tree that's quite a fixed data structure so the >>>>> chances >>>>> of developing a pluggable mechanism there would be hard. I think there >>>>> also >>>>> has to be a limit as to what we can sensibly do. If people want to >>>>> contribute this kind of work though then it's all be very well received >>>>> (with the corresponding test environment/cases of course). >>>>> >>>>> Cheers, >>>>> >>>>> Andy >>>>> >>>>> On 29 Oct 2010, at 17:56, Mark Fortner wrote: >>>>> >>>>>> It might be useful to make the K-mer storage mechanism pluggable. ?This >>>>>> would allow a developer to use anything from a simple MultiMap, to a >>>>>> NoSQL >>>>>> key-value database to store K-mers. ?You could plugin custom map >>>>>> implementations to allow you to keep a count of the number of instances >>>>>> of >>>>>> particular K-mers that were found. ?It might also be useful to be able >>>>>> to >>>>>> do >>>>>> set operations on those K-mer collections. ?You could use it to >>>>>> determine >>>>>> which K-mers were present in a pathogen and not in a host. >>>>>> http://www.ncbi.nlm.nih.gov/pubmed/20428334 >>>>>> http://www.ncbi.nlm.nih.gov/pubmed/16403026 >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Mark >>>>>> >>>>>> card.ly: >>>>>> >>>>>> >>>>>> On Fri, Oct 29, 2010 at 9:27 AM, Vishal Thapar >>>>>> wrote: >>>>>> >>>>>>> Hi Andy, >>>>>>> >>>>>>> This is good to have. I feel that including it as a part of core may >>>>>>> not >>>>>>> be >>>>>>> necessary but having it as part of Genomic module in biojava3 will be >>>>>>> nice. >>>>>>> There is a project Bioinformatica >>>>>>> >>>>>>> http://code.google.com/p/bioformatica/source/browse/#svn/trunk/src/bioformatica/sequencewhich >>>>>>> does something similar although not exactly. It counts the k-mers in a >>>>>>> given fasta file but it does not count k-mers for each sequence within >>>>>>> the >>>>>>> file, just all within a file. This is a good feature to have specially >>>>>>> if >>>>>>> one is trying to find patterns within sequences which is what I am >>>>>>> trying >>>>>>> to >>>>>>> do. It would most certainly be helpful to have a k-mer counting >>>>>>> algorithm >>>>>>> that counts k-mer frequency for each sequence. The way to go would be >>>>>>> to >>>>>>> use >>>>>>> suffix trees. Again I don't know if biojava has a suffix tree api or >>>>>>> not >>>>>>> since I haven't used java in a while and am just switching back to it. >>>>>>> A >>>>>>> paper on using suffix trees to generate genome wide k-mer frequencies >>>>>>> is: >>>>>>> http://www.biomedcentral.com/1471-2164/9/517/abstract (kurtz et al, >>>>>>> software >>>>>>> is tallymer). It would be some work to implement this in java as a >>>>>>> module >>>>>>> for biojava3 but I can see that this will be helpful. Again, for small >>>>>>> fasta >>>>>>> files, it might not be efficient to create a suffix tree but for bigger >>>>>>> files, I think that might be the way to go. >>>>>>> >>>>>>> Thats just my two cents.What do you think? >>>>>>> >>>>>>> -vishal >>>>>>> >>>>>>> On Fri, Oct 29, 2010 at 4:12 AM, Andy Yates wrote: >>>>>>> >>>>>>>> Hi Vishal, >>>>>>>> >>>>>>>> As far as I am aware there is nothing which will generate them in >>>>>>>> BioJava >>>>>>>> at the moment. However it is possible to do it with BioJava3: >>>>>>>> >>>>>>>> public static void main(String[] args) { >>>>>>>> DNASequence d = new DNASequence("ATGATC"); >>>>>>>> System.out.println("Non-Overlap"); >>>>>>>> nonOverlap(d); >>>>>>>> System.out.println("Overlap"); >>>>>>>> overlap(d); >>>>>>>> } >>>>>>>> >>>>>>>> public static final int KMER = 3; >>>>>>>> >>>>>>>> //Generate triplets overlapping >>>>>>>> public static void overlap(Sequence d) { >>>>>>>> List> l = >>>>>>>> ? ? ? ? new ArrayList>(); >>>>>>>> for(int i=1; i<=KMER; i++) { >>>>>>>> ? ? SequenceView sub = d.getSubSequence( >>>>>>>> ? ? ? ? ? ? i, d.getLength()); >>>>>>>> ? ? WindowedSequence w = >>>>>>>> ? ? ? ? new WindowedSequence(sub, KMER); >>>>>>>> ? ? l.add(w); >>>>>>>> } >>>>>>>> >>>>>>>> //Will return ATG, ATC, TGA & GAT >>>>>>>> for(WindowedSequence w: l) { >>>>>>>> ? ? for(List subList: w) { >>>>>>>> ? ? ? ? System.out.println(subList); >>>>>>>> ? ? } >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> //Generate triplet Compound lists non-overlapping >>>>>>>> public static void nonOverlap(Sequence d) { >>>>>>>> WindowedSequence w = >>>>>>>> ? ? ? ? new WindowedSequence(d, KMER); >>>>>>>> //Will return ATG & ATC >>>>>>>> for(List subList: w) { >>>>>>>> ? ? System.out.println(subList); >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> The disadvantage of all of these solutions is that they generate lists >>>>>>>> of >>>>>>>> Compounds so kmer generation can/will be a memory intensive operation. >>>>>>> This >>>>>>>> does mean it has to be since sub sequences are thin wrappers around an >>>>>>>> underlying sequence. Also the overlap solution is non-optimal since it >>>>>>>> iterates through each window rather than stepping through delegating >>>>>>>> onto >>>>>>>> each base in turn (hence why we get ATG & ATC before TGA) >>>>>>>> >>>>>>>> As for unique k-mers that's something which would require a bit more >>>>>>>> engineering & would be better suited to a solution built around a Trie >>>>>>>> (prefix tree). >>>>>>>> >>>>>>>> Hope this helps, >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> On 28 Oct 2010, at 18:40, Vishal Thapar wrote: >>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> I had a quick question: Does Biojava have a method to generate k-mers >>>>>>> or >>>>>>>>> K-mer counting in a given Fasta Sequence / File? Basically, I want >>>>>>> k-mer >>>>>>>>> counts for every sequence in a fasta file. If something like this >>>>>>> exists >>>>>>>> it >>>>>>>>> would save me some time to write the code. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Vishal >>>>>>>>> _______________________________________________ >>>>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>>> >>>>>>>> -- >>>>>>>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer >>>>>>>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 >>>>>>>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 >>>>>>>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Vishal Thapar, Ph.D.* >>>>>>> *Scientific informatics Analyst >>>>>>> Cold Spring Harbor Lab >>>>>>> Quick Bldg, Lowe Lab >>>>>>> 1 Bungtown Road >>>>>>> Cold Spring Harbor, NY - 11724* >>>>>>> _______________________________________________ >>>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>>>> >>>>>> _______________________________________________ >>>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>>>> -- >>>>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer >>>>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 >>>>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 >>>>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l >>>>> >>> >>> -- >>> Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer >>> EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 >>> Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 >>> Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ >>> >>> >>> >>> >>> > > -- > Andrew Yates ? ? ? ? ? ? ? ? ? Ensembl Genomes Engineer > EMBL-EBI ? ? ? ? ? ? ? ? ? ? ? Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus ? Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK ? ? ? ? http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From dasarnow at gmail.com Sun Oct 31 23:56:05 2010 From: dasarnow at gmail.com (Daniel Asarnow) Date: Sun, 31 Oct 2010 16:56:05 -0700 Subject: [Biojava-l] Superimposing structure pieces Message-ID: I've been trying to pull out pieces of protein chains and superimpose them...my current code (as generic-ified code snips below) works, but I wonder if it couldn't be faster. Has anyone worked on similar methods? Any other advice? Best regards everyone, da Getting residue CA's as Atom[]: for (int i; i < length; i++) { someAtoms[i] = someChain.getSeqResGroup(start + i).getAtom("CA"); } Superimposing/aligning: SVDSuperimposer svds = new SVDSuperimposer(someAtoms1, someAtoms2); Matrix rot = svds.getRotation(); Atom trans = svds.getTranslation(); for (int i = 0; i < length; i++) { Calc.rotate(someAtoms1[i], rot); Calc.shift(someAtoms1[i], trans); } SVDSuperimposer.getRmsd(someAtoms1, someAtoms2);