From mark.schreiber at novartis.com Mon Oct 3 21:06:16 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Mon Oct 3 21:17:47 2005 Subject: [Biojava-dev] JDK 1.5 Message-ID: Hello - Biojava is still officially using JDK 1.4.2 I know many people have changed to JDK1.5 While no-one is using generics etc in the code base there have been a number of method calls that have slipped in that rely on JDK 1.5. The most common one is Integer.valueOf(int i) This is only introduced in 1.5 please use the alternative new Integer(i) It even has less typing : ) - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 From wetrull at yahoo.com Wed Oct 5 18:11:10 2005 From: wetrull at yahoo.com (W. Eric Trull) Date: Wed Oct 5 18:18:20 2005 Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java Message-ID: <20051005221110.36020.qmail@web81407.mail.yahoo.com> Hello all, I'm new to the list, but have done as much archive searching, Google searching, and debugging as I can on the problem I describe here. I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), but keep getting a NullPointerException. One of my searches turned up using BlastEcho to debug the problem, but that also throws the NullPointerException: startSearch SearchProp: program: ncbi-blastp SearchProp: version: 2.0.11 java.lang.NullPointerException at org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215) at org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) at org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:311) at org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:274) at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) at com.pfizer.search.sequence.BlastEcho.echo(BlastEcho.java:42) at com.pfizer.search.sequence.BlastEcho.main(BlastEcho.java:88) Exception in thread "main" Stepping through the code in a debugger shows that the while loop added in revision 1.13 of /biojava-live/src/org/biojava/bio/program/sax/BlastSAXParser.java (fixed truncation of database id) reads all the lines without ever matching the "Searching" string. At first I thought it was because I was using a later version of BLAST, but then I tried 2.0.11 and 2.2.3 (supported version) but they also result in a NullPointerException. In the BLAST output for the various versions I never see a "Searching" string anywhere. I've tried all the -m options as well, without success. Is there a NCBI BLAST option that I need to be using? I'm running on Windows XP (during development) - is the UNIX version output different? Thanks. -Eric Trull From mark.schreiber at novartis.com Wed Oct 5 23:39:59 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Wed Oct 5 23:39:55 2005 Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java Message-ID: Hello - This is very odd. The JUnit tests currently pass using the files in /tests/files/org/biojava/bio/programs/ssbind These BLAST files all have the string "Searching....". Maybe there is a variation in the windows output? Can you post at least the header of your output to the list (preferably an entire example output)? - Mark "W. Eric Trull" Sent by: biojava-dev-bounces@portal.open-bio.org 10/06/2005 06:11 AM To: biojava-dev@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java Hello all, I'm new to the list, but have done as much archive searching, Google searching, and debugging as I can on the problem I describe here. I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), but keep getting a NullPointerException. One of my searches turned up using BlastEcho to debug the problem, but that also throws the NullPointerException: startSearch SearchProp: program: ncbi-blastp SearchProp: version: 2.0.11 java.lang.NullPointerException at org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215) at org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) at org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:311) at org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:274) at org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) at com.pfizer.search.sequence.BlastEcho.echo(BlastEcho.java:42) at com.pfizer.search.sequence.BlastEcho.main(BlastEcho.java:88) Exception in thread "main" Stepping through the code in a debugger shows that the while loop added in revision 1.13 of /biojava-live/src/org/biojava/bio/program/sax/BlastSAXParser.java (fixed truncation of database id) reads all the lines without ever matching the "Searching" string. At first I thought it was because I was using a later version of BLAST, but then I tried 2.0.11 and 2.2.3 (supported version) but they also result in a NullPointerException. In the BLAST output for the various versions I never see a "Searching" string anywhere. I've tried all the -m options as well, without success. Is there a NCBI BLAST option that I need to be using? I'm running on Windows XP (during development) - is the UNIX version output different? Thanks. -Eric Trull _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From mark.schreiber at novartis.com Thu Oct 6 01:39:58 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Thu Oct 6 01:46:18 2005 Subject: [Biojava-dev] agave, game, game12 Message-ID: Hello - Does anyone still require or make use of the following packages: org.biojava.bio.seq.io.agave org.biojava.bio.seq.io.game org.biojava.bio.seq.io.game12 They represent i/o classes for these now redundant formats. If not then I will mark them as deprecated and probably remove them when we make a 1.5 release. - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 From mark.schreiber at novartis.com Thu Oct 6 01:47:23 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Thu Oct 6 01:47:39 2005 Subject: [Biojava-dev] Java 1.5 (final chance to object) Message-ID: Hello - No one seemed to object to the idea of officially adopting java 1.5 for the biojava-live branch. This would mean ... biojava-live would require java1.5 generics, unboxing and other language features added in 1.5 will start to creep into the codebase. all 'official' and 'preview' releases after biojava1.4 will require JDK1.5 (Java 5). If you plan to use new versions of biojava on a machine for which there is or will be no JDK1.5 then you should protest now! - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 From mark.schreiber at novartis.com Thu Oct 6 03:36:05 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Thu Oct 6 03:36:07 2005 Subject: [Biojava-dev] Java 1.5 (final chance to object) Message-ID: Does SPICE rely on biojava-live? If it only requires biojava1.4 then this wouldn't be an issue. However, if you are actively building SPICE with biojava-live (possibly not a good idea) we can keep it as 1.4 for a while. - Mark Andreas Prlic 10/06/2005 03:26 PM To: Mark Schreiber/GP/Novartis@PH cc: biojava-dev@biojava.org, biojava-l@biojava.org Subject: Re: [Biojava-dev] Java 1.5 (final chance to object) Hi! I use biojava for the SPICE - protein sequence and structure browser. http://www.efamily.org.uk/software/dasclients/spice/ This application is launched from within a browser using Java Web Start. Since many people still are using java 1.4 on their machines I would not want to force them to upgrade and hence I would prefer biojava to stay with 1.4 still for a while. Cheers, Andreas On 6 Oct 2005, at 06:47, mark.schreiber@novartis.com wrote: > Hello - > > No one seemed to object to the idea of officially adopting java 1.5 for > the biojava-live branch. > > This would mean ... > > biojava-live would require java1.5 > generics, unboxing and other language features added in 1.5 will start > to > creep into the codebase. > all 'official' and 'preview' releases after biojava1.4 will require > JDK1.5 > (Java 5). > > If you plan to use new versions of biojava on a machine for which > there is > or will be no JDK1.5 then you should protest now! > > - Mark > > Mark Schreiber > Principal Scientist (Bioinformatics) > > Novartis Institute for Tropical Diseases (NITD) > 10 Biopolis Road > #05-01 Chromos > Singapore 138670 > www.nitd.novartis.com > > phone +65 6722 2973 > fax +65 6722 2910 > > _______________________________________________ > biojava-dev mailing list > biojava-dev@biojava.org > http://biojava.org/mailman/listinfo/biojava-dev > > ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From ap3 at sanger.ac.uk Thu Oct 6 03:26:24 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Thu Oct 6 04:15:55 2005 Subject: [Biojava-dev] Java 1.5 (final chance to object) In-Reply-To: References: Message-ID: <74ebc3fc158aabd8858638a8160106ab@sanger.ac.uk> Hi! I use biojava for the SPICE - protein sequence and structure browser. http://www.efamily.org.uk/software/dasclients/spice/ This application is launched from within a browser using Java Web Start. Since many people still are using java 1.4 on their machines I would not want to force them to upgrade and hence I would prefer biojava to stay with 1.4 still for a while. Cheers, Andreas On 6 Oct 2005, at 06:47, mark.schreiber@novartis.com wrote: > Hello - > > No one seemed to object to the idea of officially adopting java 1.5 for > the biojava-live branch. > > This would mean ... > > biojava-live would require java1.5 > generics, unboxing and other language features added in 1.5 will start > to > creep into the codebase. > all 'official' and 'preview' releases after biojava1.4 will require > JDK1.5 > (Java 5). > > If you plan to use new versions of biojava on a machine for which > there is > or will be no JDK1.5 then you should protest now! > > - Mark > > Mark Schreiber > Principal Scientist (Bioinformatics) > > Novartis Institute for Tropical Diseases (NITD) > 10 Biopolis Road > #05-01 Chromos > Singapore 138670 > www.nitd.novartis.com > > phone +65 6722 2973 > fax +65 6722 2910 > > _______________________________________________ > biojava-dev mailing list > biojava-dev@biojava.org > http://biojava.org/mailman/listinfo/biojava-dev > > ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From ady at sanger.ac.uk Thu Oct 6 03:39:10 2005 From: ady at sanger.ac.uk (Andy Yates) Date: Thu Oct 6 04:32:03 2005 Subject: [Biojava-dev] Re: [Biojava-l] Java 1.5 (final chance to object) In-Reply-To: References: Message-ID: <4344D49E.5070409@sanger.ac.uk> Well okay I'll fly the flag for platforms like Alpha where a 1.5 compatible JVM/compiler does not exist nor ever will. I know from the BOF at BOSC there were quite a few people who were reporting a similar situation. Now if no one else objects to the JDK1.5. move then I'm not going to fly the flag for 1.4 since I don't dev any more on Alpha plus I like more reasons to force people who I work with to upgrade :) Andy Y mark.schreiber@novartis.com wrote: > Hello - > > No one seemed to object to the idea of officially adopting java 1.5 for > the biojava-live branch. > > This would mean ... > > biojava-live would require java1.5 > generics, unboxing and other language features added in 1.5 will start to > creep into the codebase. > all 'official' and 'preview' releases after biojava1.4 will require JDK1.5 > (Java 5). > > If you plan to use new versions of biojava on a machine for which there is > or will be no JDK1.5 then you should protest now! > > - Mark > > Mark Schreiber > Principal Scientist (Bioinformatics) > > Novartis Institute for Tropical Diseases (NITD) > 10 Biopolis Road > #05-01 Chromos > Singapore 138670 > www.nitd.novartis.com > > phone +65 6722 2973 > fax +65 6722 2910 > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l From mbreese at gmail.com Thu Oct 6 03:47:10 2005 From: mbreese at gmail.com (Marcus Breese) Date: Thu Oct 6 05:24:59 2005 Subject: [Biojava-dev] Re: [Biojava-l] Java 1.5 (final chance to object) In-Reply-To: References: Message-ID: You may want to think a bit more about converting completely over to 1.5... There are still a number of platforms that don't have a compatible 1.5 JDK. Mac OS X still comes with 1.42 standard (1.5 is available, but not standard). Also, the last time I checked there wasn't an IBM PPC 1.5 JVM, which means that a number of HPC platforms / clusters will not be supported. My view on it is that 1.5 is good for apps, but still too new for a critical library. On 10/6/05, mark.schreiber@novartis.com wrote: > > Hello - > > No one seemed to object to the idea of officially adopting java 1.5 for > the biojava-live branch. > > This would mean ... > > biojava-live would require java1.5 > generics, unboxing and other language features added in 1.5 will start to > creep into the codebase. > all 'official' and 'preview' releases after biojava1.4 will require JDK1.5 > (Java 5). > > If you plan to use new versions of biojava on a machine for which there is > or will be no JDK1.5 then you should protest now! > > - Mark > > Mark Schreiber > Principal Scientist (Bioinformatics) > > Novartis Institute for Tropical Diseases (NITD) > 10 Biopolis Road > #05-01 Chromos > Singapore 138670 > www.nitd.novartis.com > > phone +65 6722 2973 > fax +65 6722 2910 > > _______________________________________________ > Biojava-l mailing list - Biojava-l@biojava.org > http://biojava.org/mailman/listinfo/biojava-l > From td2 at sanger.ac.uk Thu Oct 6 06:14:35 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Thu Oct 6 06:45:16 2005 Subject: [Biojava-dev] Re: [Biojava-l] Java 1.5 (final chance to object) In-Reply-To: References: Message-ID: On 6 Oct 2005, at 08:47, Marcus Breese wrote: > You may want to think a bit more about converting completely over > to 1.5... > There are still a number of platforms that don't have a compatible > 1.5 JDK. > Mac OS X still comes with 1.42 standard (1.5 is available, but not > standard). Also, the last time I checked there wasn't an IBM PPC > 1.5 JVM, > which means that a number of HPC platforms / clusters will not be > supported. IBM do have something out now: http://www-128.ibm.com/developerworks/java/jdk/java5beta/ Beta software on a time-limited licence, so probably not what people really want to run -- but it does suggest there should be a release version in the not-too-distant future. Perhaps we should wait until the end of the year then look at how the transition is coming along. I know there's a new release of Mac OS 10.4 coming in the next few weeks, and it sounds like that will include a big pile of bug-fixed (I know the dreaded Eclipse-running- progressively-slower bug has been looked at). That might well encourage more Mac users (who seem to be the biggest group stuck on Java 1.4) to upgrade. Thomas. From wetrull at yahoo.com Thu Oct 6 11:10:01 2005 From: wetrull at yahoo.com (W. Eric Trull) Date: Thu Oct 6 11:37:30 2005 Subject: [Biojava-dev] Java 1.5 (final chance to object) Message-ID: <20051006151002.58972.qmail@web81404.mail.yahoo.com> Hello all, I'm new to the list so I have not been following the full discussion and don't know all the issues - excuse my ignorance. I don't have any objections to moving to Java 1.5, except I know that some J2EE application servers (both web tier and business tier (i.e. EJBs)) are slow adopters of the new versions of Java That being said, would BioJava 1.4 be maintained (bug fixes) for those stuck on 1.4.2? I know in my case I'm going to be using BioJava in several web services deploying to pre-existing application servers. The owners of the application servers will balk at the cost of upgrading to Java 1.5 (both the JVM and the application server). Thanks. -Eric Trull From mbreese at gmail.com Thu Oct 6 12:27:52 2005 From: mbreese at gmail.com (Marcus Breese) Date: Thu Oct 6 12:55:47 2005 Subject: [Biojava-dev] Re: [Biojava-l] Java 1.5 (final chance to object) In-Reply-To: References: Message-ID: The problem for me is really the HPC environment. I know our cluster admins would be very hesitant to intall a beta JVM on our brand new IBM cluster. We also have a (very small) Mac cluster that will be stuck on 10.3 for quite a while as we don't have the cash to upgrade the entire thing. So, our stuff will be stuck at 1.42 for a while... Then again, we aren't actively developing biojava things on those platforms, just the smaller single linux boxes with 1.5. On 10/6/05, Thomas Down wrote: > > > On 6 Oct 2005, at 08:47, Marcus Breese wrote: > > > You may want to think a bit more about converting completely over > > to 1.5... > > There are still a number of platforms that don't have a compatible > > 1.5 JDK. > > Mac OS X still comes with 1.42 standard (1.5 is available, but not > > standard). Also, the last time I checked there wasn't an IBM PPC > > 1.5 JVM, > > which means that a number of HPC platforms / clusters will not be > > supported. > > IBM do have something out now: > > http://www-128.ibm.com/developerworks/java/jdk/java5beta/ > > Beta software on a time-limited licence, so probably not what people > really want to run -- but it does suggest there should be a release > version in the not-too-distant future. > > Perhaps we should wait until the end of the year then look at how the > transition is coming along. I know there's a new release of Mac OS > 10.4 coming in the next few weeks, and it sounds like that will > include a big pile of bug-fixed (I know the dreaded Eclipse-running- > progressively-slower bug has been looked at). That might well > encourage more Mac users (who seem to be the biggest group stuck on > Java 1.4) to upgrade. > > Thomas. > > > From mark.schreiber at novartis.com Thu Oct 6 21:53:12 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Thu Oct 6 21:52:41 2005 Subject: [Biojava-dev] Java 1.5 (final chance to object) Message-ID: OK, there seems to have been a few reasonable objections. The consensus seems to be we will wait until the end of the year. I think from then on we will change over for biojava-live. There was a suggestion of maintaining two branches, one would be a maintenance of biojava1.4 and only use JDK1.4.2, the other would be biojava-live and use JDK1.5. I have done this in the past and have no desire to do it again, it's really not that much fun. I would however like to reserve the option of putting JDK1.5 dependent code into the classes of the org.biojavax package. If this happens I will adjust the ANT build script such that these are not compiled if JDK1.5 is not detected. This should be safe as the biojava packages have no dependencies on the biojavax packages. Bug fixes to biojava would still be in the system. Additionally the org.biojavax packages are undergoing a lot of development right now so you shouldn't be doing any production programming with them anyway. - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 From sicotteh at mail.nih.gov Fri Oct 7 09:59:58 2005 From: sicotteh at mail.nih.gov (Sicotte, Hugues (NIH/NCI)) Date: Fri Oct 7 10:21:07 2005 Subject: [Biojava-dev] Java 1.5 (final chance to object) Message-ID: <27C204BD76CBC142BA1AE46D62A8548E0F4E5DD3@nihexchange9.nih.gov> Last call.. OK BioJava is one of the most turbulent bioinformatics library projects. Much more turbulent than bioPerl or bioPython. Often code written a few years ago does not work because older API's get deprecated or changed. This is especially tough for people who use BioJava on and Off. We think two sets of API's are compatible, but there are lots of deprecated features. We find online examples, only to find out they no longer work. Stability and backward compatibility are crucial to the success of BioJava. In fact, if the code is not too complex, I often have to write it myself rather than relying on BioJava. I want to write code that will still be in production 5 years from now. I say that, not because I don't appreciate the hard work of developpers, but because I would like the volunteer developpers to appreciate that the users of the toolkit need stability. We live in production environments that will not support 1.5 for a long time. I am still living in a 1.3 world. Only projects that I am starting right now can use 1.4.2_08. I beg you, Please reconsider moving to 1.5, it's only 5% more typing to use 1.4. Hugues Sicotte (a user who still doesn't get to use the RegExp package in 1.4) -----Original Message----- From: mark.schreiber@novartis.com [mailto:mark.schreiber@novartis.com] Sent: Thursday, October 06, 2005 9:53 PM To: Thomas Down Cc: biojava-l@biojava.org; wetrull@yahoo.com; BioJava Dev Subject: [Biojava-dev] Java 1.5 (final chance to object) OK, there seems to have been a few reasonable objections. The consensus seems to be we will wait until the end of the year. I think from then on we will change over for biojava-live. There was a suggestion of maintaining two branches, one would be a maintenance of biojava1.4 and only use JDK1.4.2, the other would be biojava-live and use JDK1.5. I have done this in the past and have no desire to do it again, it's really not that much fun. I would however like to reserve the option of putting JDK1.5 dependent code into the classes of the org.biojavax package. If this happens I will adjust the ANT build script such that these are not compiled if JDK1.5 is not detected. This should be safe as the biojava packages have no dependencies on the biojavax packages. Bug fixes to biojava would still be in the system. Additionally the org.biojavax packages are undergoing a lot of development right now so you shouldn't be doing any production programming with them anyway. - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From wetrull at yahoo.com Fri Oct 7 12:04:34 2005 From: wetrull at yahoo.com (W. Eric Trull) Date: Fri Oct 7 12:14:59 2005 Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java In-Reply-To: Message-ID: <20051007160434.6853.qmail@web81401.mail.yahoo.com> Should I raise this as an issue with NCBI? Seems like it makes writting parsing routines more difficult. Thanks. -Eric Trull --- mark.schreiber@novartis.com wrote: > Looks like there might be a difference in the Windows output. I will try > to take a look at this over the next few days. Probably need to change the > BlastSAXParser to look for something other than Searching so that this > will get parsed as well. > > - Mark > > > > > > "W. Eric Trull" > 10/06/2005 11:01 PM > > > To: biojava-dev@biojava.org > cc: Mark Schreiber/GP/Novartis@PH > Subject: Re: [Biojava-dev] NullPointerException from > BlastSAXParser.java > > > Hello Mark, > > Here is what I've done, using NCBI Blast 2.0.11, Windows XP, JDK 1.4.2 > > 1. Downloaded the PDB's pdb_seqres.txt > 2. Created a blast database (after changing the deflines): > C:\blast-2.0.11\formatdb.exe > -t "PDB" > -i blast\pdb_seqres.txt > -l blast\pdb_formatdb.log > -o T > -n blast\pdb > 3. BLASTed 26SPS9_Hs: > C:\blast-2.0.11\blastall.exe > -p blastp > -d blast\pdb > -i 26SPS9_Hs.fasta > -o 26SPS9_Hs.blast > 4. Tried to parse 26SPS9_Hs.blast using the class shown in BioJava in > Anger > and BlastEcho, both of which give me the NullPointerException. The > beginning > of 26SPS9_Hs.blast file is shown below, the entire file is attached. > > Please let me know if you see anything obviously wrong with the way I'm > doing > the BLAST. I'm going to cvs checkout the BioJava source code and have a > look > at the JUnit test later today. > > Thanks! > > -Eric Trull > > -------- 26SPS9_Hs.blast -------- > BLASTP 2.0.11 [Jan-20-2000] > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > "Gapped BLAST and PSI-BLAST: a new generation of protein database search > programs", Nucleic Acids Res. 25:3389-3402. > > Query= 26SPS9_Hs > (176 letters) > > Database: PDB > 78,094 sequences; 17,596,117 total letters > > > > Score > E > Sequences producing significant alignments: (bits) > Value > > pdb|1UFM|A Cop9 Complex Subunit 4 39 > 0.003 > . > . > . > -------- 26SPS9_Hs.blast -------- > > > --- mark.schreiber@novartis.com wrote: > > > Hello - > > > > This is very odd. > > > > The JUnit tests currently pass using the files in > > /tests/files/org/biojava/bio/programs/ssbind These BLAST files all have > > > the string "Searching....". Maybe there is a variation in the windows > > output? > > > > Can you post at least the header of your output to the list (preferably > an > > entire example output)? > > > > - Mark > > > > > > > > > > > > "W. Eric Trull" > > Sent by: biojava-dev-bounces@portal.open-bio.org > > 10/06/2005 06:11 AM > > > > > > To: biojava-dev@biojava.org > > cc: (bcc: Mark Schreiber/GP/Novartis) > > Subject: [Biojava-dev] NullPointerException from > > BlastSAXParser.java > > > > > > Hello all, > > > > I'm new to the list, but have done as much archive searching, Google > > searching, and debugging as I can on the problem I describe here. > > > > I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), > but > > keep getting a NullPointerException. One of my searches turned up using > > BlastEcho to debug the problem, but that also throws the > > NullPointerException: > > > > startSearch > > SearchProp: program: ncbi-blastp > > SearchProp: version: 2.0.11 > > java.lang.NullPointerException > > at > > > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215) > > at > > > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) > > at > > > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:311) > > at > > > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:274) > > at > > > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) > > at > > com.pfizer.search.sequence.BlastEcho.echo(BlastEcho.java:42) > > at > > com.pfizer.search.sequence.BlastEcho.main(BlastEcho.java:88) > > Exception in thread "main" > > > > Stepping through the code in a debugger shows that the while loop added > in > > revision 1.13 of > > /biojava-live/src/org/biojava/bio/program/sax/BlastSAXParser.java (fixed > > truncation of database id) reads all the lines without ever matching the > > "Searching" string. At first I thought it was because I was using a > later > > version of BLAST, but then I tried 2.0.11 and 2.2.3 (supported version) > > but > > they also result in a NullPointerException. In the BLAST output for the > > various versions I never see a "Searching" string anywhere. I've tried > > all > > the -m options as well, without success. > > > > Is there a NCBI BLAST option that I need to be using? I'm running on > > Windows > > XP (during development) - is the UNIX version output different? > > > > Thanks. > > > > -Eric Trull > > > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev@biojava.org > > http://biojava.org/mailman/listinfo/biojava-dev > > > > > > > > > BLASTP 2.0.11 [Jan-20-2000] > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > "Gapped BLAST and PSI-BLAST: a new generation of protein database search > programs", Nucleic Acids Res. 25:3389-3402. > > Query= 26SPS9_Hs > (176 letters) > > Database: PDB > 78,094 sequences; 17,596,117 total letters > > > > === message truncated === From sicotteh at mail.nih.gov Fri Oct 7 13:27:01 2005 From: sicotteh at mail.nih.gov (Sicotte, Hugues (NIH/NCI)) Date: Fri Oct 7 13:27:12 2005 Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java Message-ID: <27C204BD76CBC142BA1AE46D62A8548E0F4E5DD5@nihexchange9.nih.gov> I've been through this before when I was working for NCBI. The answer was that the text output of BLAST was never a supported format. The only supported format is the XML Blast Output. http://ccgb.umn.edu/~crow/projects/xmlblast/example.html also In the case of parsing multiple blast files, breaking on "Searching..." is not a good idea because if the parameters are wrong or the query sequence too low complexity, this String is not emitted by the program. Hugues Sicotte -----Original Message----- From: W. Eric Trull [mailto:wetrull@yahoo.com] Sent: Friday, October 07, 2005 12:05 PM To: biojava-dev@biojava.org Cc: mark.schreiber@novartis.com Subject: Re: [Biojava-dev] NullPointerException from BlastSAXParser.java Should I raise this as an issue with NCBI? Seems like it makes writting parsing routines more difficult. Thanks. -Eric Trull --- mark.schreiber@novartis.com wrote: > Looks like there might be a difference in the Windows output. I will try > to take a look at this over the next few days. Probably need to change the > BlastSAXParser to look for something other than Searching so that this > will get parsed as well. > > - Mark > > > > > > "W. Eric Trull" > 10/06/2005 11:01 PM > > > To: biojava-dev@biojava.org > cc: Mark Schreiber/GP/Novartis@PH > Subject: Re: [Biojava-dev] NullPointerException from > BlastSAXParser.java > > > Hello Mark, > > Here is what I've done, using NCBI Blast 2.0.11, Windows XP, JDK 1.4.2 > > 1. Downloaded the PDB's pdb_seqres.txt > 2. Created a blast database (after changing the deflines): > C:\blast-2.0.11\formatdb.exe > -t "PDB" > -i blast\pdb_seqres.txt > -l blast\pdb_formatdb.log > -o T > -n blast\pdb > 3. BLASTed 26SPS9_Hs: > C:\blast-2.0.11\blastall.exe > -p blastp > -d blast\pdb > -i 26SPS9_Hs.fasta > -o 26SPS9_Hs.blast > 4. Tried to parse 26SPS9_Hs.blast using the class shown in BioJava in > Anger > and BlastEcho, both of which give me the NullPointerException. The > beginning > of 26SPS9_Hs.blast file is shown below, the entire file is attached. > > Please let me know if you see anything obviously wrong with the way I'm > doing > the BLAST. I'm going to cvs checkout the BioJava source code and have a > look > at the JUnit test later today. > > Thanks! > > -Eric Trull > > -------- 26SPS9_Hs.blast -------- > BLASTP 2.0.11 [Jan-20-2000] > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > "Gapped BLAST and PSI-BLAST: a new generation of protein database search > programs", Nucleic Acids Res. 25:3389-3402. > > Query= 26SPS9_Hs > (176 letters) > > Database: PDB > 78,094 sequences; 17,596,117 total letters > > > > Score > E > Sequences producing significant alignments: (bits) > Value > > pdb|1UFM|A Cop9 Complex Subunit 4 39 > 0.003 > . > . > . > -------- 26SPS9_Hs.blast -------- > > > --- mark.schreiber@novartis.com wrote: > > > Hello - > > > > This is very odd. > > > > The JUnit tests currently pass using the files in > > /tests/files/org/biojava/bio/programs/ssbind These BLAST files all have > > > the string "Searching....". Maybe there is a variation in the windows > > output? > > > > Can you post at least the header of your output to the list (preferably > an > > entire example output)? > > > > - Mark > > > > > > > > > > > > "W. Eric Trull" > > Sent by: biojava-dev-bounces@portal.open-bio.org > > 10/06/2005 06:11 AM > > > > > > To: biojava-dev@biojava.org > > cc: (bcc: Mark Schreiber/GP/Novartis) > > Subject: [Biojava-dev] NullPointerException from > > BlastSAXParser.java > > > > > > Hello all, > > > > I'm new to the list, but have done as much archive searching, Google > > searching, and debugging as I can on the problem I describe here. > > > > I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), > but > > keep getting a NullPointerException. One of my searches turned up using > > BlastEcho to debug the problem, but that also throws the > > NullPointerException: > > > > startSearch > > SearchProp: program: ncbi-blastp > > SearchProp: version: 2.0.11 > > java.lang.NullPointerException > > at > > > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215 ) > > at > > > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) > > at > > > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXPars er.java:311) > > at > > > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser. java:274) > > at > > > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java :160) > > at > > com.pfizer.search.sequence.BlastEcho.echo(BlastEcho.java:42) > > at > > com.pfizer.search.sequence.BlastEcho.main(BlastEcho.java:88) > > Exception in thread "main" > > > > Stepping through the code in a debugger shows that the while loop added > in > > revision 1.13 of > > /biojava-live/src/org/biojava/bio/program/sax/BlastSAXParser.java (fixed > > truncation of database id) reads all the lines without ever matching the > > "Searching" string. At first I thought it was because I was using a > later > > version of BLAST, but then I tried 2.0.11 and 2.2.3 (supported version) > > but > > they also result in a NullPointerException. In the BLAST output for the > > various versions I never see a "Searching" string anywhere. I've tried > > all > > the -m options as well, without success. > > > > Is there a NCBI BLAST option that I need to be using? I'm running on > > Windows > > XP (during development) - is the UNIX version output different? > > > > Thanks. > > > > -Eric Trull > > > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev@biojava.org > > http://biojava.org/mailman/listinfo/biojava-dev > > > > > > > > > BLASTP 2.0.11 [Jan-20-2000] > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > "Gapped BLAST and PSI-BLAST: a new generation of protein database search > programs", Nucleic Acids Res. 25:3389-3402. > > Query= 26SPS9_Hs > (176 letters) > > Database: PDB > 78,094 sequences; 17,596,117 total letters > > > > === message truncated === _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From wetrull at yahoo.com Fri Oct 7 14:45:46 2005 From: wetrull at yahoo.com (W. Eric Trull) Date: Fri Oct 7 14:45:11 2005 Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java In-Reply-To: <27C204BD76CBC142BA1AE46D62A8548E0F4E5DD5@nihexchange9.nih.gov> Message-ID: <20051007184546.17689.qmail@web81404.mail.yahoo.com> I'll switch to the XML output and parser...seems more sane anyway. Thanks! -Eric Trull --- "Sicotte, Hugues (NIH/NCI)" wrote: > I've been through this before when I was working for > NCBI. > > The answer was that the text output of BLAST was never a supported format. > The only supported format is the XML Blast Output. > http://ccgb.umn.edu/~crow/projects/xmlblast/example.html > > also > > In the case of parsing multiple blast files, > breaking on "Searching..." is not a good idea > because if the parameters are wrong or the query sequence > too low complexity, this String is not emitted by the program. > > > Hugues Sicotte > > > -----Original Message----- > From: W. Eric Trull [mailto:wetrull@yahoo.com] > Sent: Friday, October 07, 2005 12:05 PM > To: biojava-dev@biojava.org > Cc: mark.schreiber@novartis.com > Subject: Re: [Biojava-dev] NullPointerException from BlastSAXParser.java > > > Should I raise this as an issue with NCBI? Seems like it makes writting > parsing routines more difficult. > > Thanks. > > -Eric Trull > > --- mark.schreiber@novartis.com wrote: > > > Looks like there might be a difference in the Windows output. I will try > > to take a look at this over the next few days. Probably need to change > the > > > BlastSAXParser to look for something other than Searching so that this > > will get parsed as well. > > > > - Mark > > > > > > > > > > > > "W. Eric Trull" > > 10/06/2005 11:01 PM > > > > > > To: biojava-dev@biojava.org > > cc: Mark Schreiber/GP/Novartis@PH > > Subject: Re: [Biojava-dev] NullPointerException from > > BlastSAXParser.java > > > > > > Hello Mark, > > > > Here is what I've done, using NCBI Blast 2.0.11, Windows XP, JDK 1.4.2 > > > > 1. Downloaded the PDB's pdb_seqres.txt > > 2. Created a blast database (after changing the deflines): > > C:\blast-2.0.11\formatdb.exe > > -t "PDB" > > -i blast\pdb_seqres.txt > > -l blast\pdb_formatdb.log > > -o T > > -n blast\pdb > > 3. BLASTed 26SPS9_Hs: > > C:\blast-2.0.11\blastall.exe > > -p blastp > > -d blast\pdb > > -i 26SPS9_Hs.fasta > > -o 26SPS9_Hs.blast > > 4. Tried to parse 26SPS9_Hs.blast using the class shown in BioJava in > > Anger > > and BlastEcho, both of which give me the NullPointerException. The > > beginning > > of 26SPS9_Hs.blast file is shown below, the entire file is attached. > > > > Please let me know if you see anything obviously wrong with the way I'm > > doing > > the BLAST. I'm going to cvs checkout the BioJava source code and have a > > look > > at the JUnit test later today. > > > > Thanks! > > > > -Eric Trull > > > > -------- 26SPS9_Hs.blast -------- > > BLASTP 2.0.11 [Jan-20-2000] > > > > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, > > > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > > "Gapped BLAST and PSI-BLAST: a new generation of protein database search > > programs", Nucleic Acids Res. 25:3389-3402. > > > > Query= 26SPS9_Hs > > (176 letters) > > > > Database: PDB > > 78,094 sequences; 17,596,117 total letters > > > > > > > > Score > > > E > > Sequences producing significant alignments: (bits) > > > Value > > > > pdb|1UFM|A Cop9 Complex Subunit 4 39 > > > 0.003 > > . > > . > > . > > -------- 26SPS9_Hs.blast -------- > > > > > > --- mark.schreiber@novartis.com wrote: > > > > > Hello - > > > > > > This is very odd. > > > > > > The JUnit tests currently pass using the files in > > > /tests/files/org/biojava/bio/programs/ssbind These BLAST files all > have > > > > > > the string "Searching....". Maybe there is a variation in the windows > > > output? > > > > > > Can you post at least the header of your output to the list (preferably > > > an > > > entire example output)? > > > > > > - Mark > > > > > > > > > > > > > > > > > > "W. Eric Trull" > > > Sent by: biojava-dev-bounces@portal.open-bio.org > > > 10/06/2005 06:11 AM > > > > > > > > > To: biojava-dev@biojava.org > > > cc: (bcc: Mark Schreiber/GP/Novartis) > > > Subject: [Biojava-dev] NullPointerException from > > > BlastSAXParser.java > > > > > > > > > Hello all, > > > > > > I'm new to the list, but have done as much archive searching, Google > > > searching, and debugging as I can on the problem I describe here. > > > > > > I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), > > but > > > keep getting a NullPointerException. One of my searches turned up > using > > > BlastEcho to debug the problem, but that also throws the > > > NullPointerException: > > > > > > startSearch > > > SearchProp: program: ncbi-blastp > > > SearchProp: version: 2.0.11 > > > java.lang.NullPointerException > > > at > > > > > > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215 > ) > > > at > > > > > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) > > > at > > > > > > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXPars > er.java:311) > > > at > > > > > > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser. > java:274) > > > at > > > > > > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java > :160) > === message truncated === From mark.schreiber at novartis.com Sun Oct 9 21:34:08 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Sun Oct 9 21:33:35 2005 Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java Message-ID: I would like to reiterate this. The BLAST text output has never been consistent between versions. Seems NCBI has problems with backwards compatability too : ) This has made it difficult to maintain the parsers. The XML version is much safer. Although for a while it didn't follow it's own DTD it seems that now it does and has done for a while. - Mark "W. Eric Trull" Sent by: biojava-dev-bounces@portal.open-bio.org 10/08/2005 02:45 AM To: biojava-dev@biojava.org cc: Mark Schreiber/GP/Novartis@PH Subject: RE: [Biojava-dev] NullPointerException from BlastSAXParser.java I'll switch to the XML output and parser...seems more sane anyway. Thanks! -Eric Trull --- "Sicotte, Hugues (NIH/NCI)" wrote: > I've been through this before when I was working for > NCBI. > > The answer was that the text output of BLAST was never a supported format. > The only supported format is the XML Blast Output. > http://ccgb.umn.edu/~crow/projects/xmlblast/example.html > > also > > In the case of parsing multiple blast files, > breaking on "Searching..." is not a good idea > because if the parameters are wrong or the query sequence > too low complexity, this String is not emitted by the program. > > > Hugues Sicotte > > > -----Original Message----- > From: W. Eric Trull [mailto:wetrull@yahoo.com] > Sent: Friday, October 07, 2005 12:05 PM > To: biojava-dev@biojava.org > Cc: mark.schreiber@novartis.com > Subject: Re: [Biojava-dev] NullPointerException from BlastSAXParser.java > > > Should I raise this as an issue with NCBI? Seems like it makes writting > parsing routines more difficult. > > Thanks. > > -Eric Trull > > --- mark.schreiber@novartis.com wrote: > > > Looks like there might be a difference in the Windows output. I will try > > to take a look at this over the next few days. Probably need to change > the > > > BlastSAXParser to look for something other than Searching so that this > > will get parsed as well. > > > > - Mark > > > > > > > > > > > > "W. Eric Trull" > > 10/06/2005 11:01 PM > > > > > > To: biojava-dev@biojava.org > > cc: Mark Schreiber/GP/Novartis@PH > > Subject: Re: [Biojava-dev] NullPointerException from > > BlastSAXParser.java > > > > > > Hello Mark, > > > > Here is what I've done, using NCBI Blast 2.0.11, Windows XP, JDK 1.4.2 > > > > 1. Downloaded the PDB's pdb_seqres.txt > > 2. Created a blast database (after changing the deflines): > > C:\blast-2.0.11\formatdb.exe > > -t "PDB" > > -i blast\pdb_seqres.txt > > -l blast\pdb_formatdb.log > > -o T > > -n blast\pdb > > 3. BLASTed 26SPS9_Hs: > > C:\blast-2.0.11\blastall.exe > > -p blastp > > -d blast\pdb > > -i 26SPS9_Hs.fasta > > -o 26SPS9_Hs.blast > > 4. Tried to parse 26SPS9_Hs.blast using the class shown in BioJava in > > Anger > > and BlastEcho, both of which give me the NullPointerException. The > > beginning > > of 26SPS9_Hs.blast file is shown below, the entire file is attached. > > > > Please let me know if you see anything obviously wrong with the way I'm > > doing > > the BLAST. I'm going to cvs checkout the BioJava source code and have a > > look > > at the JUnit test later today. > > > > Thanks! > > > > -Eric Trull > > > > -------- 26SPS9_Hs.blast -------- > > BLASTP 2.0.11 [Jan-20-2000] > > > > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, > > > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > > "Gapped BLAST and PSI-BLAST: a new generation of protein database search > > programs", Nucleic Acids Res. 25:3389-3402. > > > > Query= 26SPS9_Hs > > (176 letters) > > > > Database: PDB > > 78,094 sequences; 17,596,117 total letters > > > > > > > > Score > > > E > > Sequences producing significant alignments: (bits) > > > Value > > > > pdb|1UFM|A Cop9 Complex Subunit 4 39 > > > 0.003 > > . > > . > > . > > -------- 26SPS9_Hs.blast -------- > > > > > > --- mark.schreiber@novartis.com wrote: > > > > > Hello - > > > > > > This is very odd. > > > > > > The JUnit tests currently pass using the files in > > > /tests/files/org/biojava/bio/programs/ssbind These BLAST files all > have > > > > > > the string "Searching....". Maybe there is a variation in the windows > > > output? > > > > > > Can you post at least the header of your output to the list (preferably > > > an > > > entire example output)? > > > > > > - Mark > > > > > > > > > > > > > > > > > > "W. Eric Trull" > > > Sent by: biojava-dev-bounces@portal.open-bio.org > > > 10/06/2005 06:11 AM > > > > > > > > > To: biojava-dev@biojava.org > > > cc: (bcc: Mark Schreiber/GP/Novartis) > > > Subject: [Biojava-dev] NullPointerException from > > > BlastSAXParser.java > > > > > > > > > Hello all, > > > > > > I'm new to the list, but have done as much archive searching, Google > > > searching, and debugging as I can on the problem I describe here. > > > > > > I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), > > but > > > keep getting a NullPointerException. One of my searches turned up > using > > > BlastEcho to debug the problem, but that also throws the > > > NullPointerException: > > > > > > startSearch > > > SearchProp: program: ncbi-blastp > > > SearchProp: version: 2.0.11 > > > java.lang.NullPointerException > > > at > > > > > > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215 > ) > > > at > > > > > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) > > > at > > > > > > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXPars > er.java:311) > > > at > > > > > > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser. > java:274) > > > at > > > > > > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java > :160) > === message truncated === _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From mark.schreiber at novartis.com Sun Oct 9 21:58:34 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Sun Oct 9 21:57:58 2005 Subject: [Biojava-dev] Java 1.5 (final chance to object) Message-ID: >Stability and backward compatibility are crucial to the success of BioJava. >In fact, if the code is not too complex, I often have to write it myself >rather than >relying on BioJava. I want to write code that will still be in production >5 years from now. This stems from the fact that for several years package stability wasn't really a design goal. This was good cause often our first attempts where not that good. Having said that, stability is definitely a goal now. I beleive there has been a great reduction in major API changes between recent versions. Indeed many core interfaces have not changed in a long time. But we do need to remain vigilent. Basic rule, do not break the core org.biojava.* API's! There are some reservations to that. 1) Deprecation will happen. There is nothing wrong with deprecation of a bad or unsupported API and as long as it is flagged well before releases come out. Preferably there should be one or two releases where a method or class is deprecated before it is (if ever) finally removed. 2) New packages in CVS should never be considered stable. If an API has not been part of a release version I don't think we need to guarentee stability. When org.biojavax is released no org.biojava.* API will be removed or changed. Some will be deprecated as the org.biojavax APIs may give you a better alternative. They are not really seperate APIs, more extensions that give more fexibility and sometimes do things better (like swing is to AWT). The part where people may see problems is interactions with BioSQL. Previously biojava worked with bioSQL but not in the way it should according to the bioSQL specs. The new version will bring it into line with bioSQL. The old APIs will remain should people need to access legacy data in bioSQL DBs created with the old API. >I say that, not because I don't appreciate the hard work of developpers, >but because I would like the volunteer developpers to appreciate that >the users of the toolkit need stability. We live in production environments >that will not support 1.5 for a long time. I am still living in a 1.3 world. >Only projects that I am starting right now can use 1.4.2_08. The major releases from the past support the following versions, biojava1.3 and biojava1.3.1 work with JDK1.3.x and biojava1.4 works with JDK1.4.2. You should never prepare production code from versions of biojava in CVS. The best way to make production code is to bundle all your application dependencies into the application JAR. This way everything is in one place and no external changes affect it (and keep the required version of the JDK available if you upgrade to a new one). It's not elegant but it is bullet proof. There are other approaches that work too but need more management. >I beg you, Please reconsider moving to 1.5, it's only 5% more typing to use >1.4. Sure, people just need to stop dropping 1.5 dependent code into the CVS. Please note though, biojava1.4 will never require java1.5, it would only affect future versions. Richard is planning a preview of biojavax in december. We may revisit the issue then. Ps, what JDK is caBIO running on now? Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From mark.schreiber at novartis.com Mon Oct 10 01:27:50 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Mon Oct 10 01:27:32 2005 Subject: [Biojava-dev] TaxonSQL not 1.4 compliant Message-ID: Hello - Someone has placed a modification of TaxonSQL in CVS that contains methods from JDK1.5. You know who you are : ) Could you please use JDK1.4 "equivalents" Specifically, there are a number of Integer.instanceOf(int i) calls which I have fixed in CVS. There are also a couple of uses of the String methods matches(String regex) and replace(String regex) which I have not tried to fix. These were only introduced in java 1.5. Can people please try and avoid using methods that only compile with JDK 1.5 for the time being? They are clearly documented in the javadocs with Since: 1.5 Thanks, - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 From kheeteck at yahoo.com Thu Oct 6 08:50:32 2005 From: kheeteck at yahoo.com (kheeteck) Date: Tue Oct 11 11:05:35 2005 Subject: [Biojava-dev] retrieve property "ORIGIN" Message-ID: <20051006125032.87281.qmail@web32403.mail.mud.yahoo.com> Hi.. Anybody know how to retrieve the annotation property "ORIGIN" My code look like this while(sequences.hasNext()){ try { seq = sequences.nextSequence(); //Annotation Annotation anno = seq.getAnnotation(); //print each key value pair for (Iterator i = anno.keys().iterator(); i.hasNext(); ) { Object key = i.next(); System.out.println(key +" : "+ anno.getProperty(key)); }//for }//while It print out the value for each annotation key but for ORIGIN, there is nothing. Regards KheeTeck __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com From wetrull at yahoo.com Thu Oct 6 11:01:01 2005 From: wetrull at yahoo.com (W. Eric Trull) Date: Tue Oct 11 11:05:38 2005 Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java In-Reply-To: Message-ID: <20051006150102.155.qmail@web81403.mail.yahoo.com> Hello Mark, Here is what I've done, using NCBI Blast 2.0.11, Windows XP, JDK 1.4.2 1. Downloaded the PDB's pdb_seqres.txt 2. Created a blast database (after changing the deflines): C:\blast-2.0.11\formatdb.exe -t "PDB" -i blast\pdb_seqres.txt -l blast\pdb_formatdb.log -o T -n blast\pdb 3. BLASTed 26SPS9_Hs: C:\blast-2.0.11\blastall.exe -p blastp -d blast\pdb -i 26SPS9_Hs.fasta -o 26SPS9_Hs.blast 4. Tried to parse 26SPS9_Hs.blast using the class shown in BioJava in Anger and BlastEcho, both of which give me the NullPointerException. The beginning of 26SPS9_Hs.blast file is shown below, the entire file is attached. Please let me know if you see anything obviously wrong with the way I'm doing the BLAST. I'm going to cvs checkout the BioJava source code and have a look at the JUnit test later today. Thanks! -Eric Trull -------- 26SPS9_Hs.blast -------- BLASTP 2.0.11 [Jan-20-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= 26SPS9_Hs (176 letters) Database: PDB 78,094 sequences; 17,596,117 total letters Score E Sequences producing significant alignments: (bits) Value pdb|1UFM|A Cop9 Complex Subunit 4 39 0.003 . . . -------- 26SPS9_Hs.blast -------- --- mark.schreiber@novartis.com wrote: > Hello - > > This is very odd. > > The JUnit tests currently pass using the files in > /tests/files/org/biojava/bio/programs/ssbind These BLAST files all have > the string "Searching....". Maybe there is a variation in the windows > output? > > Can you post at least the header of your output to the list (preferably an > entire example output)? > > - Mark > > > > > > "W. Eric Trull" > Sent by: biojava-dev-bounces@portal.open-bio.org > 10/06/2005 06:11 AM > > > To: biojava-dev@biojava.org > cc: (bcc: Mark Schreiber/GP/Novartis) > Subject: [Biojava-dev] NullPointerException from > BlastSAXParser.java > > > Hello all, > > I'm new to the list, but have done as much archive searching, Google > searching, and debugging as I can on the problem I describe here. > > I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), but > keep getting a NullPointerException. One of my searches turned up using > BlastEcho to debug the problem, but that also throws the > NullPointerException: > > startSearch > SearchProp: program: ncbi-blastp > SearchProp: version: 2.0.11 > java.lang.NullPointerException > at > org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215) > at > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:311) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:274) > at > org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160) > at > com.pfizer.search.sequence.BlastEcho.echo(BlastEcho.java:42) > at > com.pfizer.search.sequence.BlastEcho.main(BlastEcho.java:88) > Exception in thread "main" > > Stepping through the code in a debugger shows that the while loop added in > revision 1.13 of > /biojava-live/src/org/biojava/bio/program/sax/BlastSAXParser.java (fixed > truncation of database id) reads all the lines without ever matching the > "Searching" string. At first I thought it was because I was using a later > version of BLAST, but then I tried 2.0.11 and 2.2.3 (supported version) > but > they also result in a NullPointerException. In the BLAST output for the > various versions I never see a "Searching" string anywhere. I've tried > all > the -m options as well, without success. > > Is there a NCBI BLAST option that I need to be using? I'm running on > Windows > XP (during development) - is the UNIX version output different? > > Thanks. > > -Eric Trull > > > _______________________________________________ > biojava-dev mailing list > biojava-dev@biojava.org > http://biojava.org/mailman/listinfo/biojava-dev > > > > -------------- next part -------------- BLASTP 2.0.11 [Jan-20-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= 26SPS9_Hs (176 letters) Database: PDB 78,094 sequences; 17,596,117 total letters Score E Sequences producing significant alignments: (bits) Value pdb|1UFM|A Cop9 Complex Subunit 4 39 0.003 pdb|1YM7|D Beta-Adrenergic Receptor Kinase 1 29 3.3 pdb|1YM7|C Beta-Adrenergic Receptor Kinase 1 29 3.3 pdb|1YM7|B Beta-Adrenergic Receptor Kinase 1 29 3.3 pdb|1YM7|A Beta-Adrenergic Receptor Kinase 1 29 3.3 pdb|1OMW|A G-Protein Coupled Receptor Kinase 2 29 3.3 >pdb|1UFM|A Cop9 Complex Subunit 4 Length = 84 Score = 39.1 bits (89), Expect = 0.003 Identities = 15/56 (26%), Positives = 35/56 (61%) Query: 114 LLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMILDKKFHGILDQGEGVL 169 ++E NL+ + ++ + E + +L+++ A E+ SQMI + + +G +DQ +G++ Sbjct: 16 VIEHNLLSASKLYNNITFEELGALLEIPAAKAEKIASQMITEGRMNGFIDQIDGIV 71 >pdb|1YM7|D Beta-Adrenergic Receptor Kinase 1 Length = 689 Score = 29.0 bits (63), Expect = 3.3 Identities = 18/85 (21%), Positives = 41/85 (48%), Gaps = 5/85 (5%) Query: 73 CVAQASKNRSLADFEKALTDY-RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQI 131 C+ + + L +F + + Y + E ++ ++ + +++D + + L+ PFS+ I Sbjct: 72 CLKHLEEAKPLVEFYEEIKKYEKLETEEERLVCSR--EIFDTYIMKELLACSHPFSKSAI 129 Query: 132 EHISSLIKLSKADVERKLSQMILDK 156 EH+ L K V L Q +++ Sbjct: 130 EHVQG--HLVKKQVPPDLFQPYIEE 152 >pdb|1YM7|C Beta-Adrenergic Receptor Kinase 1 Length = 689 Score = 29.0 bits (63), Expect = 3.3 Identities = 18/85 (21%), Positives = 41/85 (48%), Gaps = 5/85 (5%) Query: 73 CVAQASKNRSLADFEKALTDY-RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQI 131 C+ + + L +F + + Y + E ++ ++ + +++D + + L+ PFS+ I Sbjct: 72 CLKHLEEAKPLVEFYEEIKKYEKLETEEERLVCSR--EIFDTYIMKELLACSHPFSKSAI 129 Query: 132 EHISSLIKLSKADVERKLSQMILDK 156 EH+ L K V L Q +++ Sbjct: 130 EHVQG--HLVKKQVPPDLFQPYIEE 152 >pdb|1YM7|B Beta-Adrenergic Receptor Kinase 1 Length = 689 Score = 29.0 bits (63), Expect = 3.3 Identities = 18/85 (21%), Positives = 41/85 (48%), Gaps = 5/85 (5%) Query: 73 CVAQASKNRSLADFEKALTDY-RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQI 131 C+ + + L +F + + Y + E ++ ++ + +++D + + L+ PFS+ I Sbjct: 72 CLKHLEEAKPLVEFYEEIKKYEKLETEEERLVCSR--EIFDTYIMKELLACSHPFSKSAI 129 Query: 132 EHISSLIKLSKADVERKLSQMILDK 156 EH+ L K V L Q +++ Sbjct: 130 EHVQG--HLVKKQVPPDLFQPYIEE 152 >pdb|1YM7|A Beta-Adrenergic Receptor Kinase 1 Length = 689 Score = 29.0 bits (63), Expect = 3.3 Identities = 18/85 (21%), Positives = 41/85 (48%), Gaps = 5/85 (5%) Query: 73 CVAQASKNRSLADFEKALTDY-RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQI 131 C+ + + L +F + + Y + E ++ ++ + +++D + + L+ PFS+ I Sbjct: 72 CLKHLEEAKPLVEFYEEIKKYEKLETEEERLVCSR--EIFDTYIMKELLACSHPFSKSAI 129 Query: 132 EHISSLIKLSKADVERKLSQMILDK 156 EH+ L K V L Q +++ Sbjct: 130 EHVQG--HLVKKQVPPDLFQPYIEE 152 >pdb|1OMW|A G-Protein Coupled Receptor Kinase 2 Length = 689 Score = 29.0 bits (63), Expect = 3.3 Identities = 18/85 (21%), Positives = 41/85 (48%), Gaps = 5/85 (5%) Query: 73 CVAQASKNRSLADFEKALTDY-RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQI 131 C+ + + L +F + + Y + E ++ ++ + +++D + + L+ PFS+ I Sbjct: 72 CLKHLEEAKPLVEFYEEIKKYEKLETEEERLVCSR--EIFDTYIMKELLACSHPFSKSAI 129 Query: 132 EHISSLIKLSKADVERKLSQMILDK 156 EH+ L K V L Q +++ Sbjct: 130 EHVQG--HLVKKQVPPDLFQPYIEE 152 Database: PDB Posted date: Oct 6, 2005 7:42 AM Number of letters in database: 17,596,117 Number of sequences in database: 78,094 Lambda K H 0.319 0.136 0.379 Gapped Lambda K H 0.270 0.0470 0.230 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Hits to DB: 5635599 Number of Sequences: 78094 Number of extensions: 193971 Number of successful extensions: 758 Number of sequences better than 10.0: 6 Number of HSP's better than 10.0 without gapping: 1 Number of HSP's successfully gapped in prelim test: 5 Number of HSP's that attempted gapping in prelim test: 757 Number of HSP's gapped (non-prelim): 6 length of query: 176 length of database: 17,596,117 effective HSP length: 50 effective length of query: 126 effective length of database: 13,691,417 effective search space: 1725118542 effective search space used: 1725118542 T: 11 A: 40 X1: 16 ( 7.4 bits) X2: 38 (14.8 bits) X3: 64 (24.9 bits) S1: 41 (21.7 bits) S2: 59 (27.4 bits) From mark.schreiber at novartis.com Thu Oct 6 21:52:05 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Tue Oct 11 11:05:39 2005 Subject: [Biojava-dev] Java 1.5 (final chance to object) Message-ID: An HTML attachment was scrubbed... URL: http://portal.open-bio.org/pipermail/biojava-dev/attachments/20051007/4b427dc3/attachment-0001.htm From hzhang at ceres-inc.com Fri Oct 7 18:02:03 2005 From: hzhang at ceres-inc.com (Hongyu Zhang) Date: Tue Oct 11 11:05:40 2005 Subject: [Biojava-dev] a bug in WU-BLAST parser Message-ID: I am reporting an error in the WU-BLAST parser to the Biojava developer's list. The Biojava version is the latest 1.4, and the bug is in file src/org/biojava/bio/program/sax/WuBlastSummaryLineHelper.java. The error happened when the description field of the BLAST hit summary lines is empty. For example: ADL26502 1430 8.5e-145 1 Cause of the error: A parameter, iGrab, in the code, affects the behavior of the parser dependent on the WU-BLAST version. When the version is BLASTX, TBLASTX or TBLASTN, igrab is set to 4, and the code at the line 120 will try to read the non-existed "Frame" field from the WU-BLAST summary lines. Usually, the description field in the summary line is not empty, so this line of code will grab the last word from the description field as the "Frame" field mistakenly. This mistake usually won't matter to the following codes and therefore is hidden in most situations. In the example above, however, since the description field of the hit is empty, the code will mistakenly shift and read the next "High Score" field (1430 in this case) as the "Frame" field and cause the StringTokenizer to throw an error. Hongyu Zhang, Ph.D. Computational Biologist Ceres Inc. 1535 Rancho Conejo Blvd Thousand oaks, CA 91320 Phone: (805)376-6504 ext. 1204 Fax: (805)376-6537 ********************************************************************** This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. Ceres, Inc. declines any liability for any viruses or other potentially harmful code which may be transmitted by or accompanying this email or any attachment. ********************************************************************** From zhaozhg at keylab.net Sat Oct 8 03:34:18 2005 From: zhaozhg at keylab.net (=?gb2312?B?1dTWvrjV?=) Date: Tue Oct 11 11:05:41 2005 Subject: [Biojava-dev] Which package the class NestedError is in? Message-ID: <20051008073456.85FC44047@mail.keylab.net> Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: fox.gif Type: image/gif Size: 9519 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biojava-dev/attachments/20051008/6d27add4/fox-0001.gif From zhaozhg at keylab.net Sun Oct 9 23:08:26 2005 From: zhaozhg at keylab.net (=?gb2312?B?1dTWvrjV?=) Date: Tue Oct 11 11:05:42 2005 Subject: [Biojava-dev] who have finished the example "Changeability examples" ? Message-ID: <20051010030901.A215B40B3@mail.keylab.net> Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: fox.gif Type: image/gif Size: 9519 bytes Desc: not available Url : http://portal.open-bio.org/pipermail/biojava-dev/attachments/20051010/081af54a/fox.gif From mark.schreiber at novartis.com Tue Oct 11 21:00:13 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Tue Oct 11 20:59:34 2005 Subject: [Biojava-dev] a bug in WU-BLAST parser Message-ID: Hi - I will take a look at this, however, if you know of a solution could you post it to me and I will commit it to CVS. Thanks. - Mark "Hongyu Zhang" Sent by: biojava-dev-bounces@portal.open-bio.org 10/08/2005 06:02 AM To: cc: Raj Thavamani , (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-dev] a bug in WU-BLAST parser I am reporting an error in the WU-BLAST parser to the Biojava developer's list. The Biojava version is the latest 1.4, and the bug is in file src/org/biojava/bio/program/sax/WuBlastSummaryLineHelper.java. The error happened when the description field of the BLAST hit summary lines is empty. For example: ADL26502 1430 8.5e-145 1 Cause of the error: A parameter, iGrab, in the code, affects the behavior of the parser dependent on the WU-BLAST version. When the version is BLASTX, TBLASTX or TBLASTN, igrab is set to 4, and the code at the line 120 will try to read the non-existed "Frame" field from the WU-BLAST summary lines. Usually, the description field in the summary line is not empty, so this line of code will grab the last word from the description field as the "Frame" field mistakenly. This mistake usually won't matter to the following codes and therefore is hidden in most situations. In the example above, however, since the description field of the hit is empty, the code will mistakenly shift and read the next "High Score" field (1430 in this case) as the "Frame" field and cause the StringTokenizer to throw an error. Hongyu Zhang, Ph.D. Computational Biologist Ceres Inc. 1535 Rancho Conejo Blvd Thousand oaks, CA 91320 Phone: (805)376-6504 ext. 1204 Fax: (805)376-6537 ********************************************************************** This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. Ceres, Inc. declines any liability for any viruses or other potentially harmful code which may be transmitted by or accompanying this email or any attachment. ********************************************************************** _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From mark.schreiber at novartis.com Tue Oct 11 21:02:40 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Tue Oct 11 21:02:03 2005 Subject: [Biojava-dev] who have finished the example "Changeability examples" ? Message-ID: Hello - Some of the old examples are getting out of date. Fixing them is on the ever increasing list of things to do. More up to date examples can be found at http://www.biojava.org/docs/bj_in_anger/index.htm Thanks for pointing out the bug. - Mark "ÕÔÖ¾¸Õ" Sent by: biojava-dev-bounces@portal.open-bio.org 10/10/2005 11:08 AM To: "biojava¿ª·¢Õß" cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-dev] who have finished the example "Changeability examples" ? Is there someone who have finished the example "Changeability examples" in Biojava tutorial,the URL is http://www.biojava.org/tutorials/events2.html. I can't finish it with JDK 1.4.2 and Biojava 1.4¡£I find some errors in the source Roulet.java(http://www.biojava.org/tutorials/Roulet.java). Besides,the demo (http://www.biojava.org/tutorials/Roulet.html)doesn't work. thanks! _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev [ Attachment ''FOX.GIF'' removed by Mark Schreiber ] From sylvain.foisy at bioneq.qc.ca Tue Oct 18 12:58:45 2005 From: sylvain.foisy at bioneq.qc.ca (Sylvain Foisy) Date: Tue Oct 18 14:05:09 2005 Subject: [Biojava-dev] JDK1.5 dependencies woes Message-ID: Hi to all, Still some JDK1.5 stuff creeping in CVS in UniProtXMLFormat: compile-biojava: [javac] Compiling 40 source files to /Volumes/BIONEQ-The_Brain/Java-Librairies/biojava-live/ant-build/classes/bio java [javac] /Volumes/BIONEQ-The_Brain/Java-Librairies/biojava-live/src/org/biojavax/bio/ seq/io/UniProtXMLFormat.java:389: cannot resolve symbol [javac] symbol : method contains (java.lang.String) [javac] location: class java.lang.String [javac] if (line.contains("<"+COPYRIGHT_TAG)) XMLTools.readXMLChunk(reader, m_handler, COPYRIGHT_TAG); [javac] ^ [javac] 1 error I would like to move on to JDK1.5 but because we provide a service to many, we will not be moving toward this anytime soon... Best regards Sylvain =================================================================== Sylvain Foisy, Ph. D. Directeur - operations / Project Manager BioneQ - Reseau quebecois de bio-informatique U. de Montreal / Genome-Quebec Adresse postale: Departement de biochimie Pavillon principal 2900, boul. ?douard-Montpetit Montr?al (Qu?bec) H3T 1J4 Tel: (514) 343-6111 x.2545 Fax: (514) 343-7759 Courriel: sylvain.foisy@bioneq.qc.ca =================================================================== From kalle.naslund at genpat.uu.se Tue Oct 18 14:04:39 2005 From: kalle.naslund at genpat.uu.se (=?ISO-8859-1?Q?Kalle_N=E4slund?=) Date: Tue Oct 18 14:24:11 2005 Subject: [Biojava-dev] Serialization problems, "-" turns to "n" after serializing sequence Message-ID: <43553937.406@genpat.uu.se> Hi! I seem to be stuck with a serialization issue, somewhere deep in the alphabet stuff. The problem is that "-" turns into "n". This happens both with farily new CVS code as well as 1.4 release code. The code i am using is the following: import java.util.*; import java.io.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; import org.biojava.utils.*; import org.biojava.bio.*; /** * Temp class, just to check out some serialization issues im having. * * @author kalle */ public class AlignmentSerializationTest { public void run() throws Exception { Sequence dnaSeq1 = DNATools.createDNASequence("---ATGC---ATGC---", "seq1" ); dumpInfoAboutSequence( dnaSeq1 ); System.out.println("Writing alignment to disk"); File file = new File("/tmp/ali.obj"); FileOutputStream fOS = new FileOutputStream( file ); ObjectOutputStream oOS = new ObjectOutputStream( fOS ); oOS.writeObject( dnaSeq1 ); oOS.close(); fOS.close(); System.out.println( "Loading alignment from disk" ); FileInputStream fIS = new FileInputStream( file ); ObjectInputStream oIS = new ObjectInputStream( fIS ); Sequence serSeq = ( Sequence )oIS.readObject(); dumpInfoAboutSequence( serSeq ); } public static void main( String[] flags ) throws Exception { AlignmentSerializationTest myAST = new AlignmentSerializationTest(); myAST.run(); } private void dumpInfoAboutSequence( Sequence sequence ) throws Exception { System.out.println("Name :" + sequence.getName() ); System.out.println("Alphabet :" + sequence.getAlphabet() ); System.out.println("GapSymbol :" + sequence.getAlphabet().getGapSymbol() ); System.out.println("Sequence :" + sequence.seqString() ); System.out.println("Tokeniz :" + sequence.getAlphabet().getTokenization( "token" ) ); } } And the output i get is : Name :seq1 Alphabet :org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: [] Sequence :---atgc---atgc--- Tokeniz :org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56 Writing alignment to disk Loading alignment from disk Name :seq1 Alphabet :org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: [] Sequence :nnnatgcnnnatgcnnn Tokeniz :org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56 I have spent some time using a debugger and stepping trough the bj code but realised that it will most likely take me loads of time, and was hoping that some of you guys that have some more experience with the alphabet stuff could atleast point me in the right direction, if not outright recognize the bug =) kind regards Kalle From mark.schreiber at novartis.com Tue Oct 18 23:19:17 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Tue Oct 18 23:25:05 2005 Subject: [Biojava-dev] Serialization problems, "-" turns to "n" after serializing sequence Message-ID: Hello - What should happen is that a method called readResolve() should be called by the JVM on deserialization to replace the gap symbol that was deserialized with the gap symbol of the local AlphabetManager. This prevents you from having a gap that is not == the gap provided by the alphabet manager. It seems that somehow it is instead being replaced by the ambiguity symbol n. It may take me a while to get around to looking at this. If you find it, please let me know. If I forget, please remind me : ) - Mark Kalle N?slund Sent by: biojava-dev-bounces@portal.open-bio.org 10/19/2005 02:04 AM To: biojava-dev@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-dev] Serialization problems, "-" turns to "n" after serializing sequence Hi! I seem to be stuck with a serialization issue, somewhere deep in the alphabet stuff. The problem is that "-" turns into "n". This happens both with farily new CVS code as well as 1.4 release code. The code i am using is the following: import java.util.*; import java.io.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; import org.biojava.utils.*; import org.biojava.bio.*; /** * Temp class, just to check out some serialization issues im having. * * @author kalle */ public class AlignmentSerializationTest { public void run() throws Exception { Sequence dnaSeq1 = DNATools.createDNASequence("---ATGC---ATGC---", "seq1" ); dumpInfoAboutSequence( dnaSeq1 ); System.out.println("Writing alignment to disk"); File file = new File("/tmp/ali.obj"); FileOutputStream fOS = new FileOutputStream( file ); ObjectOutputStream oOS = new ObjectOutputStream( fOS ); oOS.writeObject( dnaSeq1 ); oOS.close(); fOS.close(); System.out.println( "Loading alignment from disk" ); FileInputStream fIS = new FileInputStream( file ); ObjectInputStream oIS = new ObjectInputStream( fIS ); Sequence serSeq = ( Sequence )oIS.readObject(); dumpInfoAboutSequence( serSeq ); } public static void main( String[] flags ) throws Exception { AlignmentSerializationTest myAST = new AlignmentSerializationTest(); myAST.run(); } private void dumpInfoAboutSequence( Sequence sequence ) throws Exception { System.out.println("Name :" + sequence.getName() ); System.out.println("Alphabet :" + sequence.getAlphabet() ); System.out.println("GapSymbol :" + sequence.getAlphabet().getGapSymbol() ); System.out.println("Sequence :" + sequence.seqString() ); System.out.println("Tokeniz :" + sequence.getAlphabet().getTokenization( "token" ) ); } } And the output i get is : Name :seq1 Alphabet :org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: [] Sequence :---atgc---atgc--- Tokeniz :org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56 Writing alignment to disk Loading alignment from disk Name :seq1 Alphabet :org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: [] Sequence :nnnatgcnnnatgcnnn Tokeniz :org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56 I have spent some time using a debugger and stepping trough the bj code but realised that it will most likely take me loads of time, and was hoping that some of you guys that have some more experience with the alphabet stuff could atleast point me in the right direction, if not outright recognize the bug =) kind regards Kalle _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From ml-it-biojava-dev at epigenomics.com Wed Oct 19 04:41:38 2005 From: ml-it-biojava-dev at epigenomics.com (ml-it-biojava-dev@epigenomics.com) Date: Wed Oct 19 04:47:36 2005 Subject: [Biojava-dev] FastaFormat performance enhancement Message-ID: Hi, I had a lot of trouble using SeqIOTools.writeFasta on large sequences. The subStr method of SymbolList seems to introduce a memory leak (I did not track that in detail!). Anyway I would suggest to change FastaFormat: public void writeSequence(Sequence seq, PrintStream os) throws IOException { os.print(">"); os.println(describeSequence(seq)); int length = seq.length(); for (int pos = 1; pos <= length; pos += lineWidth) { int end = Math.min(pos + lineWidth - 1, length); os.println(seq.subStr(pos, end)); } } to public void writeSequence(Sequence seq, PrintStream os) throws IOException { os.print(">"); os.println(describeSequence(seq)); int length = seq.length(); String seqString = seq.seqString(); for (int pos = 0; pos < length; pos += lineWidth) { int end = Math.min(pos + lineWidth, length); String sub = seqString.substring(pos, end); os.println(sub); } } since it is String manipulation that takes place in the loop, I think there is no point in using SymbolList subStr anyway. ciao dirk -- Dirk Habighorst Software Engineer/ Bioinformatician Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany phone:+49-30-24345-372 fax:+49-30-24345-555 http://www.epigenomics.com dirk.habighorst@epigenomics.com From td2 at sanger.ac.uk Wed Oct 19 09:53:39 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Wed Oct 19 10:40:16 2005 Subject: [Biojava-dev] FastaFormat performance enhancement In-Reply-To: References: Message-ID: <7A7E9CB2-D412-4A12-957B-401F08A7BD8A@sanger.ac.uk> On 19 Oct 2005, at 09:41, ml-it-biojava-dev@epigenomics.com wrote: > Hi, > I had a lot of trouble using SeqIOTools.writeFasta on large > sequences. The subStr method of SymbolList seems to introduce a > memory leak (I did not track that in detail!). Anyway I would > suggest to change FastaFormat: > public void writeSequence(Sequence seq, PrintStream os) > throws IOException { > os.print(">"); > os.println(describeSequence(seq)); > int length = seq.length(); > for (int pos = 1; pos <= length; pos += lineWidth) { > int end = Math.min(pos + lineWidth - 1, length); > os.println(seq.subStr(pos, end)); > } > } > > to > public void writeSequence(Sequence seq, PrintStream os) > throws IOException { > os.print(">"); > os.println(describeSequence(seq)); > int length = seq.length(); > String seqString = seq.seqString(); > for (int pos = 0; pos < length; pos += lineWidth) { > int end = Math.min(pos + lineWidth, length); > String sub = seqString.substring(pos, end); > os.println(sub); > } > } > > since it is String manipulation that takes place in the loop, I > think there is no point in using SymbolList subStr anyway. Hi, I'd argue against this patch since it could potentially generate some really huge strings. Suppose I've got a Sequence object representing human chromosome 1 (somewhere around 220Mb). If this is a database- backed object with chunks of sequence lazy-loaded on demand (biojava- ensembl does this, for example) then there'll be no problem working with it even on a fairly modest PC. But converting the whole thing to a String is going to use at least 440Mb of RAM, and could easily cause an OutOfMemoryError. I'd be fine with stringifying sequences in larger chunks rather than one line at a time -- but I think we should be cautious about stringifying complete large sequences. Do you have any idea where the memory leak might be? I'd be interested to track it down. What sort of sequences were you using? Thomas From ml-it-biojava-dev at epigenomics.com Wed Oct 19 11:09:27 2005 From: ml-it-biojava-dev at epigenomics.com (ml-it-biojava-dev@epigenomics.com) Date: Wed Oct 19 11:08:32 2005 Subject: [Biojava-dev] FastaFormat performance enhancement In-Reply-To: <7A7E9CB2-D412-4A12-957B-401F08A7BD8A@sanger.ac.uk> References: <7A7E9CB2-D412-4A12-957B-401F08A7BD8A@sanger.ac.uk> Message-ID: Thomas Down wrote: > > On 19 Oct 2005, at 09:41, ml-it-biojava-dev@epigenomics.com wrote: > >> Hi, >> I had a lot of trouble using SeqIOTools.writeFasta on large >> sequences. The subStr method of SymbolList seems to introduce a >> memory leak (I did not track that in detail!). Anyway I would suggest >> to change FastaFormat: >> public void writeSequence(Sequence seq, PrintStream os) >> throws IOException { >> os.print(">"); >> os.println(describeSequence(seq)); >> int length = seq.length(); >> for (int pos = 1; pos <= length; pos += lineWidth) { >> int end = Math.min(pos + lineWidth - 1, length); >> os.println(seq.subStr(pos, end)); >> } >> } >> >> to >> public void writeSequence(Sequence seq, PrintStream os) >> throws IOException { >> os.print(">"); >> os.println(describeSequence(seq)); >> int length = seq.length(); >> String seqString = seq.seqString(); >> for (int pos = 0; pos < length; pos += lineWidth) { >> int end = Math.min(pos + lineWidth, length); >> String sub = seqString.substring(pos, end); >> os.println(sub); >> } >> } >> >> since it is String manipulation that takes place in the loop, I think >> there is no point in using SymbolList subStr anyway. > > > Hi, > > I'd argue against this patch since it could potentially generate some > really huge strings. Suppose I've got a Sequence object representing > human chromosome 1 (somewhere around 220Mb). If this is a database- > backed object with chunks of sequence lazy-loaded on demand (biojava- > ensembl does this, for example) then there'll be no problem working > with it even on a fairly modest PC. But converting the whole thing to > a String is going to use at least 440Mb of RAM, and could easily cause > an OutOfMemoryError. > > I'd be fine with stringifying sequences in larger chunks rather than > one line at a time -- but I think we should be cautious about > stringifying complete large sequences. > > Do you have any idea where the memory leak might be? I'd be interested > to track it down. What sort of sequences were you using? > > Thomas > Hi thomas, I experienced performance problems (even OutOfMemoryError) when working with large Sequences (not lazy loaded). You might want to check this little example: package test; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStream; import java.util.Properties; import org.biojava.bio.seq.DNATools; import org.biojava.bio.seq.io.SeqIOTools; import org.biojava.bio.symbol.IllegalSymbolException; import org.ensembl.datamodel.CoordinateSystem; import org.ensembl.datamodel.Location; import org.ensembl.datamodel.Sequence; import org.ensembl.datamodel.SequenceRegion; import org.ensembl.driver.AdaptorException; import org.ensembl.driver.ConfigurationException; import org.ensembl.driver.CoreDriver; import org.ensembl.driver.DriverManager; import org.ensembl.driver.SequenceAdaptor; import org.ensembl.driver.SequenceRegionAdaptor; public class ExportFasta { /** * @param args */ public static void main (String[] args) { // TODO Auto-generated method stub Properties props = createDriverProperties (args); try { OutputStream os; os = new FileOutputStream (args[3]); CoreDriver coreDriver = DriverManager.loadDriver (props); SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor(); SequenceAdaptor sa = coreDriver.getSequenceAdaptor(); CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]); SequenceRegion[] srs = sra.fetchAllByCoordinateSystem(coordinateSystem); int size = Integer.parseInt(args[5]); for (SequenceRegion seqRegion : srs) { Location loc = null; int length = (int) seqRegion.getLength(); int start = 1; int end; while (start < length) { end = start + size - 1 < length ? start + size - 1: length; loc = new Location (coordinateSystem, seqRegion.getName(), start, end, 1); System.out.println(loc); start = end + 1; Sequence seq = sa.fetch(loc); org.biojava.bio.seq.Sequence bioseq = DNATools.createDNASequence(seq.getString(), loc.toString()); SeqIOTools.writeFasta(os, bioseq); } } } catch (ConfigurationException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (AdaptorException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IllegalSymbolException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } private static Properties createDriverProperties (String[] args) { Properties props = new Properties (); props.setProperty("host", args[0]); props.setProperty("user", args[1]); props.setProperty("database", args[2]); return props; } } java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE since the chunksize is stable the memory required should be stable. With large chunks (1000000) allocated memory keeps growing! hope that helps, dirk -- Dirk Habighorst Software Engineer/ Bioinformatician Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany phone:+49-30-24345-372 fax:+49-30-24345-555 http://www.epigenomics.com dirk.habighorst@epigenomics.com From ml-it-biojava-dev at epigenomics.com Wed Oct 19 12:28:37 2005 From: ml-it-biojava-dev at epigenomics.com (ml-it-biojava-dev@epigenomics.com) Date: Wed Oct 19 12:27:46 2005 Subject: [Biojava-dev] FastaFormat performance enhancement In-Reply-To: References: <7A7E9CB2-D412-4A12-957B-401F08A7BD8A@sanger.ac.uk> Message-ID: Dirk Habighorst wrote: > Thomas Down wrote: > >> >> On 19 Oct 2005, at 09:41, ml-it-biojava-dev@epigenomics.com wrote: >> >>> Hi, >>> I had a lot of trouble using SeqIOTools.writeFasta on large >>> sequences. The subStr method of SymbolList seems to introduce a >>> memory leak (I did not track that in detail!). Anyway I would >>> suggest to change FastaFormat: >>> public void writeSequence(Sequence seq, PrintStream os) >>> throws IOException { >>> os.print(">"); >>> os.println(describeSequence(seq)); >>> int length = seq.length(); >>> for (int pos = 1; pos <= length; pos += lineWidth) { >>> int end = Math.min(pos + lineWidth - 1, length); >>> os.println(seq.subStr(pos, end)); >>> } >>> } >>> >>> to >>> public void writeSequence(Sequence seq, PrintStream os) >>> throws IOException { >>> os.print(">"); >>> os.println(describeSequence(seq)); >>> int length = seq.length(); >>> String seqString = seq.seqString(); >>> for (int pos = 0; pos < length; pos += lineWidth) { >>> int end = Math.min(pos + lineWidth, length); >>> String sub = seqString.substring(pos, end); >>> os.println(sub); >>> } >>> } >>> >>> since it is String manipulation that takes place in the loop, I >>> think there is no point in using SymbolList subStr anyway. >> >> >> >> Hi, >> >> I'd argue against this patch since it could potentially generate some >> really huge strings. Suppose I've got a Sequence object representing >> human chromosome 1 (somewhere around 220Mb). If this is a database- >> backed object with chunks of sequence lazy-loaded on demand (biojava- >> ensembl does this, for example) then there'll be no problem working >> with it even on a fairly modest PC. But converting the whole thing >> to a String is going to use at least 440Mb of RAM, and could easily >> cause an OutOfMemoryError. >> >> I'd be fine with stringifying sequences in larger chunks rather than >> one line at a time -- but I think we should be cautious about >> stringifying complete large sequences. >> >> Do you have any idea where the memory leak might be? I'd be >> interested to track it down. What sort of sequences were you using? >> >> Thomas >> > Hi thomas, > > I experienced performance problems (even OutOfMemoryError) when working > with large Sequences (not lazy loaded). You might want to check this > little example: > > package test; > > import java.io.FileNotFoundException; > import java.io.FileOutputStream; > import java.io.IOException; > import java.io.OutputStream; > import java.util.Properties; > > import org.biojava.bio.seq.DNATools; > import org.biojava.bio.seq.io.SeqIOTools; > import org.biojava.bio.symbol.IllegalSymbolException; > import org.ensembl.datamodel.CoordinateSystem; > import org.ensembl.datamodel.Location; > import org.ensembl.datamodel.Sequence; > import org.ensembl.datamodel.SequenceRegion; > import org.ensembl.driver.AdaptorException; > import org.ensembl.driver.ConfigurationException; > import org.ensembl.driver.CoreDriver; > import org.ensembl.driver.DriverManager; > import org.ensembl.driver.SequenceAdaptor; > import org.ensembl.driver.SequenceRegionAdaptor; > > > public class ExportFasta > { > > /** > * @param args > */ > public static void main (String[] args) { > // TODO Auto-generated method stub > Properties props = createDriverProperties (args); > try { > OutputStream os; > os = new FileOutputStream (args[3]); > > CoreDriver coreDriver = DriverManager.loadDriver (props); > SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor(); > SequenceAdaptor sa = coreDriver.getSequenceAdaptor(); > CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]); > SequenceRegion[] srs = > sra.fetchAllByCoordinateSystem(coordinateSystem); > int size = Integer.parseInt(args[5]); > for (SequenceRegion seqRegion : srs) { > Location loc = null; > int length = (int) seqRegion.getLength(); > int start = 1; > int end; > while (start < length) { > end = start + size - 1 < length ? start + size - 1: length; > loc = new Location (coordinateSystem, seqRegion.getName(), > start, end, 1); > System.out.println(loc); > start = end + 1; > Sequence seq = sa.fetch(loc); > org.biojava.bio.seq.Sequence bioseq = > DNATools.createDNASequence(seq.getString(), loc.toString()); > SeqIOTools.writeFasta(os, bioseq); > } > } > } > catch (ConfigurationException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (AdaptorException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (FileNotFoundException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (IllegalSymbolException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (IOException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > } > > private static Properties createDriverProperties (String[] args) { > Properties props = new Properties (); > props.setProperty("host", args[0]); > props.setProperty("user", args[1]); > props.setProperty("database", args[2]); > return props; > } > > } > > java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE > RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE > > since the chunksize is stable the memory required should be stable. With > large chunks (1000000) allocated memory keeps growing! > hope that helps, dirk Hi thomas, I did a little debugging myself and found an intresting place to look at! The SimpleSymbolList backing Sequences created with the DNATools implements subList like this: public SymbolList subList(int start, int end){ if (start < 1 || end > length()) { throw new IndexOutOfBoundsException( "Sublist index out of bounds " + length() + ":" + start + "," + end ); } if (end < start) { throw new IllegalArgumentException( "end must not be lower than start: start=" + start + ", end=" + end ); } SimpleSymbolList sl = new SimpleSymbolList(this,viewOffset+start,viewOffset+end); if (isView){ referenceSymbolList.addChangeListener(sl); }else{ this.addChangeListener(sl); } return sl; } so it keeps adding references to SymbolLists via the addChangeListener method to the original Sequence. It appears that the garbage collection can't keep up with that if the Sequence is to long. I have not checked this in detail though. ciao, dirk -- Dirk Habighorst Software Engineer/ Bioinformatician Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany phone:+49-30-24345-372 fax:+49-30-24345-555 http://www.epigenomics.com dirk.habighorst@epigenomics.com From mark.schreiber at novartis.com Wed Oct 19 21:05:56 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Wed Oct 19 21:05:04 2005 Subject: [Biojava-dev] Serialization problems, "-" turns to "n" after serializing sequence Message-ID: Hello - Found out what was happening. Not a problem with serialization but a problem with the createDNASequence method. This method wasn't dealing well with gaps. There is another DNATools.createGappedDNASequence() that is supposed to do what you want. Ideally you shouldn't use the createDNASequence method with gap symbols. I have changed it now so that if it detects one it calls the createGapped method. This is in CVS. Your test seems to work now. More generally I may need to apply this to RNATools and ProteinTools as well. I'll hve a look. - Mark Mark Schreiber/GP/Novartis@PH Sent by: biojava-dev-bounces@portal.open-bio.org 10/19/2005 11:19 AM To: Kalle N?slund cc: biojava-dev@biojava.org, (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-dev] Serialization problems, "-" turns to "n" after serializing sequence Hello - What should happen is that a method called readResolve() should be called by the JVM on deserialization to replace the gap symbol that was deserialized with the gap symbol of the local AlphabetManager. This prevents you from having a gap that is not == the gap provided by the alphabet manager. It seems that somehow it is instead being replaced by the ambiguity symbol n. It may take me a while to get around to looking at this. If you find it, please let me know. If I forget, please remind me : ) - Mark Kalle N?slund Sent by: biojava-dev-bounces@portal.open-bio.org 10/19/2005 02:04 AM To: biojava-dev@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-dev] Serialization problems, "-" turns to "n" after serializing sequence Hi! I seem to be stuck with a serialization issue, somewhere deep in the alphabet stuff. The problem is that "-" turns into "n". This happens both with farily new CVS code as well as 1.4 release code. The code i am using is the following: import java.util.*; import java.io.*; import org.biojava.bio.seq.*; import org.biojava.bio.symbol.*; import org.biojava.utils.*; import org.biojava.bio.*; /** * Temp class, just to check out some serialization issues im having. * * @author kalle */ public class AlignmentSerializationTest { public void run() throws Exception { Sequence dnaSeq1 = DNATools.createDNASequence("---ATGC---ATGC---", "seq1" ); dumpInfoAboutSequence( dnaSeq1 ); System.out.println("Writing alignment to disk"); File file = new File("/tmp/ali.obj"); FileOutputStream fOS = new FileOutputStream( file ); ObjectOutputStream oOS = new ObjectOutputStream( fOS ); oOS.writeObject( dnaSeq1 ); oOS.close(); fOS.close(); System.out.println( "Loading alignment from disk" ); FileInputStream fIS = new FileInputStream( file ); ObjectInputStream oIS = new ObjectInputStream( fIS ); Sequence serSeq = ( Sequence )oIS.readObject(); dumpInfoAboutSequence( serSeq ); } public static void main( String[] flags ) throws Exception { AlignmentSerializationTest myAST = new AlignmentSerializationTest(); myAST.run(); } private void dumpInfoAboutSequence( Sequence sequence ) throws Exception { System.out.println("Name :" + sequence.getName() ); System.out.println("Alphabet :" + sequence.getAlphabet() ); System.out.println("GapSymbol :" + sequence.getAlphabet().getGapSymbol() ); System.out.println("Sequence :" + sequence.seqString() ); System.out.println("Tokeniz :" + sequence.getAlphabet().getTokenization( "token" ) ); } } And the output i get is : Name :seq1 Alphabet :org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: [] Sequence :---atgc---atgc--- Tokeniz :org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56 Writing alignment to disk Loading alignment from disk Name :seq1 Alphabet :org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: [] Sequence :nnnatgcnnnatgcnnn Tokeniz :org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56 I have spent some time using a debugger and stepping trough the bj code but realised that it will most likely take me loads of time, and was hoping that some of you guys that have some more experience with the alphabet stuff could atleast point me in the right direction, if not outright recognize the bug =) kind regards Kalle _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From mark.schreiber at novartis.com Wed Oct 19 21:12:35 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Wed Oct 19 21:11:43 2005 Subject: [Biojava-dev] FastaFormat performance enhancement Message-ID: Hi Thomas - I can confirm this. I ran a profiler a while back after getting a similar complaint. It seems that every time you call subList you add a reference to the parent SymbolList. For some reason this reference remains even when the sub list is garbage collected. Also oddly if you ever do an edit operation then all the old references disappear. The best way to see it happen is to assign lots of memory to the JVM and infinitely loop over a sublist operation: Sequence seq = ... while(true){ SymbolList sl = seq.subList(1, 10); } You quickly accumulate thousands of references. I could never figure out why they don't get released. - Mark ml-it-biojava-dev@epigenomics.com Sent by: biojava-dev-bounces@portal.open-bio.org 10/20/2005 12:28 AM To: biojava-dev@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-dev] FastaFormat performance enhancement Dirk Habighorst wrote: > Thomas Down wrote: > >> >> On 19 Oct 2005, at 09:41, ml-it-biojava-dev@epigenomics.com wrote: >> >>> Hi, >>> I had a lot of trouble using SeqIOTools.writeFasta on large >>> sequences. The subStr method of SymbolList seems to introduce a >>> memory leak (I did not track that in detail!). Anyway I would >>> suggest to change FastaFormat: >>> public void writeSequence(Sequence seq, PrintStream os) >>> throws IOException { >>> os.print(">"); >>> os.println(describeSequence(seq)); >>> int length = seq.length(); >>> for (int pos = 1; pos <= length; pos += lineWidth) { >>> int end = Math.min(pos + lineWidth - 1, length); >>> os.println(seq.subStr(pos, end)); >>> } >>> } >>> >>> to >>> public void writeSequence(Sequence seq, PrintStream os) >>> throws IOException { >>> os.print(">"); >>> os.println(describeSequence(seq)); >>> int length = seq.length(); >>> String seqString = seq.seqString(); >>> for (int pos = 0; pos < length; pos += lineWidth) { >>> int end = Math.min(pos + lineWidth, length); >>> String sub = seqString.substring(pos, end); >>> os.println(sub); >>> } >>> } >>> >>> since it is String manipulation that takes place in the loop, I >>> think there is no point in using SymbolList subStr anyway. >> >> >> >> Hi, >> >> I'd argue against this patch since it could potentially generate some >> really huge strings. Suppose I've got a Sequence object representing >> human chromosome 1 (somewhere around 220Mb). If this is a database- >> backed object with chunks of sequence lazy-loaded on demand (biojava- >> ensembl does this, for example) then there'll be no problem working >> with it even on a fairly modest PC. But converting the whole thing >> to a String is going to use at least 440Mb of RAM, and could easily >> cause an OutOfMemoryError. >> >> I'd be fine with stringifying sequences in larger chunks rather than >> one line at a time -- but I think we should be cautious about >> stringifying complete large sequences. >> >> Do you have any idea where the memory leak might be? I'd be >> interested to track it down. What sort of sequences were you using? >> >> Thomas >> > Hi thomas, > > I experienced performance problems (even OutOfMemoryError) when working > with large Sequences (not lazy loaded). You might want to check this > little example: > > package test; > > import java.io.FileNotFoundException; > import java.io.FileOutputStream; > import java.io.IOException; > import java.io.OutputStream; > import java.util.Properties; > > import org.biojava.bio.seq.DNATools; > import org.biojava.bio.seq.io.SeqIOTools; > import org.biojava.bio.symbol.IllegalSymbolException; > import org.ensembl.datamodel.CoordinateSystem; > import org.ensembl.datamodel.Location; > import org.ensembl.datamodel.Sequence; > import org.ensembl.datamodel.SequenceRegion; > import org.ensembl.driver.AdaptorException; > import org.ensembl.driver.ConfigurationException; > import org.ensembl.driver.CoreDriver; > import org.ensembl.driver.DriverManager; > import org.ensembl.driver.SequenceAdaptor; > import org.ensembl.driver.SequenceRegionAdaptor; > > > public class ExportFasta > { > > /** > * @param args > */ > public static void main (String[] args) { > // TODO Auto-generated method stub > Properties props = createDriverProperties (args); > try { > OutputStream os; > os = new FileOutputStream (args[3]); > > CoreDriver coreDriver = DriverManager.loadDriver (props); > SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor(); > SequenceAdaptor sa = coreDriver.getSequenceAdaptor(); > CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]); > SequenceRegion[] srs = > sra.fetchAllByCoordinateSystem(coordinateSystem); > int size = Integer.parseInt(args[5]); > for (SequenceRegion seqRegion : srs) { > Location loc = null; > int length = (int) seqRegion.getLength(); > int start = 1; > int end; > while (start < length) { > end = start + size - 1 < length ? start + size - 1: length; > loc = new Location (coordinateSystem, seqRegion.getName(), > start, end, 1); > System.out.println(loc); > start = end + 1; > Sequence seq = sa.fetch(loc); > org.biojava.bio.seq.Sequence bioseq = > DNATools.createDNASequence(seq.getString(), loc.toString()); > SeqIOTools.writeFasta(os, bioseq); > } > } > } > catch (ConfigurationException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (AdaptorException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (FileNotFoundException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (IllegalSymbolException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (IOException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > } > > private static Properties createDriverProperties (String[] args) { > Properties props = new Properties (); > props.setProperty("host", args[0]); > props.setProperty("user", args[1]); > props.setProperty("database", args[2]); > return props; > } > > } > > java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE > RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE > > since the chunksize is stable the memory required should be stable. With > large chunks (1000000) allocated memory keeps growing! > hope that helps, dirk Hi thomas, I did a little debugging myself and found an intresting place to look at! The SimpleSymbolList backing Sequences created with the DNATools implements subList like this: public SymbolList subList(int start, int end){ if (start < 1 || end > length()) { throw new IndexOutOfBoundsException( "Sublist index out of bounds " + length() + ":" + start + "," + end ); } if (end < start) { throw new IllegalArgumentException( "end must not be lower than start: start=" + start + ", end=" + end ); } SimpleSymbolList sl = new SimpleSymbolList(this,viewOffset+start,viewOffset+end); if (isView){ referenceSymbolList.addChangeListener(sl); }else{ this.addChangeListener(sl); } return sl; } so it keeps adding references to SymbolLists via the addChangeListener method to the original Sequence. It appears that the garbage collection can't keep up with that if the Sequence is to long. I have not checked this in detail though. ciao, dirk -- Dirk Habighorst Software Engineer/ Bioinformatician Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany phone:+49-30-24345-372 fax:+49-30-24345-555 http://www.epigenomics.com dirk.habighorst@epigenomics.com _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From mark.schreiber at novartis.com Thu Oct 20 03:16:39 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Thu Oct 20 03:22:38 2005 Subject: [Biojava-dev] FastaFormat performance enhancement Message-ID: Hello - Think I may have solved this. I found that the ChangeSupport had WeakReferences to the the SymbolLists that are created in the subList() method. Obviously the things that were referenced were becomming weakly referenced and getting garbage collected but the ChangeSupport was not clearing out the WeakReference objects that no longer pointed to anything. There was a provision for this if someone did something that fired a change event but not if they did not. I've tweaked ChangeSupport a bit so that when it tries to grow it's array or WeakReferences it first checks if it can purge some. This seems to stabalize the number of WeakReferences at about 1500 on my machine, each typically lasts about 4 GC cycles on average. I will check this into CVS. I'm still a little concerned by the gradual increase of java.lang.ref.Finalize objects however these are package private and only used by the JVM so I don't think they are anything to do with what biojava is doing (directly) so hopefully they will sort themselves out given enough time. - Mark ml-it-biojava-dev@epigenomics.com Sent by: biojava-dev-bounces@portal.open-bio.org 10/20/2005 12:28 AM To: biojava-dev@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [Biojava-dev] FastaFormat performance enhancement Dirk Habighorst wrote: > Thomas Down wrote: > >> >> On 19 Oct 2005, at 09:41, ml-it-biojava-dev@epigenomics.com wrote: >> >>> Hi, >>> I had a lot of trouble using SeqIOTools.writeFasta on large >>> sequences. The subStr method of SymbolList seems to introduce a >>> memory leak (I did not track that in detail!). Anyway I would >>> suggest to change FastaFormat: >>> public void writeSequence(Sequence seq, PrintStream os) >>> throws IOException { >>> os.print(">"); >>> os.println(describeSequence(seq)); >>> int length = seq.length(); >>> for (int pos = 1; pos <= length; pos += lineWidth) { >>> int end = Math.min(pos + lineWidth - 1, length); >>> os.println(seq.subStr(pos, end)); >>> } >>> } >>> >>> to >>> public void writeSequence(Sequence seq, PrintStream os) >>> throws IOException { >>> os.print(">"); >>> os.println(describeSequence(seq)); >>> int length = seq.length(); >>> String seqString = seq.seqString(); >>> for (int pos = 0; pos < length; pos += lineWidth) { >>> int end = Math.min(pos + lineWidth, length); >>> String sub = seqString.substring(pos, end); >>> os.println(sub); >>> } >>> } >>> >>> since it is String manipulation that takes place in the loop, I >>> think there is no point in using SymbolList subStr anyway. >> >> >> >> Hi, >> >> I'd argue against this patch since it could potentially generate some >> really huge strings. Suppose I've got a Sequence object representing >> human chromosome 1 (somewhere around 220Mb). If this is a database- >> backed object with chunks of sequence lazy-loaded on demand (biojava- >> ensembl does this, for example) then there'll be no problem working >> with it even on a fairly modest PC. But converting the whole thing >> to a String is going to use at least 440Mb of RAM, and could easily >> cause an OutOfMemoryError. >> >> I'd be fine with stringifying sequences in larger chunks rather than >> one line at a time -- but I think we should be cautious about >> stringifying complete large sequences. >> >> Do you have any idea where the memory leak might be? I'd be >> interested to track it down. What sort of sequences were you using? >> >> Thomas >> > Hi thomas, > > I experienced performance problems (even OutOfMemoryError) when working > with large Sequences (not lazy loaded). You might want to check this > little example: > > package test; > > import java.io.FileNotFoundException; > import java.io.FileOutputStream; > import java.io.IOException; > import java.io.OutputStream; > import java.util.Properties; > > import org.biojava.bio.seq.DNATools; > import org.biojava.bio.seq.io.SeqIOTools; > import org.biojava.bio.symbol.IllegalSymbolException; > import org.ensembl.datamodel.CoordinateSystem; > import org.ensembl.datamodel.Location; > import org.ensembl.datamodel.Sequence; > import org.ensembl.datamodel.SequenceRegion; > import org.ensembl.driver.AdaptorException; > import org.ensembl.driver.ConfigurationException; > import org.ensembl.driver.CoreDriver; > import org.ensembl.driver.DriverManager; > import org.ensembl.driver.SequenceAdaptor; > import org.ensembl.driver.SequenceRegionAdaptor; > > > public class ExportFasta > { > > /** > * @param args > */ > public static void main (String[] args) { > // TODO Auto-generated method stub > Properties props = createDriverProperties (args); > try { > OutputStream os; > os = new FileOutputStream (args[3]); > > CoreDriver coreDriver = DriverManager.loadDriver (props); > SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor(); > SequenceAdaptor sa = coreDriver.getSequenceAdaptor(); > CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]); > SequenceRegion[] srs = > sra.fetchAllByCoordinateSystem(coordinateSystem); > int size = Integer.parseInt(args[5]); > for (SequenceRegion seqRegion : srs) { > Location loc = null; > int length = (int) seqRegion.getLength(); > int start = 1; > int end; > while (start < length) { > end = start + size - 1 < length ? start + size - 1: length; > loc = new Location (coordinateSystem, seqRegion.getName(), > start, end, 1); > System.out.println(loc); > start = end + 1; > Sequence seq = sa.fetch(loc); > org.biojava.bio.seq.Sequence bioseq = > DNATools.createDNASequence(seq.getString(), loc.toString()); > SeqIOTools.writeFasta(os, bioseq); > } > } > } > catch (ConfigurationException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (AdaptorException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (FileNotFoundException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (IllegalSymbolException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > catch (IOException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > } > > private static Properties createDriverProperties (String[] args) { > Properties props = new Properties (); > props.setProperty("host", args[0]); > props.setProperty("user", args[1]); > props.setProperty("database", args[2]); > return props; > } > > } > > java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE > RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE > > since the chunksize is stable the memory required should be stable. With > large chunks (1000000) allocated memory keeps growing! > hope that helps, dirk Hi thomas, I did a little debugging myself and found an intresting place to look at! The SimpleSymbolList backing Sequences created with the DNATools implements subList like this: public SymbolList subList(int start, int end){ if (start < 1 || end > length()) { throw new IndexOutOfBoundsException( "Sublist index out of bounds " + length() + ":" + start + "," + end ); } if (end < start) { throw new IllegalArgumentException( "end must not be lower than start: start=" + start + ", end=" + end ); } SimpleSymbolList sl = new SimpleSymbolList(this,viewOffset+start,viewOffset+end); if (isView){ referenceSymbolList.addChangeListener(sl); }else{ this.addChangeListener(sl); } return sl; } so it keeps adding references to SymbolLists via the addChangeListener method to the original Sequence. It appears that the garbage collection can't keep up with that if the Sequence is to long. I have not checked this in detail though. ciao, dirk -- Dirk Habighorst Software Engineer/ Bioinformatician Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany phone:+49-30-24345-372 fax:+49-30-24345-555 http://www.epigenomics.com dirk.habighorst@epigenomics.com _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From mark.schreiber at novartis.com Fri Oct 21 03:58:05 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Fri Oct 21 03:57:28 2005 Subject: [Biojava-dev] gaps and basis symbols Message-ID: Hello - There seems to be a slightly strange relationship between gaps and AlphabetManager.getGapSymbol(). If I take (for example) the SymbolTokenization of DNA and ask it for the Symbol associated with "-" it gives me back a BasisSymbol that is composed of a List that contains only the GapSymbol from AlphabetManager. This leads to the slightly weird problem that the Symbol returned != AlphabetManager.getGapSymbol() which is what I expected. This also causes some curious problems with serialization that may or may not be related. Regardless, why does the "-" token not map directly to the GapSymbol in a singleton manner rather than mapping to the BasisSymbol composed of a List of only the GapSymbol. Can any biojava mystics illucidate some wisdom on this? - Mark Mark Schreiber Research Investigator (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 From mark.schreiber at novartis.com Fri Oct 21 04:42:40 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Fri Oct 21 04:42:01 2005 Subject: [Biojava-dev] gaps and basis symbols Message-ID: Further to this ... Investigating a bit further it seems that AlphabetManager.xml denotes an for "-" , "." and " ". It denotes a for "~". I'm not sure if this is an oversight or if this was intentional. Should they not all be s?? Are not all gaps created equal? If I edit this in my copy of AlphabetManager.xml then everything seems to work and the JUnit tests still pass. It seems odd though, given that this has not been spotted before I am thinking it is intentional. Should I commit these changes to CVS??? - Mark Mark Schreiber/GP/Novartis@PH Sent by: biojava-dev-bounces@portal.open-bio.org 10/21/2005 03:58 PM To: biojava-dev@biojava.org cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [Biojava-dev] gaps and basis symbols Hello - There seems to be a slightly strange relationship between gaps and AlphabetManager.getGapSymbol(). If I take (for example) the SymbolTokenization of DNA and ask it for the Symbol associated with "-" it gives me back a BasisSymbol that is composed of a List that contains only the GapSymbol from AlphabetManager. This leads to the slightly weird problem that the Symbol returned != AlphabetManager.getGapSymbol() which is what I expected. This also causes some curious problems with serialization that may or may not be related. Regardless, why does the "-" token not map directly to the GapSymbol in a singleton manner rather than mapping to the BasisSymbol composed of a List of only the GapSymbol. Can any biojava mystics illucidate some wisdom on this? - Mark Mark Schreiber Research Investigator (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 _______________________________________________ biojava-dev mailing list biojava-dev@biojava.org http://biojava.org/mailman/listinfo/biojava-dev From matthew.pocock at ncl.ac.uk Tue Oct 25 12:07:28 2005 From: matthew.pocock at ncl.ac.uk (Matthew Pocock) Date: Tue Oct 25 12:25:59 2005 Subject: [Biojava-dev] hello all Message-ID: <200510251707.29364.matthew.pocock@ncl.ac.uk> Hi, Due to work's (very helpful) network security I wasn't able to access my Yahoo account for a while, so my old e-mail address lapsed. Anyhoo - I've now re-subscribed to both mailing lists with this address, so should be back in the loop now. I trust nothing too exciting has happened in the year that I've been away :-) Matthew From kalle.naslund at genpat.uu.se Wed Oct 26 05:05:15 2005 From: kalle.naslund at genpat.uu.se (=?ISO-8859-1?Q?Kalle_N=E4slund?=) Date: Wed Oct 26 05:27:24 2005 Subject: [Biojava-dev] gaps and basis symbols In-Reply-To: References: Message-ID: <435F46CB.60009@genpat.uu.se> mark.schreiber@novartis.com wrote: >Further to this ... > >Investigating a bit further it seems that AlphabetManager.xml denotes an > for "-" , "." and " ". It denotes a for >"~". > >I'm not sure if this is an oversight or if this was intentional. Should >they not all be s?? Are not all gaps created equal? If I edit >this in my copy of AlphabetManager.xml then everything seems to work and >the JUnit tests still pass. It seems odd though, given that this has not >been spotted before I am thinking it is intentional. > >Should I commit these changes to CVS??? > >- Mark > > > Hi! I realy dont have much of a clue myself, but i have been digging around in the serialization code myself, and have come to a similar conclusion as you. In reagards to the "~" i THINK the idea is that there are different gaps in biojava. One gap, the "-" are for gaps inside a sequence, while "~" are for gaps that realy do not exist in the sequence, they are there because there is no sequence, normaly this would be in a multiple alignment, where any initial and terminal gaps are "~" and any gaps inside the actual sequence are "-". I think this is used somewhere aswell, perhaps in the HMM code ? If we are nasty we could always give Matthew a "nice" welcome back present =P Kalle > > > >Mark Schreiber/GP/Novartis@PH >Sent by: biojava-dev-bounces@portal.open-bio.org >10/21/2005 03:58 PM > > > To: biojava-dev@biojava.org > cc: (bcc: Mark Schreiber/GP/Novartis) > Subject: [Biojava-dev] gaps and basis symbols > > >Hello - > >There seems to be a slightly strange relationship between gaps and >AlphabetManager.getGapSymbol(). If I take (for example) the >SymbolTokenization of DNA and ask it for the Symbol associated with "-" it > >gives me back a BasisSymbol that is composed of a List that contains only >the GapSymbol from AlphabetManager. > >This leads to the slightly weird problem that the Symbol returned != >AlphabetManager.getGapSymbol() which is what I expected. This also causes >some curious problems with serialization that may or may not be related. >Regardless, why does the "-" token not map directly to the GapSymbol in a >singleton manner rather than mapping to the BasisSymbol composed of a List > >of only the GapSymbol. > >Can any biojava mystics illucidate some wisdom on this? > >- Mark > >Mark Schreiber >Research Investigator (Bioinformatics) > >Novartis Institute for Tropical Diseases (NITD) >10 Biopolis Road >#05-01 Chromos >Singapore 138670 >www.nitd.novartis.com > >phone +65 6722 2973 >fax +65 6722 2910 >_______________________________________________ >biojava-dev mailing list >biojava-dev@biojava.org >http://biojava.org/mailman/listinfo/biojava-dev > > > >_______________________________________________ >biojava-dev mailing list >biojava-dev@biojava.org >http://biojava.org/mailman/listinfo/biojava-dev > > From matthew.pocock at ncl.ac.uk Wed Oct 26 07:20:40 2005 From: matthew.pocock at ncl.ac.uk (Matthew Pocock) Date: Wed Oct 26 07:19:33 2005 Subject: [Biojava-dev] gaps and holes Message-ID: <200510261220.40776.matthew.pocock@ncl.ac.uk> Hi, I understand there's been a thread about the biojava gaps model. For those of you who want a not very good explanation, you track down my PhD on the Sanger web site and scan through chapter 3. If my memory serves, there is a section in there about the symbol group theory. Anyhoo - here is the long-and-short of the problem... The BioJava symbol model was designed to support algorithms. This lead us to go a set-theoretic route for modelling the DNA/RNA/Protein/(insert biopolymer here). This is visible in two places. 1) ambiguity is modelled using sets * The base nucleotides a/g/c/t are interconvertible with the sets {a}, {g}, {c} and {t} * Ambiguities are interconvertible with the sets that they range over e.g. n -> {a,g,c,t} * If you take the power set of DNA, you naturally have the bases, N and {} - and as if by magic, we also need gaps, so gap -> {} seems like an obvious rule 2) columns of an alignment are modelled as elements of cross-products of symbol sets * an alignment between two DNA sequences is a string from the alphabet DNA x DNA, which by convention in BioJava is a string over Pow({a,g,c,t}) ^2 * for a symbol to be a member of DNA^2, it must be a cross-product symbol of dimension 2, where each component s_1 and s_2 are from the DNA alphabet, written [s_1, s_2]. OK - so, this all sounds fairly reasonable so far. Now for the more anoying bits. a) what happens if we have an ambiguous symbol in an alignment? * Well, let's say we have the symbols s_1 and s_2, and s_1 is ambiguous - that is, it maps to a set of symbols that does not have size=1. This can be displayed as a column in an alignment just fine. E.g. the column [n,a] can be expanded to [{a,g,c,t},{a}] and this in turn can be expanded to {[a,a], [g,a], [c,a], [t,a]}. * Just for the fun of it, let's take {[a,g], [g,a]} - we can't write this down in the form [{i,j,...},{x,y,...}], so it can not be the column of an alignment. The symbols that can be a column in an alignment are basis symbols. Those that can not are just Symbol instances. Every basis symbol could be used as a basis function in a probability distribution. Every single-dimensional symbol is a basis symbol, but we tend to be even more specific and call these atomic symbols. b) Specifically, what happens if we have gaps in an alignment? * Let's have ~ for {} - you'll see why in a moment... * Let's have DNA^2 * We could write [~, a] to represent a column in an alignment where there was a gap in the 1st sequence and a in the second. Similarly, [~,~] could represent a gap in both. * It follows by analogy that we would use [~] for the one-dimensional case. If we push this 1-dimensional case notation back up to the 2d one, we get the unweildy result [[~],[~]] * Let's clean up the notation by keeping ~ -> {} and adding - -> [~]. Now we can write [-,-] for the 2d case, - for the 1d case and ~ for the 0d case. c) Wait a minute - the 0d case??? * Consider the empty alphabet. It is defined by the symbol set {}, so it contains ~ only. Now let's say there's a finite-state machine that is generating symbol lists. If the machine can never reach part of the symbol list to generate it, then the symbol there is ~ * So - if you have a DNA sequence generated by a FSM, then the portion of the tape before and after the generated symbol list is populated by ~, which is kind of nice notationally because this is what multi-fasta uses to pad out before & after sequences in alignments d) Back to alignments, from a FSM-centric point of view * Now we can use - to represent the case when the FSM advanced through a state silently. That is, the FSM moved on one state, but the emitted symbol list did not. If we choose to capture this 'emission slippage', then we need to notate that nothing was emitted but that it took up one symbol's space in the symbol list because one state was advanced. Hence, [~] is a reasonable choice here. * In pair-wise alignment, we can now use [-,x] and [x,-] to represent the case where the FSM emitted nothing on the first or second tape, respectively. [-,-] would represent the case where the FSM emitted on neither tape but still advanced a state. ~ can still be used for the case when the FSM could never generate a symbol, for example, outside the alignment matrix. I hope this has made part of the rationalle for structured gaps more clear. I agree that it is a bit strange, but if you want a consistent structure for representing symbols as sets and for representing alignments, it prety much drops out as The One True Way. We can split hairs about exactly when ~ and - get uses, but they are different things, and if you confuse the two then inside things like DP recursions, Very Bad Things happen which require boundary conditions and nasty hacks to correct. Perhaps we need to use a GapSymbol interface or have isGap on Symbol or something to make life easier. It's a pitty the Java type system plays so badly with sets. Pitty ML isn't generally accepted as being a useable language :-( Matthew From atariml at gmail.com Wed Oct 26 10:29:13 2005 From: atariml at gmail.com (Andrea Franceschini) Date: Wed Oct 26 16:14:08 2005 Subject: [Biojava-dev] Automated upstream region sequence retrieval Message-ID: <001501c5da39$a5205dc0$0801a8c0@atarippc> Hi We are looking to build a simple utility in Java to retrieve DNA sequences starting from a list of Entrez geneId. ( for example a user will be able to extract all the 2k upstream sequences of a list of geneIds ). If nobody of you have already done something like this we will be happy to do it and if you 're interested we could integrate our code in BioJava, following your indications. Thankyou very much Andrea Franceschini University Politecnico of Milan (Italy) From mmccormi at fhcrc.org Fri Oct 28 12:48:37 2005 From: mmccormi at fhcrc.org (Michael McCormick) Date: Fri Oct 28 12:47:27 2005 Subject: [Biojava-dev] Potential Enhancements, Defect Message-ID: Greetings, Ruihan Wang and I are developing an application that uses biojava in a J2EE environment. We have made a few changes and would like to add them to the biojava code. All of the changes except for one class involve serialization issues. Here is a brief summary. Please let me know if you are interested in adding these changes and how they should be submitted. Thanks. Mike Michael McCormick Systems Analyst Fred Hutchinson Cancer Research Center /org/biojava/bio/search/SeqSimilaritySearchHit should be Serializable /org/biojava/bio/search/SeqSimilaritySearchResult should be Serializable /org/biojava/bio/search/SeqSimilaritySearchSubHit should be Serializable /org/biojava/bio/seq/FeatureHolder should be Serializable /org/biojava/bio/seq/db/SequenceDB should be Serializable /org/biojava/bio/symbol/Symbol should be Serializable /org/biojava/bio/symbol/SymbolList should be Serializable /org/biojava/bio/symbol/SimpleAtomicSymbol and /org/biojava/bio/symbol/SimpleBasisSymbol do not serialize correctly, however the mailing list provided a work around by commenting out the defective code. org/biojava/bio/program/abi/ABIFChromatogram.java has a few issues. 1. Should be Serializable. 2. We experienced file handle count resource exceptions since File access was not being closed! This still needs future refactoring since the new close does not occur within a finally block. 3. Modify class to use readFully(). In our environment, this change allowed us to parse chromats at least 10 times faster. diff for org/biojava/bio/program/abi/ABIFChromatogram.java 27,28d26 < import java.io.RandomAccessFile; < import java.io.Serializable; 57c55 < public class ABIFChromatogram extends AbstractChromatogram implements Serializable { --- > public class ABIFChromatogram extends AbstractChromatogram { 141d138 < 151a149 > 153d150 < ((RandomAccessFile)getDataAccess()).close(); 164a162 > 166,171c164,166 < byte[] shortArray = new byte[2 * count]; < getDataAccess().readFully(shortArray); < int i = 0; < for (int s = 0; s < shortArray.length; s += 2) { < trace[i] = ((short)((shortArray[s] << 8) | (shortArray[s + 1] & 0xff))) & 0xffff; < max = Math.max(trace[i++], max); --- > for (int i = 0 ; i < count ; i++) { > trace[i] = getDataAccess().readShort() & 0xffff; > max = Math.max(trace[i], max); 175,178c170,171 < byte[] byteArray = new byte[count]; < getDataAccess().readFully(byteArray); < for (int i = 0; i < byteArray.length; i++) { < trace[i] = byteArray[i] & 0xff; --- > for (int i = 0 ; i < count ; i++) { > trace[i] = getDataAccess().readByte() & 0xff; 185c178 < --- > 212,216c205,206 < byte[] shortArray = new byte[2 * count]; < getDataAccess().readFully(shortArray); < IntegerAlphabet integerAlphabet = IntegerAlphabet.getInstance(); < for (int s = 0; s < shortArray.length; s += 2) { < offsets.add(integerAlphabet.getSymbol(((short) ((shortArray[s] << 8) | (shortArray[s + 1] & 0xff))) & 0xffff)); --- > for (int i = 0 ; i < offsetsPtr.numberOfElements ; i++) { > offsets.add(IntegerAlphabet.getInstance ().getSymbol(getDataAccess().readShort() & 0xffff)); 220,224c210,211 < byte[] byteArray = new byte[count]; < getDataAccess().readFully(byteArray); < IntegerAlphabet integerAlphabet = IntegerAlphabet.getInstance(); < for (int i = 0 ; i < byteArray.length; i++) { < offsets.add(integerAlphabet.getSymbol(byteArray [i] & 0xff)); --- > for (int i = 0 ; i < offsetsPtr.numberOfElements ; i++) { > offsets.add(IntegerAlphabet.getInstance ().getSymbol(getDataAccess().readByte() & 0xff)); 234,237c221,224 < byte[] byteArray = new byte[(int) basesPtr.numberOfElements]; < getDataAccess().readFully(byteArray); < for (int i = 0; i < byteArray.length; i++) { < dna.add(ABIFParser.decodeDNAToken((char) byteArray[i])); --- > char token; > for (int i = 0 ; i < basesPtr.numberOfElements ; i+ +) { > token = (char) getDataAccess().readByte(); > dna.add(ABIFParser.decodeDNAToken(token));