From mark.schreiber at novartis.com  Mon Oct  3 21:06:16 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Mon Oct  3 21:17:47 2005
Subject: [Biojava-dev] JDK 1.5
Message-ID: <OFB53AE077.42043982-ON48257090.0005CCE1-48257090.00061179@EU.novartis.net>

Hello -

Biojava is still officially using JDK 1.4.2 I know many people have 
changed to JDK1.5

While no-one is using generics etc in the code base there have been a 
number of method calls that have slipped in that rely on JDK 1.5. The most 
common one is 

Integer.valueOf(int i)

This is only introduced in 1.5 please use the alternative

new Integer(i)

It even has less typing : )

- Mark

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910

From wetrull at yahoo.com  Wed Oct  5 18:11:10 2005
From: wetrull at yahoo.com (W. Eric Trull)
Date: Wed Oct  5 18:18:20 2005
Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java
Message-ID: <20051005221110.36020.qmail@web81407.mail.yahoo.com>

Hello all,

I'm new to the list, but have done as much archive searching, Google
searching, and debugging as I can on the problem I describe here.

I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), but
keep getting a NullPointerException.  One of my searches turned up using
BlastEcho to debug the problem, but that also throws the
NullPointerException:

startSearch
	SearchProp:	program: ncbi-blastp
	SearchProp:	version: 2.0.11
java.lang.NullPointerException
	at
org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215)
	at org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164)
	at
org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:311)
	at
org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:274)
	at
org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)
	at com.pfizer.search.sequence.BlastEcho.echo(BlastEcho.java:42)
	at com.pfizer.search.sequence.BlastEcho.main(BlastEcho.java:88)
Exception in thread "main" 

Stepping through the code in a debugger shows that the while loop added in
revision 1.13 of
/biojava-live/src/org/biojava/bio/program/sax/BlastSAXParser.java (fixed
truncation of database id) reads all the lines without ever matching the
"Searching" string.  At first I thought it was because I was using a later
version of BLAST, but then I tried 2.0.11 and 2.2.3 (supported version) but
they also result in a NullPointerException.  In the BLAST output for the
various versions I never see a "Searching" string anywhere.  I've tried all
the -m options as well, without success.

Is there a NCBI BLAST option that I need to be using?  I'm running on Windows
XP (during development) - is the UNIX version output different?  

Thanks.

-Eric Trull


From mark.schreiber at novartis.com  Wed Oct  5 23:39:59 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Wed Oct  5 23:39:55 2005
Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java
Message-ID: <OF40F8AA06.6011FF06-ON48257092.0013D988-48257092.0014244A@EU.novartis.net>

Hello -

This is very odd.

The JUnit tests currently pass using the files in 
/tests/files/org/biojava/bio/programs/ssbind  These BLAST files all have 
the string "Searching....". Maybe there is a variation in the windows 
output?

Can you post at least the header of your output to the list (preferably an 
entire example output)?

- Mark


"W. Eric Trull" <wetrull@yahoo.com>
Sent by: biojava-dev-bounces@portal.open-bio.org
10/06/2005 06:11 AM

 
        To:     biojava-dev@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-dev] NullPointerException from BlastSAXParser.java


Hello all,

I'm new to the list, but have done as much archive searching, Google
searching, and debugging as I can on the problem I describe here.

I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), but
keep getting a NullPointerException.  One of my searches turned up using
BlastEcho to debug the problem, but that also throws the
NullPointerException:

startSearch
                 SearchProp:             program: ncbi-blastp
                 SearchProp:             version: 2.0.11
java.lang.NullPointerException
                 at
org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215)
                 at 
org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164)
                 at
org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:311)
                 at
org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:274)
                 at
org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)
                 at 
com.pfizer.search.sequence.BlastEcho.echo(BlastEcho.java:42)
                 at 
com.pfizer.search.sequence.BlastEcho.main(BlastEcho.java:88)
Exception in thread "main" 

Stepping through the code in a debugger shows that the while loop added in
revision 1.13 of
/biojava-live/src/org/biojava/bio/program/sax/BlastSAXParser.java (fixed
truncation of database id) reads all the lines without ever matching the
"Searching" string.  At first I thought it was because I was using a later
version of BLAST, but then I tried 2.0.11 and 2.2.3 (supported version) 
but
they also result in a NullPointerException.  In the BLAST output for the
various versions I never see a "Searching" string anywhere.  I've tried 
all
the -m options as well, without success.

Is there a NCBI BLAST option that I need to be using?  I'm running on 
Windows
XP (during development) - is the UNIX version output different? 

Thanks.

-Eric Trull


_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


From mark.schreiber at novartis.com  Thu Oct  6 01:39:58 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Thu Oct  6 01:46:18 2005
Subject: [Biojava-dev] agave, game, game12
Message-ID: <OF67C2A320.C66ADBF9-ON48257092.001E9EA9-48257092.001F206E@EU.novartis.net>

Hello -

Does anyone still require or make use of the following packages:

org.biojava.bio.seq.io.agave
org.biojava.bio.seq.io.game
org.biojava.bio.seq.io.game12

They represent i/o classes for these now redundant formats.

If not then I will mark them as deprecated and probably remove them when 
we make a 1.5 release.

- Mark


Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910

From mark.schreiber at novartis.com  Thu Oct  6 01:47:23 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Thu Oct  6 01:47:39 2005
Subject: [Biojava-dev] Java 1.5 (final chance to object)
Message-ID: <OFE6F0860B.8D790B20-ON48257092.001F57D0-48257092.001FCE19@EU.novartis.net>

Hello -

No one seemed to object to the idea of officially adopting java 1.5 for 
the biojava-live branch.

This would mean ...

biojava-live would require java1.5
generics, unboxing and other language features added in 1.5 will start to 
creep into the codebase.
all 'official' and 'preview' releases after biojava1.4 will require JDK1.5 
(Java 5).

If you plan to use new versions of biojava on a machine for which there is 
or will be no JDK1.5 then you should protest now!

- Mark

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910

From mark.schreiber at novartis.com  Thu Oct  6 03:36:05 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Thu Oct  6 03:36:07 2005
Subject: [Biojava-dev] Java 1.5 (final chance to object)
Message-ID: <OF06E385C1.530EF49E-ON48257092.00298B76-48257092.0029C1DB@EU.novartis.net>

Does SPICE rely on biojava-live? If it only requires biojava1.4 then this 
wouldn't be an issue. However, if you are actively building SPICE with 
biojava-live (possibly not a good idea) we can keep it as 1.4 for a while.

- Mark


Andreas Prlic <ap3@sanger.ac.uk>
10/06/2005 03:26 PM

 
        To:     Mark Schreiber/GP/Novartis@PH
        cc:     biojava-dev@biojava.org, biojava-l@biojava.org
        Subject:        Re: [Biojava-dev] Java 1.5 (final chance to object)


Hi!

I use biojava for the SPICE - protein sequence and structure browser.

http://www.efamily.org.uk/software/dasclients/spice/

This application is launched from within a browser using Java Web Start.
Since many people still are using java 1.4 on their machines I would not
want to force them to upgrade and hence I would prefer biojava to stay
with 1.4 still for a while.

Cheers,
Andreas


On 6 Oct 2005, at 06:47, mark.schreiber@novartis.com wrote:

> Hello -
>
> No one seemed to object to the idea of officially adopting java 1.5 for
> the biojava-live branch.
>
> This would mean ...
>
> biojava-live would require java1.5
> generics, unboxing and other language features added in 1.5 will start 
> to
> creep into the codebase.
> all 'official' and 'preview' releases after biojava1.4 will require 
> JDK1.5
> (Java 5).
>
> If you plan to use new versions of biojava on a machine for which 
> there is
> or will be no JDK1.5 then you should protest now!
>
> - Mark
>
> Mark Schreiber
> Principal Scientist (Bioinformatics)
>
> Novartis Institute for Tropical Diseases (NITD)
> 10 Biopolis Road
> #05-01 Chromos
> Singapore 138670
> www.nitd.novartis.com
>
> phone +65 6722 2973
> fax  +65 6722 2910
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev@biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
>
>
-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
                                                  +44 (0) 1223 49 6891


From ap3 at sanger.ac.uk  Thu Oct  6 03:26:24 2005
From: ap3 at sanger.ac.uk (Andreas Prlic)
Date: Thu Oct  6 04:15:55 2005
Subject: [Biojava-dev] Java 1.5 (final chance to object)
In-Reply-To: <OFE6F0860B.8D790B20-ON48257092.001F57D0-48257092.001FCE19@EU.novartis.net>
References: <OFE6F0860B.8D790B20-ON48257092.001F57D0-48257092.001FCE19@EU.novartis.net>
Message-ID: <74ebc3fc158aabd8858638a8160106ab@sanger.ac.uk>

Hi!

I use biojava for the SPICE - protein sequence and structure browser.

http://www.efamily.org.uk/software/dasclients/spice/

This application is launched from within a browser using Java Web Start.
Since many people still are using java 1.4 on their machines I would not
want to force them to upgrade and hence I would prefer biojava to stay
with 1.4 still for a while.

Cheers,
Andreas


On 6 Oct 2005, at 06:47, mark.schreiber@novartis.com wrote:

> Hello -
>
> No one seemed to object to the idea of officially adopting java 1.5 for
> the biojava-live branch.
>
> This would mean ...
>
> biojava-live would require java1.5
> generics, unboxing and other language features added in 1.5 will start 
> to
> creep into the codebase.
> all 'official' and 'preview' releases after biojava1.4 will require 
> JDK1.5
> (Java 5).
>
> If you plan to use new versions of biojava on a machine for which 
> there is
> or will be no JDK1.5 then you should protest now!
>
> - Mark
>
> Mark Schreiber
> Principal Scientist (Bioinformatics)
>
> Novartis Institute for Tropical Diseases (NITD)
> 10 Biopolis Road
> #05-01 Chromos
> Singapore 138670
> www.nitd.novartis.com
>
> phone +65 6722 2973
> fax  +65 6722 2910
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev@biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
>
>
-----------------------------------------------------------------------

Andreas Prlic      Wellcome Trust Sanger Institute
                               Hinxton, Cambridge CB10 1SA, UK
			 +44 (0) 1223 49 6891

From ady at sanger.ac.uk  Thu Oct  6 03:39:10 2005
From: ady at sanger.ac.uk (Andy Yates)
Date: Thu Oct  6 04:32:03 2005
Subject: [Biojava-dev] Re: [Biojava-l] Java 1.5 (final chance to object)
In-Reply-To: <OFE6F0860B.8D790B20-ON48257092.001F57D0-48257092.001FCE19@EU.novartis.net>
References: <OFE6F0860B.8D790B20-ON48257092.001F57D0-48257092.001FCE19@EU.novartis.net>
Message-ID: <4344D49E.5070409@sanger.ac.uk>

Well okay I'll fly the flag for platforms like Alpha where a 1.5 
compatible JVM/compiler does not exist nor ever will. I know from the 
BOF at BOSC there were quite a few people who were reporting a similar 
situation.

Now if no one else objects to the JDK1.5. move then I'm not going to fly 
the flag for 1.4 since I don't dev any more on Alpha plus I like more 
reasons to force people who I work with to upgrade :)

Andy Y

mark.schreiber@novartis.com wrote:
> Hello -
> 
> No one seemed to object to the idea of officially adopting java 1.5 for 
> the biojava-live branch.
> 
> This would mean ...
> 
> biojava-live would require java1.5
> generics, unboxing and other language features added in 1.5 will start to 
> creep into the codebase.
> all 'official' and 'preview' releases after biojava1.4 will require JDK1.5 
> (Java 5).
> 
> If you plan to use new versions of biojava on a machine for which there is 
> or will be no JDK1.5 then you should protest now!
> 
> - Mark
> 
> Mark Schreiber
> Principal Scientist (Bioinformatics)
> 
> Novartis Institute for Tropical Diseases (NITD)
> 10 Biopolis Road
> #05-01 Chromos
> Singapore 138670
> www.nitd.novartis.com
> 
> phone +65 6722 2973
> fax  +65 6722 2910
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
From mbreese at gmail.com  Thu Oct  6 03:47:10 2005
From: mbreese at gmail.com (Marcus Breese)
Date: Thu Oct  6 05:24:59 2005
Subject: [Biojava-dev] Re: [Biojava-l] Java 1.5 (final chance to object)
In-Reply-To: <OFE6F0860B.8D790B20-ON48257092.001F57D0-48257092.001FCE19@EU.novartis.net>
References: <OFE6F0860B.8D790B20-ON48257092.001F57D0-48257092.001FCE19@EU.novartis.net>
Message-ID: <d85061200510060047u6aaf2e01x26d0a945fb2053df@mail.gmail.com>

You may want to think a bit more about converting completely over to 1.5...
There are still a number of platforms that don't have a compatible 1.5 JDK.
Mac OS X still comes with 1.42 standard (1.5 is available, but not
standard). Also, the last time I checked there wasn't an IBM PPC 1.5 JVM,
which means that a number of HPC platforms / clusters will not be supported.

My view on it is that 1.5 is good for apps, but still too new for a critical
library.


On 10/6/05, mark.schreiber@novartis.com <mark.schreiber@novartis.com> wrote:
>
> Hello -
>
> No one seemed to object to the idea of officially adopting java 1.5 for
> the biojava-live branch.
>
> This would mean ...
>
> biojava-live would require java1.5
> generics, unboxing and other language features added in 1.5 will start to
> creep into the codebase.
> all 'official' and 'preview' releases after biojava1.4 will require JDK1.5
> (Java 5).
>
> If you plan to use new versions of biojava on a machine for which there is
> or will be no JDK1.5 then you should protest now!
>
> - Mark
>
> Mark Schreiber
> Principal Scientist (Bioinformatics)
>
> Novartis Institute for Tropical Diseases (NITD)
> 10 Biopolis Road
> #05-01 Chromos
> Singapore 138670
> www.nitd.novartis.com <http://www.nitd.novartis.com>
>
> phone +65 6722 2973
> fax +65 6722 2910
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>

From td2 at sanger.ac.uk  Thu Oct  6 06:14:35 2005
From: td2 at sanger.ac.uk (Thomas Down)
Date: Thu Oct  6 06:45:16 2005
Subject: [Biojava-dev] Re: [Biojava-l] Java 1.5 (final chance to object)
In-Reply-To: <d85061200510060047u6aaf2e01x26d0a945fb2053df@mail.gmail.com>
References: <OFE6F0860B.8D790B20-ON48257092.001F57D0-48257092.001FCE19@EU.novartis.net>
	<d85061200510060047u6aaf2e01x26d0a945fb2053df@mail.gmail.com>
Message-ID: <D7A33E05-144A-4CA1-9F72-4F56F7BD00EC@sanger.ac.uk>


On 6 Oct 2005, at 08:47, Marcus Breese wrote:

> You may want to think a bit more about converting completely over  
> to 1.5...
> There are still a number of platforms that don't have a compatible  
> 1.5 JDK.
> Mac OS X still comes with 1.42 standard (1.5 is available, but not
> standard). Also, the last time I checked there wasn't an IBM PPC  
> 1.5 JVM,
> which means that a number of HPC platforms / clusters will not be  
> supported.

IBM do have something out now:

             http://www-128.ibm.com/developerworks/java/jdk/java5beta/

Beta software on a time-limited licence, so probably not what people  
really want to run -- but it does suggest there should be a release  
version in the not-too-distant future.

Perhaps we should wait until the end of the year then look at how the  
transition is coming along.  I know there's a new release of Mac OS  
10.4 coming in the next few weeks, and it sounds like that will  
include a big pile of bug-fixed (I know the dreaded Eclipse-running- 
progressively-slower bug has been looked at).  That might well  
encourage more Mac users (who seem to be the biggest group stuck on  
Java 1.4) to upgrade.

             Thomas.


From wetrull at yahoo.com  Thu Oct  6 11:10:01 2005
From: wetrull at yahoo.com (W. Eric Trull)
Date: Thu Oct  6 11:37:30 2005
Subject: [Biojava-dev] Java 1.5 (final chance to object)
Message-ID: <20051006151002.58972.qmail@web81404.mail.yahoo.com>

Hello all,

I'm new to the list so I have not been following the full discussion and
don't know all the issues - excuse my ignorance.

I don't have any objections to moving to Java 1.5, except I know that some
J2EE application servers (both web tier and business tier (i.e. EJBs)) are
slow adopters of the new versions of Java  That being said, would BioJava 1.4
be maintained (bug fixes) for those stuck on 1.4.2?

I know in my case I'm going to be using BioJava in several web services
deploying to pre-existing application servers.  The owners of the application
servers will balk at the cost of upgrading to Java 1.5 (both the JVM and the
application server).

Thanks.

-Eric Trull
From mbreese at gmail.com  Thu Oct  6 12:27:52 2005
From: mbreese at gmail.com (Marcus Breese)
Date: Thu Oct  6 12:55:47 2005
Subject: [Biojava-dev] Re: [Biojava-l] Java 1.5 (final chance to object)
In-Reply-To: <D7A33E05-144A-4CA1-9F72-4F56F7BD00EC@sanger.ac.uk>
References: <OFE6F0860B.8D790B20-ON48257092.001F57D0-48257092.001FCE19@EU.novartis.net>
	<d85061200510060047u6aaf2e01x26d0a945fb2053df@mail.gmail.com>
	<D7A33E05-144A-4CA1-9F72-4F56F7BD00EC@sanger.ac.uk>
Message-ID: <d85061200510060927n3e653119m8af05525c4d439e8@mail.gmail.com>

The problem for me is really the HPC environment. I know our cluster admins
would be very hesitant to intall a beta JVM on our brand new IBM cluster. We
also have a (very small) Mac cluster that will be stuck on 10.3 for quite a
while as we don't have the cash to upgrade the entire thing. So, our stuff
will be stuck at 1.42 for a while... Then again, we aren't actively
developing biojava things on those platforms, just the smaller single linux
boxes with 1.5.

On 10/6/05, Thomas Down <td2@sanger.ac.uk> wrote:
>
>
> On 6 Oct 2005, at 08:47, Marcus Breese wrote:
>
> > You may want to think a bit more about converting completely over
> > to 1.5...
> > There are still a number of platforms that don't have a compatible
> > 1.5 JDK.
> > Mac OS X still comes with 1.42 standard (1.5 is available, but not
> > standard). Also, the last time I checked there wasn't an IBM PPC
> > 1.5 JVM,
> > which means that a number of HPC platforms / clusters will not be
> > supported.
>
> IBM do have something out now:
>
> http://www-128.ibm.com/developerworks/java/jdk/java5beta/
>
> Beta software on a time-limited licence, so probably not what people
> really want to run -- but it does suggest there should be a release
> version in the not-too-distant future.
>
> Perhaps we should wait until the end of the year then look at how the
> transition is coming along. I know there's a new release of Mac OS
> 10.4 coming in the next few weeks, and it sounds like that will
> include a big pile of bug-fixed (I know the dreaded Eclipse-running-
> progressively-slower bug has been looked at). That might well
> encourage more Mac users (who seem to be the biggest group stuck on
> Java 1.4) to upgrade.
>
> Thomas.
>
>
>

From mark.schreiber at novartis.com  Thu Oct  6 21:53:12 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Thu Oct  6 21:52:41 2005
Subject: [Biojava-dev] Java 1.5 (final chance to object)
Message-ID: <OF9ECEE258.5F18AD26-ON48257093.000A4C5C-48257093.000A5D6E@EU.novartis.net>

OK, there seems to have been a few reasonable objections. The consensus 
seems to be we will wait until the end of the year. I think from then on 
we will change over for biojava-live.

There was a suggestion of maintaining two branches, one would be a 
maintenance of biojava1.4 and only use JDK1.4.2, the other would be 
biojava-live and use JDK1.5. I have done this in the past and have no 
desire to do it again, it's really not that much fun. I would however like 
to reserve the option of putting JDK1.5 dependent code into the classes of 
the org.biojavax package. If this happens I will adjust the ANT build 
script such that these are not compiled if JDK1.5 is not detected. This 
should be safe as the biojava packages have no dependencies on the 
biojavax packages. Bug fixes to biojava would still be in the system. 
Additionally the org.biojavax packages are undergoing a lot of development 
right now so you shouldn't be doing any production programming with them 
anyway.

- Mark

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910


From sicotteh at mail.nih.gov  Fri Oct  7 09:59:58 2005
From: sicotteh at mail.nih.gov (Sicotte, Hugues (NIH/NCI))
Date: Fri Oct  7 10:21:07 2005
Subject: [Biojava-dev] Java 1.5 (final chance to object)
Message-ID: <27C204BD76CBC142BA1AE46D62A8548E0F4E5DD3@nihexchange9.nih.gov>

Last call.. OK

BioJava is one of the most turbulent bioinformatics library projects.
Much more turbulent than bioPerl or bioPython.
Often code written a few years ago does not work because older API's
get deprecated or changed. This is especially tough for people who use
BioJava
on and Off. We think two sets of API's are compatible, but there 
are lots of deprecated features. We find online examples, only to find
out they no longer work.

Stability and backward compatibility are crucial to the success of BioJava.
In fact, if the code is not too complex, I often have to write it myself
rather than
relying on BioJava. I want to write code that will still be in production
5 years from now.

I say that, not because I don't appreciate the hard work of developpers,
but because I would like the volunteer developpers to appreciate that
the users of the toolkit need stability. We live in production environments
that will not support 1.5 for a long time. I am still living in a 1.3 world.
Only projects that I am starting right now can use 1.4.2_08.


I beg you, Please reconsider moving to 1.5, it's only 5% more typing to use
1.4.

Hugues Sicotte
(a user who still doesn't get to use the RegExp package in 1.4)


-----Original Message-----
From: mark.schreiber@novartis.com [mailto:mark.schreiber@novartis.com]
Sent: Thursday, October 06, 2005 9:53 PM
To: Thomas Down
Cc: biojava-l@biojava.org; wetrull@yahoo.com; BioJava Dev
Subject: [Biojava-dev] Java 1.5 (final chance to object)


OK, there seems to have been a few reasonable objections. The consensus 
seems to be we will wait until the end of the year. I think from then on 
we will change over for biojava-live.

There was a suggestion of maintaining two branches, one would be a 
maintenance of biojava1.4 and only use JDK1.4.2, the other would be 
biojava-live and use JDK1.5. I have done this in the past and have no 
desire to do it again, it's really not that much fun. I would however like 
to reserve the option of putting JDK1.5 dependent code into the classes of 
the org.biojavax package. If this happens I will adjust the ANT build 
script such that these are not compiled if JDK1.5 is not detected. This 
should be safe as the biojava packages have no dependencies on the 
biojavax packages. Bug fixes to biojava would still be in the system. 
Additionally the org.biojavax packages are undergoing a lot of development 
right now so you shouldn't be doing any production programming with them 
anyway.

- Mark

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910


_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev
From wetrull at yahoo.com  Fri Oct  7 12:04:34 2005
From: wetrull at yahoo.com (W. Eric Trull)
Date: Fri Oct  7 12:14:59 2005
Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java
In-Reply-To: <OFEEE7CD57.12B0D3C2-ON48257093.000B557A-48257093.000BA282@EU.novartis.net>
Message-ID: <20051007160434.6853.qmail@web81401.mail.yahoo.com>

Should I raise this as an issue with NCBI?  Seems like it makes writting
parsing routines more difficult.

Thanks.

-Eric Trull

--- mark.schreiber@novartis.com wrote:

> Looks like there might be a difference in the Windows output. I will try 
> to take a look at this over the next few days. Probably need to change the 
> BlastSAXParser to look for something other than Searching so that this 
> will get parsed as well.
> 
> - Mark
> 
> 
> 
> 
> 
> "W. Eric Trull" <wetrull@yahoo.com>
> 10/06/2005 11:01 PM
> 
>  
>         To:     biojava-dev@biojava.org
>         cc:     Mark Schreiber/GP/Novartis@PH
>         Subject:        Re: [Biojava-dev] NullPointerException from
> BlastSAXParser.java
> 
> 
> Hello Mark,
> 
> Here is what I've done, using NCBI Blast 2.0.11, Windows XP, JDK 1.4.2
> 
> 1.  Downloaded the PDB's pdb_seqres.txt
> 2.  Created a blast database (after changing the deflines):
>         C:\blast-2.0.11\formatdb.exe
>             -t "PDB" 
>             -i blast\pdb_seqres.txt
>             -l blast\pdb_formatdb.log
>             -o T
>             -n blast\pdb
> 3.  BLASTed 26SPS9_Hs:
>         C:\blast-2.0.11\blastall.exe
>             -p blastp
>             -d blast\pdb
>             -i 26SPS9_Hs.fasta
>             -o 26SPS9_Hs.blast
> 4.  Tried to parse 26SPS9_Hs.blast using the class shown in BioJava in 
> Anger
> and BlastEcho, both of which give me the NullPointerException.  The 
> beginning
> of 26SPS9_Hs.blast file is shown below, the entire file is attached. 
> 
> Please let me know if you see anything obviously wrong with the way I'm 
> doing
> the BLAST.  I'm going to cvs checkout the BioJava source code and have a 
> look
> at the JUnit test later today.
> 
> Thanks!
> 
> -Eric Trull
> 
> -------- 26SPS9_Hs.blast --------
> BLASTP 2.0.11 [Jan-20-2000]
> 
> 
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
> "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> programs",  Nucleic Acids Res. 25:3389-3402.
> 
> Query= 26SPS9_Hs 
>          (176 letters)
> 
> Database: PDB
>            78,094 sequences; 17,596,117 total letters
> 
> 
> 
>                                                                    Score  
> E
> Sequences producing significant alignments:                        (bits) 
> Value
> 
> pdb|1UFM|A Cop9 Complex Subunit 4                                      39 
> 0.003
> .
> .
> .
> -------- 26SPS9_Hs.blast --------
> 
> 
> --- mark.schreiber@novartis.com wrote:
> 
> > Hello -
> > 
> > This is very odd.
> > 
> > The JUnit tests currently pass using the files in 
> > /tests/files/org/biojava/bio/programs/ssbind  These BLAST files all have 
> 
> > the string "Searching....". Maybe there is a variation in the windows 
> > output?
> > 
> > Can you post at least the header of your output to the list (preferably 
> an 
> > entire example output)?
> > 
> > - Mark
> > 
> > 
> > 
> > 
> > 
> > "W. Eric Trull" <wetrull@yahoo.com>
> > Sent by: biojava-dev-bounces@portal.open-bio.org
> > 10/06/2005 06:11 AM
> > 
> > 
> >         To:     biojava-dev@biojava.org
> >         cc:     (bcc: Mark Schreiber/GP/Novartis)
> >         Subject:        [Biojava-dev] NullPointerException from
> > BlastSAXParser.java
> > 
> > 
> > Hello all,
> > 
> > I'm new to the list, but have done as much archive searching, Google
> > searching, and debugging as I can on the problem I describe here.
> > 
> > I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), 
> but
> > keep getting a NullPointerException.  One of my searches turned up using
> > BlastEcho to debug the problem, but that also throws the
> > NullPointerException:
> > 
> > startSearch
> >                  SearchProp:             program: ncbi-blastp
> >                  SearchProp:             version: 2.0.11
> > java.lang.NullPointerException
> >                  at
> >
>
org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215)
> >                  at 
> > 
> org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164)
> >                  at
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:311)
> >                  at
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:274)
> >                  at
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)
> >                  at 
> > com.pfizer.search.sequence.BlastEcho.echo(BlastEcho.java:42)
> >                  at 
> > com.pfizer.search.sequence.BlastEcho.main(BlastEcho.java:88)
> > Exception in thread "main" 
> > 
> > Stepping through the code in a debugger shows that the while loop added 
> in
> > revision 1.13 of
> > /biojava-live/src/org/biojava/bio/program/sax/BlastSAXParser.java (fixed
> > truncation of database id) reads all the lines without ever matching the
> > "Searching" string.  At first I thought it was because I was using a 
> later
> > version of BLAST, but then I tried 2.0.11 and 2.2.3 (supported version) 
> > but
> > they also result in a NullPointerException.  In the BLAST output for the
> > various versions I never see a "Searching" string anywhere.  I've tried 
> > all
> > the -m options as well, without success.
> > 
> > Is there a NCBI BLAST option that I need to be using?  I'm running on 
> > Windows
> > XP (during development) - is the UNIX version output different? 
> > 
> > Thanks.
> > 
> > -Eric Trull
> > 
> > 
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev@biojava.org
> > http://biojava.org/mailman/listinfo/biojava-dev
> > 
> > 
> > 
> > 
> BLASTP 2.0.11 [Jan-20-2000]
> 
> 
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
> "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> programs",  Nucleic Acids Res. 25:3389-3402.
> 
> Query= 26SPS9_Hs 
>          (176 letters)
> 
> Database: PDB
>            78,094 sequences; 17,596,117 total letters
> 
> 
> 
> 
=== message truncated ===

From sicotteh at mail.nih.gov  Fri Oct  7 13:27:01 2005
From: sicotteh at mail.nih.gov (Sicotte, Hugues (NIH/NCI))
Date: Fri Oct  7 13:27:12 2005
Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java
Message-ID: <27C204BD76CBC142BA1AE46D62A8548E0F4E5DD5@nihexchange9.nih.gov>

I've been through this before when I was working for
NCBI.

The answer was that the text output of BLAST was never a supported format.
The only supported format is the XML Blast Output.
http://ccgb.umn.edu/~crow/projects/xmlblast/example.html

also

In the case of parsing multiple blast files,
breaking on "Searching..." is not a good idea
because if the parameters are wrong or the query sequence
too low complexity, this String is not emitted by the program.


Hugues Sicotte


-----Original Message-----
From: W. Eric Trull [mailto:wetrull@yahoo.com]
Sent: Friday, October 07, 2005 12:05 PM
To: biojava-dev@biojava.org
Cc: mark.schreiber@novartis.com
Subject: Re: [Biojava-dev] NullPointerException from BlastSAXParser.java


Should I raise this as an issue with NCBI?  Seems like it makes writting
parsing routines more difficult.

Thanks.

-Eric Trull

--- mark.schreiber@novartis.com wrote:

> Looks like there might be a difference in the Windows output. I will try 
> to take a look at this over the next few days. Probably need to change the

> BlastSAXParser to look for something other than Searching so that this 
> will get parsed as well.
> 
> - Mark
> 
> 
> 
> 
> 
> "W. Eric Trull" <wetrull@yahoo.com>
> 10/06/2005 11:01 PM
> 
>  
>         To:     biojava-dev@biojava.org
>         cc:     Mark Schreiber/GP/Novartis@PH
>         Subject:        Re: [Biojava-dev] NullPointerException from
> BlastSAXParser.java
> 
> 
> Hello Mark,
> 
> Here is what I've done, using NCBI Blast 2.0.11, Windows XP, JDK 1.4.2
> 
> 1.  Downloaded the PDB's pdb_seqres.txt
> 2.  Created a blast database (after changing the deflines):
>         C:\blast-2.0.11\formatdb.exe
>             -t "PDB" 
>             -i blast\pdb_seqres.txt
>             -l blast\pdb_formatdb.log
>             -o T
>             -n blast\pdb
> 3.  BLASTed 26SPS9_Hs:
>         C:\blast-2.0.11\blastall.exe
>             -p blastp
>             -d blast\pdb
>             -i 26SPS9_Hs.fasta
>             -o 26SPS9_Hs.blast
> 4.  Tried to parse 26SPS9_Hs.blast using the class shown in BioJava in 
> Anger
> and BlastEcho, both of which give me the NullPointerException.  The 
> beginning
> of 26SPS9_Hs.blast file is shown below, the entire file is attached. 
> 
> Please let me know if you see anything obviously wrong with the way I'm 
> doing
> the BLAST.  I'm going to cvs checkout the BioJava source code and have a 
> look
> at the JUnit test later today.
> 
> Thanks!
> 
> -Eric Trull
> 
> -------- 26SPS9_Hs.blast --------
> BLASTP 2.0.11 [Jan-20-2000]
> 
> 
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
> "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> programs",  Nucleic Acids Res. 25:3389-3402.
> 
> Query= 26SPS9_Hs 
>          (176 letters)
> 
> Database: PDB
>            78,094 sequences; 17,596,117 total letters
> 
> 
> 
>                                                                    Score  
> E
> Sequences producing significant alignments:                        (bits) 
> Value
> 
> pdb|1UFM|A Cop9 Complex Subunit 4                                      39 
> 0.003
> .
> .
> .
> -------- 26SPS9_Hs.blast --------
> 
> 
> --- mark.schreiber@novartis.com wrote:
> 
> > Hello -
> > 
> > This is very odd.
> > 
> > The JUnit tests currently pass using the files in 
> > /tests/files/org/biojava/bio/programs/ssbind  These BLAST files all have

> 
> > the string "Searching....". Maybe there is a variation in the windows 
> > output?
> > 
> > Can you post at least the header of your output to the list (preferably 
> an 
> > entire example output)?
> > 
> > - Mark
> > 
> > 
> > 
> > 
> > 
> > "W. Eric Trull" <wetrull@yahoo.com>
> > Sent by: biojava-dev-bounces@portal.open-bio.org
> > 10/06/2005 06:11 AM
> > 
> > 
> >         To:     biojava-dev@biojava.org
> >         cc:     (bcc: Mark Schreiber/GP/Novartis)
> >         Subject:        [Biojava-dev] NullPointerException from
> > BlastSAXParser.java
> > 
> > 
> > Hello all,
> > 
> > I'm new to the list, but have done as much archive searching, Google
> > searching, and debugging as I can on the problem I describe here.
> > 
> > I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), 
> but
> > keep getting a NullPointerException.  One of my searches turned up using
> > BlastEcho to debug the problem, but that also throws the
> > NullPointerException:
> > 
> > startSearch
> >                  SearchProp:             program: ncbi-blastp
> >                  SearchProp:             version: 2.0.11
> > java.lang.NullPointerException
> >                  at
> >
>
org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215
)
> >                  at 
> > 
> org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164)
> >                  at
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXPars
er.java:311)
> >                  at
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.
java:274)
> >                  at
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java
:160)
> >                  at 
> > com.pfizer.search.sequence.BlastEcho.echo(BlastEcho.java:42)
> >                  at 
> > com.pfizer.search.sequence.BlastEcho.main(BlastEcho.java:88)
> > Exception in thread "main" 
> > 
> > Stepping through the code in a debugger shows that the while loop added 
> in
> > revision 1.13 of
> > /biojava-live/src/org/biojava/bio/program/sax/BlastSAXParser.java (fixed
> > truncation of database id) reads all the lines without ever matching the
> > "Searching" string.  At first I thought it was because I was using a 
> later
> > version of BLAST, but then I tried 2.0.11 and 2.2.3 (supported version) 
> > but
> > they also result in a NullPointerException.  In the BLAST output for the
> > various versions I never see a "Searching" string anywhere.  I've tried 
> > all
> > the -m options as well, without success.
> > 
> > Is there a NCBI BLAST option that I need to be using?  I'm running on 
> > Windows
> > XP (during development) - is the UNIX version output different? 
> > 
> > Thanks.
> > 
> > -Eric Trull
> > 
> > 
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev@biojava.org
> > http://biojava.org/mailman/listinfo/biojava-dev
> > 
> > 
> > 
> > 
> BLASTP 2.0.11 [Jan-20-2000]
> 
> 
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
> "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> programs",  Nucleic Acids Res. 25:3389-3402.
> 
> Query= 26SPS9_Hs 
>          (176 letters)
> 
> Database: PDB
>            78,094 sequences; 17,596,117 total letters
> 
> 
> 
> 
=== message truncated ===

_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev
From wetrull at yahoo.com  Fri Oct  7 14:45:46 2005
From: wetrull at yahoo.com (W. Eric Trull)
Date: Fri Oct  7 14:45:11 2005
Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java
In-Reply-To: <27C204BD76CBC142BA1AE46D62A8548E0F4E5DD5@nihexchange9.nih.gov>
Message-ID: <20051007184546.17689.qmail@web81404.mail.yahoo.com>

I'll switch to the XML output and parser...seems more sane anyway.

Thanks!

-Eric Trull

--- "Sicotte, Hugues (NIH/NCI)" <sicotteh@mail.nih.gov> wrote:

> I've been through this before when I was working for
> NCBI.
> 
> The answer was that the text output of BLAST was never a supported format.
> The only supported format is the XML Blast Output.
> http://ccgb.umn.edu/~crow/projects/xmlblast/example.html
> 
> also
> 
> In the case of parsing multiple blast files,
> breaking on "Searching..." is not a good idea
> because if the parameters are wrong or the query sequence
> too low complexity, this String is not emitted by the program.
> 
> 
> Hugues Sicotte
> 
> 
> -----Original Message-----
> From: W. Eric Trull [mailto:wetrull@yahoo.com]
> Sent: Friday, October 07, 2005 12:05 PM
> To: biojava-dev@biojava.org
> Cc: mark.schreiber@novartis.com
> Subject: Re: [Biojava-dev] NullPointerException from BlastSAXParser.java
> 
> 
> Should I raise this as an issue with NCBI?  Seems like it makes writting
> parsing routines more difficult.
> 
> Thanks.
> 
> -Eric Trull
> 
> --- mark.schreiber@novartis.com wrote:
> 
> > Looks like there might be a difference in the Windows output. I will try 
> > to take a look at this over the next few days. Probably need to change
> the
> 
> > BlastSAXParser to look for something other than Searching so that this 
> > will get parsed as well.
> > 
> > - Mark
> > 
> > 
> > 
> > 
> > 
> > "W. Eric Trull" <wetrull@yahoo.com>
> > 10/06/2005 11:01 PM
> > 
> >  
> >         To:     biojava-dev@biojava.org
> >         cc:     Mark Schreiber/GP/Novartis@PH
> >         Subject:        Re: [Biojava-dev] NullPointerException from
> > BlastSAXParser.java
> > 
> > 
> > Hello Mark,
> > 
> > Here is what I've done, using NCBI Blast 2.0.11, Windows XP, JDK 1.4.2
> > 
> > 1.  Downloaded the PDB's pdb_seqres.txt
> > 2.  Created a blast database (after changing the deflines):
> >         C:\blast-2.0.11\formatdb.exe
> >             -t "PDB" 
> >             -i blast\pdb_seqres.txt
> >             -l blast\pdb_formatdb.log
> >             -o T
> >             -n blast\pdb
> > 3.  BLASTed 26SPS9_Hs:
> >         C:\blast-2.0.11\blastall.exe
> >             -p blastp
> >             -d blast\pdb
> >             -i 26SPS9_Hs.fasta
> >             -o 26SPS9_Hs.blast
> > 4.  Tried to parse 26SPS9_Hs.blast using the class shown in BioJava in 
> > Anger
> > and BlastEcho, both of which give me the NullPointerException.  The 
> > beginning
> > of 26SPS9_Hs.blast file is shown below, the entire file is attached. 
> > 
> > Please let me know if you see anything obviously wrong with the way I'm 
> > doing
> > the BLAST.  I'm going to cvs checkout the BioJava source code and have a 
> > look
> > at the JUnit test later today.
> > 
> > Thanks!
> > 
> > -Eric Trull
> > 
> > -------- 26SPS9_Hs.blast --------
> > BLASTP 2.0.11 [Jan-20-2000]
> > 
> > 
> > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
> 
> > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
> > "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> > programs",  Nucleic Acids Res. 25:3389-3402.
> > 
> > Query= 26SPS9_Hs 
> >          (176 letters)
> > 
> > Database: PDB
> >            78,094 sequences; 17,596,117 total letters
> > 
> > 
> > 
> >                                                                    Score 
> 
> > E
> > Sequences producing significant alignments:                        (bits)
> 
> > Value
> > 
> > pdb|1UFM|A Cop9 Complex Subunit 4                                      39
> 
> > 0.003
> > .
> > .
> > .
> > -------- 26SPS9_Hs.blast --------
> > 
> > 
> > --- mark.schreiber@novartis.com wrote:
> > 
> > > Hello -
> > > 
> > > This is very odd.
> > > 
> > > The JUnit tests currently pass using the files in 
> > > /tests/files/org/biojava/bio/programs/ssbind  These BLAST files all
> have
> 
> > 
> > > the string "Searching....". Maybe there is a variation in the windows 
> > > output?
> > > 
> > > Can you post at least the header of your output to the list (preferably
> 
> > an 
> > > entire example output)?
> > > 
> > > - Mark
> > > 
> > > 
> > > 
> > > 
> > > 
> > > "W. Eric Trull" <wetrull@yahoo.com>
> > > Sent by: biojava-dev-bounces@portal.open-bio.org
> > > 10/06/2005 06:11 AM
> > > 
> > > 
> > >         To:     biojava-dev@biojava.org
> > >         cc:     (bcc: Mark Schreiber/GP/Novartis)
> > >         Subject:        [Biojava-dev] NullPointerException from
> > > BlastSAXParser.java
> > > 
> > > 
> > > Hello all,
> > > 
> > > I'm new to the list, but have done as much archive searching, Google
> > > searching, and debugging as I can on the problem I describe here.
> > > 
> > > I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), 
> > but
> > > keep getting a NullPointerException.  One of my searches turned up
> using
> > > BlastEcho to debug the problem, but that also throws the
> > > NullPointerException:
> > > 
> > > startSearch
> > >                  SearchProp:             program: ncbi-blastp
> > >                  SearchProp:             version: 2.0.11
> > > java.lang.NullPointerException
> > >                  at
> > >
> >
>
org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215
> )
> > >                  at 
> > > 
> > org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164)
> > >                  at
> > >
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXPars
> er.java:311)
> > >                  at
> > >
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.
> java:274)
> > >                  at
> > >
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java
> :160)
> 
=== message truncated ===

From mark.schreiber at novartis.com  Sun Oct  9 21:34:08 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Sun Oct  9 21:33:35 2005
Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java
Message-ID: <OFABC5F9A0.3903B259-ON48257096.0005CEC8-48257096.00089E9C@EU.novartis.net>

I would like to reiterate this.

The BLAST text output has never been consistent between versions. Seems 
NCBI has problems with backwards compatability too : ) This has made it 
difficult to maintain the parsers.

The XML version is much safer. Although for a while it didn't follow it's 
own DTD it seems that now it does and has done for a while.

- Mark


"W. Eric Trull" <wetrull@yahoo.com>
Sent by: biojava-dev-bounces@portal.open-bio.org
10/08/2005 02:45 AM

 
        To:     biojava-dev@biojava.org
        cc:     Mark Schreiber/GP/Novartis@PH
        Subject:        RE: [Biojava-dev] NullPointerException from BlastSAXParser.java


I'll switch to the XML output and parser...seems more sane anyway.

Thanks!

-Eric Trull

--- "Sicotte, Hugues (NIH/NCI)" <sicotteh@mail.nih.gov> wrote:

> I've been through this before when I was working for
> NCBI.
> 
> The answer was that the text output of BLAST was never a supported 
format.
> The only supported format is the XML Blast Output.
> http://ccgb.umn.edu/~crow/projects/xmlblast/example.html
> 
> also
> 
> In the case of parsing multiple blast files,
> breaking on "Searching..." is not a good idea
> because if the parameters are wrong or the query sequence
> too low complexity, this String is not emitted by the program.
> 
> 
> Hugues Sicotte
> 
> 
> -----Original Message-----
> From: W. Eric Trull [mailto:wetrull@yahoo.com]
> Sent: Friday, October 07, 2005 12:05 PM
> To: biojava-dev@biojava.org
> Cc: mark.schreiber@novartis.com
> Subject: Re: [Biojava-dev] NullPointerException from BlastSAXParser.java
> 
> 
> Should I raise this as an issue with NCBI?  Seems like it makes writting
> parsing routines more difficult.
> 
> Thanks.
> 
> -Eric Trull
> 
> --- mark.schreiber@novartis.com wrote:
> 
> > Looks like there might be a difference in the Windows output. I will 
try 
> > to take a look at this over the next few days. Probably need to change
> the
> 
> > BlastSAXParser to look for something other than Searching so that this 

> > will get parsed as well.
> > 
> > - Mark
> > 
> > 
> > 
> > 
> > 
> > "W. Eric Trull" <wetrull@yahoo.com>
> > 10/06/2005 11:01 PM
> > 
> > 
> >         To:     biojava-dev@biojava.org
> >         cc:     Mark Schreiber/GP/Novartis@PH
> >         Subject:        Re: [Biojava-dev] NullPointerException from
> > BlastSAXParser.java
> > 
> > 
> > Hello Mark,
> > 
> > Here is what I've done, using NCBI Blast 2.0.11, Windows XP, JDK 1.4.2
> > 
> > 1.  Downloaded the PDB's pdb_seqres.txt
> > 2.  Created a blast database (after changing the deflines):
> >         C:\blast-2.0.11\formatdb.exe
> >             -t "PDB" 
> >             -i blast\pdb_seqres.txt
> >             -l blast\pdb_formatdb.log
> >             -o T
> >             -n blast\pdb
> > 3.  BLASTed 26SPS9_Hs:
> >         C:\blast-2.0.11\blastall.exe
> >             -p blastp
> >             -d blast\pdb
> >             -i 26SPS9_Hs.fasta
> >             -o 26SPS9_Hs.blast
> > 4.  Tried to parse 26SPS9_Hs.blast using the class shown in BioJava in 

> > Anger
> > and BlastEcho, both of which give me the NullPointerException.  The 
> > beginning
> > of 26SPS9_Hs.blast file is shown below, the entire file is attached. 
> > 
> > Please let me know if you see anything obviously wrong with the way 
I'm 
> > doing
> > the BLAST.  I'm going to cvs checkout the BioJava source code and have 
a 
> > look
> > at the JUnit test later today.
> > 
> > Thanks!
> > 
> > -Eric Trull
> > 
> > -------- 26SPS9_Hs.blast --------
> > BLASTP 2.0.11 [Jan-20-2000]
> > 
> > 
> > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. 
Schaffer,
> 
> > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
> > "Gapped BLAST and PSI-BLAST: a new generation of protein database 
search
> > programs",  Nucleic Acids Res. 25:3389-3402.
> > 
> > Query= 26SPS9_Hs 
> >          (176 letters)
> > 
> > Database: PDB
> >            78,094 sequences; 17,596,117 total letters
> > 
> > 
> > 
> > Score 
> 
> > E
> > Sequences producing significant alignments: (bits)
> 
> > Value
> > 
> > pdb|1UFM|A Cop9 Complex Subunit 4 39
> 
> > 0.003
> > .
> > .
> > .
> > -------- 26SPS9_Hs.blast --------
> > 
> > 
> > --- mark.schreiber@novartis.com wrote:
> > 
> > > Hello -
> > > 
> > > This is very odd.
> > > 
> > > The JUnit tests currently pass using the files in 
> > > /tests/files/org/biojava/bio/programs/ssbind  These BLAST files all
> have
> 
> > 
> > > the string "Searching....". Maybe there is a variation in the 
windows 
> > > output?
> > > 
> > > Can you post at least the header of your output to the list 
(preferably
> 
> > an 
> > > entire example output)?
> > > 
> > > - Mark
> > > 
> > > 
> > > 
> > > 
> > > 
> > > "W. Eric Trull" <wetrull@yahoo.com>
> > > Sent by: biojava-dev-bounces@portal.open-bio.org
> > > 10/06/2005 06:11 AM
> > > 
> > > 
> > >         To:     biojava-dev@biojava.org
> > >         cc:     (bcc: Mark Schreiber/GP/Novartis)
> > >         Subject:        [Biojava-dev] NullPointerException from
> > > BlastSAXParser.java
> > > 
> > > 
> > > Hello all,
> > > 
> > > I'm new to the list, but have done as much archive searching, Google
> > > searching, and debugging as I can on the problem I describe here.
> > > 
> > > I'm trying to parse NCBI BLAST output (as shown in BioJava in 
Anger), 
> > but
> > > keep getting a NullPointerException.  One of my searches turned up
> using
> > > BlastEcho to debug the problem, but that also throws the
> > > NullPointerException:
> > > 
> > > startSearch
> > >                  SearchProp:             program: ncbi-blastp
> > >                  SearchProp:             version: 2.0.11
> > > java.lang.NullPointerException
> > >                  at
> > >
> >
>
org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215
> )
> > >                  at 
> > > 
> > 
org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164)
> > >                  at
> > >
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXPars
> er.java:311)
> > >                  at
> > >
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.
> java:274)
> > >                  at
> > >
> >
>
org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java
> :160)
> 
=== message truncated ===

_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


From mark.schreiber at novartis.com  Sun Oct  9 21:58:34 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Sun Oct  9 21:57:58 2005
Subject: [Biojava-dev] Java 1.5 (final chance to object)
Message-ID: <OF89737996.AA24E447-ON48257096.0008C4EC-48257096.000ADB47@EU.novartis.net>

>Stability and backward compatibility are crucial to the success of 
BioJava.
>In fact, if the code is not too complex, I often have to write it myself
>rather than
>relying on BioJava. I want to write code that will still be in production
>5 years from now.

This stems from the fact that for several years package stability wasn't 
really a design goal. This was good cause often our first attempts where 
not that good. Having said that, stability is definitely a goal now. I 
beleive there has been a great reduction in major API changes between 
recent versions. Indeed many core interfaces have not changed in a long 
time. But we do need to remain vigilent. Basic rule, do not break the core 
org.biojava.* API's!

There are some reservations to that. 1) Deprecation will happen. There is 
nothing wrong with deprecation of a bad or unsupported API and as long as 
it is flagged well before releases come out. Preferably there should be 
one or two releases where a method or class is deprecated before it is (if 
ever) finally removed. 2) New packages in CVS should never be considered 
stable. If an API has not been part of a release version I don't think we 
need to guarentee stability.

When org.biojavax is released no org.biojava.* API will be removed or 
changed. Some will be deprecated as the org.biojavax APIs may give you a 
better alternative. They are not really seperate APIs, more extensions 
that give more fexibility and sometimes do things better (like swing is to 
AWT). The part where people may see problems is interactions with BioSQL. 
Previously biojava worked with bioSQL but not in the way it should 
according to the bioSQL specs. The new version will bring it into line 
with bioSQL. The old APIs will remain should people need to access legacy 
data in bioSQL DBs created with the old API.

>I say that, not because I don't appreciate the hard work of developpers,
>but because I would like the volunteer developpers to appreciate that
>the users of the toolkit need stability. We live in production 
environments
>that will not support 1.5 for a long time. I am still living in a 1.3 
world.
>Only projects that I am starting right now can use 1.4.2_08.

The major releases from the past support the following versions, 
biojava1.3 and biojava1.3.1 work with JDK1.3.x and biojava1.4 works with 
JDK1.4.2. You should never prepare production code from versions of 
biojava in CVS. The best way to make production code is to bundle all your 
application dependencies into the application JAR. This way everything is 
in one place and no external changes affect it (and keep the required 
version of the JDK available if you upgrade to a new one). It's not 
elegant but it is bullet proof. There are other approaches that work too 
but need more management.

>I beg you, Please reconsider moving to 1.5, it's only 5% more typing to 
use
>1.4.

Sure, people just need to stop dropping 1.5 dependent code into the CVS. 
Please note though, biojava1.4 will never require java1.5, it would only 
affect future versions. Richard is planning a preview of biojavax in 
december. We may revisit the issue then.

Ps, what JDK is caBIO running on now?


Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910


_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev
_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


From mark.schreiber at novartis.com  Mon Oct 10 01:27:50 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Mon Oct 10 01:27:32 2005
Subject: [Biojava-dev] TaxonSQL not 1.4 compliant
Message-ID: <OFDB13D474.01406DFA-ON48257096.0014D8C7-48257096.001E03DA@EU.novartis.net>

Hello -

Someone has placed a modification of TaxonSQL in CVS that contains methods 
from JDK1.5. You know who you are : )

Could you please use JDK1.4 "equivalents"

Specifically, there are a number of Integer.instanceOf(int i) calls which 
I have fixed in CVS. There are also a couple of uses of the String methods 
matches(String regex) and replace(String regex) which I have not tried to 
fix. These were only introduced in java 1.5.

Can people please try and avoid using methods that only compile with JDK 
1.5 for the time being? They are clearly documented in the javadocs with 
Since: 1.5

Thanks,

- Mark

Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910

From kheeteck at yahoo.com  Thu Oct  6 08:50:32 2005
From: kheeteck at yahoo.com (kheeteck)
Date: Tue Oct 11 11:05:35 2005
Subject: [Biojava-dev] retrieve property "ORIGIN"
Message-ID: <20051006125032.87281.qmail@web32403.mail.mud.yahoo.com>

Hi..

Anybody know how to retrieve the annotation property
"ORIGIN"

My code look like this

while(sequences.hasNext()){
try {

seq = sequences.nextSequence();
//Annotation
Annotation anno = seq.getAnnotation();

//print each key value pair
for (Iterator i = anno.keys().iterator(); i.hasNext();
) {
Object key = i.next();

System.out.println(key +" : "+ anno.getProperty(key));
}//for
}//while

It print out the value for each annotation key but for
ORIGIN, there is nothing.

Regards
KheeTeck


__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com
From wetrull at yahoo.com  Thu Oct  6 11:01:01 2005
From: wetrull at yahoo.com (W. Eric Trull)
Date: Tue Oct 11 11:05:38 2005
Subject: [Biojava-dev] NullPointerException from BlastSAXParser.java
In-Reply-To: <OF40F8AA06.6011FF06-ON48257092.0013D988-48257092.0014244A@EU.novartis.net>
Message-ID: <20051006150102.155.qmail@web81403.mail.yahoo.com>

Hello Mark,

Here is what I've done, using NCBI Blast 2.0.11, Windows XP, JDK 1.4.2

1.  Downloaded the PDB's pdb_seqres.txt
2.  Created a blast database (after changing the deflines):
        C:\blast-2.0.11\formatdb.exe
            -t "PDB" 
            -i blast\pdb_seqres.txt
            -l blast\pdb_formatdb.log
            -o T
            -n blast\pdb
3.  BLASTed 26SPS9_Hs:
        C:\blast-2.0.11\blastall.exe
            -p blastp
            -d blast\pdb
            -i 26SPS9_Hs.fasta
            -o 26SPS9_Hs.blast
4.  Tried to parse 26SPS9_Hs.blast using the class shown in BioJava in Anger
and BlastEcho, both of which give me the NullPointerException.  The beginning
of 26SPS9_Hs.blast file is shown below, the entire file is attached. 

Please let me know if you see anything obviously wrong with the way I'm doing
the BLAST.  I'm going to cvs checkout the BioJava source code and have a look
at the JUnit test later today.

Thanks!

-Eric Trull

-------- 26SPS9_Hs.blast --------
BLASTP 2.0.11 [Jan-20-2000]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= 26SPS9_Hs 
         (176 letters)

Database: PDB
           78,094 sequences; 17,596,117 total letters


                                                                   Score    
E
Sequences producing significant alignments:                        (bits) 
Value

pdb|1UFM|A Cop9 Complex Subunit 4                                      39 
0.003
.
.
.
-------- 26SPS9_Hs.blast --------


--- mark.schreiber@novartis.com wrote:

> Hello -
> 
> This is very odd.
> 
> The JUnit tests currently pass using the files in 
> /tests/files/org/biojava/bio/programs/ssbind  These BLAST files all have 
> the string "Searching....". Maybe there is a variation in the windows 
> output?
> 
> Can you post at least the header of your output to the list (preferably an 
> entire example output)?
> 
> - Mark
> 
> 
> 
> 
> 
> "W. Eric Trull" <wetrull@yahoo.com>
> Sent by: biojava-dev-bounces@portal.open-bio.org
> 10/06/2005 06:11 AM
> 
>  
>         To:     biojava-dev@biojava.org
>         cc:     (bcc: Mark Schreiber/GP/Novartis)
>         Subject:        [Biojava-dev] NullPointerException from
> BlastSAXParser.java
> 
> 
> Hello all,
> 
> I'm new to the list, but have done as much archive searching, Google
> searching, and debugging as I can on the problem I describe here.
> 
> I'm trying to parse NCBI BLAST output (as shown in BioJava in Anger), but
> keep getting a NullPointerException.  One of my searches turned up using
> BlastEcho to debug the problem, but that also throws the
> NullPointerException:
> 
> startSearch
>                  SearchProp:             program: ncbi-blastp
>                  SearchProp:             version: 2.0.11
> java.lang.NullPointerException
>                  at
>
org.biojava.bio.program.sax.BlastSAXParser.interpret(BlastSAXParser.java:215)
>                  at 
> org.biojava.bio.program.sax.BlastSAXParser.parse(BlastSAXParser.java:164)
>                  at
>
org.biojava.bio.program.sax.BlastLikeSAXParser.onNewDataSet(BlastLikeSAXParser.java:311)
>                  at
>
org.biojava.bio.program.sax.BlastLikeSAXParser.interpret(BlastLikeSAXParser.java:274)
>                  at
>
org.biojava.bio.program.sax.BlastLikeSAXParser.parse(BlastLikeSAXParser.java:160)
>                  at 
> com.pfizer.search.sequence.BlastEcho.echo(BlastEcho.java:42)
>                  at 
> com.pfizer.search.sequence.BlastEcho.main(BlastEcho.java:88)
> Exception in thread "main" 
> 
> Stepping through the code in a debugger shows that the while loop added in
> revision 1.13 of
> /biojava-live/src/org/biojava/bio/program/sax/BlastSAXParser.java (fixed
> truncation of database id) reads all the lines without ever matching the
> "Searching" string.  At first I thought it was because I was using a later
> version of BLAST, but then I tried 2.0.11 and 2.2.3 (supported version) 
> but
> they also result in a NullPointerException.  In the BLAST output for the
> various versions I never see a "Searching" string anywhere.  I've tried 
> all
> the -m options as well, without success.
> 
> Is there a NCBI BLAST option that I need to be using?  I'm running on 
> Windows
> XP (during development) - is the UNIX version output different? 
> 
> Thanks.
> 
> -Eric Trull
> 
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev@biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
> 
> 
> 
> 
-------------- next part --------------
BLASTP 2.0.11 [Jan-20-2000]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= 26SPS9_Hs 
         (176 letters)

Database: PDB
           78,094 sequences; 17,596,117 total letters


                                                                   Score     E
Sequences producing significant alignments:                        (bits)  Value

pdb|1UFM|A Cop9 Complex Subunit 4                                      39  0.003
pdb|1YM7|D Beta-Adrenergic Receptor Kinase 1                           29  3.3
pdb|1YM7|C Beta-Adrenergic Receptor Kinase 1                           29  3.3
pdb|1YM7|B Beta-Adrenergic Receptor Kinase 1                           29  3.3
pdb|1YM7|A Beta-Adrenergic Receptor Kinase 1                           29  3.3
pdb|1OMW|A G-Protein Coupled Receptor Kinase 2                         29  3.3

>pdb|1UFM|A Cop9 Complex Subunit 4
           Length = 84
           
 Score = 39.1 bits (89), Expect = 0.003
 Identities = 15/56 (26%), Positives = 35/56 (61%)

Query: 114 LLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMILDKKFHGILDQGEGVL 169
           ++E NL+   + ++ +  E + +L+++  A  E+  SQMI + + +G +DQ +G++
Sbjct: 16  VIEHNLLSASKLYNNITFEELGALLEIPAAKAEKIASQMITEGRMNGFIDQIDGIV 71


>pdb|1YM7|D Beta-Adrenergic Receptor Kinase 1
           Length = 689
           
 Score = 29.0 bits (63), Expect = 3.3
 Identities = 18/85 (21%), Positives = 41/85 (48%), Gaps = 5/85 (5%)

Query: 73  CVAQASKNRSLADFEKALTDY-RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQI 131
           C+    + + L +F + +  Y + E  ++ ++ +   +++D  + + L+    PFS+  I
Sbjct: 72  CLKHLEEAKPLVEFYEEIKKYEKLETEEERLVCSR--EIFDTYIMKELLACSHPFSKSAI 129

Query: 132 EHISSLIKLSKADVERKLSQMILDK 156
           EH+     L K  V   L Q  +++
Sbjct: 130 EHVQG--HLVKKQVPPDLFQPYIEE 152


>pdb|1YM7|C Beta-Adrenergic Receptor Kinase 1
           Length = 689
           
 Score = 29.0 bits (63), Expect = 3.3
 Identities = 18/85 (21%), Positives = 41/85 (48%), Gaps = 5/85 (5%)

Query: 73  CVAQASKNRSLADFEKALTDY-RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQI 131
           C+    + + L +F + +  Y + E  ++ ++ +   +++D  + + L+    PFS+  I
Sbjct: 72  CLKHLEEAKPLVEFYEEIKKYEKLETEEERLVCSR--EIFDTYIMKELLACSHPFSKSAI 129

Query: 132 EHISSLIKLSKADVERKLSQMILDK 156
           EH+     L K  V   L Q  +++
Sbjct: 130 EHVQG--HLVKKQVPPDLFQPYIEE 152


>pdb|1YM7|B Beta-Adrenergic Receptor Kinase 1
           Length = 689
           
 Score = 29.0 bits (63), Expect = 3.3
 Identities = 18/85 (21%), Positives = 41/85 (48%), Gaps = 5/85 (5%)

Query: 73  CVAQASKNRSLADFEKALTDY-RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQI 131
           C+    + + L +F + +  Y + E  ++ ++ +   +++D  + + L+    PFS+  I
Sbjct: 72  CLKHLEEAKPLVEFYEEIKKYEKLETEEERLVCSR--EIFDTYIMKELLACSHPFSKSAI 129

Query: 132 EHISSLIKLSKADVERKLSQMILDK 156
           EH+     L K  V   L Q  +++
Sbjct: 130 EHVQG--HLVKKQVPPDLFQPYIEE 152


>pdb|1YM7|A Beta-Adrenergic Receptor Kinase 1
           Length = 689
           
 Score = 29.0 bits (63), Expect = 3.3
 Identities = 18/85 (21%), Positives = 41/85 (48%), Gaps = 5/85 (5%)

Query: 73  CVAQASKNRSLADFEKALTDY-RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQI 131
           C+    + + L +F + +  Y + E  ++ ++ +   +++D  + + L+    PFS+  I
Sbjct: 72  CLKHLEEAKPLVEFYEEIKKYEKLETEEERLVCSR--EIFDTYIMKELLACSHPFSKSAI 129

Query: 132 EHISSLIKLSKADVERKLSQMILDK 156
           EH+     L K  V   L Q  +++
Sbjct: 130 EHVQG--HLVKKQVPPDLFQPYIEE 152


>pdb|1OMW|A G-Protein Coupled Receptor Kinase 2
           Length = 689
           
 Score = 29.0 bits (63), Expect = 3.3
 Identities = 18/85 (21%), Positives = 41/85 (48%), Gaps = 5/85 (5%)

Query: 73  CVAQASKNRSLADFEKALTDY-RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQI 131
           C+    + + L +F + +  Y + E  ++ ++ +   +++D  + + L+    PFS+  I
Sbjct: 72  CLKHLEEAKPLVEFYEEIKKYEKLETEEERLVCSR--EIFDTYIMKELLACSHPFSKSAI 129

Query: 132 EHISSLIKLSKADVERKLSQMILDK 156
           EH+     L K  V   L Q  +++
Sbjct: 130 EHVQG--HLVKKQVPPDLFQPYIEE 152


  Database: PDB
    Posted date:  Oct 6, 2005  7:42 AM
  Number of letters in database: 17,596,117
  Number of sequences in database:  78,094
  
Lambda     K      H
   0.319    0.136    0.379 

Gapped
Lambda     K      H
   0.270   0.0470    0.230 


Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 5635599
Number of Sequences: 78094
Number of extensions: 193971
Number of successful extensions: 758
Number of sequences better than 10.0: 6
Number of HSP's better than 10.0 without gapping: 1
Number of HSP's successfully gapped in prelim test: 5
Number of HSP's that attempted gapping in prelim test: 757
Number of HSP's gapped (non-prelim): 6
length of query: 176
length of database: 17,596,117
effective HSP length: 50
effective length of query: 126
effective length of database: 13,691,417
effective search space: 1725118542
effective search space used: 1725118542
T: 11
A: 40
X1: 16 ( 7.4 bits)
X2: 38 (14.8 bits)
X3: 64 (24.9 bits)
S1: 41 (21.7 bits)
S2: 59 (27.4 bits)
From mark.schreiber at novartis.com  Thu Oct  6 21:52:05 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Tue Oct 11 11:05:39 2005
Subject: [Biojava-dev] Java 1.5 (final chance to object)
Message-ID: <OFDE43E947.A0B4CA31-ON48257093.00094EAB-48257093.000A4351@EU.novartis.net>

An HTML attachment was scrubbed...
URL: http://portal.open-bio.org/pipermail/biojava-dev/attachments/20051007/4b427dc3/attachment-0001.htm
From hzhang at ceres-inc.com  Fri Oct  7 18:02:03 2005
From: hzhang at ceres-inc.com (Hongyu Zhang)
Date: Tue Oct 11 11:05:40 2005
Subject: [Biojava-dev] a bug in WU-BLAST parser
Message-ID: <AEC65ADD13FF6E4B8E715F407484D2F20181D596@mailman.ceres-inc.com>

I am reporting an error in the WU-BLAST parser to the Biojava
developer's list. The Biojava version is the latest 1.4, and the bug is
in file src/org/biojava/bio/program/sax/WuBlastSummaryLineHelper.java.
The error happened when the description field of the BLAST hit summary
lines is empty. For example:

 
ADL26502                                                      1430
8.5e-145  1

 
Cause of the error:

 
A parameter, iGrab, in the code, affects the behavior of the parser
dependent on the WU-BLAST version. When the version is BLASTX, TBLASTX
or TBLASTN, igrab is set to 4, and the code at the line 120 will try to
read the non-existed "Frame" field from the WU-BLAST summary lines.
Usually, the description field in the summary line is not empty, so this
line of code will grab the last word from the description field as the
"Frame" field mistakenly. This mistake usually won't matter to the
following codes and therefore is hidden in most situations. In the
example above, however, since the description field of the hit is empty,
the code will mistakenly shift and read the next "High Score" field
(1430 in this case) as the "Frame" field and cause the StringTokenizer
to throw an error.

 
Hongyu Zhang, Ph.D.

Computational Biologist

Ceres Inc.

1535 Rancho Conejo Blvd

Thousand oaks, CA 91320

Phone: (805)376-6504 ext. 1204

Fax: (805)376-6537

 
**********************************************************************

This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.  Ceres, Inc. declines any liability for any viruses or other potentially harmful code which may be transmitted by or accompanying this email or any attachment.

**********************************************************************


From zhaozhg at keylab.net  Sat Oct  8 03:34:18 2005
From: zhaozhg at keylab.net (=?gb2312?B?1dTWvrjV?=)
Date: Tue Oct 11 11:05:41 2005
Subject: [Biojava-dev] Which package  the class NestedError is in? 
Message-ID: <20051008073456.85FC44047@mail.keylab.net>

Skipped content of type multipart/alternative-------------- next part --------------
A non-text attachment was scrubbed...
Name: fox.gif
Type: image/gif
Size: 9519 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biojava-dev/attachments/20051008/6d27add4/fox-0001.gif
From zhaozhg at keylab.net  Sun Oct  9 23:08:26 2005
From: zhaozhg at keylab.net (=?gb2312?B?1dTWvrjV?=)
Date: Tue Oct 11 11:05:42 2005
Subject: [Biojava-dev] who have finished the example "Changeability
	examples" ?
Message-ID: <20051010030901.A215B40B3@mail.keylab.net>

Skipped content of type multipart/alternative-------------- next part --------------
A non-text attachment was scrubbed...
Name: fox.gif
Type: image/gif
Size: 9519 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biojava-dev/attachments/20051010/081af54a/fox.gif
From mark.schreiber at novartis.com  Tue Oct 11 21:00:13 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Tue Oct 11 20:59:34 2005
Subject: [Biojava-dev] a bug in WU-BLAST parser
Message-ID: <OF19C84285.7E512272-ON48257098.0005737C-48257098.0005835E@EU.novartis.net>

Hi -

I will take a look at this, however, if you know of a solution could you 
post it to me and I will commit it to CVS.

Thanks.

- Mark


"Hongyu Zhang" <hzhang@ceres-inc.com>
Sent by: biojava-dev-bounces@portal.open-bio.org
10/08/2005 06:02 AM

 
        To:     <biojava-dev@biojava.org>
        cc:     Raj Thavamani <rthavamani@ceres-inc.com>, (bcc: Mark 
Schreiber/GP/Novartis)
        Subject:        [Biojava-dev] a bug in WU-BLAST parser


I am reporting an error in the WU-BLAST parser to the Biojava
developer's list. The Biojava version is the latest 1.4, and the bug is
in file src/org/biojava/bio/program/sax/WuBlastSummaryLineHelper.java.
The error happened when the description field of the BLAST hit summary
lines is empty. For example:

 
ADL26502                                                      1430
8.5e-145  1

 
Cause of the error:

 
A parameter, iGrab, in the code, affects the behavior of the parser
dependent on the WU-BLAST version. When the version is BLASTX, TBLASTX
or TBLASTN, igrab is set to 4, and the code at the line 120 will try to
read the non-existed "Frame" field from the WU-BLAST summary lines.
Usually, the description field in the summary line is not empty, so this
line of code will grab the last word from the description field as the
"Frame" field mistakenly. This mistake usually won't matter to the
following codes and therefore is hidden in most situations. In the
example above, however, since the description field of the hit is empty,
the code will mistakenly shift and read the next "High Score" field
(1430 in this case) as the "Frame" field and cause the StringTokenizer
to throw an error.

 
Hongyu Zhang, Ph.D.

Computational Biologist

Ceres Inc.

1535 Rancho Conejo Blvd

Thousand oaks, CA 91320

Phone: (805)376-6504 ext. 1204

Fax: (805)376-6537

 
**********************************************************************

This email message is for the sole use of the intended recipient(s) and 
may contain confidential and privileged information.  Any unauthorized 
review, use, disclosure or distribution is prohibited.  If you are not the 
intended recipient, please contact the sender by reply email and destroy 
all copies of the original message.  Ceres, Inc. declines any liability 
for any viruses or other potentially harmful code which may be transmitted 
by or accompanying this email or any attachment.

**********************************************************************


_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


From mark.schreiber at novartis.com  Tue Oct 11 21:02:40 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Tue Oct 11 21:02:03 2005
Subject: [Biojava-dev] who have finished the example
	"Changeability	examples" ?
Message-ID: <OFE3AF6F6B.4866C41A-ON48257098.00059033-48257098.0005BD24@EU.novartis.net>

Hello -

Some of the old examples are getting out of date. Fixing them is on the 
ever increasing list of things to do. More up to date examples can be 
found at http://www.biojava.org/docs/bj_in_anger/index.htm

Thanks for pointing out the bug.

- Mark


"��־��" <zhaozhg@keylab.net>
Sent by: biojava-dev-bounces@portal.open-bio.org
10/10/2005 11:08 AM

 
        To:     "biojava������" <biojava-dev@biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-dev] who have finished the example "Changeability      examples" 
?


Is there someone who have finished the example "Changeability examples" in 
Biojava tutorial,the URL is http://www.biojava.org/tutorials/events2.html.
I can't finish it with JDK 1.4.2 and Biojava 1.4��I find some errors in 
the source Roulet.java(http://www.biojava.org/tutorials/Roulet.java).  Besides,the demo (http://www.biojava.org/tutorials/Roulet.html)doesn't work.
thanks!


_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev

[ Attachment ''FOX.GIF'' removed by Mark Schreiber ]


From sylvain.foisy at bioneq.qc.ca  Tue Oct 18 12:58:45 2005
From: sylvain.foisy at bioneq.qc.ca (Sylvain Foisy)
Date: Tue Oct 18 14:05:09 2005
Subject: [Biojava-dev] JDK1.5 dependencies woes
Message-ID: <BF7AA205.7626%sylvain.foisy@bioneq.qc.ca>

Hi to all,

Still some JDK1.5 stuff creeping in CVS in UniProtXMLFormat:

compile-biojava:
    [javac] Compiling 40 source files to
/Volumes/BIONEQ-The_Brain/Java-Librairies/biojava-live/ant-build/classes/bio
java
    [javac] 
/Volumes/BIONEQ-The_Brain/Java-Librairies/biojava-live/src/org/biojavax/bio/
seq/io/UniProtXMLFormat.java:389: cannot resolve symbol
    [javac] symbol  : method contains (java.lang.String)
    [javac] location: class java.lang.String
    [javac]             if (line.contains("<"+COPYRIGHT_TAG))
XMLTools.readXMLChunk(reader, m_handler, COPYRIGHT_TAG);
    [javac]                     ^
    [javac] 1 error

I would like to move on to JDK1.5 but because we provide a service to many,
we will not be moving toward this anytime soon...

Best regards

Sylvain

===================================================================
Sylvain Foisy, Ph. D.
Directeur - operations / Project Manager
BioneQ - Reseau quebecois de bio-informatique
U. de Montreal / Genome-Quebec

Adresse postale:

Departement de biochimie
Pavillon principal
2900, boul. ?douard-Montpetit
Montr?al (Qu?bec) H3T 1J4

Tel: (514) 343-6111 x.2545
Fax: (514) 343-7759
Courriel: sylvain.foisy@bioneq.qc.ca
===================================================================


From kalle.naslund at genpat.uu.se  Tue Oct 18 14:04:39 2005
From: kalle.naslund at genpat.uu.se (=?ISO-8859-1?Q?Kalle_N=E4slund?=)
Date: Tue Oct 18 14:24:11 2005
Subject: [Biojava-dev] Serialization problems,
	"-" turns to "n" after serializing sequence
Message-ID: <43553937.406@genpat.uu.se>

Hi!

I seem to be stuck with a serialization issue, somewhere deep in the 
alphabet stuff. The problem is that "-" turns into "n". This happens 
both with farily new CVS code as well as 1.4 release code.

The code i am using is the following:

import java.util.*;
import java.io.*;

import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
import org.biojava.utils.*;
import org.biojava.bio.*;

/**
 * Temp class, just to check out some serialization issues im having.
 *
 * @author kalle
 */
public class AlignmentSerializationTest {

    public void run() throws Exception {
        Sequence dnaSeq1 = 
DNATools.createDNASequence("---ATGC---ATGC---", "seq1" );

        dumpInfoAboutSequence( dnaSeq1 );

        System.out.println("Writing alignment to disk");

        File file = new File("/tmp/ali.obj");
        FileOutputStream fOS = new FileOutputStream( file );
        ObjectOutputStream oOS = new ObjectOutputStream( fOS );

        oOS.writeObject( dnaSeq1 );

        oOS.close();
        fOS.close();

        System.out.println( "Loading alignment from disk" );
        FileInputStream     fIS = new FileInputStream( file );
        ObjectInputStream   oIS = new ObjectInputStream( fIS );

        Sequence  serSeq  = ( Sequence )oIS.readObject();

        dumpInfoAboutSequence( serSeq );
    }

    public static void main( String[] flags ) throws Exception {
        AlignmentSerializationTest myAST = new AlignmentSerializationTest();
        myAST.run();
    }

    private void dumpInfoAboutSequence( Sequence sequence ) throws 
Exception {
        System.out.println("Name      :" + sequence.getName() );
        System.out.println("Alphabet  :" + sequence.getAlphabet() );
        System.out.println("GapSymbol :" + 
sequence.getAlphabet().getGapSymbol() );
        System.out.println("Sequence  :" + sequence.seqString() );
        System.out.println("Tokeniz   :" + 
sequence.getAlphabet().getTokenization( "token" ) );
    }
}


And the output i get is :

Name      :seq1
Alphabet  
:org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b
GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: []
Sequence  :---atgc---atgc---
Tokeniz   
:org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56

Writing alignment to disk

Loading alignment from disk

Name      :seq1
Alphabet  
:org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b
GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: []
Sequence  :nnnatgcnnnatgcnnn
Tokeniz   
:org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56


I have spent some time using a debugger and stepping trough the bj code 
but realised that it will most likely take me loads of time, and was 
hoping that some of you guys that have some more experience with the 
alphabet stuff could atleast point me in the right direction, if not 
outright recognize the bug =)

kind regards Kalle
From mark.schreiber at novartis.com  Tue Oct 18 23:19:17 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Tue Oct 18 23:25:05 2005
Subject: [Biojava-dev] Serialization problems,	"-" turns to "n" after
	serializing sequence
Message-ID: <OFC7252551.60DE74B4-ON4825709F.0011C7F1-4825709F.00123EDE@EU.novartis.net>

Hello -

What should happen is that a method called readResolve() should be called 
by the JVM on deserialization to replace the gap symbol that was 
deserialized with the gap symbol of the local AlphabetManager.

This prevents you from having a gap that is not == the gap provided by the 
alphabet manager. It seems that somehow it is instead being replaced by 
the ambiguity symbol n.

It may take me a while to get around to looking at this. If you find it, 
please let me know. If I forget, please remind me : )

- Mark


Kalle N?slund <kalle.naslund@genpat.uu.se>
Sent by: biojava-dev-bounces@portal.open-bio.org
10/19/2005 02:04 AM

 
        To:     biojava-dev@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-dev] Serialization problems,   "-" turns to "n" after serializing 
sequence


Hi!

I seem to be stuck with a serialization issue, somewhere deep in the 
alphabet stuff. The problem is that "-" turns into "n". This happens 
both with farily new CVS code as well as 1.4 release code.

The code i am using is the following:

import java.util.*;
import java.io.*;

import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
import org.biojava.utils.*;
import org.biojava.bio.*;

/**
 * Temp class, just to check out some serialization issues im having.
 *
 * @author kalle
 */
public class AlignmentSerializationTest {

    public void run() throws Exception {
        Sequence dnaSeq1 = 
DNATools.createDNASequence("---ATGC---ATGC---", "seq1" );

        dumpInfoAboutSequence( dnaSeq1 );

        System.out.println("Writing alignment to disk");

        File file = new File("/tmp/ali.obj");
        FileOutputStream fOS = new FileOutputStream( file );
        ObjectOutputStream oOS = new ObjectOutputStream( fOS );

        oOS.writeObject( dnaSeq1 );

        oOS.close();
        fOS.close();

        System.out.println( "Loading alignment from disk" );
        FileInputStream     fIS = new FileInputStream( file );
        ObjectInputStream   oIS = new ObjectInputStream( fIS );

        Sequence  serSeq  = ( Sequence )oIS.readObject();

        dumpInfoAboutSequence( serSeq );
    }

    public static void main( String[] flags ) throws Exception {
        AlignmentSerializationTest myAST = new 
AlignmentSerializationTest();
        myAST.run();
    }

    private void dumpInfoAboutSequence( Sequence sequence ) throws 
Exception {
        System.out.println("Name      :" + sequence.getName() );
        System.out.println("Alphabet  :" + sequence.getAlphabet() );
        System.out.println("GapSymbol :" + 
sequence.getAlphabet().getGapSymbol() );
        System.out.println("Sequence  :" + sequence.seqString() );
        System.out.println("Tokeniz   :" + 
sequence.getAlphabet().getTokenization( "token" ) );
    }
}


And the output i get is :

Name      :seq1
Alphabet 
:org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b
GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: []
Sequence  :---atgc---atgc---
Tokeniz 
:org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56

Writing alignment to disk

Loading alignment from disk

Name      :seq1
Alphabet 
:org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b
GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: []
Sequence  :nnnatgcnnnatgcnnn
Tokeniz 
:org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56


I have spent some time using a debugger and stepping trough the bj code 
but realised that it will most likely take me loads of time, and was 
hoping that some of you guys that have some more experience with the 
alphabet stuff could atleast point me in the right direction, if not 
outright recognize the bug =)

kind regards Kalle
_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


From ml-it-biojava-dev at epigenomics.com  Wed Oct 19 04:41:38 2005
From: ml-it-biojava-dev at epigenomics.com (ml-it-biojava-dev@epigenomics.com)
Date: Wed Oct 19 04:47:36 2005
Subject: [Biojava-dev] FastaFormat performance enhancement
Message-ID: <dj50s2$dus$1@perl.epigenomics.epi>

Hi, 

I had a lot of trouble using SeqIOTools.writeFasta on large sequences. The subStr method of SymbolList seems to introduce a memory leak (I did not track that in detail!). Anyway I would suggest to change FastaFormat:
  
    public void writeSequence(Sequence seq, PrintStream os)
    throws IOException {
        os.print(">");
        os.println(describeSequence(seq));
        
        int length = seq.length();
        
        for (int pos = 1; pos <= length; pos += lineWidth) {
            int end = Math.min(pos + lineWidth - 1, length);
            os.println(seq.subStr(pos, end));
        }
    }

to 

    public void writeSequence(Sequence seq, PrintStream os)
    throws IOException {
        os.print(">");
        os.println(describeSequence(seq));
        
        int length = seq.length();
        String seqString = seq.seqString();
        for (int pos = 0; pos < length; pos += lineWidth) {
            int end = Math.min(pos + lineWidth, length);
            String sub = seqString.substring(pos, end);
            os.println(sub);
        }
    }

since it is String manipulation that takes place in the loop, I think there is no point in using SymbolList subStr anyway.

ciao dirk
  
-- 
Dirk Habighorst                  Software Engineer/ Bioinformatician
Epigenomics AG    Kleine Praesidentenstr. 1    10178 Berlin, Germany
phone:+49-30-24345-372                          fax:+49-30-24345-555
http://www.epigenomics.com           dirk.habighorst@epigenomics.com
From td2 at sanger.ac.uk  Wed Oct 19 09:53:39 2005
From: td2 at sanger.ac.uk (Thomas Down)
Date: Wed Oct 19 10:40:16 2005
Subject: [Biojava-dev] FastaFormat performance enhancement
In-Reply-To: <dj50s2$dus$1@perl.epigenomics.epi>
References: <dj50s2$dus$1@perl.epigenomics.epi>
Message-ID: <7A7E9CB2-D412-4A12-957B-401F08A7BD8A@sanger.ac.uk>


On 19 Oct 2005, at 09:41, ml-it-biojava-dev@epigenomics.com wrote:

> Hi,
> I had a lot of trouble using SeqIOTools.writeFasta on large  
> sequences. The subStr method of SymbolList seems to introduce a  
> memory leak (I did not track that in detail!). Anyway I would  
> suggest to change FastaFormat:
>     public void writeSequence(Sequence seq, PrintStream os)
>    throws IOException {
>        os.print(">");
>        os.println(describeSequence(seq));
>               int length = seq.length();
>               for (int pos = 1; pos <= length; pos += lineWidth) {
>            int end = Math.min(pos + lineWidth - 1, length);
>            os.println(seq.subStr(pos, end));
>        }
>    }
>
> to
>    public void writeSequence(Sequence seq, PrintStream os)
>    throws IOException {
>        os.print(">");
>        os.println(describeSequence(seq));
>               int length = seq.length();
>        String seqString = seq.seqString();
>        for (int pos = 0; pos < length; pos += lineWidth) {
>            int end = Math.min(pos + lineWidth, length);
>            String sub = seqString.substring(pos, end);
>            os.println(sub);
>        }
>    }
>
> since it is String manipulation that takes place in the loop, I  
> think there is no point in using SymbolList subStr anyway.

Hi,

I'd argue against this patch since it could potentially generate some  
really huge strings.  Suppose I've got a Sequence object representing  
human chromosome 1 (somewhere around 220Mb).  If this is a database- 
backed object with chunks of sequence lazy-loaded on demand (biojava- 
ensembl does this, for example) then there'll be no problem working  
with it even on a fairly modest PC.  But converting the whole thing  
to a String is going to use at least 440Mb of RAM, and could easily  
cause an OutOfMemoryError.

I'd be fine with stringifying sequences in larger chunks rather than  
one line at a time -- but I think we should be cautious about  
stringifying complete large sequences.

Do you have any idea where the memory leak might be?  I'd be  
interested to track it down.  What sort of sequences were you using?

              Thomas
From ml-it-biojava-dev at epigenomics.com  Wed Oct 19 11:09:27 2005
From: ml-it-biojava-dev at epigenomics.com (ml-it-biojava-dev@epigenomics.com)
Date: Wed Oct 19 11:08:32 2005
Subject: [Biojava-dev] FastaFormat performance enhancement
In-Reply-To: <7A7E9CB2-D412-4A12-957B-401F08A7BD8A@sanger.ac.uk>
References: <dj50s2$dus$1@perl.epigenomics.epi>
	<7A7E9CB2-D412-4A12-957B-401F08A7BD8A@sanger.ac.uk>
Message-ID: <dj5nj7$75u$1@perl.epigenomics.epi>

Thomas Down wrote:
> 
> On 19 Oct 2005, at 09:41, ml-it-biojava-dev@epigenomics.com wrote:
> 
>> Hi,
>> I had a lot of trouble using SeqIOTools.writeFasta on large  
>> sequences. The subStr method of SymbolList seems to introduce a  
>> memory leak (I did not track that in detail!). Anyway I would  suggest 
>> to change FastaFormat:
>>     public void writeSequence(Sequence seq, PrintStream os)
>>    throws IOException {
>>        os.print(">");
>>        os.println(describeSequence(seq));
>>               int length = seq.length();
>>               for (int pos = 1; pos <= length; pos += lineWidth) {
>>            int end = Math.min(pos + lineWidth - 1, length);
>>            os.println(seq.subStr(pos, end));
>>        }
>>    }
>>
>> to
>>    public void writeSequence(Sequence seq, PrintStream os)
>>    throws IOException {
>>        os.print(">");
>>        os.println(describeSequence(seq));
>>               int length = seq.length();
>>        String seqString = seq.seqString();
>>        for (int pos = 0; pos < length; pos += lineWidth) {
>>            int end = Math.min(pos + lineWidth, length);
>>            String sub = seqString.substring(pos, end);
>>            os.println(sub);
>>        }
>>    }
>>
>> since it is String manipulation that takes place in the loop, I  think 
>> there is no point in using SymbolList subStr anyway.
> 
> 
> Hi,
> 
> I'd argue against this patch since it could potentially generate some  
> really huge strings.  Suppose I've got a Sequence object representing  
> human chromosome 1 (somewhere around 220Mb).  If this is a database- 
> backed object with chunks of sequence lazy-loaded on demand (biojava- 
> ensembl does this, for example) then there'll be no problem working  
> with it even on a fairly modest PC.  But converting the whole thing  to 
> a String is going to use at least 440Mb of RAM, and could easily  cause 
> an OutOfMemoryError.
> 
> I'd be fine with stringifying sequences in larger chunks rather than  
> one line at a time -- but I think we should be cautious about  
> stringifying complete large sequences.
> 
> Do you have any idea where the memory leak might be?  I'd be  interested 
> to track it down.  What sort of sequences were you using?
> 
>              Thomas
> 
Hi thomas,

I experienced performance problems (even OutOfMemoryError) when working with large Sequences (not lazy loaded). You might want to check this little example:

package test;

import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.Properties;

import org.biojava.bio.seq.DNATools;
import org.biojava.bio.seq.io.SeqIOTools;
import org.biojava.bio.symbol.IllegalSymbolException;
import org.ensembl.datamodel.CoordinateSystem;
import org.ensembl.datamodel.Location;
import org.ensembl.datamodel.Sequence;
import org.ensembl.datamodel.SequenceRegion;
import org.ensembl.driver.AdaptorException;
import org.ensembl.driver.ConfigurationException;
import org.ensembl.driver.CoreDriver;
import org.ensembl.driver.DriverManager;
import org.ensembl.driver.SequenceAdaptor;
import org.ensembl.driver.SequenceRegionAdaptor;


public class ExportFasta
{

  /**
   * @param args
   */
  public static void main (String[] args) {
    // TODO Auto-generated method stub
    Properties props = createDriverProperties (args);
    try {
      OutputStream os;
      os = new FileOutputStream (args[3]);

      CoreDriver coreDriver = DriverManager.loadDriver (props);
      SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
      SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
      CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
      SequenceRegion[] srs = sra.fetchAllByCoordinateSystem(coordinateSystem);
      
      int size = Integer.parseInt(args[5]);
      for (SequenceRegion seqRegion : srs) {
        Location loc = null;
        int length = (int) seqRegion.getLength();
        int start = 1;
        int end;
        while (start < length) {
          end = start + size - 1 < length ? start + size - 1: length;
          loc = new Location (coordinateSystem, seqRegion.getName(), start, end, 1);
          System.out.println(loc);
          start = end + 1;
          Sequence seq = sa.fetch(loc);
          org.biojava.bio.seq.Sequence bioseq = DNATools.createDNASequence(seq.getString(), loc.toString());
          SeqIOTools.writeFasta(os, bioseq);
        }
      }
    }
    catch (ConfigurationException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
    catch (AdaptorException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
    catch (FileNotFoundException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
    catch (IllegalSymbolException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
    catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }

  private static Properties createDriverProperties (String[] args) {
    Properties props = new Properties ();
    props.setProperty("host", args[0]);
    props.setProperty("user", args[1]);
    props.setProperty("database", args[2]);
    
    return props;
  }

}

java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE

since the chunksize is stable the memory required should be stable. With large chunks (1000000) allocated memory keeps growing! 

hope that helps, dirk
-- 
Dirk Habighorst                  Software Engineer/ Bioinformatician
Epigenomics AG    Kleine Praesidentenstr. 1    10178 Berlin, Germany
phone:+49-30-24345-372                          fax:+49-30-24345-555
http://www.epigenomics.com           dirk.habighorst@epigenomics.com
From ml-it-biojava-dev at epigenomics.com  Wed Oct 19 12:28:37 2005
From: ml-it-biojava-dev at epigenomics.com (ml-it-biojava-dev@epigenomics.com)
Date: Wed Oct 19 12:27:46 2005
Subject: [Biojava-dev] FastaFormat performance enhancement
In-Reply-To: <dj5nj7$75u$1@perl.epigenomics.epi>
References: <dj50s2$dus$1@perl.epigenomics.epi>
	<7A7E9CB2-D412-4A12-957B-401F08A7BD8A@sanger.ac.uk>
	<dj5nj7$75u$1@perl.epigenomics.epi>
Message-ID: <dj5s7l$cam$1@perl.epigenomics.epi>

Dirk Habighorst wrote:
> Thomas Down wrote:
> 
>>
>> On 19 Oct 2005, at 09:41, ml-it-biojava-dev@epigenomics.com wrote:
>>
>>> Hi,
>>> I had a lot of trouble using SeqIOTools.writeFasta on large  
>>> sequences. The subStr method of SymbolList seems to introduce a  
>>> memory leak (I did not track that in detail!). Anyway I would  
>>> suggest to change FastaFormat:
>>>     public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>               for (int pos = 1; pos <= length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth - 1, length);
>>>            os.println(seq.subStr(pos, end));
>>>        }
>>>    }
>>>
>>> to
>>>    public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>        String seqString = seq.seqString();
>>>        for (int pos = 0; pos < length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth, length);
>>>            String sub = seqString.substring(pos, end);
>>>            os.println(sub);
>>>        }
>>>    }
>>>
>>> since it is String manipulation that takes place in the loop, I  
>>> think there is no point in using SymbolList subStr anyway.
>>
>>
>>
>> Hi,
>>
>> I'd argue against this patch since it could potentially generate some  
>> really huge strings.  Suppose I've got a Sequence object representing  
>> human chromosome 1 (somewhere around 220Mb).  If this is a database- 
>> backed object with chunks of sequence lazy-loaded on demand (biojava- 
>> ensembl does this, for example) then there'll be no problem working  
>> with it even on a fairly modest PC.  But converting the whole thing  
>> to a String is going to use at least 440Mb of RAM, and could easily  
>> cause an OutOfMemoryError.
>>
>> I'd be fine with stringifying sequences in larger chunks rather than  
>> one line at a time -- but I think we should be cautious about  
>> stringifying complete large sequences.
>>
>> Do you have any idea where the memory leak might be?  I'd be  
>> interested to track it down.  What sort of sequences were you using?
>>
>>              Thomas
>>
> Hi thomas,
> 
> I experienced performance problems (even OutOfMemoryError) when working 
> with large Sequences (not lazy loaded). You might want to check this 
> little example:
> 
> package test;
> 
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.OutputStream;
> import java.util.Properties;
> 
> import org.biojava.bio.seq.DNATools;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.symbol.IllegalSymbolException;
> import org.ensembl.datamodel.CoordinateSystem;
> import org.ensembl.datamodel.Location;
> import org.ensembl.datamodel.Sequence;
> import org.ensembl.datamodel.SequenceRegion;
> import org.ensembl.driver.AdaptorException;
> import org.ensembl.driver.ConfigurationException;
> import org.ensembl.driver.CoreDriver;
> import org.ensembl.driver.DriverManager;
> import org.ensembl.driver.SequenceAdaptor;
> import org.ensembl.driver.SequenceRegionAdaptor;
> 
> 
> public class ExportFasta
> {
> 
>  /**
>   * @param args
>   */
>  public static void main (String[] args) {
>    // TODO Auto-generated method stub
>    Properties props = createDriverProperties (args);
>    try {
>      OutputStream os;
>      os = new FileOutputStream (args[3]);
> 
>      CoreDriver coreDriver = DriverManager.loadDriver (props);
>      SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
>      SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
>      CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
>      SequenceRegion[] srs = 
> sra.fetchAllByCoordinateSystem(coordinateSystem);
>           int size = Integer.parseInt(args[5]);
>      for (SequenceRegion seqRegion : srs) {
>        Location loc = null;
>        int length = (int) seqRegion.getLength();
>        int start = 1;
>        int end;
>        while (start < length) {
>          end = start + size - 1 < length ? start + size - 1: length;
>          loc = new Location (coordinateSystem, seqRegion.getName(), 
> start, end, 1);
>          System.out.println(loc);
>          start = end + 1;
>          Sequence seq = sa.fetch(loc);
>          org.biojava.bio.seq.Sequence bioseq = 
> DNATools.createDNASequence(seq.getString(), loc.toString());
>          SeqIOTools.writeFasta(os, bioseq);
>        }
>      }
>    }
>    catch (ConfigurationException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (AdaptorException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (FileNotFoundException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IllegalSymbolException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IOException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>  }
> 
>  private static Properties createDriverProperties (String[] args) {
>    Properties props = new Properties ();
>    props.setProperty("host", args[0]);
>    props.setProperty("user", args[1]);
>    props.setProperty("database", args[2]);
>       return props;
>  }
> 
> }
> 
> java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE 
> RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE
> 
> since the chunksize is stable the memory required should be stable. With 
> large chunks (1000000) allocated memory keeps growing!
> hope that helps, dirk

Hi thomas,

I did a little debugging myself and found an intresting place to look at! The SimpleSymbolList backing Sequences created with the DNATools implements subList like this:

     public SymbolList subList(int start, int end){
        if (start < 1 || end > length()) {
            throw new IndexOutOfBoundsException(
                      "Sublist index out of bounds " + length() + ":" + start + "," + end
                      );
        }

        if (end < start) {
            throw new IllegalArgumentException(
                "end must not be lower than start: start=" + start + ", end=" + end
                );
        }

        SimpleSymbolList sl = new SimpleSymbolList(this,viewOffset+start,viewOffset+end);
        if (isView){
            referenceSymbolList.addChangeListener(sl);
        }else{
            this.addChangeListener(sl);
        }
        return sl;
    }

so it keeps adding references to SymbolLists via the addChangeListener method to the original Sequence. It appears that the garbage collection can't keep up with that if the Sequence is to long. I have not checked this in detail though.

ciao, dirk
-- 
Dirk Habighorst                  Software Engineer/ Bioinformatician
Epigenomics AG    Kleine Praesidentenstr. 1    10178 Berlin, Germany
phone:+49-30-24345-372                          fax:+49-30-24345-555
http://www.epigenomics.com           dirk.habighorst@epigenomics.com
From mark.schreiber at novartis.com  Wed Oct 19 21:05:56 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Wed Oct 19 21:05:04 2005
Subject: [Biojava-dev] Serialization problems,	"-" turns to "n" after
	serializing sequence
Message-ID: <OF3BB6354E.8EED48E4-ON482570A0.0005C1BF-482570A0.000609A5@EU.novartis.net>

Hello -

Found out what was happening. Not a problem with serialization but a 
problem with the createDNASequence method. This method wasn't dealing well 
with gaps. There is another DNATools.createGappedDNASequence() that is 
supposed to do what you want. Ideally you shouldn't use the 
createDNASequence method with gap symbols. I have changed it now so that 
if it detects one it calls the createGapped method. This is in CVS. Your 
test seems to work now.

More generally I may need to apply this to RNATools and ProteinTools as 
well. I'll hve a look.

- Mark


Mark Schreiber/GP/Novartis@PH
Sent by: biojava-dev-bounces@portal.open-bio.org
10/19/2005 11:19 AM

 
        To:     Kalle N?slund <kalle.naslund@genpat.uu.se>
        cc:     biojava-dev@biojava.org, (bcc: Mark Schreiber/GP/Novartis)
        Subject:        Re: [Biojava-dev] Serialization problems,       "-" turns to "n" after 
serializing sequence


Hello -

What should happen is that a method called readResolve() should be called 
by the JVM on deserialization to replace the gap symbol that was 
deserialized with the gap symbol of the local AlphabetManager.

This prevents you from having a gap that is not == the gap provided by the 

alphabet manager. It seems that somehow it is instead being replaced by 
the ambiguity symbol n.

It may take me a while to get around to looking at this. If you find it, 
please let me know. If I forget, please remind me : )

- Mark


Kalle N?slund <kalle.naslund@genpat.uu.se>
Sent by: biojava-dev-bounces@portal.open-bio.org
10/19/2005 02:04 AM

 
        To:     biojava-dev@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-dev] Serialization problems,   "-" turns 
to "n" after serializing 
sequence


Hi!

I seem to be stuck with a serialization issue, somewhere deep in the 
alphabet stuff. The problem is that "-" turns into "n". This happens 
both with farily new CVS code as well as 1.4 release code.

The code i am using is the following:

import java.util.*;
import java.io.*;

import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
import org.biojava.utils.*;
import org.biojava.bio.*;

/**
 * Temp class, just to check out some serialization issues im having.
 *
 * @author kalle
 */
public class AlignmentSerializationTest {

    public void run() throws Exception {
        Sequence dnaSeq1 = 
DNATools.createDNASequence("---ATGC---ATGC---", "seq1" );

        dumpInfoAboutSequence( dnaSeq1 );

        System.out.println("Writing alignment to disk");

        File file = new File("/tmp/ali.obj");
        FileOutputStream fOS = new FileOutputStream( file );
        ObjectOutputStream oOS = new ObjectOutputStream( fOS );

        oOS.writeObject( dnaSeq1 );

        oOS.close();
        fOS.close();

        System.out.println( "Loading alignment from disk" );
        FileInputStream     fIS = new FileInputStream( file );
        ObjectInputStream   oIS = new ObjectInputStream( fIS );

        Sequence  serSeq  = ( Sequence )oIS.readObject();

        dumpInfoAboutSequence( serSeq );
    }

    public static void main( String[] flags ) throws Exception {
        AlignmentSerializationTest myAST = new 
AlignmentSerializationTest();
        myAST.run();
    }

    private void dumpInfoAboutSequence( Sequence sequence ) throws 
Exception {
        System.out.println("Name      :" + sequence.getName() );
        System.out.println("Alphabet  :" + sequence.getAlphabet() );
        System.out.println("GapSymbol :" + 
sequence.getAlphabet().getGapSymbol() );
        System.out.println("Sequence  :" + sequence.seqString() );
        System.out.println("Tokeniz   :" + 
sequence.getAlphabet().getTokenization( "token" ) );
    }
}


And the output i get is :

Name      :seq1
Alphabet 
:org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b
GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: []
Sequence  :---atgc---atgc---
Tokeniz 
:org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56

Writing alignment to disk

Loading alignment from disk

Name      :seq1
Alphabet 
:org.biojava.bio.symbol.AlphabetManager$ImmutableWellKnownAlphabetWrapper@1bc887b
GapSymbol :org.biojava.bio.symbol.SimpleBasisSymbol: []
Sequence  :nnnatgcnnnatgcnnn
Tokeniz 
:org.biojava.bio.symbol.AlphabetManager$WellKnownTokenizationWrapper@120cc56


I have spent some time using a debugger and stepping trough the bj code 
but realised that it will most likely take me loads of time, and was 
hoping that some of you guys that have some more experience with the 
alphabet stuff could atleast point me in the right direction, if not 
outright recognize the bug =)

kind regards Kalle
_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


From mark.schreiber at novartis.com  Wed Oct 19 21:12:35 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Wed Oct 19 21:11:43 2005
Subject: [Biojava-dev] FastaFormat performance enhancement
Message-ID: <OF0DB6D8E6.B6F636E3-ON482570A0.00062FB2-482570A0.0006A555@EU.novartis.net>

Hi Thomas -

I can confirm this. I ran a profiler a while back after getting a similar 
complaint. It seems that every time you call subList you add a reference 
to the parent SymbolList. For some reason this reference remains even when 
the sub list is garbage collected. Also oddly if you ever do an edit 
operation then all the old references disappear.

The best way to see it happen is to assign lots of memory to the JVM and 
infinitely loop over a sublist operation:


Sequence seq = ...
while(true){
    SymbolList sl = seq.subList(1, 10);
}


You quickly accumulate thousands of references. I could never figure out 
why they don't get released.

- Mark


ml-it-biojava-dev@epigenomics.com
Sent by: biojava-dev-bounces@portal.open-bio.org
10/20/2005 12:28 AM

 
        To:     biojava-dev@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        Re: [Biojava-dev] FastaFormat performance enhancement


Dirk Habighorst wrote:
> Thomas Down wrote:
> 
>>
>> On 19 Oct 2005, at 09:41, ml-it-biojava-dev@epigenomics.com wrote:
>>
>>> Hi,
>>> I had a lot of trouble using SeqIOTools.writeFasta on large 
>>> sequences. The subStr method of SymbolList seems to introduce a 
>>> memory leak (I did not track that in detail!). Anyway I would 
>>> suggest to change FastaFormat:
>>>     public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>               for (int pos = 1; pos <= length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth - 1, length);
>>>            os.println(seq.subStr(pos, end));
>>>        }
>>>    }
>>>
>>> to
>>>    public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>        String seqString = seq.seqString();
>>>        for (int pos = 0; pos < length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth, length);
>>>            String sub = seqString.substring(pos, end);
>>>            os.println(sub);
>>>        }
>>>    }
>>>
>>> since it is String manipulation that takes place in the loop, I 
>>> think there is no point in using SymbolList subStr anyway.
>>
>>
>>
>> Hi,
>>
>> I'd argue against this patch since it could potentially generate some 
>> really huge strings.  Suppose I've got a Sequence object representing 
>> human chromosome 1 (somewhere around 220Mb).  If this is a database- 
>> backed object with chunks of sequence lazy-loaded on demand (biojava- 
>> ensembl does this, for example) then there'll be no problem working 
>> with it even on a fairly modest PC.  But converting the whole thing 
>> to a String is going to use at least 440Mb of RAM, and could easily 
>> cause an OutOfMemoryError.
>>
>> I'd be fine with stringifying sequences in larger chunks rather than 
>> one line at a time -- but I think we should be cautious about 
>> stringifying complete large sequences.
>>
>> Do you have any idea where the memory leak might be?  I'd be 
>> interested to track it down.  What sort of sequences were you using?
>>
>>              Thomas
>>
> Hi thomas,
> 
> I experienced performance problems (even OutOfMemoryError) when working 
> with large Sequences (not lazy loaded). You might want to check this 
> little example:
> 
> package test;
> 
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.OutputStream;
> import java.util.Properties;
> 
> import org.biojava.bio.seq.DNATools;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.symbol.IllegalSymbolException;
> import org.ensembl.datamodel.CoordinateSystem;
> import org.ensembl.datamodel.Location;
> import org.ensembl.datamodel.Sequence;
> import org.ensembl.datamodel.SequenceRegion;
> import org.ensembl.driver.AdaptorException;
> import org.ensembl.driver.ConfigurationException;
> import org.ensembl.driver.CoreDriver;
> import org.ensembl.driver.DriverManager;
> import org.ensembl.driver.SequenceAdaptor;
> import org.ensembl.driver.SequenceRegionAdaptor;
> 
> 
> public class ExportFasta
> {
> 
>  /**
>   * @param args
>   */
>  public static void main (String[] args) {
>    // TODO Auto-generated method stub
>    Properties props = createDriverProperties (args);
>    try {
>      OutputStream os;
>      os = new FileOutputStream (args[3]);
> 
>      CoreDriver coreDriver = DriverManager.loadDriver (props);
>      SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
>      SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
>      CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
>      SequenceRegion[] srs = 
> sra.fetchAllByCoordinateSystem(coordinateSystem);
>           int size = Integer.parseInt(args[5]);
>      for (SequenceRegion seqRegion : srs) {
>        Location loc = null;
>        int length = (int) seqRegion.getLength();
>        int start = 1;
>        int end;
>        while (start < length) {
>          end = start + size - 1 < length ? start + size - 1: length;
>          loc = new Location (coordinateSystem, seqRegion.getName(), 
> start, end, 1);
>          System.out.println(loc);
>          start = end + 1;
>          Sequence seq = sa.fetch(loc);
>          org.biojava.bio.seq.Sequence bioseq = 
> DNATools.createDNASequence(seq.getString(), loc.toString());
>          SeqIOTools.writeFasta(os, bioseq);
>        }
>      }
>    }
>    catch (ConfigurationException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (AdaptorException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (FileNotFoundException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IllegalSymbolException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IOException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>  }
> 
>  private static Properties createDriverProperties (String[] args) {
>    Properties props = new Properties ();
>    props.setProperty("host", args[0]);
>    props.setProperty("user", args[1]);
>    props.setProperty("database", args[2]);
>       return props;
>  }
> 
> }
> 
> java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE 

> RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE
> 
> since the chunksize is stable the memory required should be stable. With 

> large chunks (1000000) allocated memory keeps growing!
> hope that helps, dirk

Hi thomas,

I did a little debugging myself and found an intresting place to look at! 
The SimpleSymbolList backing Sequences created with the DNATools 
implements subList like this:

     public SymbolList subList(int start, int end){
        if (start < 1 || end > length()) {
            throw new IndexOutOfBoundsException(
                      "Sublist index out of bounds " + length() + ":" + 
start + "," + end
                      );
        }

        if (end < start) {
            throw new IllegalArgumentException(
                "end must not be lower than start: start=" + start + ", 
end=" + end
                );
        }

        SimpleSymbolList sl = new 
SimpleSymbolList(this,viewOffset+start,viewOffset+end);
        if (isView){
            referenceSymbolList.addChangeListener(sl);
        }else{
            this.addChangeListener(sl);
        }
        return sl;
    }

so it keeps adding references to SymbolLists via the addChangeListener 
method to the original Sequence. It appears that the garbage collection 
can't keep up with that if the Sequence is to long. I have not checked 
this in detail though.

ciao, dirk
-- 
Dirk Habighorst                  Software Engineer/ Bioinformatician
Epigenomics AG    Kleine Praesidentenstr. 1    10178 Berlin, Germany
phone:+49-30-24345-372                          fax:+49-30-24345-555
http://www.epigenomics.com           dirk.habighorst@epigenomics.com
_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


From mark.schreiber at novartis.com  Thu Oct 20 03:16:39 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Thu Oct 20 03:22:38 2005
Subject: [Biojava-dev] FastaFormat performance enhancement
Message-ID: <OF85FCB8E4.7B3F8EB5-ON482570A0.0027225D-482570A0.0027FA58@EU.novartis.net>

Hello -

Think I may have solved this. I found that the ChangeSupport had 
WeakReferences to the the SymbolLists that are created in the subList() 
method. Obviously the things that were referenced were becomming weakly 
referenced and getting garbage collected but the ChangeSupport was not 
clearing out the WeakReference objects that no longer pointed to anything. 
There was a provision for this if someone did something that fired a 
change event  but not if they did not.

I've tweaked ChangeSupport a bit so that when it tries to grow it's array 
or WeakReferences it first checks if it can purge some. This seems to 
stabalize the number of WeakReferences at about 1500 on my machine, each 
typically lasts about 4 GC cycles on average. I will check this into CVS.

I'm still a little concerned by the gradual increase of 
java.lang.ref.Finalize objects however these are package private and only 
used by the JVM so I don't think they are anything to do with what biojava 
is doing (directly) so hopefully they will sort themselves out given 
enough time.

- Mark


ml-it-biojava-dev@epigenomics.com
Sent by: biojava-dev-bounces@portal.open-bio.org
10/20/2005 12:28 AM

 
        To:     biojava-dev@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        Re: [Biojava-dev] FastaFormat performance enhancement


Dirk Habighorst wrote:
> Thomas Down wrote:
> 
>>
>> On 19 Oct 2005, at 09:41, ml-it-biojava-dev@epigenomics.com wrote:
>>
>>> Hi,
>>> I had a lot of trouble using SeqIOTools.writeFasta on large 
>>> sequences. The subStr method of SymbolList seems to introduce a 
>>> memory leak (I did not track that in detail!). Anyway I would 
>>> suggest to change FastaFormat:
>>>     public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>               for (int pos = 1; pos <= length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth - 1, length);
>>>            os.println(seq.subStr(pos, end));
>>>        }
>>>    }
>>>
>>> to
>>>    public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>        String seqString = seq.seqString();
>>>        for (int pos = 0; pos < length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth, length);
>>>            String sub = seqString.substring(pos, end);
>>>            os.println(sub);
>>>        }
>>>    }
>>>
>>> since it is String manipulation that takes place in the loop, I 
>>> think there is no point in using SymbolList subStr anyway.
>>
>>
>>
>> Hi,
>>
>> I'd argue against this patch since it could potentially generate some 
>> really huge strings.  Suppose I've got a Sequence object representing 
>> human chromosome 1 (somewhere around 220Mb).  If this is a database- 
>> backed object with chunks of sequence lazy-loaded on demand (biojava- 
>> ensembl does this, for example) then there'll be no problem working 
>> with it even on a fairly modest PC.  But converting the whole thing 
>> to a String is going to use at least 440Mb of RAM, and could easily 
>> cause an OutOfMemoryError.
>>
>> I'd be fine with stringifying sequences in larger chunks rather than 
>> one line at a time -- but I think we should be cautious about 
>> stringifying complete large sequences.
>>
>> Do you have any idea where the memory leak might be?  I'd be 
>> interested to track it down.  What sort of sequences were you using?
>>
>>              Thomas
>>
> Hi thomas,
> 
> I experienced performance problems (even OutOfMemoryError) when working 
> with large Sequences (not lazy loaded). You might want to check this 
> little example:
> 
> package test;
> 
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.OutputStream;
> import java.util.Properties;
> 
> import org.biojava.bio.seq.DNATools;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.symbol.IllegalSymbolException;
> import org.ensembl.datamodel.CoordinateSystem;
> import org.ensembl.datamodel.Location;
> import org.ensembl.datamodel.Sequence;
> import org.ensembl.datamodel.SequenceRegion;
> import org.ensembl.driver.AdaptorException;
> import org.ensembl.driver.ConfigurationException;
> import org.ensembl.driver.CoreDriver;
> import org.ensembl.driver.DriverManager;
> import org.ensembl.driver.SequenceAdaptor;
> import org.ensembl.driver.SequenceRegionAdaptor;
> 
> 
> public class ExportFasta
> {
> 
>  /**
>   * @param args
>   */
>  public static void main (String[] args) {
>    // TODO Auto-generated method stub
>    Properties props = createDriverProperties (args);
>    try {
>      OutputStream os;
>      os = new FileOutputStream (args[3]);
> 
>      CoreDriver coreDriver = DriverManager.loadDriver (props);
>      SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
>      SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
>      CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
>      SequenceRegion[] srs = 
> sra.fetchAllByCoordinateSystem(coordinateSystem);
>           int size = Integer.parseInt(args[5]);
>      for (SequenceRegion seqRegion : srs) {
>        Location loc = null;
>        int length = (int) seqRegion.getLength();
>        int start = 1;
>        int end;
>        while (start < length) {
>          end = start + size - 1 < length ? start + size - 1: length;
>          loc = new Location (coordinateSystem, seqRegion.getName(), 
> start, end, 1);
>          System.out.println(loc);
>          start = end + 1;
>          Sequence seq = sa.fetch(loc);
>          org.biojava.bio.seq.Sequence bioseq = 
> DNATools.createDNASequence(seq.getString(), loc.toString());
>          SeqIOTools.writeFasta(os, bioseq);
>        }
>      }
>    }
>    catch (ConfigurationException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (AdaptorException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (FileNotFoundException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IllegalSymbolException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IOException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>  }
> 
>  private static Properties createDriverProperties (String[] args) {
>    Properties props = new Properties ();
>    props.setProperty("host", args[0]);
>    props.setProperty("user", args[1]);
>    props.setProperty("database", args[2]);
>       return props;
>  }
> 
> }
> 
> java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE 

> RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE
> 
> since the chunksize is stable the memory required should be stable. With 

> large chunks (1000000) allocated memory keeps growing!
> hope that helps, dirk

Hi thomas,

I did a little debugging myself and found an intresting place to look at! 
The SimpleSymbolList backing Sequences created with the DNATools 
implements subList like this:

     public SymbolList subList(int start, int end){
        if (start < 1 || end > length()) {
            throw new IndexOutOfBoundsException(
                      "Sublist index out of bounds " + length() + ":" + 
start + "," + end
                      );
        }

        if (end < start) {
            throw new IllegalArgumentException(
                "end must not be lower than start: start=" + start + ", 
end=" + end
                );
        }

        SimpleSymbolList sl = new 
SimpleSymbolList(this,viewOffset+start,viewOffset+end);
        if (isView){
            referenceSymbolList.addChangeListener(sl);
        }else{
            this.addChangeListener(sl);
        }
        return sl;
    }

so it keeps adding references to SymbolLists via the addChangeListener 
method to the original Sequence. It appears that the garbage collection 
can't keep up with that if the Sequence is to long. I have not checked 
this in detail though.

ciao, dirk
-- 
Dirk Habighorst                  Software Engineer/ Bioinformatician
Epigenomics AG    Kleine Praesidentenstr. 1    10178 Berlin, Germany
phone:+49-30-24345-372                          fax:+49-30-24345-555
http://www.epigenomics.com           dirk.habighorst@epigenomics.com
_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


From mark.schreiber at novartis.com  Fri Oct 21 03:58:05 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Fri Oct 21 03:57:28 2005
Subject: [Biojava-dev] gaps and basis symbols
Message-ID: <OF001C2511.60E7AC7A-ON482570A1.002BBA1E-482570A1.002BC556@EU.novartis.net>

Hello -

There seems to be a slightly strange relationship between gaps and 
AlphabetManager.getGapSymbol(). If I take (for example) the 
SymbolTokenization of DNA and ask it for the Symbol associated with "-" it 
gives me back a BasisSymbol that is composed of a List that contains only 
the GapSymbol from AlphabetManager.

This leads to the slightly weird problem that the Symbol returned != 
AlphabetManager.getGapSymbol() which is what I expected. This also causes 
some curious problems with serialization that may or may not be related. 
Regardless, why does the "-" token not map directly to the GapSymbol in a 
singleton manner rather than mapping to the BasisSymbol composed of a List 
of only the GapSymbol.

Can any biojava mystics illucidate some wisdom on this?

- Mark

Mark Schreiber
Research Investigator (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910
From mark.schreiber at novartis.com  Fri Oct 21 04:42:40 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Fri Oct 21 04:42:01 2005
Subject: [Biojava-dev] gaps and basis symbols
Message-ID: <OF7C97E5A2.FE845D02-ON482570A1.002F5927-482570A1.002FDA5C@EU.novartis.net>

Further to this ...

Investigating a bit further it seems that AlphabetManager.xml denotes an 
<ambiguityMapping> for "-"  ,  "." and " ". It denotes a <gapMapping> for 
"~".

I'm not sure if this is an oversight or if this was intentional. Should 
they not all be <gapMapping>s?? Are not all gaps created equal? If I edit 
this in my copy of AlphabetManager.xml then everything seems to work and 
the JUnit tests still pass. It seems odd though, given that this has not 
been spotted before I am thinking it is intentional.

Should I commit these changes to CVS???

- Mark 


Mark Schreiber/GP/Novartis@PH
Sent by: biojava-dev-bounces@portal.open-bio.org
10/21/2005 03:58 PM

 
        To:     biojava-dev@biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-dev] gaps and basis symbols


Hello -

There seems to be a slightly strange relationship between gaps and 
AlphabetManager.getGapSymbol(). If I take (for example) the 
SymbolTokenization of DNA and ask it for the Symbol associated with "-" it 

gives me back a BasisSymbol that is composed of a List that contains only 
the GapSymbol from AlphabetManager.

This leads to the slightly weird problem that the Symbol returned != 
AlphabetManager.getGapSymbol() which is what I expected. This also causes 
some curious problems with serialization that may or may not be related. 
Regardless, why does the "-" token not map directly to the GapSymbol in a 
singleton manner rather than mapping to the BasisSymbol composed of a List 

of only the GapSymbol.

Can any biojava mystics illucidate some wisdom on this?

- Mark

Mark Schreiber
Research Investigator (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910
_______________________________________________
biojava-dev mailing list
biojava-dev@biojava.org
http://biojava.org/mailman/listinfo/biojava-dev


From matthew.pocock at ncl.ac.uk  Tue Oct 25 12:07:28 2005
From: matthew.pocock at ncl.ac.uk (Matthew Pocock)
Date: Tue Oct 25 12:25:59 2005
Subject: [Biojava-dev] hello all
Message-ID: <200510251707.29364.matthew.pocock@ncl.ac.uk>

Hi,

Due to work's (very helpful) network security I wasn't able to access my Yahoo 
account for a while, so my old e-mail address lapsed. Anyhoo - I've now 
re-subscribed to both mailing lists with this address, so should be back in 
the loop now.

I trust nothing too exciting has happened in the year that I've been away :-)

Matthew
From kalle.naslund at genpat.uu.se  Wed Oct 26 05:05:15 2005
From: kalle.naslund at genpat.uu.se (=?ISO-8859-1?Q?Kalle_N=E4slund?=)
Date: Wed Oct 26 05:27:24 2005
Subject: [Biojava-dev] gaps and basis symbols
In-Reply-To: <OF7C97E5A2.FE845D02-ON482570A1.002F5927-482570A1.002FDA5C@EU.novartis.net>
References: <OF7C97E5A2.FE845D02-ON482570A1.002F5927-482570A1.002FDA5C@EU.novartis.net>
Message-ID: <435F46CB.60009@genpat.uu.se>

mark.schreiber@novartis.com wrote:

>Further to this ...
>
>Investigating a bit further it seems that AlphabetManager.xml denotes an 
><ambiguityMapping> for "-"  ,  "." and " ". It denotes a <gapMapping> for 
>"~".
>
>I'm not sure if this is an oversight or if this was intentional. Should 
>they not all be <gapMapping>s?? Are not all gaps created equal? If I edit 
>this in my copy of AlphabetManager.xml then everything seems to work and 
>the JUnit tests still pass. It seems odd though, given that this has not 
>been spotted before I am thinking it is intentional.
>
>Should I commit these changes to CVS???
>
>- Mark 
>
>  
>
Hi!

I realy dont have much of a clue myself, but i have been digging around 
in the
serialization code myself, and have come to a similar conclusion as you.

In reagards to the "~" i THINK the idea is that there are different gaps 
in biojava.
One gap, the "-" are for gaps inside a sequence, while "~" are for gaps 
that realy
do not exist in the sequence, they are there because there is no 
sequence, normaly
this would be in a multiple alignment, where any initial and terminal 
gaps are
"~" and any gaps inside the actual sequence are "-".

I think this is used somewhere aswell, perhaps in the HMM code ?

If we are nasty we could always give Matthew a "nice" welcome back 
present =P

Kalle

>
>
>
>Mark Schreiber/GP/Novartis@PH
>Sent by: biojava-dev-bounces@portal.open-bio.org
>10/21/2005 03:58 PM
>
> 
>        To:     biojava-dev@biojava.org
>        cc:     (bcc: Mark Schreiber/GP/Novartis)
>        Subject:        [Biojava-dev] gaps and basis symbols
>
>
>Hello -
>
>There seems to be a slightly strange relationship between gaps and 
>AlphabetManager.getGapSymbol(). If I take (for example) the 
>SymbolTokenization of DNA and ask it for the Symbol associated with "-" it 
>
>gives me back a BasisSymbol that is composed of a List that contains only 
>the GapSymbol from AlphabetManager.
>
>This leads to the slightly weird problem that the Symbol returned != 
>AlphabetManager.getGapSymbol() which is what I expected. This also causes 
>some curious problems with serialization that may or may not be related. 
>Regardless, why does the "-" token not map directly to the GapSymbol in a 
>singleton manner rather than mapping to the BasisSymbol composed of a List 
>
>of only the GapSymbol.
>
>Can any biojava mystics illucidate some wisdom on this?
>
>- Mark
>
>Mark Schreiber
>Research Investigator (Bioinformatics)
>
>Novartis Institute for Tropical Diseases (NITD)
>10 Biopolis Road
>#05-01 Chromos
>Singapore 138670
>www.nitd.novartis.com
>
>phone +65 6722 2973
>fax  +65 6722 2910
>_______________________________________________
>biojava-dev mailing list
>biojava-dev@biojava.org
>http://biojava.org/mailman/listinfo/biojava-dev
>
>
>
>_______________________________________________
>biojava-dev mailing list
>biojava-dev@biojava.org
>http://biojava.org/mailman/listinfo/biojava-dev
>  
>

From matthew.pocock at ncl.ac.uk  Wed Oct 26 07:20:40 2005
From: matthew.pocock at ncl.ac.uk (Matthew Pocock)
Date: Wed Oct 26 07:19:33 2005
Subject: [Biojava-dev] gaps and holes
Message-ID: <200510261220.40776.matthew.pocock@ncl.ac.uk>

Hi,

I understand there's been a thread about the biojava gaps model. For those of 
you who want a not very good explanation, you track down my PhD on the Sanger 
web site and scan through chapter 3. If my memory serves, there is a section 
in there about the symbol group theory. Anyhoo - here is the long-and-short 
of the problem...

The BioJava symbol model was designed to support algorithms. This lead us to 
go a set-theoretic route for modelling the DNA/RNA/Protein/(insert biopolymer 
here). This is visible in two places.

1) ambiguity is modelled using sets

  * The base nucleotides a/g/c/t are interconvertible with the sets {a}, {g}, 
{c} and {t}

  * Ambiguities are interconvertible with the sets that they range over e.g. n 
-> {a,g,c,t}

  * If you take the power set of DNA, you naturally have the bases, N and {} - 
and as if by magic, we also need gaps, so gap -> {} seems like an obvious 
rule

2) columns of an alignment are modelled as elements of cross-products of 
symbol sets

  * an alignment between two DNA sequences is a string from the alphabet DNA x 
DNA, which by convention in BioJava is a string over Pow({a,g,c,t}) ^2

  * for a symbol to be a member of DNA^2, it must be a cross-product symbol of 
dimension 2, where each component s_1 and s_2 are from the DNA alphabet, 
written [s_1, s_2].

OK - so, this all sounds fairly reasonable so far. Now for the more anoying 
bits.

a) what happens if we have an ambiguous symbol in an alignment?

  * Well, let's say we have the symbols s_1 and s_2, and s_1 is ambiguous - 
that is, it maps to a set of symbols that does not have size=1. This can be 
displayed as a column in an alignment just fine. E.g. the column [n,a] can be 
expanded to [{a,g,c,t},{a}] and this in turn can be expanded to {[a,a], 
[g,a], [c,a], [t,a]}.

  * Just for the fun of it, let's take {[a,g], [g,a]} - we can't write this 
down in the form [{i,j,...},{x,y,...}], so it can not be the column of an 
alignment. The symbols that can be a column in an alignment are basis 
symbols. Those that can not are just Symbol instances. Every basis symbol 
could be used as a basis function in a probability distribution. Every 
single-dimensional symbol is a basis symbol, but we tend to be even more 
specific and call these atomic symbols.

b) Specifically, what happens if we have gaps in an alignment?

  * Let's have ~ for {} - you'll see why in a moment...

  * Let's have DNA^2

  * We could write [~, a] to represent a column in an alignment where there 
was a gap in the 1st sequence and a in the second. Similarly, [~,~] could 
represent a gap in both.

  * It follows by analogy that we would use [~] for the one-dimensional case. 
If we push this 1-dimensional case notation back up to the 2d one, we get the 
unweildy result [[~],[~]]

  * Let's clean up the notation by keeping ~ -> {} and adding - -> [~]. Now we 
can write [-,-] for the 2d case, - for the 1d case and ~ for the 0d case.

c) Wait a minute - the 0d case???

  * Consider the empty alphabet. It is defined by the symbol set {}, so it 
contains ~ only. Now let's say there's a finite-state machine that is 
generating symbol lists. If the machine can never reach part of the symbol 
list to generate it, then the symbol there is ~

  * So - if you have a DNA sequence generated by a FSM, then the portion of 
the tape before and after the generated symbol list is populated by ~, which 
is kind of nice notationally  because this is what multi-fasta uses to pad 
out before & after sequences in alignments

d) Back to alignments, from a FSM-centric point of view

  * Now we can use - to represent the case when the FSM advanced through a 
state silently. That is, the FSM moved on one state, but the emitted symbol 
list did not. If we choose to capture this 'emission slippage', then we need 
to notate that nothing was emitted but that it took up one symbol's space in 
the symbol list because one state was advanced. Hence, [~] is a reasonable 
choice here.

  * In pair-wise alignment, we can now use [-,x] and [x,-] to represent the 
case where the FSM emitted nothing on the first or second tape, respectively. 
[-,-] would represent the case where the FSM emitted on neither tape but 
still advanced a state. ~ can still be used for the case when the FSM could 
never generate a symbol, for example, outside the alignment matrix.


I hope this has made part of the rationalle for structured gaps more clear. I 
agree that it is a bit strange, but if you want a consistent structure for 
representing symbols as sets and for representing alignments, it prety much 
drops out as The One True Way. We can split hairs about exactly when ~ and - 
get uses, but they are different things, and if you confuse the two then 
inside things like DP recursions, Very Bad Things happen which require 
boundary conditions and nasty hacks to correct. Perhaps we need to use a 
GapSymbol interface or have isGap on Symbol or something to make life easier. 
It's a pitty the Java type system plays so badly with sets. Pitty ML isn't 
generally accepted as being a useable language :-(

Matthew
From atariml at gmail.com  Wed Oct 26 10:29:13 2005
From: atariml at gmail.com (Andrea Franceschini)
Date: Wed Oct 26 16:14:08 2005
Subject: [Biojava-dev] Automated upstream region sequence retrieval
Message-ID: <001501c5da39$a5205dc0$0801a8c0@atarippc>

Hi
We are looking to build a simple utility in Java to retrieve DNA sequences starting from a list of Entrez geneId. 
( for example a user will be able to extract all the 2k upstream sequences of a list of geneIds ).

If nobody of you have already done something like this we will be happy to do it
and if you 're interested we could integrate our code in BioJava, following your indications.

Thankyou very much
Andrea Franceschini
University Politecnico of Milan (Italy)

From mmccormi at fhcrc.org  Fri Oct 28 12:48:37 2005
From: mmccormi at fhcrc.org (Michael McCormick)
Date: Fri Oct 28 12:47:27 2005
Subject: [Biojava-dev] Potential Enhancements, Defect
Message-ID: <BE7583A7-544C-43BF-9B1B-D78FF9AE9FCF@fhcrc.org>

Greetings,

Ruihan Wang and I are developing an application that uses biojava in  
a J2EE environment. We have made a few changes and would like to add  
them to the biojava code. All of the changes except for one class  
involve serialization issues. Here is a brief summary.

Please let me know if you are interested in adding these changes and  
how they should be submitted.

Thanks.
Mike

Michael McCormick
Systems Analyst
Fred Hutchinson Cancer Research Center


/org/biojava/bio/search/SeqSimilaritySearchHit should be Serializable
/org/biojava/bio/search/SeqSimilaritySearchResult should be Serializable
/org/biojava/bio/search/SeqSimilaritySearchSubHit should be Serializable
/org/biojava/bio/seq/FeatureHolder should be Serializable
/org/biojava/bio/seq/db/SequenceDB should be Serializable
/org/biojava/bio/symbol/Symbol should be Serializable
/org/biojava/bio/symbol/SymbolList should be Serializable

/org/biojava/bio/symbol/SimpleAtomicSymbol and
/org/biojava/bio/symbol/SimpleBasisSymbol do not serialize correctly,  
however the mailing list provided a work around by commenting out the  
defective code.

org/biojava/bio/program/abi/ABIFChromatogram.java has a few issues.
1. Should be Serializable.
2. We experienced file handle count resource exceptions since File  
access was not being closed! This still needs future refactoring  
since the new close does not occur within a finally block.
3. Modify class to use readFully(). In our environment, this change  
allowed us to parse chromats at least 10 times faster.

diff for org/biojava/bio/program/abi/ABIFChromatogram.java
27,28d26
< import java.io.RandomAccessFile;
< import java.io.Serializable;
57c55
< public class ABIFChromatogram extends AbstractChromatogram  
implements Serializable {
---
 > public class ABIFChromatogram extends AbstractChromatogram {
141d138
<
151a149
 >
153d150
<             ((RandomAccessFile)getDataAccess()).close();
164a162
 >
166,171c164,166
<                 byte[] shortArray = new byte[2 * count];
<                 getDataAccess().readFully(shortArray);
<                 int i = 0;
<                 for (int s = 0; s < shortArray.length; s += 2) {
<                     trace[i] =  ((short)((shortArray[s] << 8) |  
(shortArray[s + 1] & 0xff))) & 0xffff;
<                     max = Math.max(trace[i++], max);
---
 >                 for (int i = 0 ; i < count ; i++) {
 >                     trace[i] = getDataAccess().readShort() & 0xffff;
 >                     max = Math.max(trace[i], max);
175,178c170,171
<                 byte[] byteArray = new byte[count];
<                 getDataAccess().readFully(byteArray);
<                 for (int i = 0; i < byteArray.length; i++) {
<                     trace[i] = byteArray[i] & 0xff;
---
 >                 for (int i = 0 ; i < count ; i++) {
 >                     trace[i] = getDataAccess().readByte() & 0xff;
185c178
<
---
 >
212,216c205,206
<                 byte[] shortArray = new byte[2 * count];
<                 getDataAccess().readFully(shortArray);
<                 IntegerAlphabet integerAlphabet =  
IntegerAlphabet.getInstance();
<                 for (int s = 0; s < shortArray.length; s += 2) {
<                     offsets.add(integerAlphabet.getSymbol(((short) 
((shortArray[s] << 8) | (shortArray[s + 1] & 0xff))) & 0xffff));
---
 >                 for (int i = 0 ; i < offsetsPtr.numberOfElements ;  
i++) {
 >                     offsets.add(IntegerAlphabet.getInstance 
().getSymbol(getDataAccess().readShort() & 0xffff));
220,224c210,211
<                 byte[] byteArray = new byte[count];
<                 getDataAccess().readFully(byteArray);
<                 IntegerAlphabet integerAlphabet =  
IntegerAlphabet.getInstance();
<                 for (int i = 0 ; i < byteArray.length; i++) {
<                     offsets.add(integerAlphabet.getSymbol(byteArray 
[i] & 0xff));
---
 >                 for (int i = 0 ; i < offsetsPtr.numberOfElements ;  
i++) {
 >                     offsets.add(IntegerAlphabet.getInstance 
().getSymbol(getDataAccess().readByte() & 0xff));
234,237c221,224
<                 byte[] byteArray = new byte[(int)  
basesPtr.numberOfElements];
<                 getDataAccess().readFully(byteArray);
<                 for (int i = 0; i < byteArray.length; i++) {
<                     dna.add(ABIFParser.decodeDNAToken((char)  
byteArray[i]));
---
 >                 char token;
 >                 for (int i = 0 ; i < basesPtr.numberOfElements ; i+ 
+) {
 >                     token = (char) getDataAccess().readByte();
 >                     dna.add(ABIFParser.decodeDNAToken(token));