[Biojava-dev] Plans for next biojava release - modularization
mark.schreiber at novartis.com
mark.schreiber at novartis.com
Wed May 13 02:15:27 UTC 2009
Hi -
I think it depends if the code is going to be auto-generated at each build
or only once. I have autogenerated Entity classes for BioSQL tables. My
recommendation would be that these be used for JPA mapping to BioSQL from
BioJava. I think these only need be generated once (unless the BioSQL
schema changes), especially as the autogeneration didn't quite catch some
of the subtleties of the schema. They can also be in their own module,
not the core.
Classes that map to XML or webservice clients can be autogenerated from
XML schema, DTD or WSDL once or at every build (automatically from ANT and
probably Maven). In these cases it may pay to do it with every build
because these classes are completely boiler plate code and should never
need to be manually modified. Also it means the code for these utility
classes will never be in the code base and at will not be possible for
someone to change it accidentally (and the code base will be smaller).
Only the XSD or WSDL will be in subversion (and any higher level code that
makes use of the boilerplate client code). Improvements in the
boilerplate code or changes that come with updates to JAXB and similar
will automatically appear at the next build (when we change JAXB
versions).
Conceptually the BLAST XML parsing module may consist of only the BLAST
XSD (or DTD) and a high-level biojava class like the following:
public interface BlastParser {
public Serializable[] parseBlast(URL url){
Calls bioler plate code...
}
public Serializable[] parseBlast(String blastXMLOutput){
Calls bioler plate code...
}
}
The code for the bit that does the JAXB marshalling etc could be generated
at build time. The Serializable array would be the objects that JAXB
generates. Probably they would be a more specific stub that implements
serializable (eg BlastResult or similar depending on the XSD).
I think it really comes down to a question of how much the generated code
is boilerplate code that will never be changed. If it is not 'modifiable'
then it can be generated at build. If the autogenerated code is an outline
of a class where method bodies need to be filled in or customized then
they should not be autogenerated at build time. A good example would be
JUnit classes that can be autogenerated to give you a template that will
compile and run but probably will not perform a sensible test. The
developer of the test could autogenerate the template but would then need
to make the test sensible. At that point the test should be in the code
base and should not be regenerated at build time.
- Mark
biojava-dev-bounces at lists.open-bio.org wrote on 05/13/2009 08:45:54 AM:
> The point with the auto-generated code raises actually another
> question to me: How shall we deal with auto-generated code?
>
> I also have some code that is currently not part on BioJava, but it
> might be useful for other people: It allows to parse uniprot XML files
> and serialize / de-serialize the objects to a database using EJBs,
> hibernate and the uniprot XML files.
>
> How far should biojava go in supporting such auto generated or
> semi-auto generated code?
> A
>
>
> On Tue, May 12, 2009 at 5:09 PM, <mark.schreiber at novartis.com> wrote:
> >
> > A while back I gave Richard some code that uses JAXB to objectify (and
> > deobjectify) BLAST XML output. This might be useful for parsing BLAST
> > results from the webservices which normally use BLAST XML. I could
probably
> > dig it up again if needed (it was autogenerated anyway).
> >
> > It would probably be a good object model for BLAST output if people
want to
> > parse other types of BLAST output (such as flatfile, but who would
want to
> > do that!). The BLAST XML seems to accommodate strange flavours of
BLAST
> > such as PSI-BLAST etc and also has been much more stable than the
default
> > flat file output.
> >
> > - Mark
> >
> >
> >
> > Andreas Prlic <andreas at sdsc.edu>
> > Sent by: biojava-dev-bounces at lists.open-bio.org
> >
> > 05/13/2009 08:02 AM
> >
> > To
> > Scooter Willis <HWillis at scripps.edu>
> > cc
> > biojava-dev <biojava-dev at lists.open-bio.org>
> > Subject
> > Re: [Biojava-dev] Plans for next biojava release - modularization
> >
> >
> >
> >
> > Hi Scooter,
> >
> > about your suggestion for the blast webservice client code: In
> > principle I like the idea and we have had questions on the mailing
> > list regarding this in the past. Only thing is I think there is
> > already some client code in java available:
> > http://www.ebi.ac.uk/Tools/webservices/clients/blastpgp
> > but I am not sure how good that Java client library is....
> >
> > Besides this, there is the need for work on our blast parser library
> > and if you are interested in working on that you are welcome. As I
> > mentioned, I think this should become its own module, due to the
> > popularity of that code.
> >
> > Andreas
> >
> >
> >
> >
> > On Tue, May 12, 2009 at 6:34 AM, Scooter Willis <HWillis at scripps.edu>
wrote:
> >> Mark
> >>
> >>
> >>
> >> It is a challenge on knowing where to draw the line. Allowing both
options
> >> is a reasonable approach. The implementation of the algorithm is key
to
> >> allow it to be multi-threaded or being able to run in parallel. One
> >> approach
> >> is to provide a standard interface such as process() would wait for
the
> >> result/return value and run in the parent thread. To run the
algorithm in
> >> a
> >> thread you can have a startProcess() where you can add yourself as a
> >> progress listener and when complete() method is called you can call
> >> getResults(). You can then also have the corresponding stopProcess()
which
> >> would set an internal value to cause all threads to quit. Lots of
ways to
> >> tackle the problem the key is to start talking about it and at
minimum
> >> take
> >> advantage of multiple-cores where the external code can set the
number of
> >> cores to use. You can get a dual quad core machine these days for <
$1000
> >> but most software implementations are not designed to take advantage
of
> >> it.
> >>
> >>
> >>
> >> The real question is what exists today in the BioJava API that is
> >> considered
> >> long running in normal use case and thus is a candidate to be run in
> >> parallel. It may not be an issue in existing BioJava code. When I
first
> >> started using BioJava I went looking for BLAST code only to find a
BLAST
> >> parser. I wanted to do a Multiple Sequence Alignment and turns out
that
> >> Biojava code calls CLUSTALW as an external processor under the
covers. I
> >> also needed code to construct trees from an MSA and found the summer
of
> >> code
> >> project that was only focused on representing the tree.
> >>
> >>
> >>
> >> It would be nice to have a BLAST implementation in Java optimized to
run
> >> on
> >> a cluster but who has time to rewrite BLAST in Java when you can do
BLAST
> >> search via the web and focus on parsing the results. BioJava needs a
BLAST
> >> API that makes a web services call to an external service and gets
returns
> >> structured results in core BioJava structures. Probably not difficult
to
> >> do
> >> a Java version of CLUSTALW but again we can push the work out to
> >> http://www.ebi.ac.uk/Tools/webservices/services/clustalw and get the
> >> results
> >> back returned in BioJava structures.
> >>
> >>
> >>
> >> I can signup for doing a BLAST web service -> BioJava and a CLUSTALW
web
> >> service -> BioJava code. I haven?t done the research but it seems
that
> >> http://www.ebi.ac.uk/Tools/webservices/ has done a fair amount of
work to
> >> expose common biology computational services. If multiple external
> >> services
> >> are offering BLAST via web services where each picked a different
> >> implementation then BioJava could provide abstraction to different
> >> services.
> >>
> >>
> >>
> >> Thanks
> >>
> >> Scooter
> >>
> >>
> >>
> >> From: mark.schreiber at novartis.com
[mailto:mark.schreiber at novartis.com]
> >> Sent: Tuesday, May 12, 2009 1:27 AM
> >> To: Scooter Willis
> >> Cc: Andreas Prlic; biojava-dev
> >> Subject: Re: [Biojava-dev] Plans for next biojava release -
modularization
> >>
> >>
> >>
> >> Hi -
> >>
> >> This was one thing we discussed previously with respect to biojava 3.
> >> Generally I support the idea because almost all computers are now
> >> multi-core and as you say cloud or utility computing is already a
reality.
> >>
> >> However, I tend to think that biojava should not control threading or
> >> concurrency. This should be done by the developer. This is because
> >> sometimes
> >> mutithreading can be fast on a slow computer but slow on a fast
computer
> >> (due to the overhead in spawning threads) so programs need to be
tunable.
> >> Also Java app servers and things like Sun Grid Engine, EC2 etc don't
like
> >> people attempting to control their own threads. What BioJava should
do is
> >> expose granular and thread-safe operations that can be threaded or
form
> >> discrete tasks on a utility grid or complete in SessionBeans on an
App
> >> server. For example it would be better if BioJava had a single
threaded
> >> method to calculate the GC of a single sequence rather than a
> >> multi-threaded
> >> method that calculates the GC of multiple sequences. This would let
the
> >> developer make a multithreaded version if desired or distribute
multiple
> >> tasks based on the single threaded version to a compute cloud (and
let the
> >> cloud manage all the tasks).
> >>
> >> Possibly the best situation would be to have the single threaded fine
> >> grain
> >> operations that let developers or grid engines control threading and
then
> >> higher level APIs that do it for you (or good cookbook examples that
show
> >> you how to do it). Another idea that was discussed was the use of
> >> properties files to allow people to set how many CPUs they wanted to
make
> >> available to the JVM or name packages that can or cannot use
threading.
> >>
> >> Finally, there are lots of times when it is highly desirable to use
Java
> >> beans because they play well with dozens of Java api's however beans
don't
> >> work well with threads because they have public setter methods. I
would
> >> like to see a lot more bean use in a future BioJava because it would
make
> >> life so much easier but a lot of care would need to be taken to make
sure
> >> thread safety is preserved. There are many patterns that can be used
such
> >> as synchronization locks etc to make things thread safe so I think
this
> >> can
> >> be achieved as long as we are disciplined and consider that all
methods
> >> may
> >> be used in a multi-threaded application (even if we write the method
as a
> >> single thread). If there are code checkers that make suggestions on
> >> thread
> >> safety it would be great to have these as part of the standard build
> >> process. Good documentation would go a long way as well. Are there
unit
> >> test patterns that can catch these problems as well? Suggestions
would be
> >> great.
> >>
> >> Progress Listener patterns are good but it depends on the situation
and
> >> might be better handled in high level APIs or left to the developer.
For
> >> example in your NJ code a progress listener would be good if someone
fed
> >> 1000 sequences into the method but not if they only put in 10. Also
code
> >> running on an old machine might need a progress listener but the same
> >> problem on a new machine may complete almost instantly. Probably a
> >> pluggable listener would be the way to go. Also it might be possible
to
> >> do
> >> this using the new JDK APIs that let you take a peek at the stack
trace.
> >> Even if your NJ method didn't allow for a progress listener a
developer
> >> could still make one by looking at the method calls in the stack. As
long
> >> as
> >> your NJ method called other methods internally for each sequence
(quite
> >> likely) it would be possible to observe the cycle of method calls
from the
> >> stack. This might make it possible to have a very general BioJava
> >> progress
> >> listener that can be told to count the number of times a method is
called
> >> in
> >> the stack. The name of the method would be the argument. If the
> >> application
> >> runs in a Java App server you can also do this very easily with a
method
> >> Interceptor.
> >>
> >> - Mark
> >>
> >> biojava-dev-bounces at lists.open-bio.org wrote on 05/11/2009 09:50:58
PM:
> >>
> >>> Andreas
> >>>
> >>> Another theme that should be considered is providing a multi-thread
> >>> version of any module with long run time. This would have a couple
> >>> elements. A progress listener interface should be standard where
core
> >>> code would update progress messages to listeners that can be used by
> >>> external code to display feedback to the user. I did this with the
> >>> Neighbor Joining code for tree construction and it provides needed
> >>> feedback in a GUI. If not the user gets frustrated because they
don't
> >>> know the code they are about to execute may take 10 minutes or 8
hours
> >>> to complete and they think the software is not working. The reverse
is
> >>> also true for canceling an operation where you want to have core
code
> >>> stop processing a long running loop. Once the code has completed
then
> >>> the listener interface for process complete is called allowing the
next
> >>> step in the external code to continue. The developer would have the
> >>> choice to call the "process" method or run it in a thread and wait
for
> >>> the callback complete method to be called.
> >>>
> >>> This is the first step in the ability to have the core/long running
> >>> processes take advantage of multiple threads to complete the
> >>> computational task faster. Not all code can be parallelized easily
but
> >>> if the algorithm can take advantage of running in parallel then it
> >>> should. This then opens up a couple of cloud computing frameworks
that
> >>> extend the multi-threaded concepts in Java across a cluster
> >>> http://www.terracotta.org/. If we put an emphasis on having code
that
> >>> runs well in a thread we are one step closer to an architecture that
can
> >>> run in a cloud. The computational problems are only going to get
bigger
> >>> and with Amazon EC2 and http://www.eucalyptus.com/ approaches
> >>> computational IO cycles are going to be cheap as long as the
> >>> software/libraries can easily take advantage of it.
> >>>
> >>> Thanks
> >>>
> >>> Scooter
> >>>
> >>> -----Original Message-----
> >>> From: biojava-dev-bounces at lists.open-bio.org
> >>> [mailto:biojava-dev-bounces at lists.open-bio.org] On Behalf Of Andreas
> >>> Prlic
> >>> Sent: Monday, May 11, 2009 12:27 AM
> >>> To: biojava-dev
> >>> Subject: [Biojava-dev] Plans for next biojava release -
modularization
> >>>
> >>> Hi biojava-devs,
> >>>
> >>> It is time to start working on the next biojava release. I would
> >>> like to modularize the current code base and apply some of the ideas
> >>> that have emerged around Richard's "biojava 3" code. In principle
the
> >>> idea is that all changes should be backwards compatible with the
> >>> interfaces provided by the current biojava 1.7 release. Backwards
> >>> compatibility shall only be broken if the functionality is being
> >>> replaced with something that works better, and gets documented
> >>> accordingly. For the build functionality I would suggest to stick
with
> >>> what Richard's biojava 3 code base already is providing. Since we
will
> >>> try to be backwards compatible all code development should be part
of
> >>> the biojava-trunk and the first step will be to move the ant-build
> >>> scripts to a maven build process. Following this procedure will
allow
> >>> to use e.g. the code refactoring tools provided by Eclipse, which
> >>> should come in handy.
> >>>
> >>> The modules I would like to see should provide self-contained
> >>> functionality and cross dependencies should be restricted to a
> >>> minimum. I would suggest to have the following modules:
> >>>
> >>> biojava-core: Contains everything that can not easily be modularized
> >>> or nobody volunteers to become a module maintainer.
> >>> biojava-phylogeny: Scooter expressed some interested to provide such
a
> >>> module and become package maintainer for it.
> >>> biojava-structure: Everything protein structure related. I would be
> >>> package maintainer.
> >>> biojava-blast: Blast parsing is a frequently requested functionality
> >>> and it would be good to have this code self-contained. A package
> >>> maintainer for this still will need to be nominated at a later
stage.
> >>> Any suggestions for other modules?
> >>>
> >>> Let me know what you think about this.
> >>>
> >>> Andreas
> >>> _______________________________________________
> >>> biojava-dev mailing list
> >>> biojava-dev at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>>
> >>> _______________________________________________
> >>> biojava-dev mailing list
> >>> biojava-dev at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>
> >> _________________________
> >>
> >> CONFIDENTIALITY NOTICE
> >>
> >> The information contained in this e-mail message is intended only for
the
> >> exclusive use of the individual or entity named above and may contain
> >> information that is privileged, confidential or exempt from
disclosure
> >> under
> >> applicable law. If the reader of this message is not the intended
> >> recipient,
> >> or the employee or agent responsible for delivery of the message to
the
> >> intended recipient, you are hereby notified that any dissemination,
> >> distribution or copying of this communication is strictly prohibited.
If
> >> you
> >> have received this communication in error, please notify the sender
> >> immediately by e-mail and delete the material from any computer.
Thank
> >> you.
> >
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >
> >
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
More information about the biojava-dev
mailing list