From ap3 at sanger.ac.uk Sat Apr 5 08:53:53 2008 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sat, 5 Apr 2008 13:53:53 +0100 Subject: [Biojava-dev] preparations for release 1.6 - svn freeze In-Reply-To: <78ECCE6A-F8CC-45AA-862B-F7D8BFC65EA0@sanger.ac.uk> References: <78ECCE6A-F8CC-45AA-862B-F7D8BFC65EA0@sanger.ac.uk> Message-ID: <65617D9D-0E0C-476F-A515-1222733DA9C2@sanger.ac.uk> Hi, In preparation for the 1.6 release, please do not commit any new features into svn from now until the release. Javadoc improvements are still welcome. There were 2 patches end of last week, regarding the Genetic Algorithms and PDB file header parsing. I suggest to give those a week to make sure they are fine and target next weekend for the release. Andreas On 26 Mar 2008, at 13:55, Andreas Prlic wrote: > Hi, > > The biojava 1.6 release candidate 1 has been available now for a > while and I would like to proceed with releasing the final biojava > 1.6. > > I ran doccheck on the latest SVN and we still could do with some > javadoc improvements: > http://www.spice-3d.org/doccheck/biojava-svn/biojava/ > PackageStatistics.html > > Please commit any remaining bug fixes to SVN until > > Friday, April 4th 18:00 GMT > > I will do the release (and SVN branch) after that. > > Cheers, > Andreas > > > > ---------------------------------------------------------------------- > - > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 > > ---------------------------------------------------------------------- > - > > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome > ResearchLimited, a charity registered in England with number > 1021457 and acompany registered in England with number 2742969, > whose registeredoffice is 215 Euston Road, London, NW1 > 2BE._______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ap3 at sanger.ac.uk Wed Apr 9 06:40:58 2008 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Wed, 9 Apr 2008 11:40:58 +0100 Subject: [Biojava-dev] Why BJ3 should be multithreaded In-Reply-To: <47FC7E3B.9000106@ebi.ac.uk> References: <93b45ca50804090051h42632f43u6b977574c54853c7@mail.gmail.com> <47FC7E3B.9000106@ebi.ac.uk> Message-ID: <77FF0938-D653-490F-8933-B34306068727@sanger.ac.uk> Hi, I like the idea of having support for multiple threads. Only thing is, when running BioJava on our compute farm, I am pretty sure our admins won't be happy if BJ would use more than just a single CPU, unless run on special hardware. As such there should be a BJ wide configuration management, which would allow to determine how many CPUs to be used (and the default could be all of them). Andreas On 9 Apr 2008, at 09:28, Andy Yates wrote: > Lo, > > This is the kind of problem Java7 is attempting to solve with the > fork-join framework (which really is a rip-off of Google's > MapReduce). There's two ways of looking at thread safety & how to > implement it: > > * Packages which could be threaded or want to be threaded are > programmed with threading in mind using items from the > util.concurrent package to split, queue & work with data points. > > * Packages can be created as required & have data to process passed > to them for processing in a stateless manner; much in the same way > servlet engines and a lot of web frameworks run > > The first way does mean we can support environments with useful > multi-threaded support (no point in threading on a single CPU/core > box) from the word go. The second way would require some plumbing > on the user's behalf but this would be very easy plumbing; the > majority of which we could write (like wrapping things in instances > of Callables). > > Anyway my 2p worth :) > > Andy > > Mark Schreiber wrote: >> Hi - >> I was just playing with threads to see how efficient they are on >> one of our old 4 CPU IBM servers. The following fairly naive >> program splits a large array of numbers and sums them all up. The >> multi-threaded version is 2.5 times faster even allowing for >> thread overhead. The program could be even better if I make more >> use of the java1.5 concurrent package. >> Similar tasks in biojava would be include training distributions >> which should see similar performance improvements. Much of the >> current biojava doesn't make use of threads and worse, requires >> the developer to manage all the thread safety themselves. >> - Mark >> /* >> * To change this template, choose Tools | Templates >> * and open the template in the editor. >> */ >> package concurrent; >> import java.util.concurrent.atomic.AtomicInteger; >> /** >> * This program demo's the use of threads to sum a large array of >> integers. >> * @author Mark Schreiber >> */ >> public class ThreadedAdder { >> static int processors = Runtime.getRuntime >> ().availableProcessors(); >> int bigNumber = 10000000; >> int[] bigArray = new int[bigNumber * processors]; >> public ThreadedAdder(){ >> //make a big array of integers (10 000 000 numbers for >> each processor) >> for(int i = 0; i < bigArray.length; i++){ >> //random number between 1 and 100 >> bigArray[i] = (int)(Math.random() * 100.0); >> } >> } >> public void singleThreadedAdd(){ >> int result = 0; >> //single threaded sum >> long start = System.currentTimeMillis(); >> for(int number : bigArray){ >> result += number; >> } >> long time = System.currentTimeMillis() - start; >> System.out.println("Calculation time = "+time+" ms"); >> System.out.println("total = "+result); >> } >> public void multiThreadedAdd() throws InterruptedException{ >> AtomicInteger total = new AtomicInteger(); >> long start = System.currentTimeMillis(); >> AddingThread[] threads = new AddingThread[processors]; >> for(int i = 0; i < threads.length; i++){ >> threads[i] = new AddingThread("Thread "+i, i * >> bigNumber, total); >> System.out.println(threads[i].getName()+" starting"); >> threads[i].start(); >> } >> for(Thread thread : threads){ >> //make sure everyone is finished >> thread.join(); >> } >> long time = System.currentTimeMillis() - start; >> System.out.println("Calculation time = "+time+" ms"); >> System.out.println("total = "+total); >> } >> /** >> * @param args the command line arguments >> */ >> public static void main(String[] args) throws Exception{ >> //how many processors do I have? >> System.out.println("Available processors = "+processors); >> System.out.println("Initializing number array"); >> ThreadedAdder adder = new ThreadedAdder(); >> System.out.println("single thread add"); >> adder.singleThreadedAdd(); >> System.out.println("multi thread add"); >> adder.multiThreadedAdd(); >> } >> public class AddingThread extends Thread{ >> int internalTotal = 0; >> int offSet = 0; >> AtomicInteger callBackTotal; >> public AddingThread(String name, int offSet, >> AtomicInteger callBackTotal){ >> super(name); >> this.offSet = offSet; >> this.callBackTotal = callBackTotal; >> } >> @Override >> public void run(){ >> for(int i = offSet; i < offSet + bigNumber; i++){ >> internalTotal += bigArray[i]; >> } >> callBackTotal.addAndGet(internalTotal); >> System.out.println(this.getName()+" complete"); >> } >> } >> } ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ayates at ebi.ac.uk Wed Apr 9 07:03:19 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 09 Apr 2008 12:03:19 +0100 Subject: [Biojava-dev] Why BJ3 should be multithreaded In-Reply-To: <77FF0938-D653-490F-8933-B34306068727@sanger.ac.uk> References: <93b45ca50804090051h42632f43u6b977574c54853c7@mail.gmail.com> <47FC7E3B.9000106@ebi.ac.uk> <77FF0938-D653-490F-8933-B34306068727@sanger.ac.uk> Message-ID: <47FCA277.2020401@ebi.ac.uk> Most the time any kind of farm management software (like LSF & please correct me if I'm wrong) looks at the amount of CPU time a process takes up and the number of threads it detects; not only the number of processes you have in a queue. So a multi-threaded biojava should not pose a problem to these systems. Not to mention with the newer multiple core computers; threaded software is becoming the only way to take full advantage of the available power. Where you would want to ignore multi-threading is if you are in a queue like LSF and your x number of Java processes all get chucked onto the same machine. Then if you've got so many processor hungry operations all trying to create threads ... well it's not going to behave as optimally as you might hope. Personally though I'd still air on the side of caution WRT multi-threading and not to have it as part of the default tools but as an Object I can instantiate to do my multi-threading work (so it's a choice at the user's level rather than the framework level). Then using the Java5 executor framework we let users submit work to pools of threads to do their work. Couple this with forcing us to pass around immutable messages between threads/callables (since values shared by threads are probably the number one cause of **** ups) you'll have one heck of a kick-ass scalable framework ;-) Andy Andreas Prlic wrote: > Hi, > > I like the idea of having support for multiple threads. Only thing is, > when running BioJava on our compute farm, I am pretty sure our admins > won't be happy if BJ would use more than just a single CPU, unless run > on special hardware. As such there should be a BJ wide configuration > management, which would allow to determine how many CPUs to be used (and > the default could be all of them). > > Andreas > > > On 9 Apr 2008, at 09:28, Andy Yates wrote: > >> Lo, >> >> This is the kind of problem Java7 is attempting to solve with the >> fork-join framework (which really is a rip-off of Google's MapReduce). >> There's two ways of looking at thread safety & how to implement it: >> >> * Packages which could be threaded or want to be threaded are >> programmed with threading in mind using items from the util.concurrent >> package to split, queue & work with data points. >> >> * Packages can be created as required & have data to process passed to >> them for processing in a stateless manner; much in the same way >> servlet engines and a lot of web frameworks run >> >> The first way does mean we can support environments with useful >> multi-threaded support (no point in threading on a single CPU/core >> box) from the word go. The second way would require some plumbing on >> the user's behalf but this would be very easy plumbing; the majority >> of which we could write (like wrapping things in instances of Callables). >> >> Anyway my 2p worth :) >> >> Andy >> >> Mark Schreiber wrote: >>> Hi - >>> I was just playing with threads to see how efficient they are on one >>> of our old 4 CPU IBM servers. The following fairly naive program >>> splits a large array of numbers and sums them all up. The >>> multi-threaded version is 2.5 times faster even allowing for thread >>> overhead. The program could be even better if I make more use of the >>> java1.5 concurrent package. >>> Similar tasks in biojava would be include training distributions >>> which should see similar performance improvements. Much of the >>> current biojava doesn't make use of threads and worse, requires the >>> developer to manage all the thread safety themselves. >>> - Mark >>> /* >>> * To change this template, choose Tools | Templates >>> * and open the template in the editor. >>> */ >>> package concurrent; >>> import java.util.concurrent.atomic.AtomicInteger; >>> /** >>> * This program demo's the use of threads to sum a large array of >>> integers. >>> * @author Mark Schreiber >>> */ >>> public class ThreadedAdder { >>> static int processors = Runtime.getRuntime().availableProcessors(); >>> int bigNumber = 10000000; >>> int[] bigArray = new int[bigNumber * processors]; >>> public ThreadedAdder(){ >>> //make a big array of integers (10 000 000 numbers for each >>> processor) >>> for(int i = 0; i < bigArray.length; i++){ >>> //random number between 1 and 100 >>> bigArray[i] = (int)(Math.random() * 100.0); >>> } >>> } >>> public void singleThreadedAdd(){ >>> int result = 0; >>> //single threaded sum >>> long start = System.currentTimeMillis(); >>> for(int number : bigArray){ >>> result += number; >>> } >>> long time = System.currentTimeMillis() - start; >>> System.out.println("Calculation time = "+time+" ms"); >>> System.out.println("total = "+result); >>> } >>> public void multiThreadedAdd() throws InterruptedException{ >>> AtomicInteger total = new AtomicInteger(); >>> long start = System.currentTimeMillis(); >>> AddingThread[] threads = new AddingThread[processors]; >>> for(int i = 0; i < threads.length; i++){ >>> threads[i] = new AddingThread("Thread "+i, i * bigNumber, >>> total); >>> System.out.println(threads[i].getName()+" starting"); >>> threads[i].start(); >>> } >>> for(Thread thread : threads){ >>> //make sure everyone is finished >>> thread.join(); >>> } >>> long time = System.currentTimeMillis() - start; >>> System.out.println("Calculation time = "+time+" ms"); >>> System.out.println("total = "+total); >>> } >>> /** >>> * @param args the command line arguments >>> */ >>> public static void main(String[] args) throws Exception{ >>> //how many processors do I have? >>> System.out.println("Available processors = "+processors); >>> System.out.println("Initializing number array"); >>> ThreadedAdder adder = new ThreadedAdder(); >>> System.out.println("single thread add"); >>> adder.singleThreadedAdd(); >>> System.out.println("multi thread add"); >>> adder.multiThreadedAdd(); >>> } >>> public class AddingThread extends Thread{ >>> int internalTotal = 0; >>> int offSet = 0; >>> AtomicInteger callBackTotal; >>> public AddingThread(String name, int offSet, >>> AtomicInteger callBackTotal){ >>> super(name); >>> this.offSet = offSet; >>> this.callBackTotal = callBackTotal; >>> } >>> @Override >>> public void run(){ >>> for(int i = offSet; i < offSet + bigNumber; i++){ >>> internalTotal += bigArray[i]; >>> } >>> callBackTotal.addAndGet(internalTotal); >>> System.out.println(this.getName()+" complete"); >>> } >>> } >>> } > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 > > ----------------------------------------------------------------------- > > > > From markjschreiber at gmail.com Wed Apr 9 07:45:16 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 9 Apr 2008 19:45:16 +0800 Subject: [Biojava-dev] Why BJ3 should be multithreaded In-Reply-To: <47FCA277.2020401@ebi.ac.uk> References: <93b45ca50804090051h42632f43u6b977574c54853c7@mail.gmail.com> <47FC7E3B.9000106@ebi.ac.uk> <77FF0938-D653-490F-8933-B34306068727@sanger.ac.uk> <47FCA277.2020401@ebi.ac.uk> Message-ID: <93b45ca50804090445t5e04c555ue7ce8ff90d852c97@mail.gmail.com> Andy is right on this, a JVM can use at most the available CPUs on one machine (and sometimes not even that). Unless there is a very sophisticated farm management system that makes it look like all 100 cores are on the same machine then there is no chance that the JVM can take over more than one machine (unless you start another whole JVM from within your program). On Wed, Apr 9, 2008 at 7:03 PM, Andy Yates wrote: > > > Most the time any kind of farm management software (like LSF & please > correct me if I'm wrong) looks at the amount of CPU time a process takes up > and the number of threads it detects; not only the number of processes you > have in a queue. So a multi-threaded biojava should not pose a problem to > these systems. Not to mention with the newer multiple core computers; > threaded software is becoming the only way to take full advantage of the > available power. > > Where you would want to ignore multi-threading is if you are in a queue > like LSF and your x number of Java processes all get chucked onto the same > machine. Then if you've got so many processor hungry operations all trying > to create threads ... well it's not going to behave as optimally as you > might hope. > > Personally though I'd still air on the side of caution WRT multi-threading > and not to have it as part of the default tools but as an Object I can > instantiate to do my multi-threading work (so it's a choice at the user's > level rather than the framework level). Then using the Java5 executor > framework we let users submit work to pools of threads to do their work. > Couple this with forcing us to pass around immutable messages between > threads/callables (since values shared by threads are probably the number > one cause of **** ups) you'll have one heck of a kick-ass scalable framework > ;-) > > Andy > > > Andreas Prlic wrote: > > > Hi, > > > > I like the idea of having support for multiple threads. Only thing is, > > when running BioJava on our compute farm, I am pretty sure our admins won't > > be happy if BJ would use more than just a single CPU, unless run on special > > hardware. As such there should be a BJ wide configuration management, which > > would allow to determine how many CPUs to be used (and the default could be > > all of them). > > > > Andreas > > > > > > On 9 Apr 2008, at 09:28, Andy Yates wrote: > > > > Lo, > > > > > > This is the kind of problem Java7 is attempting to solve with the > > > fork-join framework (which really is a rip-off of Google's MapReduce). > > > There's two ways of looking at thread safety & how to implement it: > > > > > > * Packages which could be threaded or want to be threaded are > > > programmed with threading in mind using items from the util.concurrent > > > package to split, queue & work with data points. > > > > > > * Packages can be created as required & have data to process passed to > > > them for processing in a stateless manner; much in the same way servlet > > > engines and a lot of web frameworks run > > > > > > The first way does mean we can support environments with useful > > > multi-threaded support (no point in threading on a single CPU/core box) from > > > the word go. The second way would require some plumbing on the user's behalf > > > but this would be very easy plumbing; the majority of which we could write > > > (like wrapping things in instances of Callables). > > > > > > Anyway my 2p worth :) > > > > > > Andy > > > > > > Mark Schreiber wrote: > > > > > > > Hi - > > > > I was just playing with threads to see how efficient they are on one > > > > of our old 4 CPU IBM servers. The following fairly naive program splits a > > > > large array of numbers and sums them all up. The multi-threaded version is > > > > 2.5 times faster even allowing for thread overhead. The program could be > > > > even better if I make more use of the java1.5 concurrent package. > > > > Similar tasks in biojava would be include training distributions > > > > which should see similar performance improvements. Much of the current > > > > biojava doesn't make use of threads and worse, requires the developer to > > > > manage all the thread safety themselves. > > > > - Mark > > > > /* > > > > * To change this template, choose Tools | Templates > > > > * and open the template in the editor. > > > > */ > > > > package concurrent; > > > > import java.util.concurrent.atomic.AtomicInteger; > > > > /** > > > > * This program demo's the use of threads to sum a large array of > > > > integers. > > > > * @author Mark Schreiber > > > > */ > > > > public class ThreadedAdder { > > > > static int processors = > > > > Runtime.getRuntime().availableProcessors(); > > > > int bigNumber = 10000000; > > > > int[] bigArray = new int[bigNumber * processors]; > > > > public ThreadedAdder(){ > > > > //make a big array of integers (10 000 000 numbers for each > > > > processor) > > > > for(int i = 0; i < bigArray.length; i++){ > > > > //random number between 1 and 100 > > > > bigArray[i] = (int)(Math.random() * 100.0); > > > > } > > > > } > > > > public void singleThreadedAdd(){ > > > > int result = 0; > > > > //single threaded sum > > > > long start = System.currentTimeMillis(); > > > > for(int number : bigArray){ > > > > result += number; > > > > } > > > > long time = System.currentTimeMillis() - start; > > > > System.out.println("Calculation time = "+time+" ms"); > > > > System.out.println("total = "+result); > > > > } > > > > public void multiThreadedAdd() throws InterruptedException{ > > > > AtomicInteger total = new AtomicInteger(); > > > > long start = System.currentTimeMillis(); > > > > AddingThread[] threads = new AddingThread[processors]; > > > > for(int i = 0; i < threads.length; i++){ > > > > threads[i] = new AddingThread("Thread "+i, i * bigNumber, > > > > total); > > > > System.out.println(threads[i].getName()+" starting"); > > > > threads[i].start(); > > > > } > > > > for(Thread thread : threads){ > > > > //make sure everyone is finished > > > > thread.join(); > > > > } > > > > long time = System.currentTimeMillis() - start; > > > > System.out.println("Calculation time = "+time+" ms"); > > > > System.out.println("total = "+total); > > > > } > > > > /** > > > > * @param args the command line arguments > > > > */ > > > > public static void main(String[] args) throws Exception{ > > > > //how many processors do I have? > > > > System.out.println("Available processors = "+processors); > > > > System.out.println("Initializing number array"); > > > > ThreadedAdder adder = new ThreadedAdder(); > > > > System.out.println("single thread add"); > > > > adder.singleThreadedAdd(); > > > > System.out.println("multi thread add"); > > > > adder.multiThreadedAdd(); > > > > } > > > > public class AddingThread extends Thread{ > > > > int internalTotal = 0; > > > > int offSet = 0; > > > > AtomicInteger callBackTotal; > > > > public AddingThread(String name, int offSet, > > > > AtomicInteger callBackTotal){ > > > > super(name); > > > > this.offSet = offSet; > > > > this.callBackTotal = callBackTotal; > > > > } > > > > @Override > > > > public void run(){ > > > > for(int i = offSet; i < offSet + bigNumber; i++){ > > > > internalTotal += bigArray[i]; > > > > } > > > > callBackTotal.addAndGet(internalTotal); > > > > System.out.println(this.getName()+" complete"); > > > > } > > > > } > > > > } > > > > > > > > > ----------------------------------------------------------------------- > > > > Andreas Prlic Wellcome Trust Sanger Institute > > Hinxton, Cambridge CB10 1SA, UK > > +44 (0) 1223 49 6891 > > > > ----------------------------------------------------------------------- > > > > > > > > > > From markjschreiber at gmail.com Wed Apr 9 07:54:06 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 9 Apr 2008 19:54:06 +0800 Subject: [Biojava-dev] Why BJ3 should be multithreaded In-Reply-To: <47FCA277.2020401@ebi.ac.uk> References: <93b45ca50804090051h42632f43u6b977574c54853c7@mail.gmail.com> <47FC7E3B.9000106@ebi.ac.uk> <77FF0938-D653-490F-8933-B34306068727@sanger.ac.uk> <47FCA277.2020401@ebi.ac.uk> Message-ID: <93b45ca50804090454j2f0ff061gbf3ddb1a247610@mail.gmail.com> I'm not too sure which option I prefer, multi-threading by default (ie all handled by the packages) or stateless immutable classes and messages that can be multi-threaded. There are arguments for both. The former is recommended in a book I am currently reading on concurrency which was written by the authors of the java 1.5 concurrency package. Essentially the classes can be designed ahead of time to be thread safe and mutability (sometimes a good thing) can be done with this in mind. On the other hand stateless and immutable stuff is often safe enough to put into a thread although _only_ as long as operations are truely atomic. Take for example Servlets and stateless Session Beans. They are pretty thread safe by nescessity (use in app servers) but just because they are stateless doens't mean you can't accedentally right one that gives you stale data or a race condition. In both cases thread safety needs to be designed from the start. Currently BioJava is neither of these things and I imagine things will start getting pretty interesting if you try to multi-thread a biojava program right now. - Mark On Wed, Apr 9, 2008 at 7:03 PM, Andy Yates wrote: > > > Most the time any kind of farm management software (like LSF & please correct me if I'm wrong) looks at the amount of CPU time a process takes up and the number of threads it detects; not only the number of processes you have in a queue. So a multi-threaded biojava should not pose a problem to these systems. Not to mention with the newer multiple core computers; threaded software is becoming the only way to take full advantage of the available power. > > Where you would want to ignore multi-threading is if you are in a queue like LSF and your x number of Java processes all get chucked onto the same machine. Then if you've got so many processor hungry operations all trying to create threads ... well it's not going to behave as optimally as you might hope. > > Personally though I'd still air on the side of caution WRT multi-threading and not to have it as part of the default tools but as an Object I can instantiate to do my multi-threading work (so it's a choice at the user's level rather than the framework level). Then using the Java5 executor framework we let users submit work to pools of threads to do their work. Couple this with forcing us to pass around immutable messages between threads/callables (since values shared by threads are probably the number one cause of **** ups) you'll have one heck of a kick-ass scalable framework ;-) > > Andy > > > > > Andreas Prlic wrote: > > > Hi, > > > > I like the idea of having support for multiple threads. Only thing is, when running BioJava on our compute farm, I am pretty sure our admins won't be happy if BJ would use more than just a single CPU, unless run on special hardware. As such there should be a BJ wide configuration management, which would allow to determine how many CPUs to be used (and the default could be all of them). > > > > Andreas > > > > > > On 9 Apr 2008, at 09:28, Andy Yates wrote: > > > > > > > Lo, > > > > > > This is the kind of problem Java7 is attempting to solve with the fork-join framework (which really is a rip-off of Google's MapReduce). There's two ways of looking at thread safety & how to implement it: > > > > > > * Packages which could be threaded or want to be threaded are programmed with threading in mind using items from the util.concurrent package to split, queue & work with data points. > > > > > > * Packages can be created as required & have data to process passed to them for processing in a stateless manner; much in the same way servlet engines and a lot of web frameworks run > > > > > > The first way does mean we can support environments with useful multi-threaded support (no point in threading on a single CPU/core box) from the word go. The second way would require some plumbing on the user's behalf but this would be very easy plumbing; the majority of which we could write (like wrapping things in instances of Callables). > > > > > > Anyway my 2p worth :) > > > > > > Andy > > > > > > Mark Schreiber wrote: > > > > > > > Hi - > > > > I was just playing with threads to see how efficient they are on one of our old 4 CPU IBM servers. The following fairly naive program splits a large array of numbers and sums them all up. The multi-threaded version is 2.5 times faster even allowing for thread overhead. The program could be even better if I make more use of the java1.5 concurrent package. > > > > Similar tasks in biojava would be include training distributions which should see similar performance improvements. Much of the current biojava doesn't make use of threads and worse, requires the developer to manage all the thread safety themselves. > > > > - Mark > > > > /* > > > > * To change this template, choose Tools | Templates > > > > * and open the template in the editor. > > > > */ > > > > package concurrent; > > > > import java.util.concurrent.atomic.AtomicInteger; > > > > /** > > > > * This program demo's the use of threads to sum a large array of integers. > > > > * @author Mark Schreiber > > > > */ > > > > public class ThreadedAdder { > > > > static int processors = Runtime.getRuntime().availableProcessors(); > > > > int bigNumber = 10000000; > > > > int[] bigArray = new int[bigNumber * processors]; > > > > public ThreadedAdder(){ > > > > //make a big array of integers (10 000 000 numbers for each processor) > > > > for(int i = 0; i < bigArray.length; i++){ > > > > //random number between 1 and 100 > > > > bigArray[i] = (int)(Math.random() * 100.0); > > > > } > > > > } > > > > public void singleThreadedAdd(){ > > > > int result = 0; > > > > //single threaded sum > > > > long start = System.currentTimeMillis(); > > > > for(int number : bigArray){ > > > > result += number; > > > > } > > > > long time = System.currentTimeMillis() - start; > > > > System.out.println("Calculation time = "+time+" ms"); > > > > System.out.println("total = "+result); > > > > } > > > > public void multiThreadedAdd() throws InterruptedException{ > > > > AtomicInteger total = new AtomicInteger(); > > > > long start = System.currentTimeMillis(); > > > > AddingThread[] threads = new AddingThread[processors]; > > > > for(int i = 0; i < threads.length; i++){ > > > > threads[i] = new AddingThread("Thread "+i, i * bigNumber, total); > > > > System.out.println(threads[i].getName()+" starting"); > > > > threads[i].start(); > > > > } > > > > for(Thread thread : threads){ > > > > //make sure everyone is finished > > > > thread.join(); > > > > } > > > > long time = System.currentTimeMillis() - start; > > > > System.out.println("Calculation time = "+time+" ms"); > > > > System.out.println("total = "+total); > > > > } > > > > /** > > > > * @param args the command line arguments > > > > */ > > > > public static void main(String[] args) throws Exception{ > > > > //how many processors do I have? > > > > System.out.println("Available processors = "+processors); > > > > System.out.println("Initializing number array"); > > > > ThreadedAdder adder = new ThreadedAdder(); > > > > System.out.println("single thread add"); > > > > adder.singleThreadedAdd(); > > > > System.out.println("multi thread add"); > > > > adder.multiThreadedAdd(); > > > > } > > > > public class AddingThread extends Thread{ > > > > int internalTotal = 0; > > > > int offSet = 0; > > > > AtomicInteger callBackTotal; > > > > public AddingThread(String name, int offSet, AtomicInteger callBackTotal){ > > > > super(name); > > > > this.offSet = offSet; > > > > this.callBackTotal = callBackTotal; > > > > } > > > > @Override > > > > public void run(){ > > > > for(int i = offSet; i < offSet + bigNumber; i++){ > > > > internalTotal += bigArray[i]; > > > > } > > > > callBackTotal.addAndGet(internalTotal); > > > > System.out.println(this.getName()+" complete"); > > > > } > > > > } > > > > } > > > > > > > > > > > ----------------------------------------------------------------------- > > > > Andreas Prlic Wellcome Trust Sanger Institute > > Hinxton, Cambridge CB10 1SA, UK > > +44 (0) 1223 49 6891 > > > > ----------------------------------------------------------------------- > > > > > > > > > > > From markjschreiber at gmail.com Wed Apr 9 09:12:52 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 9 Apr 2008 21:12:52 +0800 Subject: [Biojava-dev] Why BJ3 should be multithreaded In-Reply-To: <93b45ca50804090454j2f0ff061gbf3ddb1a247610@mail.gmail.com> References: <93b45ca50804090051h42632f43u6b977574c54853c7@mail.gmail.com> <47FC7E3B.9000106@ebi.ac.uk> <77FF0938-D653-490F-8933-B34306068727@sanger.ac.uk> <47FCA277.2020401@ebi.ac.uk> <93b45ca50804090454j2f0ff061gbf3ddb1a247610@mail.gmail.com> Message-ID: <93b45ca50804090612x7ba0b3b2jbb8d1e031e030dc4@mail.gmail.com> > > Personally though I'd still air on the side of caution WRT multi-threading and not to have it as part of the default tools but as an Object I can instantiate to do my multi-threading work (so it's a choice at the user's level rather than the framework level). Then using the Java5 executor framework we let users submit work to pools of threads to do their work. Couple this with forcing us to pass around immutable messages between threads/callables (since values shared by threads are probably the number one cause of **** ups) you'll have one heck of a kick-ass scalable framework ;-) > > > > Andy One area where you could get an interesting mixture of stateless and synchronized access to a mutable would be threaded parsing of large sequence files. In my experience the BioJava parsers are not normally I/O bound due to all the object building they do. Given this a filereader could for example read a feature block and hand it off to a threaded stateless feature handler which produces a Feature object and then adds it (synchronized) to the BioJava Sequence that is being built. As long as I/O doesn't limit then you would get improved parsing performance. It would also be a case where the threading should happen internally as it could be pretty hard to coordinate the process from the outside. This also highlights the difference between encapsulation and immutability. Even if access to variables is controlled by package and protected setters the class is still mutable (but not by the user). Immutability can only be achieved by not providing any setter methods which has obvious severe limitations. Currently BioJava Sequence objects have restricted mutability (use of Edit objects) but are certainly not immutable. Again messages need not be immutable as long as they have appropriate locks and or synchronized getters and setters. Many java frameworks work best when messages or DTO's are beans (with parameterless constructors and public getters and setters), being able to use these is often very desirable. These beans can still be threadsafe if you code them right. - Mark From ayates at ebi.ac.uk Wed Apr 9 10:00:29 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 09 Apr 2008 15:00:29 +0100 Subject: [Biojava-dev] Why BJ3 should be multithreaded In-Reply-To: <93b45ca50804090454j2f0ff061gbf3ddb1a247610@mail.gmail.com> References: <93b45ca50804090051h42632f43u6b977574c54853c7@mail.gmail.com> <47FC7E3B.9000106@ebi.ac.uk> <77FF0938-D653-490F-8933-B34306068727@sanger.ac.uk> <47FCA277.2020401@ebi.ac.uk> <93b45ca50804090454j2f0ff061gbf3ddb1a247610@mail.gmail.com> Message-ID: <47FCCBFD.1030805@ebi.ac.uk> I admit mutability is a good thing sometimes (and as Java programmers is the way we've been taught to work in). Oh I've triggered more than enough race conditions working with so called 'stateless' services assuming too much about how stateless they were (or more to the point how stateful I had made them). Anyway yes race conditions can occur anywhere in any bit of code but the majority of time I see them appearing when 'static' is used. Yeah I would be worried about someone making a multi-threded app with BJ. Not impossible (far from it) but I can imagine a few edge cases coming in. Andy Mark Schreiber wrote: > I'm not too sure which option I prefer, multi-threading by default (ie > all handled by the packages) or stateless immutable classes and > messages that can be multi-threaded. > > There are arguments for both. The former is recommended in a book I > am currently reading on concurrency which was written by the authors > of the java 1.5 concurrency package. Essentially the classes can be > designed ahead of time to be thread safe and mutability (sometimes a > good thing) can be done with this in mind. > > On the other hand stateless and immutable stuff is often safe enough > to put into a thread although _only_ as long as operations are truely > atomic. Take for example Servlets and stateless Session Beans. They > are pretty thread safe by nescessity (use in app servers) but just > because they are stateless doens't mean you can't accedentally right > one that gives you stale data or a race condition. > > In both cases thread safety needs to be designed from the start. > > Currently BioJava is neither of these things and I imagine things will > start getting pretty interesting if you try to multi-thread a biojava > program right now. > > - Mark > > On Wed, Apr 9, 2008 at 7:03 PM, Andy Yates wrote: >> >> Most the time any kind of farm management software (like LSF & please correct me if I'm wrong) looks at the amount of CPU time a process takes up and the number of threads it detects; not only the number of processes you have in a queue. So a multi-threaded biojava should not pose a problem to these systems. Not to mention with the newer multiple core computers; threaded software is becoming the only way to take full advantage of the available power. >> >> Where you would want to ignore multi-threading is if you are in a queue like LSF and your x number of Java processes all get chucked onto the same machine. Then if you've got so many processor hungry operations all trying to create threads ... well it's not going to behave as optimally as you might hope. >> >> Personally though I'd still air on the side of caution WRT multi-threading and not to have it as part of the default tools but as an Object I can instantiate to do my multi-threading work (so it's a choice at the user's level rather than the framework level). Then using the Java5 executor framework we let users submit work to pools of threads to do their work. Couple this with forcing us to pass around immutable messages between threads/callables (since values shared by threads are probably the number one cause of **** ups) you'll have one heck of a kick-ass scalable framework ;-) >> >> Andy >> >> >> >> >> Andreas Prlic wrote: >> >>> Hi, >>> >>> I like the idea of having support for multiple threads. Only thing is, when running BioJava on our compute farm, I am pretty sure our admins won't be happy if BJ would use more than just a single CPU, unless run on special hardware. As such there should be a BJ wide configuration management, which would allow to determine how many CPUs to be used (and the default could be all of them). >>> >>> Andreas >>> >>> >>> On 9 Apr 2008, at 09:28, Andy Yates wrote: >>> >>> >>>> Lo, >>>> >>>> This is the kind of problem Java7 is attempting to solve with the fork-join framework (which really is a rip-off of Google's MapReduce). There's two ways of looking at thread safety & how to implement it: >>>> >>>> * Packages which could be threaded or want to be threaded are programmed with threading in mind using items from the util.concurrent package to split, queue & work with data points. >>>> >>>> * Packages can be created as required & have data to process passed to them for processing in a stateless manner; much in the same way servlet engines and a lot of web frameworks run >>>> >>>> The first way does mean we can support environments with useful multi-threaded support (no point in threading on a single CPU/core box) from the word go. The second way would require some plumbing on the user's behalf but this would be very easy plumbing; the majority of which we could write (like wrapping things in instances of Callables). >>>> >>>> Anyway my 2p worth :) >>>> >>>> Andy >>>> >>>> Mark Schreiber wrote: >>>> >>>>> Hi - >>>>> I was just playing with threads to see how efficient they are on one of our old 4 CPU IBM servers. The following fairly naive program splits a large array of numbers and sums them all up. The multi-threaded version is 2.5 times faster even allowing for thread overhead. The program could be even better if I make more use of the java1.5 concurrent package. >>>>> Similar tasks in biojava would be include training distributions which should see similar performance improvements. Much of the current biojava doesn't make use of threads and worse, requires the developer to manage all the thread safety themselves. >>>>> - Mark >>>>> /* >>>>> * To change this template, choose Tools | Templates >>>>> * and open the template in the editor. >>>>> */ >>>>> package concurrent; >>>>> import java.util.concurrent.atomic.AtomicInteger; >>>>> /** >>>>> * This program demo's the use of threads to sum a large array of integers. >>>>> * @author Mark Schreiber >>>>> */ >>>>> public class ThreadedAdder { >>>>> static int processors = Runtime.getRuntime().availableProcessors(); >>>>> int bigNumber = 10000000; >>>>> int[] bigArray = new int[bigNumber * processors]; >>>>> public ThreadedAdder(){ >>>>> //make a big array of integers (10 000 000 numbers for each processor) >>>>> for(int i = 0; i < bigArray.length; i++){ >>>>> //random number between 1 and 100 >>>>> bigArray[i] = (int)(Math.random() * 100.0); >>>>> } >>>>> } >>>>> public void singleThreadedAdd(){ >>>>> int result = 0; >>>>> //single threaded sum >>>>> long start = System.currentTimeMillis(); >>>>> for(int number : bigArray){ >>>>> result += number; >>>>> } >>>>> long time = System.currentTimeMillis() - start; >>>>> System.out.println("Calculation time = "+time+" ms"); >>>>> System.out.println("total = "+result); >>>>> } >>>>> public void multiThreadedAdd() throws InterruptedException{ >>>>> AtomicInteger total = new AtomicInteger(); >>>>> long start = System.currentTimeMillis(); >>>>> AddingThread[] threads = new AddingThread[processors]; >>>>> for(int i = 0; i < threads.length; i++){ >>>>> threads[i] = new AddingThread("Thread "+i, i * bigNumber, total); >>>>> System.out.println(threads[i].getName()+" starting"); >>>>> threads[i].start(); >>>>> } >>>>> for(Thread thread : threads){ >>>>> //make sure everyone is finished >>>>> thread.join(); >>>>> } >>>>> long time = System.currentTimeMillis() - start; >>>>> System.out.println("Calculation time = "+time+" ms"); >>>>> System.out.println("total = "+total); >>>>> } >>>>> /** >>>>> * @param args the command line arguments >>>>> */ >>>>> public static void main(String[] args) throws Exception{ >>>>> //how many processors do I have? >>>>> System.out.println("Available processors = "+processors); >>>>> System.out.println("Initializing number array"); >>>>> ThreadedAdder adder = new ThreadedAdder(); >>>>> System.out.println("single thread add"); >>>>> adder.singleThreadedAdd(); >>>>> System.out.println("multi thread add"); >>>>> adder.multiThreadedAdd(); >>>>> } >>>>> public class AddingThread extends Thread{ >>>>> int internalTotal = 0; >>>>> int offSet = 0; >>>>> AtomicInteger callBackTotal; >>>>> public AddingThread(String name, int offSet, AtomicInteger callBackTotal){ >>>>> super(name); >>>>> this.offSet = offSet; >>>>> this.callBackTotal = callBackTotal; >>>>> } >>>>> @Override >>>>> public void run(){ >>>>> for(int i = offSet; i < offSet + bigNumber; i++){ >>>>> internalTotal += bigArray[i]; >>>>> } >>>>> callBackTotal.addAndGet(internalTotal); >>>>> System.out.println(this.getName()+" complete"); >>>>> } >>>>> } >>>>> } >>>>> >>> ----------------------------------------------------------------------- >>> >>> Andreas Prlic Wellcome Trust Sanger Institute >>> Hinxton, Cambridge CB10 1SA, UK >>> +44 (0) 1223 49 6891 >>> >>> ----------------------------------------------------------------------- >>> >>> >>> >>> >>> From ayates at ebi.ac.uk Wed Apr 9 10:09:33 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 09 Apr 2008 15:09:33 +0100 Subject: [Biojava-dev] Why BJ3 should be multithreaded In-Reply-To: <93b45ca50804090612x7ba0b3b2jbb8d1e031e030dc4@mail.gmail.com> References: <93b45ca50804090051h42632f43u6b977574c54853c7@mail.gmail.com> <47FC7E3B.9000106@ebi.ac.uk> <77FF0938-D653-490F-8933-B34306068727@sanger.ac.uk> <47FCA277.2020401@ebi.ac.uk> <93b45ca50804090454j2f0ff061gbf3ddb1a247610@mail.gmail.com> <93b45ca50804090612x7ba0b3b2jbb8d1e031e030dc4@mail.gmail.com> Message-ID: <47FCCE1D.8050107@ebi.ac.uk> That is an interesting bit of usage. You could queue the events out from the feature builders into the thread/callable which constructs the final Sequence object quite easily. Yeah very very true :) The majority of objects are mutable in BJ I think. I'm not saying this is a bad thing nor suggesting everything needs to be immutable :). It's more about making sure only one thread is working on one object at a given point in the program. If there are going to be mutable objects hanging around then Queues are probably the best way to work with them. Andy > > One area where you could get an interesting mixture of stateless and > synchronized access to a mutable would be threaded parsing of large > sequence files. In my experience the BioJava parsers are not > normally I/O bound due to all the object building they do. Given > this a filereader could for example read a feature block and hand it > off to a threaded stateless feature handler which produces a Feature > object and then adds it (synchronized) to the BioJava Sequence that > is being built. As long as I/O doesn't limit then you would get > improved parsing performance. It would also be a case where the > threading should happen internally as it could be pretty hard to > coordinate the process from the outside. > > This also highlights the difference between encapsulation and > immutability. Even if access to variables is controlled by package > and protected setters the class is still mutable (but not by the > user). Immutability can only be achieved by not providing any setter > methods which has obvious severe limitations. Currently BioJava > Sequence objects have restricted mutability (use of Edit objects) but > are certainly not immutable. > > Again messages need not be immutable as long as they have appropriate > locks and or synchronized getters and setters. Many java frameworks > work best when messages or DTO's are beans (with parameterless > constructors and public getters and setters), being able to use these > is often very desirable. These beans can still be threadsafe if you > code them right. > > - Mark From heuermh at acm.org Wed Apr 9 12:34:40 2008 From: heuermh at acm.org (Michael Heuer) Date: Wed, 9 Apr 2008 12:34:40 -0400 (EDT) Subject: [Biojava-dev] Why BJ3 should be multithreaded In-Reply-To: <47FCCE1D.8050107@ebi.ac.uk> Message-ID: On Wed, 9 Apr 2008, Andy Yates wrote: > That is an interesting bit of usage. You could queue the events out from > the feature builders into the thread/callable which constructs the final > Sequence object quite easily. Yeah very very true :) > > The majority of objects are mutable in BJ I think. I'm not saying this > is a bad thing nor suggesting everything needs to be immutable :). It's > more about making sure only one thread is working on one object at a > given point in the program. If there are going to be mutable objects > hanging around then Queues are probably the best way to work with them. I am going to crib directly from the book I think Mark was referring to earlier: - It's the mutable state, stupid All concurrency issues boil down to coordinating access to mutable state. The less mutable state, the easier it is to ensure thread safety. - Make fields final unless they need to be mutable - Immutable objects are automatically thread-safe Immutable objects simplify concurrent programming tremendously. They are simper and safer, and can be shared freely without locking or defensive copying. "Java Concurrency in Practice", Goetz et al., 2006, p110. http://www.javaconcurrencyinpractice.com/ The Immutable with Copy Mutators pattern provides "setter"-like methods that return copies of the immutable object: /** * Return a copy of this foo with the bar set to bar. * *

Foo is immutable, so there are no set methods. Instead, this * method returns a new instance of Foo copied from this * with the value of bar changed.

* * @param bar bar for the copy of this foo * @return a copy of this fo with the bar set to bar */ public Foo withBar(final Bar bar) { Foo copy = new Foo(..., bar); return copy; } This is used in JodaTime, JSR-310, and elsewhere. I have a template I use to generate classes in this style at http://tinyurl.com/6n2nhp > > Mark Schreiber wrote: > > One area where you could get an interesting mixture of stateless and > > synchronized access to a mutable would be threaded parsing of large > > sequence files. In my experience the BioJava parsers are not > > normally I/O bound due to all the object building they do. Given > > this a filereader could for example read a feature block and hand it > > off to a threaded stateless feature handler which produces a Feature > > object and then adds it (synchronized) to the BioJava Sequence that > > is being built. As long as I/O doesn't limit then you would get > > improved parsing performance. It would also be a case where the > > threading should happen internally as it could be pretty hard to > > coordinate the process from the outside. > > > > This also highlights the difference between encapsulation and > > immutability. Even if access to variables is controlled by package > > and protected setters the class is still mutable (but not by the > > user). Immutability can only be achieved by not providing any setter > > methods which has obvious severe limitations. Currently BioJava > > Sequence objects have restricted mutability (use of Edit objects) but > > are certainly not immutable. > > > > Again messages need not be immutable as long as they have appropriate > > locks and or synchronized getters and setters. Many java frameworks > > work best when messages or DTO's are beans (with parameterless > > constructors and public getters and setters), being able to use these > > is often very desirable. These beans can still be threadsafe if you > > code them right. What might that look like? I have to think in most cases (DTOs, form beans, etc) are safe only because the container is managing the lifecycle of those beans. Perhaps we might want to copy some of this discussion to http://biojava.org/wiki/Talk:BioJava3_Design or a new page about concurrency issues when we are finished. michael From ayates at ebi.ac.uk Thu Apr 10 04:36:41 2008 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 10 Apr 2008 09:36:41 +0100 Subject: [Biojava-dev] Why BJ3 should be multithreaded In-Reply-To: References: Message-ID: <47FDD199.4010606@ebi.ac.uk> All of that looks very reasonable to me; I really should get round to reading that book soon :). The only thing that worries me about the constructor copy is object churn but as far as I'm aware that is a worry from the older days of Java & doesn't hold up with the later VMs. It seems as we have two use-cases for concurrency in the 'newer' biojava: * Using concurrency to speed up a process which is not CPU limited & is part of the core API * Using concurrency to speed up a process which is CPU limited but can be sped up on machines with more that one core Each scenario needs a different way of 'triggering' the concurrency. The first as people have said some kind of System property might be a good way to either enable multiple threads or disable it completely; this also needs to be designed with good concurrent practice in mind from the start. The second way is by user intention i.e. they use the multi-threaded pyhlogenetics package. Does that sound okay? Andy Michael Heuer wrote: > On Wed, 9 Apr 2008, Andy Yates wrote: > >> That is an interesting bit of usage. You could queue the events out from >> the feature builders into the thread/callable which constructs the final >> Sequence object quite easily. Yeah very very true :) >> >> The majority of objects are mutable in BJ I think. I'm not saying this >> is a bad thing nor suggesting everything needs to be immutable :). It's >> more about making sure only one thread is working on one object at a >> given point in the program. If there are going to be mutable objects >> hanging around then Queues are probably the best way to work with them. > > I am going to crib directly from the book I think Mark was referring to > earlier: > > - It's the mutable state, stupid > > All concurrency issues boil down to coordinating access to mutable > state. The less mutable state, the easier it is to ensure thread safety. > > - Make fields final unless they need to be mutable > > - Immutable objects are automatically thread-safe > > Immutable objects simplify concurrent programming tremendously. They > are simper and safer, and can be shared freely without locking or > defensive copying. > > "Java Concurrency in Practice", Goetz et al., 2006, p110. > http://www.javaconcurrencyinpractice.com/ > > > The Immutable with Copy Mutators pattern provides "setter"-like methods > that return copies of the immutable object: > > /** > * Return a copy of this foo with the bar set to bar. > * > *

Foo is immutable, so there are no set methods. Instead, this > * method returns a new instance of Foo copied from this > * with the value of bar changed.

> * > * @param bar bar for the copy of this foo > * @return a copy of this fo with the bar set to bar > */ > public Foo withBar(final Bar bar) > { > Foo copy = new Foo(..., bar); > return copy; > } > > This is used in JodaTime, JSR-310, and elsewhere. I have a template I use > to generate classes in this style at > > http://tinyurl.com/6n2nhp > > >>> Mark Schreiber wrote: >>> One area where you could get an interesting mixture of stateless and >>> synchronized access to a mutable would be threaded parsing of large >>> sequence files. In my experience the BioJava parsers are not >>> normally I/O bound due to all the object building they do. Given >>> this a filereader could for example read a feature block and hand it >>> off to a threaded stateless feature handler which produces a Feature >>> object and then adds it (synchronized) to the BioJava Sequence that >>> is being built. As long as I/O doesn't limit then you would get >>> improved parsing performance. It would also be a case where the >>> threading should happen internally as it could be pretty hard to >>> coordinate the process from the outside. >>> >>> This also highlights the difference between encapsulation and >>> immutability. Even if access to variables is controlled by package >>> and protected setters the class is still mutable (but not by the >>> user). Immutability can only be achieved by not providing any setter >>> methods which has obvious severe limitations. Currently BioJava >>> Sequence objects have restricted mutability (use of Edit objects) but >>> are certainly not immutable. >>> >>> Again messages need not be immutable as long as they have appropriate >>> locks and or synchronized getters and setters. Many java frameworks >>> work best when messages or DTO's are beans (with parameterless >>> constructors and public getters and setters), being able to use these >>> is often very desirable. These beans can still be threadsafe if you >>> code them right. > > What might that look like? > > I have to think in most cases (DTOs, form beans, etc) are safe only > because the container is managing the lifecycle of those beans. > > > Perhaps we might want to copy some of this discussion to > > http://biojava.org/wiki/Talk:BioJava3_Design > > or a new page about concurrency issues when we are finished. > > michael From markjschreiber at gmail.com Thu Apr 10 07:40:44 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 10 Apr 2008 19:40:44 +0800 Subject: [Biojava-dev] Why BJ3 should be multithreaded In-Reply-To: <47FDD199.4010606@ebi.ac.uk> References: <47FDD199.4010606@ebi.ac.uk> Message-ID: <93b45ca50804100440u5afacfa0o650ed162aef6a9c1@mail.gmail.com> > * Using concurrency to speed up a process which is not CPU limited & is part > of the core API > Do you have a specific example in mind? Something blocking that needs to be non-blocking? The parseing example could be one (as i/o blocks during parsing) but I think it actually might be CPU limited as well. > * Using concurrency to speed up a process which is CPU limited but can be > sped up on machines with more that one core > Yes. Seems almost ever modern machine is dual core nowadays, we should take advantage of this. > Each scenario needs a different way of 'triggering' the concurrency. The > first as people have said some kind of System property might be a good way > to either enable multiple threads or disable it completely; this also needs > to be designed with good concurrent practice in mind from the start. The It would be good to make it configurable via the presence of a properties file or similar. Default could be to use all available processors, which can be determined from the Runtime object. This approach would let users control how much of their machines grunt is used for heavy lifting. This approach would also allow users to test and tune for any installation. In recent tests I have noticed that a task has to be reasonably expensive to be worth spawning more threads (to get a quicker run time). The definition of expensive really depends on the machine. One task on an old linux 4 CPU machine got a 2 fold speed up by using all CPUs. The exact same task on a new dual core laptop actually slowed down as the thread spawning was slower than the calculation. A much harder calculation on this machine did improve with threading. Control of this via a property would let you set the appropriate strategy on any deployment. > second way is by user intention i.e. they use the multi-threaded > pyhlogenetics package. > Some packages should be threaded even if there is only one processor to prevent blocking. For example parsing should spawn at least one thread that is seperate from the i/o thread even on a single CPU system. Much as swing is threaded to prevent GUI blocking. - Mark > Does that sound okay? > > Andy > > > > Michael Heuer wrote: > > On Wed, 9 Apr 2008, Andy Yates wrote: > > > > > > > That is an interesting bit of usage. You could queue the events out from > > > the feature builders into the thread/callable which constructs the final > > > Sequence object quite easily. Yeah very very true :) > > > > > > The majority of objects are mutable in BJ I think. I'm not saying this > > > is a bad thing nor suggesting everything needs to be immutable :). It's > > > more about making sure only one thread is working on one object at a > > > given point in the program. If there are going to be mutable objects > > > hanging around then Queues are probably the best way to work with them. > > > > > > > I am going to crib directly from the book I think Mark was referring to > > earlier: > > > > - It's the mutable state, stupid > > > > All concurrency issues boil down to coordinating access to mutable > > state. The less mutable state, the easier it is to ensure thread safety. > > > > - Make fields final unless they need to be mutable > > > > - Immutable objects are automatically thread-safe > > > > Immutable objects simplify concurrent programming tremendously. They > > are simper and safer, and can be shared freely without locking or > > defensive copying. > > > > "Java Concurrency in Practice", Goetz et al., 2006, p110. > > http://www.javaconcurrencyinpractice.com/ > > > > > > The Immutable with Copy Mutators pattern provides "setter"-like methods > > that return copies of the immutable object: > > > > /** > > * Return a copy of this foo with the bar set to bar. > > * > > *

Foo is immutable, so there are no set methods. Instead, this > > * method returns a new instance of Foo copied from this > > * with the value of bar changed.

> > * > > * @param bar bar for the copy of this foo > > * @return a copy of this fo with the bar set to bar > > */ > > public Foo withBar(final Bar bar) > > { > > Foo copy = new Foo(..., bar); > > return copy; > > } > > > > This is used in JodaTime, JSR-310, and elsewhere. I have a template I use > > to generate classes in this style at > > > > http://tinyurl.com/6n2nhp > > > > > > > > > > > > > Mark Schreiber wrote: > > > > One area where you could get an interesting mixture of stateless and > > > > synchronized access to a mutable would be threaded parsing of large > > > > sequence files. In my experience the BioJava parsers are not > > > > normally I/O bound due to all the object building they do. Given > > > > this a filereader could for example read a feature block and hand it > > > > off to a threaded stateless feature handler which produces a Feature > > > > object and then adds it (synchronized) to the BioJava Sequence that > > > > is being built. As long as I/O doesn't limit then you would get > > > > improved parsing performance. It would also be a case where the > > > > threading should happen internally as it could be pretty hard to > > > > coordinate the process from the outside. > > > > > > > > This also highlights the difference between encapsulation and > > > > immutability. Even if access to variables is controlled by package > > > > and protected setters the class is still mutable (but not by the > > > > user). Immutability can only be achieved by not providing any setter > > > > methods which has obvious severe limitations. Currently BioJava > > > > Sequence objects have restricted mutability (use of Edit objects) but > > > > are certainly not immutable. > > > > > > > > Again messages need not be immutable as long as they have appropriate > > > > locks and or synchronized getters and setters. Many java frameworks > > > > work best when messages or DTO's are beans (with parameterless > > > > constructors and public getters and setters), being able to use these > > > > is often very desirable. These beans can still be threadsafe if you > > > > code them right. > > > > > > > > > > > What might that look like? > > > > I have to think in most cases (DTOs, form beans, etc) are safe only > > because the container is managing the lifecycle of those beans. > > > > > > Perhaps we might want to copy some of this discussion to > > > > http://biojava.org/wiki/Talk:BioJava3_Design > > > > or a new page about concurrency issues when we are finished. > > > > michael > > > From ap3 at sanger.ac.uk Sun Apr 13 14:02:41 2008 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sun, 13 Apr 2008 19:02:41 +0100 Subject: [Biojava-dev] biojava 1.6 released Message-ID: <0A060667-C24C-4D41-8D10-ED1D449A5F62@sanger.ac.uk> Biojava 1.6 has been released and is available from http:// biojava.org/wiki/BioJava:Download Biojava 1.6 offers more functionality and stability over the previous official releases. BioJava now depends on Java 1.5+. We highly recommend you to upgrade as soon as possible. In detail, the phylo package org.biojavax.bio.phylo was improved and expanded by our GSOC'07 student Boh-Yun Lee. It now contains fully- functional Nexus and Phylip parsers, and tools for calculating UPGMA and Neighbour Joining, Jukes-Kantor and Kimura Two Parameter, and MP. It uses JGraphT to represent parsed trees. The PDB file parser was improved by Jules Jacobsen for better dealing with PDB header records. Andreas Draeger provided several patches for improving the Genetic Algorithm modules. Additionally this release contains numerous bug fixes and documentation improvements. Thanks to the entire biojava community for making this possible! Happy Biojava-ing, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From darin.london at duke.edu Tue Apr 29 12:48:33 2008 From: darin.london at duke.edu (darin.london at duke.edu) Date: Tue, 29 Apr 2008 12:48:33 -0400 Subject: [Biojava-dev] BOSC 2008 Announcement and Call For Submissions Message-ID: <200804291648.m3TGmXk7020802@tenero.duhs.duke.edu> BOSC 2008 Call for Abstracts Reminder The 9th annual Bioinformatics Open Source Conference (BOSC 2008) will take place in Toronto, Ontario, Canada, as one of several Special Interest Group (SIG) meetings occurring in conjunction with the 16th annual Intelligent Systems for Molecular Biology Conference (ISMB 2008). This is a reminder to submit your proposals for talks to the BOSC submission system before May 11. Submission Process: All abstracts must be submitted through our Open Conference Systems site (http://events.open-bio.org/BOSC2008/openconf.php). The form will ask for a small Abstract Text to be pasted into it, and a full paper. The small Abstract text should be a summary, while the longer abstract (should provide more details, including the open-source license requirement details) Full-length abstracts are limited to one page with one inch (2.5 cm) margins on the top, sides, and bottom. The full-length abstract should include the title, authors, and affiliations. We prefer your abstract to be in PDF format, although plain t Important Dates: May 11: Abstract submission deadline. June 2: Notification of accepted talks. June 4: Early registration discount cut-off. July 18-19: BOSC 2008! We hope to see you at BOSC 2008! Kam Dahlquist and Darin London BOSC 2008 Co-organizers From ap3 at sanger.ac.uk Wed Apr 30 06:49:21 2008 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Wed, 30 Apr 2008 11:49:21 +0100 Subject: [Biojava-dev] new uniprot file format Message-ID: <00FA5524-C0B6-4293-84B8-496934B56398@sanger.ac.uk> Hi, There is a change in the uniprot file format coming up beginning of July http://ca.expasy.org/sprot/relnotes/sp_soon.html Having had a quick look at the code I think we will need a patch to allow access to the EC numbers and other sub-category data... Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 ----------------------------------------------------------------------- -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From thpar at psb.ugent.be Mon Apr 14 05:28:03 2008 From: thpar at psb.ugent.be (Thomas Van Parys) Date: Mon, 14 Apr 2008 09:28:03 -0000 Subject: [Biojava-dev] [Biojava-l] biojava 1.6 released In-Reply-To: <0A060667-C24C-4D41-8D10-ED1D449A5F62@sanger.ac.uk> References: <0A060667-C24C-4D41-8D10-ED1D449A5F62@sanger.ac.uk> Message-ID: <48032123.7010803@psb.ugent.be> Andreas Prlic schreef: > > Biojava 1.6 has been released and is available from > http://biojava.org/wiki/BioJava:Download > Hi, Thanks for the new release, but is there any chance that there's something wrong with the download? Firefox hangs when trying to download and wget gives me a jar file that doesn't contain the source code. http://www.biojava.org/download/bj16/all/biojava-1.6-all.jar regards, Thomas -- ================================================================== Thomas Van Parys Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM thomas.vanparys at psb.ugent.be http://bioinformatics.psb.ugent.be ================================================================== From Stefan.Pinkernell at awi.de Mon Apr 14 07:06:02 2008 From: Stefan.Pinkernell at awi.de (Stefan Pinkernell) Date: Mon, 14 Apr 2008 11:06:02 -0000 Subject: [Biojava-dev] biojava 1.6 released In-Reply-To: <0A060667-C24C-4D41-8D10-ED1D449A5F62@sanger.ac.uk> References: <0A060667-C24C-4D41-8D10-ED1D449A5F62@sanger.ac.uk> Message-ID: <48033501.4050804@awi.de> Dear all, I just loaded the new Biojava 1.6 package (biojava-all.jar) but it seems the sources are missing. Where can I find them? Best regards, Stefan Andreas Prlic schrieb: > > Biojava 1.6 has been released and is available from > http://biojava.org/wiki/BioJava:Download >