[Biojava-dev] Why BJ3 should be multithreaded
Andy Yates
ayates at ebi.ac.uk
Wed Apr 9 14:00:29 UTC 2008
I admit mutability is a good thing sometimes (and as Java programmers is
the way we've been taught to work in).
Oh I've triggered more than enough race conditions working with so
called 'stateless' services assuming too much about how stateless they
were (or more to the point how stateful I had made them). Anyway yes
race conditions can occur anywhere in any bit of code but the majority
of time I see them appearing when 'static' is used.
Yeah I would be worried about someone making a multi-threded app with
BJ. Not impossible (far from it) but I can imagine a few edge cases
coming in.
Andy
Mark Schreiber wrote:
> I'm not too sure which option I prefer, multi-threading by default (ie
> all handled by the packages) or stateless immutable classes and
> messages that can be multi-threaded.
>
> There are arguments for both. The former is recommended in a book I
> am currently reading on concurrency which was written by the authors
> of the java 1.5 concurrency package. Essentially the classes can be
> designed ahead of time to be thread safe and mutability (sometimes a
> good thing) can be done with this in mind.
>
> On the other hand stateless and immutable stuff is often safe enough
> to put into a thread although _only_ as long as operations are truely
> atomic. Take for example Servlets and stateless Session Beans. They
> are pretty thread safe by nescessity (use in app servers) but just
> because they are stateless doens't mean you can't accedentally right
> one that gives you stale data or a race condition.
>
> In both cases thread safety needs to be designed from the start.
>
> Currently BioJava is neither of these things and I imagine things will
> start getting pretty interesting if you try to multi-thread a biojava
> program right now.
>
> - Mark
>
> On Wed, Apr 9, 2008 at 7:03 PM, Andy Yates <ayates at ebi.ac.uk> wrote:
>>
>> Most the time any kind of farm management software (like LSF & please correct me if I'm wrong) looks at the amount of CPU time a process takes up and the number of threads it detects; not only the number of processes you have in a queue. So a multi-threaded biojava should not pose a problem to these systems. Not to mention with the newer multiple core computers; threaded software is becoming the only way to take full advantage of the available power.
>>
>> Where you would want to ignore multi-threading is if you are in a queue like LSF and your x number of Java processes all get chucked onto the same machine. Then if you've got so many processor hungry operations all trying to create threads ... well it's not going to behave as optimally as you might hope.
>>
>> Personally though I'd still air on the side of caution WRT multi-threading and not to have it as part of the default tools but as an Object I can instantiate to do my multi-threading work (so it's a choice at the user's level rather than the framework level). Then using the Java5 executor framework we let users submit work to pools of threads to do their work. Couple this with forcing us to pass around immutable messages between threads/callables (since values shared by threads are probably the number one cause of **** ups) you'll have one heck of a kick-ass scalable framework ;-)
>>
>> Andy
>>
>>
>>
>>
>> Andreas Prlic wrote:
>>
>>> Hi,
>>>
>>> I like the idea of having support for multiple threads. Only thing is, when running BioJava on our compute farm, I am pretty sure our admins won't be happy if BJ would use more than just a single CPU, unless run on special hardware. As such there should be a BJ wide configuration management, which would allow to determine how many CPUs to be used (and the default could be all of them).
>>>
>>> Andreas
>>>
>>>
>>> On 9 Apr 2008, at 09:28, Andy Yates wrote:
>>>
>>>
>>>> Lo,
>>>>
>>>> This is the kind of problem Java7 is attempting to solve with the fork-join framework (which really is a rip-off of Google's MapReduce). There's two ways of looking at thread safety & how to implement it:
>>>>
>>>> * Packages which could be threaded or want to be threaded are programmed with threading in mind using items from the util.concurrent package to split, queue & work with data points.
>>>>
>>>> * Packages can be created as required & have data to process passed to them for processing in a stateless manner; much in the same way servlet engines and a lot of web frameworks run
>>>>
>>>> The first way does mean we can support environments with useful multi-threaded support (no point in threading on a single CPU/core box) from the word go. The second way would require some plumbing on the user's behalf but this would be very easy plumbing; the majority of which we could write (like wrapping things in instances of Callables).
>>>>
>>>> Anyway my 2p worth :)
>>>>
>>>> Andy
>>>>
>>>> Mark Schreiber wrote:
>>>>
>>>>> Hi -
>>>>> I was just playing with threads to see how efficient they are on one of our old 4 CPU IBM servers. The following fairly naive program splits a large array of numbers and sums them all up. The multi-threaded version is 2.5 times faster even allowing for thread overhead. The program could be even better if I make more use of the java1.5 concurrent package.
>>>>> Similar tasks in biojava would be include training distributions which should see similar performance improvements. Much of the current biojava doesn't make use of threads and worse, requires the developer to manage all the thread safety themselves.
>>>>> - Mark
>>>>> /*
>>>>> * To change this template, choose Tools | Templates
>>>>> * and open the template in the editor.
>>>>> */
>>>>> package concurrent;
>>>>> import java.util.concurrent.atomic.AtomicInteger;
>>>>> /**
>>>>> * This program demo's the use of threads to sum a large array of integers.
>>>>> * @author Mark Schreiber
>>>>> */
>>>>> public class ThreadedAdder {
>>>>> static int processors = Runtime.getRuntime().availableProcessors();
>>>>> int bigNumber = 10000000;
>>>>> int[] bigArray = new int[bigNumber * processors];
>>>>> public ThreadedAdder(){
>>>>> //make a big array of integers (10 000 000 numbers for each processor)
>>>>> for(int i = 0; i < bigArray.length; i++){
>>>>> //random number between 1 and 100
>>>>> bigArray[i] = (int)(Math.random() * 100.0);
>>>>> }
>>>>> }
>>>>> public void singleThreadedAdd(){
>>>>> int result = 0;
>>>>> //single threaded sum
>>>>> long start = System.currentTimeMillis();
>>>>> for(int number : bigArray){
>>>>> result += number;
>>>>> }
>>>>> long time = System.currentTimeMillis() - start;
>>>>> System.out.println("Calculation time = "+time+" ms");
>>>>> System.out.println("total = "+result);
>>>>> }
>>>>> public void multiThreadedAdd() throws InterruptedException{
>>>>> AtomicInteger total = new AtomicInteger();
>>>>> long start = System.currentTimeMillis();
>>>>> AddingThread[] threads = new AddingThread[processors];
>>>>> for(int i = 0; i < threads.length; i++){
>>>>> threads[i] = new AddingThread("Thread "+i, i * bigNumber, total);
>>>>> System.out.println(threads[i].getName()+" starting");
>>>>> threads[i].start();
>>>>> }
>>>>> for(Thread thread : threads){
>>>>> //make sure everyone is finished
>>>>> thread.join();
>>>>> }
>>>>> long time = System.currentTimeMillis() - start;
>>>>> System.out.println("Calculation time = "+time+" ms");
>>>>> System.out.println("total = "+total);
>>>>> }
>>>>> /**
>>>>> * @param args the command line arguments
>>>>> */
>>>>> public static void main(String[] args) throws Exception{
>>>>> //how many processors do I have?
>>>>> System.out.println("Available processors = "+processors);
>>>>> System.out.println("Initializing number array");
>>>>> ThreadedAdder adder = new ThreadedAdder();
>>>>> System.out.println("single thread add");
>>>>> adder.singleThreadedAdd();
>>>>> System.out.println("multi thread add");
>>>>> adder.multiThreadedAdd();
>>>>> }
>>>>> public class AddingThread extends Thread{
>>>>> int internalTotal = 0;
>>>>> int offSet = 0;
>>>>> AtomicInteger callBackTotal;
>>>>> public AddingThread(String name, int offSet, AtomicInteger callBackTotal){
>>>>> super(name);
>>>>> this.offSet = offSet;
>>>>> this.callBackTotal = callBackTotal;
>>>>> }
>>>>> @Override
>>>>> public void run(){
>>>>> for(int i = offSet; i < offSet + bigNumber; i++){
>>>>> internalTotal += bigArray[i];
>>>>> }
>>>>> callBackTotal.addAndGet(internalTotal);
>>>>> System.out.println(this.getName()+" complete");
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>> -----------------------------------------------------------------------
>>>
>>> Andreas Prlic Wellcome Trust Sanger Institute
>>> Hinxton, Cambridge CB10 1SA, UK
>>> +44 (0) 1223 49 6891
>>>
>>> -----------------------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
More information about the biojava-dev
mailing list