[Biojava-dev] Pairwise similarity score speed

Radhouane Aniba aradwen at gmail.com
Wed Jun 30 12:01:02 UTC 2010


Hi Andy,

Thank you for your reply.
Actually, I was thinking about a parallelization method or a kind of hadoop
like implementation to do all pairwise comparison. The aim is that at the
end i would like to calculate the average pairwise similarity score within a
set of sequences.

What I am doing is something like that :

For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
  For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
     PairwiseScore
+=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
 End_For
End_For

Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)

In fact the problem is in the ((n * (n-1)) / 2) operations.

As for the solution presented in perl sorry but I dont see what you've did
inside ?! You created a 2D array ? how to achieve operations inside , I
think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?

Radwen


2010/6/30 Andy Yates <ayates at ebi.ac.uk>

> Hi Radwen,
>
> I would have said that this is more of a problem because of the type of
> algorithm you are using. It's impossible (as far as I am aware) to calculate
> the score matrices in one step for multiple sequences & even if it did I
> don't quite see where the speed increase would come from.
>
> As for the All vs. All problem don't forget that really your total number
> of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you
> are comparing so a simple 2D for loop will have you spending twice the
> amount of time on this than needs to occur. When I've done this before (in
> Perl so excuse the usage of it) the code looks like this:
>
> my @output;
> my @elements = ('some','elements','something');
> while(scalar(@elements) > 1) {
>  my $target = pop(@elements);
>  foreach my $remaining_element (@elements) {
>    push(@output, [$target, $remaining_element]);
>  }
> }
>
> So this would have emitted:
>
> [
>        ['some','elements'],
>        ['some','something'],
>        ['elements','something']
> ]
>
> Try doing something similar to this using the Java Deque objects which can
> act as a stack.
>
> Hope this helps to answer your question
>
> Andy
>
> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
>
> > Hello Biojava people,
> >
> > I have a question concerning Needlman Wunsh or Smith waterman algorithms.
> > I am using Biojava 1.7 and I read sequences from proteins fasta file then
> I
> > store my sequences into an array to calculate pairwise similarity scores
> > using a for loop.
> > The problem is that it is very time consuming if we want to calculate all
> > pairwise for a big number of protein sequences. I would like to know if
> > there is way to do a kind of "All against All" comparisons in one single
> > step ?
> > Do someone have a solution for this kind of problem ?
> >
> > Thanks for help.
> >
> > Radwen
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>


-- 
R. ANIBA

Bioinformatics PhD
Laboratoire de Bioinformatique et Génomique Intégrative,
Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC),
1 rue Laurent Fries,
67404 Illkirch, France.
http://www-igbmc.u-strasbg.fr
http://alnitak.u-strasbg.fr/~aniba/alexsys




More information about the biojava-dev mailing list