[Biojava-dev] Pairwise similarity score speed

Wed Jun 30 12:05:41 UTC 2010

It was more of a way of decomposing the operations into a data structure where each element in the 1st dimension represents the elements to compare together. Really the Perl code is a way of describing the operations to occur in order to cover all possible permutations.

Andy

On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:

> Hi Andy, 
> 
> Thank you for your reply.
> Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences.
> 
> What I am doing is something like that :
> 
> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
>   For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
>      PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
>  End_For
> End_For
> 
> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
> 
> In fact the problem is in the ((n * (n-1)) / 2) operations.
> 
> As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?
> 
> Radwen
> 
> 
> 2010/6/30 Andy Yates <ayates at ebi.ac.uk>
> Hi Radwen,
> 
> I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences & even if it did I don't quite see where the speed increase would come from.
> 
> As for the All vs. All problem don't forget that really your total number of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this:
> 
> my @output;
> my @elements = ('some','elements','something');
> while(scalar(@elements) > 1) {
>  my $target = pop(@elements);
>  foreach my $remaining_element (@elements) {
>    push(@output, [$target, $remaining_element]);
>  }
> }
> 
> So this would have emitted:
> 
> [
>        ['some','elements'],
>        ['some','something'],
>        ['elements','something']
> ]
> 
> Try doing something similar to this using the Java Deque objects which can act as a stack.
> 
> Hope this helps to answer your question
> 
> Andy
> 
> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
> 
> > Hello Biojava people,
> >
> > I have a question concerning Needlman Wunsh or Smith waterman algorithms.
> > I am using Biojava 1.7 and I read sequences from proteins fasta file then I
> > store my sequences into an array to calculate pairwise similarity scores
> > using a for loop.
> > The problem is that it is very time consuming if we want to calculate all
> > pairwise for a big number of protein sequences. I would like to know if
> > there is way to do a kind of "All against All" comparisons in one single
> > step ?
> > Do someone have a solution for this kind of problem ?
> >
> > Thanks for help.
> >
> > Radwen
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> 
> 
> -- 
> R. ANIBA
> 
> Bioinformatics PhD
> Laboratoire de Bioinformatique et Génomique Intégrative,
> Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC),
> 1 rue Laurent Fries,
> 67404 Illkirch, France.
> http://www-igbmc.u-strasbg.fr
> http://alnitak.u-strasbg.fr/~aniba/alexsys 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/