From andreas at sdsc.edu Tue Jun 29 19:00:25 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 29 Jun 2010 16:00:25 -0700 Subject: [Biojava-dev] BioJava 3 current status Message-ID: Hi, The mailing lists have been a bit quiet lately, but there was a lot of communication off-list and many SVN commit. As such here a quick update on the current status of BioJava 3 and the Google summer of code projects: The two GSoC projects are well under way and on track. If you are interested in following what is going on, check out the project pages for Posttranslational Modification: http://biojava.org/wiki/GSoC:PTM Multiple Sequence Alignment: http://biojava.org/wiki/GSoC:MSA About BioJava 3: This has made great progress over the last weeks and a lot of new functionality has been committed to SVN. To make this release ready there are now two new tools: * There is now a BioJava Maven repository, which is hosting SNAPSHOT builds from the current SVN. http://www.biojava.org/download/maven/ These builds are made available by the * Automated build system (using CruiseControl) which is running at: http://emmy.rcsb.org:8080/cruisecontrol/ and http://emmy.rcsb.org:8080/dashboard/ Andreas From aradwen at gmail.com Wed Jun 30 07:18:01 2010 From: aradwen at gmail.com (Radhouane Aniba) Date: Wed, 30 Jun 2010 13:18:01 +0200 Subject: [Biojava-dev] Pairwise similarity score speed Message-ID: Hello Biojava people, I have a question concerning Needlman Wunsh or Smith waterman algorithms. I am using Biojava 1.7 and I read sequences from proteins fasta file then I store my sequences into an array to calculate pairwise similarity scores using a for loop. The problem is that it is very time consuming if we want to calculate all pairwise for a big number of protein sequences. I would like to know if there is way to do a kind of "All against All" comparisons in one single step ? Do someone have a solution for this kind of problem ? Thanks for help. Radwen From ayates at ebi.ac.uk Wed Jun 30 07:48:55 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 30 Jun 2010 12:48:55 +0100 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: Message-ID: Hi Radwen, I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences & even if it did I don't quite see where the speed increase would come from. As for the All vs. All problem don't forget that really your total number of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this: my @output; my @elements = ('some','elements','something'); while(scalar(@elements) > 1) { my $target = pop(@elements); foreach my $remaining_element (@elements) { push(@output, [$target, $remaining_element]); } } So this would have emitted: [ ['some','elements'], ['some','something'], ['elements','something'] ] Try doing something similar to this using the Java Deque objects which can act as a stack. Hope this helps to answer your question Andy On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: > Hello Biojava people, > > I have a question concerning Needlman Wunsh or Smith waterman algorithms. > I am using Biojava 1.7 and I read sequences from proteins fasta file then I > store my sequences into an array to calculate pairwise similarity scores > using a for loop. > The problem is that it is very time consuming if we want to calculate all > pairwise for a big number of protein sequences. I would like to know if > there is way to do a kind of "All against All" comparisons in one single > step ? > Do someone have a solution for this kind of problem ? > > Thanks for help. > > Radwen > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From aradwen at gmail.com Wed Jun 30 08:01:02 2010 From: aradwen at gmail.com (Radhouane Aniba) Date: Wed, 30 Jun 2010 14:01:02 +0200 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: Message-ID: Hi Andy, Thank you for your reply. Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences. What I am doing is something like that : For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) End_For End_For Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) In fact the problem is in the ((n * (n-1)) / 2) operations. As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? Radwen 2010/6/30 Andy Yates > Hi Radwen, > > I would have said that this is more of a problem because of the type of > algorithm you are using. It's impossible (as far as I am aware) to calculate > the score matrices in one step for multiple sequences & even if it did I > don't quite see where the speed increase would come from. > > As for the All vs. All problem don't forget that really your total number > of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you > are comparing so a simple 2D for loop will have you spending twice the > amount of time on this than needs to occur. When I've done this before (in > Perl so excuse the usage of it) the code looks like this: > > my @output; > my @elements = ('some','elements','something'); > while(scalar(@elements) > 1) { > my $target = pop(@elements); > foreach my $remaining_element (@elements) { > push(@output, [$target, $remaining_element]); > } > } > > So this would have emitted: > > [ > ['some','elements'], > ['some','something'], > ['elements','something'] > ] > > Try doing something similar to this using the Java Deque objects which can > act as a stack. > > Hope this helps to answer your question > > Andy > > On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: > > > Hello Biojava people, > > > > I have a question concerning Needlman Wunsh or Smith waterman algorithms. > > I am using Biojava 1.7 and I read sequences from proteins fasta file then > I > > store my sequences into an array to calculate pairwise similarity scores > > using a for loop. > > The problem is that it is very time consuming if we want to calculate all > > pairwise for a big number of protein sequences. I would like to know if > > there is way to do a kind of "All against All" comparisons in one single > > step ? > > Do someone have a solution for this kind of problem ? > > > > Thanks for help. > > > > Radwen > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > -- R. ANIBA Bioinformatics PhD Laboratoire de Bioinformatique et G?nomique Int?grative, Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), 1 rue Laurent Fries, 67404 Illkirch, France. http://www-igbmc.u-strasbg.fr http://alnitak.u-strasbg.fr/~aniba/alexsys From ayates at ebi.ac.uk Wed Jun 30 08:05:41 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 30 Jun 2010 13:05:41 +0100 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: Message-ID: It was more of a way of decomposing the operations into a data structure where each element in the 1st dimension represents the elements to compare together. Really the Perl code is a way of describing the operations to occur in order to cover all possible permutations. Andy On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: > Hi Andy, > > Thank you for your reply. > Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences. > > What I am doing is something like that : > > For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 > For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) > PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) > End_For > End_For > > Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) > > In fact the problem is in the ((n * (n-1)) / 2) operations. > > As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? > > Radwen > > > 2010/6/30 Andy Yates > Hi Radwen, > > I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences & even if it did I don't quite see where the speed increase would come from. > > As for the All vs. All problem don't forget that really your total number of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this: > > my @output; > my @elements = ('some','elements','something'); > while(scalar(@elements) > 1) { > my $target = pop(@elements); > foreach my $remaining_element (@elements) { > push(@output, [$target, $remaining_element]); > } > } > > So this would have emitted: > > [ > ['some','elements'], > ['some','something'], > ['elements','something'] > ] > > Try doing something similar to this using the Java Deque objects which can act as a stack. > > Hope this helps to answer your question > > Andy > > On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: > > > Hello Biojava people, > > > > I have a question concerning Needlman Wunsh or Smith waterman algorithms. > > I am using Biojava 1.7 and I read sequences from proteins fasta file then I > > store my sequences into an array to calculate pairwise similarity scores > > using a for loop. > > The problem is that it is very time consuming if we want to calculate all > > pairwise for a big number of protein sequences. I would like to know if > > there is way to do a kind of "All against All" comparisons in one single > > step ? > > Do someone have a solution for this kind of problem ? > > > > Thanks for help. > > > > Radwen > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > -- > R. ANIBA > > Bioinformatics PhD > Laboratoire de Bioinformatique et G?nomique Int?grative, > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > 1 rue Laurent Fries, > 67404 Illkirch, France. > http://www-igbmc.u-strasbg.fr > http://alnitak.u-strasbg.fr/~aniba/alexsys -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From chapman at cs.wisc.edu Wed Jun 30 16:01:47 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Wed, 30 Jun 2010 15:01:47 -0500 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: Message-ID: <4C2BA2AB.1000307@cs.wisc.edu> Hi Radwen, I have already added this functionality to the BioJava3 alignment package. The code is available on the repository [1] and current builds are on the web site [2]. The necessary files are [3] and [4] and in the example code that follows you should only have to replace "piwi-seed-fasta.txt" with your file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . int similars = 0, total = 0; GapPenalty gaps = new SimpleGapPenalty(); SubstitutionMatrix blosum62 = new SimpleSubstitutionMatrix(); List piwi = new ArrayList(); try { piwi.addAll(FastaReaderHelper.readFastaProteinSequence( new File("piwi-seed-fasta.txt")).values()); } catch (Exception e) { e.printStackTrace(); } for (SequencePair pair : Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, blosum62)) { PairwiseSequenceScorer scorer = new FractionalSimilarityScorer(pair); System.out.printf("%n%s vs %s : %d / %d%n%s", pair.getQuery().getAccession(), pair.getTarget().getAccession(), scorer.getScore(), scorer.getMaxScore(), pair); similars += scorer.getScore(); total += scorer.getMaxScore(); } System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, (double)similars/total); ConcurrencyTools.shutdown(); [1] http://biojava.org/wiki/CVS_to_SVN_Migration [2] http://biojava.org/download/maven/ [3] http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar [4] http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar Enjoy, Mark On 6/30/2010 7:05 AM, Andy Yates wrote: > It was more of a way of decomposing the operations into a data structure where each element in the 1st dimension represents the elements to compare together. Really the Perl code is a way of describing the operations to occur in order to cover all possible permutations. > > Andy > > On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: > >> Hi Andy, >> >> Thank you for your reply. >> Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences. >> >> What I am doing is something like that : >> >> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 >> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) >> PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) >> End_For >> End_For >> >> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) >> >> In fact the problem is in the ((n * (n-1)) / 2) operations. >> >> As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? >> >> Radwen >> >> >> 2010/6/30 Andy Yates >> Hi Radwen, >> >> I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences& even if it did I don't quite see where the speed increase would come from. >> >> As for the All vs. All problem don't forget that really your total number of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this: >> >> my @output; >> my @elements = ('some','elements','something'); >> while(scalar(@elements)> 1) { >> my $target = pop(@elements); >> foreach my $remaining_element (@elements) { >> push(@output, [$target, $remaining_element]); >> } >> } >> >> So this would have emitted: >> >> [ >> ['some','elements'], >> ['some','something'], >> ['elements','something'] >> ] >> >> Try doing something similar to this using the Java Deque objects which can act as a stack. >> >> Hope this helps to answer your question >> >> Andy >> >> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: >> >>> Hello Biojava people, >>> >>> I have a question concerning Needlman Wunsh or Smith waterman algorithms. >>> I am using Biojava 1.7 and I read sequences from proteins fasta file then I >>> store my sequences into an array to calculate pairwise similarity scores >>> using a for loop. >>> The problem is that it is very time consuming if we want to calculate all >>> pairwise for a big number of protein sequences. I would like to know if >>> there is way to do a kind of "All against All" comparisons in one single >>> step ? >>> Do someone have a solution for this kind of problem ? >>> >>> Thanks for help. >>> >>> Radwen >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev From andreas at sdsc.edu Tue Jun 29 23:00:25 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Tue, 29 Jun 2010 16:00:25 -0700 Subject: [Biojava-dev] BioJava 3 current status Message-ID: Hi, The mailing lists have been a bit quiet lately, but there was a lot of communication off-list and many SVN commit. As such here a quick update on the current status of BioJava 3 and the Google summer of code projects: The two GSoC projects are well under way and on track. If you are interested in following what is going on, check out the project pages for Posttranslational Modification: http://biojava.org/wiki/GSoC:PTM Multiple Sequence Alignment: http://biojava.org/wiki/GSoC:MSA About BioJava 3: This has made great progress over the last weeks and a lot of new functionality has been committed to SVN. To make this release ready there are now two new tools: * There is now a BioJava Maven repository, which is hosting SNAPSHOT builds from the current SVN. http://www.biojava.org/download/maven/ These builds are made available by the * Automated build system (using CruiseControl) which is running at: http://emmy.rcsb.org:8080/cruisecontrol/ and http://emmy.rcsb.org:8080/dashboard/ Andreas From aradwen at gmail.com Wed Jun 30 11:18:01 2010 From: aradwen at gmail.com (Radhouane Aniba) Date: Wed, 30 Jun 2010 13:18:01 +0200 Subject: [Biojava-dev] Pairwise similarity score speed Message-ID: Hello Biojava people, I have a question concerning Needlman Wunsh or Smith waterman algorithms. I am using Biojava 1.7 and I read sequences from proteins fasta file then I store my sequences into an array to calculate pairwise similarity scores using a for loop. The problem is that it is very time consuming if we want to calculate all pairwise for a big number of protein sequences. I would like to know if there is way to do a kind of "All against All" comparisons in one single step ? Do someone have a solution for this kind of problem ? Thanks for help. Radwen From ayates at ebi.ac.uk Wed Jun 30 11:48:55 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 30 Jun 2010 12:48:55 +0100 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: Message-ID: Hi Radwen, I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences & even if it did I don't quite see where the speed increase would come from. As for the All vs. All problem don't forget that really your total number of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this: my @output; my @elements = ('some','elements','something'); while(scalar(@elements) > 1) { my $target = pop(@elements); foreach my $remaining_element (@elements) { push(@output, [$target, $remaining_element]); } } So this would have emitted: [ ['some','elements'], ['some','something'], ['elements','something'] ] Try doing something similar to this using the Java Deque objects which can act as a stack. Hope this helps to answer your question Andy On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: > Hello Biojava people, > > I have a question concerning Needlman Wunsh or Smith waterman algorithms. > I am using Biojava 1.7 and I read sequences from proteins fasta file then I > store my sequences into an array to calculate pairwise similarity scores > using a for loop. > The problem is that it is very time consuming if we want to calculate all > pairwise for a big number of protein sequences. I would like to know if > there is way to do a kind of "All against All" comparisons in one single > step ? > Do someone have a solution for this kind of problem ? > > Thanks for help. > > Radwen > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From aradwen at gmail.com Wed Jun 30 12:01:02 2010 From: aradwen at gmail.com (Radhouane Aniba) Date: Wed, 30 Jun 2010 14:01:02 +0200 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: Message-ID: Hi Andy, Thank you for your reply. Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences. What I am doing is something like that : For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) End_For End_For Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) In fact the problem is in the ((n * (n-1)) / 2) operations. As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? Radwen 2010/6/30 Andy Yates > Hi Radwen, > > I would have said that this is more of a problem because of the type of > algorithm you are using. It's impossible (as far as I am aware) to calculate > the score matrices in one step for multiple sequences & even if it did I > don't quite see where the speed increase would come from. > > As for the All vs. All problem don't forget that really your total number > of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you > are comparing so a simple 2D for loop will have you spending twice the > amount of time on this than needs to occur. When I've done this before (in > Perl so excuse the usage of it) the code looks like this: > > my @output; > my @elements = ('some','elements','something'); > while(scalar(@elements) > 1) { > my $target = pop(@elements); > foreach my $remaining_element (@elements) { > push(@output, [$target, $remaining_element]); > } > } > > So this would have emitted: > > [ > ['some','elements'], > ['some','something'], > ['elements','something'] > ] > > Try doing something similar to this using the Java Deque objects which can > act as a stack. > > Hope this helps to answer your question > > Andy > > On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: > > > Hello Biojava people, > > > > I have a question concerning Needlman Wunsh or Smith waterman algorithms. > > I am using Biojava 1.7 and I read sequences from proteins fasta file then > I > > store my sequences into an array to calculate pairwise similarity scores > > using a for loop. > > The problem is that it is very time consuming if we want to calculate all > > pairwise for a big number of protein sequences. I would like to know if > > there is way to do a kind of "All against All" comparisons in one single > > step ? > > Do someone have a solution for this kind of problem ? > > > > Thanks for help. > > > > Radwen > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > -- R. ANIBA Bioinformatics PhD Laboratoire de Bioinformatique et G?nomique Int?grative, Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), 1 rue Laurent Fries, 67404 Illkirch, France. http://www-igbmc.u-strasbg.fr http://alnitak.u-strasbg.fr/~aniba/alexsys From ayates at ebi.ac.uk Wed Jun 30 12:05:41 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Wed, 30 Jun 2010 13:05:41 +0100 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: Message-ID: It was more of a way of decomposing the operations into a data structure where each element in the 1st dimension represents the elements to compare together. Really the Perl code is a way of describing the operations to occur in order to cover all possible permutations. Andy On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: > Hi Andy, > > Thank you for your reply. > Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences. > > What I am doing is something like that : > > For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 > For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) > PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) > End_For > End_For > > Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) > > In fact the problem is in the ((n * (n-1)) / 2) operations. > > As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? > > Radwen > > > 2010/6/30 Andy Yates > Hi Radwen, > > I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences & even if it did I don't quite see where the speed increase would come from. > > As for the All vs. All problem don't forget that really your total number of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this: > > my @output; > my @elements = ('some','elements','something'); > while(scalar(@elements) > 1) { > my $target = pop(@elements); > foreach my $remaining_element (@elements) { > push(@output, [$target, $remaining_element]); > } > } > > So this would have emitted: > > [ > ['some','elements'], > ['some','something'], > ['elements','something'] > ] > > Try doing something similar to this using the Java Deque objects which can act as a stack. > > Hope this helps to answer your question > > Andy > > On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: > > > Hello Biojava people, > > > > I have a question concerning Needlman Wunsh or Smith waterman algorithms. > > I am using Biojava 1.7 and I read sequences from proteins fasta file then I > > store my sequences into an array to calculate pairwise similarity scores > > using a for loop. > > The problem is that it is very time consuming if we want to calculate all > > pairwise for a big number of protein sequences. I would like to know if > > there is way to do a kind of "All against All" comparisons in one single > > step ? > > Do someone have a solution for this kind of problem ? > > > > Thanks for help. > > > > Radwen > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > > > -- > R. ANIBA > > Bioinformatics PhD > Laboratoire de Bioinformatique et G?nomique Int?grative, > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > 1 rue Laurent Fries, > 67404 Illkirch, France. > http://www-igbmc.u-strasbg.fr > http://alnitak.u-strasbg.fr/~aniba/alexsys -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From chapman at cs.wisc.edu Wed Jun 30 20:01:47 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Wed, 30 Jun 2010 15:01:47 -0500 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: Message-ID: <4C2BA2AB.1000307@cs.wisc.edu> Hi Radwen, I have already added this functionality to the BioJava3 alignment package. The code is available on the repository [1] and current builds are on the web site [2]. The necessary files are [3] and [4] and in the example code that follows you should only have to replace "piwi-seed-fasta.txt" with your file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . int similars = 0, total = 0; GapPenalty gaps = new SimpleGapPenalty(); SubstitutionMatrix blosum62 = new SimpleSubstitutionMatrix(); List piwi = new ArrayList(); try { piwi.addAll(FastaReaderHelper.readFastaProteinSequence( new File("piwi-seed-fasta.txt")).values()); } catch (Exception e) { e.printStackTrace(); } for (SequencePair pair : Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, blosum62)) { PairwiseSequenceScorer scorer = new FractionalSimilarityScorer(pair); System.out.printf("%n%s vs %s : %d / %d%n%s", pair.getQuery().getAccession(), pair.getTarget().getAccession(), scorer.getScore(), scorer.getMaxScore(), pair); similars += scorer.getScore(); total += scorer.getMaxScore(); } System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, (double)similars/total); ConcurrencyTools.shutdown(); [1] http://biojava.org/wiki/CVS_to_SVN_Migration [2] http://biojava.org/download/maven/ [3] http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar [4] http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar Enjoy, Mark On 6/30/2010 7:05 AM, Andy Yates wrote: > It was more of a way of decomposing the operations into a data structure where each element in the 1st dimension represents the elements to compare together. Really the Perl code is a way of describing the operations to occur in order to cover all possible permutations. > > Andy > > On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: > >> Hi Andy, >> >> Thank you for your reply. >> Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences. >> >> What I am doing is something like that : >> >> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 >> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) >> PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) >> End_For >> End_For >> >> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) >> >> In fact the problem is in the ((n * (n-1)) / 2) operations. >> >> As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? >> >> Radwen >> >> >> 2010/6/30 Andy Yates >> Hi Radwen, >> >> I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences& even if it did I don't quite see where the speed increase would come from. >> >> As for the All vs. All problem don't forget that really your total number of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this: >> >> my @output; >> my @elements = ('some','elements','something'); >> while(scalar(@elements)> 1) { >> my $target = pop(@elements); >> foreach my $remaining_element (@elements) { >> push(@output, [$target, $remaining_element]); >> } >> } >> >> So this would have emitted: >> >> [ >> ['some','elements'], >> ['some','something'], >> ['elements','something'] >> ] >> >> Try doing something similar to this using the Java Deque objects which can act as a stack. >> >> Hope this helps to answer your question >> >> Andy >> >> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: >> >>> Hello Biojava people, >>> >>> I have a question concerning Needlman Wunsh or Smith waterman algorithms. >>> I am using Biojava 1.7 and I read sequences from proteins fasta file then I >>> store my sequences into an array to calculate pairwise similarity scores >>> using a for loop. >>> The problem is that it is very time consuming if we want to calculate all >>> pairwise for a big number of protein sequences. I would like to know if >>> there is way to do a kind of "All against All" comparisons in one single >>> step ? >>> Do someone have a solution for this kind of problem ? >>> >>> Thanks for help. >>> >>> Radwen >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev