From cur3n4 at yahoo.es Thu Jul 1 00:23:55 2010 From: cur3n4 at yahoo.es (Sergio Alvarez) Date: Thu, 1 Jul 2010 04:23:55 +0000 (GMT) Subject: [Biojava-dev] Supporting BioJava Message-ID: <940885.11936.qm@web25902.mail.ukl.yahoo.com> Hello, My name is Sergio Alvarez, and I am a software engineer with 11 years of experience in the Java world. Recently, I have decided to focus my career towards bioinformatics, and as a first step, I would like to be able to contribute some of my time to Bio Java. I suppose you have a list of improvements or new features, that you would like to have, so please let me know if you are interested in my support and what I could work on. Thanks a lot and Best Regards Sergio Alvarez From aradwen at gmail.com Thu Jul 1 07:05:34 2010 From: aradwen at gmail.com (Radhouane Aniba) Date: Thu, 1 Jul 2010 13:05:34 +0200 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: <4C2BA2AB.1000307@cs.wisc.edu> References:

<4C2BA2AB.1000307@cs.wisc.edu> Message-ID: Mark ! Can we modify the code you gave to : - extract the MIN pairwise similarity score - extract the MAX pairwise similarity score - calculate the standard deviation of similarity scores ? Rad 2010/6/30 Mark Chapman > Hi Radwen, > > I have already added this functionality to the BioJava3 alignment package. > The code is available on the repository [1] and current builds are on the > web site [2]. The necessary files are [3] and [4] and in the example code > that follows you should only have to replace "piwi-seed-fasta.txt" with your > file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just > change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . > > > int similars = 0, total = 0; > GapPenalty gaps = new SimpleGapPenalty(); > SubstitutionMatrix blosum62 = > new SimpleSubstitutionMatrix(); > > List piwi = new ArrayList(); > try { > piwi.addAll(FastaReaderHelper.readFastaProteinSequence( > new File("piwi-seed-fasta.txt")).values()); > } catch (Exception e) { > e.printStackTrace(); > } > > for (SequencePair pair : > Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, > blosum62)) { > PairwiseSequenceScorer scorer = > new FractionalSimilarityScorer AminoAcidCompound>(pair); > System.out.printf("%n%s vs %s : %d / %d%n%s", > pair.getQuery().getAccession(), > pair.getTarget().getAccession(), scorer.getScore(), > scorer.getMaxScore(), > pair); > similars += scorer.getScore(); > total += scorer.getMaxScore(); > } > > System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, > (double)similars/total); > > ConcurrencyTools.shutdown(); > > > [1] http://biojava.org/wiki/CVS_to_SVN_Migration > [2] http://biojava.org/download/maven/ > [3] > http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar > [4] > http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar > > > Enjoy, > Mark > > > > On 6/30/2010 7:05 AM, Andy Yates wrote: > >> It was more of a way of decomposing the operations into a data structure >> where each element in the 1st dimension represents the elements to compare >> together. Really the Perl code is a way of describing the operations to >> occur in order to cover all possible permutations. >> >> Andy >> >> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: >> >> Hi Andy, >>> >>> Thank you for your reply. >>> Actually, I was thinking about a parallelization method or a kind of >>> hadoop like implementation to do all pairwise comparison. The aim is that at >>> the end i would like to calculate the average pairwise similarity score >>> within a set of sequences. >>> >>> What I am doing is something like that : >>> >>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 >>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) >>> PairwiseScore >>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) >>> End_For >>> End_For >>> >>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) >>> >>> In fact the problem is in the ((n * (n-1)) / 2) operations. >>> >>> As for the solution presented in perl sorry but I dont see what you've >>> did inside ?! You created a 2D array ? how to achieve operations inside , I >>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? >>> >>> Radwen >>> >>> >>> 2010/6/30 Andy Yates >>> Hi Radwen, >>> >>> I would have said that this is more of a problem because of the type of >>> algorithm you are using. It's impossible (as far as I am aware) to calculate >>> the score matrices in one step for multiple sequences& even if it did I >>> don't quite see where the speed increase would come from. >>> >>> As for the All vs. All problem don't forget that really your total number >>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you >>> are comparing so a simple 2D for loop will have you spending twice the >>> amount of time on this than needs to occur. When I've done this before (in >>> Perl so excuse the usage of it) the code looks like this: >>> >>> my @output; >>> my @elements = ('some','elements','something'); >>> while(scalar(@elements)> 1) { >>> my $target = pop(@elements); >>> foreach my $remaining_element (@elements) { >>> push(@output, [$target, $remaining_element]); >>> } >>> } >>> >>> So this would have emitted: >>> >>> [ >>> ['some','elements'], >>> ['some','something'], >>> ['elements','something'] >>> ] >>> >>> Try doing something similar to this using the Java Deque objects which >>> can act as a stack. >>> >>> Hope this helps to answer your question >>> >>> Andy >>> >>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: >>> >>> Hello Biojava people, >>>> >>>> I have a question concerning Needlman Wunsh or Smith waterman >>>> algorithms. >>>> I am using Biojava 1.7 and I read sequences from proteins fasta file >>>> then I >>>> store my sequences into an array to calculate pairwise similarity scores >>>> using a for loop. >>>> The problem is that it is very time consuming if we want to calculate >>>> all >>>> pairwise for a big number of protein sequences. I would like to know if >>>> there is way to do a kind of "All against All" comparisons in one single >>>> step ? >>>> Do someone have a solution for this kind of problem ? >>>> >>>> Thanks for help. >>>> >>>> Radwen >>>> _______________________________________________ >>>> biojava-dev mailing list >>>> biojava-dev at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>> >>> -- R. ANIBA Bioinformatics PhD Laboratoire de Bioinformatique et G?nomique Int?grative, Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), 1 rue Laurent Fries, 67404 Illkirch, France. http://www-igbmc.u-strasbg.fr http://alnitak.u-strasbg.fr/~aniba/alexsys From ayates at ebi.ac.uk Thu Jul 1 07:28:01 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 1 Jul 2010 12:28:01 +0100 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References:

<4C2BA2AB.1000307@cs.wisc.edu> Message-ID: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk> I believe that you could use Jakarta commons-math & more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation & anything else you'd expect to be able to use to describe a range of values Andy On 1 Jul 2010, at 12:05, Radhouane Aniba wrote: > Mark ! > > Can we modify the code you gave to : > > - extract the MIN pairwise similarity score > - extract the MAX pairwise similarity score > - calculate the standard deviation of similarity scores > > ? > > Rad > > 2010/6/30 Mark Chapman > >> Hi Radwen, >> >> I have already added this functionality to the BioJava3 alignment package. >> The code is available on the repository [1] and current builds are on the >> web site [2]. The necessary files are [3] and [4] and in the example code >> that follows you should only have to replace "piwi-seed-fasta.txt" with your >> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just >> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . >> >> >> int similars = 0, total = 0; >> GapPenalty gaps = new SimpleGapPenalty(); >> SubstitutionMatrix blosum62 = >> new SimpleSubstitutionMatrix(); >> >> List piwi = new ArrayList(); >> try { >> piwi.addAll(FastaReaderHelper.readFastaProteinSequence( >> new File("piwi-seed-fasta.txt")).values()); >> } catch (Exception e) { >> e.printStackTrace(); >> } >> >> for (SequencePair pair : >> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, >> blosum62)) { >> PairwiseSequenceScorer scorer = >> new FractionalSimilarityScorer> AminoAcidCompound>(pair); >> System.out.printf("%n%s vs %s : %d / %d%n%s", >> pair.getQuery().getAccession(), >> pair.getTarget().getAccession(), scorer.getScore(), >> scorer.getMaxScore(), >> pair); >> similars += scorer.getScore(); >> total += scorer.getMaxScore(); >> } >> >> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, >> (double)similars/total); >> >> ConcurrencyTools.shutdown(); >> >> >> [1] http://biojava.org/wiki/CVS_to_SVN_Migration >> [2] http://biojava.org/download/maven/ >> [3] >> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar >> [4] >> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar >> >> >> Enjoy, >> Mark >> >> >> >> On 6/30/2010 7:05 AM, Andy Yates wrote: >> >>> It was more of a way of decomposing the operations into a data structure >>> where each element in the 1st dimension represents the elements to compare >>> together. Really the Perl code is a way of describing the operations to >>> occur in order to cover all possible permutations. >>> >>> Andy >>> >>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: >>> >>> Hi Andy, >>>> >>>> Thank you for your reply. >>>> Actually, I was thinking about a parallelization method or a kind of >>>> hadoop like implementation to do all pairwise comparison. The aim is that at >>>> the end i would like to calculate the average pairwise similarity score >>>> within a set of sequences. >>>> >>>> What I am doing is something like that : >>>> >>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 >>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) >>>> PairwiseScore >>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) >>>> End_For >>>> End_For >>>> >>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) >>>> >>>> In fact the problem is in the ((n * (n-1)) / 2) operations. >>>> >>>> As for the solution presented in perl sorry but I dont see what you've >>>> did inside ?! You created a 2D array ? how to achieve operations inside , I >>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? >>>> >>>> Radwen >>>> >>>> >>>> 2010/6/30 Andy Yates >>>> Hi Radwen, >>>> >>>> I would have said that this is more of a problem because of the type of >>>> algorithm you are using. It's impossible (as far as I am aware) to calculate >>>> the score matrices in one step for multiple sequences& even if it did I >>>> don't quite see where the speed increase would come from. >>>> >>>> As for the All vs. All problem don't forget that really your total number >>>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you >>>> are comparing so a simple 2D for loop will have you spending twice the >>>> amount of time on this than needs to occur. When I've done this before (in >>>> Perl so excuse the usage of it) the code looks like this: >>>> >>>> my @output; >>>> my @elements = ('some','elements','something'); >>>> while(scalar(@elements)> 1) { >>>> my $target = pop(@elements); >>>> foreach my $remaining_element (@elements) { >>>> push(@output, [$target, $remaining_element]); >>>> } >>>> } >>>> >>>> So this would have emitted: >>>> >>>> [ >>>> ['some','elements'], >>>> ['some','something'], >>>> ['elements','something'] >>>> ] >>>> >>>> Try doing something similar to this using the Java Deque objects which >>>> can act as a stack. >>>> >>>> Hope this helps to answer your question >>>> >>>> Andy >>>> >>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: >>>> >>>> Hello Biojava people, >>>>> >>>>> I have a question concerning Needlman Wunsh or Smith waterman >>>>> algorithms. >>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file >>>>> then I >>>>> store my sequences into an array to calculate pairwise similarity scores >>>>> using a for loop. >>>>> The problem is that it is very time consuming if we want to calculate >>>>> all >>>>> pairwise for a big number of protein sequences. I would like to know if >>>>> there is way to do a kind of "All against All" comparisons in one single >>>>> step ? >>>>> Do someone have a solution for this kind of problem ? >>>>> >>>>> Thanks for help. >>>>> >>>>> Radwen >>>>> _______________________________________________ >>>>> biojava-dev mailing list >>>>> biojava-dev at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>>> >>>> > > > -- > R. ANIBA > > Bioinformatics PhD > Laboratoire de Bioinformatique et G?nomique Int?grative, > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > 1 rue Laurent Fries, > 67404 Illkirch, France. > http://www-igbmc.u-strasbg.fr > http://alnitak.u-strasbg.fr/~aniba/alexsys > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From HWillis at scripps.edu Thu Jul 1 07:39:59 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Thu, 1 Jul 2010 07:39:59 -0400 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk> Message-ID: Andy I agree we should probably include the commons statistics jar as a must have for anything statistical or p-score related. Scooter On 7/1/10 7:28 AM, "Andy Yates" wrote: I believe that you could use Jakarta commons-math & more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation & anything else you'd expect to be able to use to describe a range of values Andy On 1 Jul 2010, at 12:05, Radhouane Aniba wrote: > Mark ! > > Can we modify the code you gave to : > > - extract the MIN pairwise similarity score > - extract the MAX pairwise similarity score > - calculate the standard deviation of similarity scores > > ? > > Rad > > 2010/6/30 Mark Chapman > >> Hi Radwen, >> >> I have already added this functionality to the BioJava3 alignment package. >> The code is available on the repository [1] and current builds are on the >> web site [2]. The necessary files are [3] and [4] and in the example code >> that follows you should only have to replace "piwi-seed-fasta.txt" with your >> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just >> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . >> >> >> int similars = 0, total = 0; >> GapPenalty gaps = new SimpleGapPenalty(); >> SubstitutionMatrix blosum62 = >> new SimpleSubstitutionMatrix(); >> >> List piwi = new ArrayList(); >> try { >> piwi.addAll(FastaReaderHelper.readFastaProteinSequence( >> new File("piwi-seed-fasta.txt")).values()); >> } catch (Exception e) { >> e.printStackTrace(); >> } >> >> for (SequencePair pair : >> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, >> blosum62)) { >> PairwiseSequenceScorer scorer = >> new FractionalSimilarityScorer> AminoAcidCompound>(pair); >> System.out.printf("%n%s vs %s : %d / %d%n%s", >> pair.getQuery().getAccession(), >> pair.getTarget().getAccession(), scorer.getScore(), >> scorer.getMaxScore(), >> pair); >> similars += scorer.getScore(); >> total += scorer.getMaxScore(); >> } >> >> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, >> (double)similars/total); >> >> ConcurrencyTools.shutdown(); >> >> >> [1] http://biojava.org/wiki/CVS_to_SVN_Migration >> [2] http://biojava.org/download/maven/ >> [3] >> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar >> [4] >> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar >> >> >> Enjoy, >> Mark >> >> >> >> On 6/30/2010 7:05 AM, Andy Yates wrote: >> >>> It was more of a way of decomposing the operations into a data structure >>> where each element in the 1st dimension represents the elements to compare >>> together. Really the Perl code is a way of describing the operations to >>> occur in order to cover all possible permutations. >>> >>> Andy >>> >>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: >>> >>> Hi Andy, >>>> >>>> Thank you for your reply. >>>> Actually, I was thinking about a parallelization method or a kind of >>>> hadoop like implementation to do all pairwise comparison. The aim is that at >>>> the end i would like to calculate the average pairwise similarity score >>>> within a set of sequences. >>>> >>>> What I am doing is something like that : >>>> >>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 >>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) >>>> PairwiseScore >>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) >>>> End_For >>>> End_For >>>> >>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) >>>> >>>> In fact the problem is in the ((n * (n-1)) / 2) operations. >>>> >>>> As for the solution presented in perl sorry but I dont see what you've >>>> did inside ?! You created a 2D array ? how to achieve operations inside , I >>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? >>>> >>>> Radwen >>>> >>>> >>>> 2010/6/30 Andy Yates >>>> Hi Radwen, >>>> >>>> I would have said that this is more of a problem because of the type of >>>> algorithm you are using. It's impossible (as far as I am aware) to calculate >>>> the score matrices in one step for multiple sequences& even if it did I >>>> don't quite see where the speed increase would come from. >>>> >>>> As for the All vs. All problem don't forget that really your total number >>>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you >>>> are comparing so a simple 2D for loop will have you spending twice the >>>> amount of time on this than needs to occur. When I've done this before (in >>>> Perl so excuse the usage of it) the code looks like this: >>>> >>>> my @output; >>>> my @elements = ('some','elements','something'); >>>> while(scalar(@elements)> 1) { >>>> my $target = pop(@elements); >>>> foreach my $remaining_element (@elements) { >>>> push(@output, [$target, $remaining_element]); >>>> } >>>> } >>>> >>>> So this would have emitted: >>>> >>>> [ >>>> ['some','elements'], >>>> ['some','something'], >>>> ['elements','something'] >>>> ] >>>> >>>> Try doing something similar to this using the Java Deque objects which >>>> can act as a stack. >>>> >>>> Hope this helps to answer your question >>>> >>>> Andy >>>> >>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: >>>> >>>> Hello Biojava people, >>>>> >>>>> I have a question concerning Needlman Wunsh or Smith waterman >>>>> algorithms. >>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file >>>>> then I >>>>> store my sequences into an array to calculate pairwise similarity scores >>>>> using a for loop. >>>>> The problem is that it is very time consuming if we want to calculate >>>>> all >>>>> pairwise for a big number of protein sequences. I would like to know if >>>>> there is way to do a kind of "All against All" comparisons in one single >>>>> step ? >>>>> Do someone have a solution for this kind of problem ? >>>>> >>>>> Thanks for help. >>>>> >>>>> Radwen >>>>> _______________________________________________ >>>>> biojava-dev mailing list >>>>> biojava-dev at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>>> >>>> > > > -- > R. ANIBA > > Bioinformatics PhD > Laboratoire de Bioinformatique et G?nomique Int?grative, > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > 1 rue Laurent Fries, > 67404 Illkirch, France. > http://www-igbmc.u-strasbg.fr > http://alnitak.u-strasbg.fr/~aniba/alexsys > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev From aradwen at gmail.com Thu Jul 1 08:02:32 2010 From: aradwen at gmail.com (Radhouane Aniba) Date: Thu, 1 Jul 2010 14:02:32 +0200 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk> Message-ID: Thx Andy, Yes absolutely true what you said. But my question was much concerning when Mark is calculating pairwise scores I don't know if there is an "elegant way" to keep temporary the pairwise scores for further treatments, i'm specifically thinking about an array of scores. TBC .. Rad 2010/7/1 Scooter Willis > Andy > > I agree we should probably include the commons statistics jar as a must > have for anything statistical or p-score related. > > Scooter > > > > On 7/1/10 7:28 AM, "Andy Yates" wrote: > > I believe that you could use Jakarta commons-math & more specifically > org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will > give you min, max, std deviation & anything else you'd expect to be able to > use to describe a range of values > > Andy > > On 1 Jul 2010, at 12:05, Radhouane Aniba wrote: > > > Mark ! > > > > Can we modify the code you gave to : > > > > - extract the MIN pairwise similarity score > > - extract the MAX pairwise similarity score > > - calculate the standard deviation of similarity scores > > > > ? > > > > Rad > > > > 2010/6/30 Mark Chapman > > > >> Hi Radwen, > >> > >> I have already added this functionality to the BioJava3 alignment > package. > >> The code is available on the repository [1] and current builds are on > the > >> web site [2]. The necessary files are [3] and [4] and in the example > code > >> that follows you should only have to replace "piwi-seed-fasta.txt" with > your > >> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just > >> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . > >> > >> > >> int similars = 0, total = 0; > >> GapPenalty gaps = new SimpleGapPenalty(); > >> SubstitutionMatrix blosum62 = > >> new SimpleSubstitutionMatrix(); > >> > >> List piwi = new ArrayList(); > >> try { > >> piwi.addAll(FastaReaderHelper.readFastaProteinSequence( > >> new File("piwi-seed-fasta.txt")).values()); > >> } catch (Exception e) { > >> e.printStackTrace(); > >> } > >> > >> for (SequencePair pair : > >> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, > >> blosum62)) { > >> PairwiseSequenceScorer scorer = > >> new FractionalSimilarityScorer >> AminoAcidCompound>(pair); > >> System.out.printf("%n%s vs %s : %d / %d%n%s", > >> pair.getQuery().getAccession(), > >> pair.getTarget().getAccession(), scorer.getScore(), > >> scorer.getMaxScore(), > >> pair); > >> similars += scorer.getScore(); > >> total += scorer.getMaxScore(); > >> } > >> > >> System.out.printf("%nAverage similarity = %d / %d = %f", similars, > total, > >> (double)similars/total); > >> > >> ConcurrencyTools.shutdown(); > >> > >> > >> [1] http://biojava.org/wiki/CVS_to_SVN_Migration > >> [2] http://biojava.org/download/maven/ > >> [3] > >> > http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar > >> [4] > >> > http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar > >> > >> > >> Enjoy, > >> Mark > >> > >> > >> > >> On 6/30/2010 7:05 AM, Andy Yates wrote: > >> > >>> It was more of a way of decomposing the operations into a data > structure > >>> where each element in the 1st dimension represents the elements to > compare > >>> together. Really the Perl code is a way of describing the operations to > >>> occur in order to cover all possible permutations. > >>> > >>> Andy > >>> > >>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: > >>> > >>> Hi Andy, > >>>> > >>>> Thank you for your reply. > >>>> Actually, I was thinking about a parallelization method or a kind of > >>>> hadoop like implementation to do all pairwise comparison. The aim is > that at > >>>> the end i would like to calculate the average pairwise similarity > score > >>>> within a set of sequences. > >>>> > >>>> What I am doing is something like that : > >>>> > >>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 > >>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) > >>>> PairwiseScore > >>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) > >>>> End_For > >>>> End_For > >>>> > >>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) > >>>> > >>>> In fact the problem is in the ((n * (n-1)) / 2) operations. > >>>> > >>>> As for the solution presented in perl sorry but I dont see what you've > >>>> did inside ?! You created a 2D array ? how to achieve operations > inside , I > >>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? > >>>> > >>>> Radwen > >>>> > >>>> > >>>> 2010/6/30 Andy Yates > >>>> Hi Radwen, > >>>> > >>>> I would have said that this is more of a problem because of the type > of > >>>> algorithm you are using. It's impossible (as far as I am aware) to > calculate > >>>> the score matrices in one step for multiple sequences& even if it did > I > >>>> don't quite see where the speed increase would come from. > >>>> > >>>> As for the All vs. All problem don't forget that really your total > number > >>>> of comparisons is ((n * (n-1)) / 2) where n is the number of > sequences you > >>>> are comparing so a simple 2D for loop will have you spending twice the > >>>> amount of time on this than needs to occur. When I've done this before > (in > >>>> Perl so excuse the usage of it) the code looks like this: > >>>> > >>>> my @output; > >>>> my @elements = ('some','elements','something'); > >>>> while(scalar(@elements)> 1) { > >>>> my $target = pop(@elements); > >>>> foreach my $remaining_element (@elements) { > >>>> push(@output, [$target, $remaining_element]); > >>>> } > >>>> } > >>>> > >>>> So this would have emitted: > >>>> > >>>> [ > >>>> ['some','elements'], > >>>> ['some','something'], > >>>> ['elements','something'] > >>>> ] > >>>> > >>>> Try doing something similar to this using the Java Deque objects which > >>>> can act as a stack. > >>>> > >>>> Hope this helps to answer your question > >>>> > >>>> Andy > >>>> > >>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: > >>>> > >>>> Hello Biojava people, > >>>>> > >>>>> I have a question concerning Needlman Wunsh or Smith waterman > >>>>> algorithms. > >>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file > >>>>> then I > >>>>> store my sequences into an array to calculate pairwise similarity > scores > >>>>> using a for loop. > >>>>> The problem is that it is very time consuming if we want to calculate > >>>>> all > >>>>> pairwise for a big number of protein sequences. I would like to know > if > >>>>> there is way to do a kind of "All against All" comparisons in one > single > >>>>> step ? > >>>>> Do someone have a solution for this kind of problem ? > >>>>> > >>>>> Thanks for help. > >>>>> > >>>>> Radwen > >>>>> _______________________________________________ > >>>>> biojava-dev mailing list > >>>>> biojava-dev at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > >>>>> > >>>> > > > > > > -- > > R. ANIBA > > > > Bioinformatics PhD > > Laboratoire de Bioinformatique et G?nomique Int?grative, > > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > > 1 rue Laurent Fries, > > 67404 Illkirch, France. > > http://www-igbmc.u-strasbg.fr > > http://alnitak.u-strasbg.fr/~aniba/alexsys > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- R. ANIBA Bioinformatics PhD Laboratoire de Bioinformatique et G?nomique Int?grative, Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), 1 rue Laurent Fries, 67404 Illkirch, France. http://www-igbmc.u-strasbg.fr http://alnitak.u-strasbg.fr/~aniba/alexsys From ayates at ebi.ac.uk Thu Jul 1 09:08:16 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 1 Jul 2010 14:08:16 +0100 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

Message-ID: <207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> Looking at the implementation the pairwise scores for the FractionalSimilarityScorer comes from SequencePair.getNumSimilars() so the pairwise scores should always be easily available Andy On 1 Jul 2010, at 13:02, Radhouane Aniba wrote: > Thx Andy, > > Yes absolutely true what you said. But my question was much concerning when Mark is calculating pairwise scores I don't know if there is an "elegant way" to keep temporary the pairwise scores for further treatments, i'm specifically thinking about an array of scores. > > TBC .. > > Rad > > 2010/7/1 Scooter Willis > Andy > > I agree we should probably include the commons statistics jar as a must have for anything statistical or p-score related. > > Scooter > > > > On 7/1/10 7:28 AM, "Andy Yates" wrote: > > I believe that you could use Jakarta commons-math & more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation & anything else you'd expect to be able to use to describe a range of values > > Andy > > On 1 Jul 2010, at 12:05, Radhouane Aniba wrote: > > > Mark ! > > > > Can we modify the code you gave to : > > > > - extract the MIN pairwise similarity score > > - extract the MAX pairwise similarity score > > - calculate the standard deviation of similarity scores > > > > ? > > > > Rad > > > > 2010/6/30 Mark Chapman > > > >> Hi Radwen, > >> > >> I have already added this functionality to the BioJava3 alignment package. > >> The code is available on the repository [1] and current builds are on the > >> web site [2]. The necessary files are [3] and [4] and in the example code > >> that follows you should only have to replace "piwi-seed-fasta.txt" with your > >> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just > >> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . > >> > >> > >> int similars = 0, total = 0; > >> GapPenalty gaps = new SimpleGapPenalty(); > >> SubstitutionMatrix blosum62 = > >> new SimpleSubstitutionMatrix(); > >> > >> List piwi = new ArrayList(); > >> try { > >> piwi.addAll(FastaReaderHelper.readFastaProteinSequence( > >> new File("piwi-seed-fasta.txt")).values()); > >> } catch (Exception e) { > >> e.printStackTrace(); > >> } > >> > >> for (SequencePair pair : > >> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, > >> blosum62)) { > >> PairwiseSequenceScorer scorer = > >> new FractionalSimilarityScorer >> AminoAcidCompound>(pair); > >> System.out.printf("%n%s vs %s : %d / %d%n%s", > >> pair.getQuery().getAccession(), > >> pair.getTarget().getAccession(), scorer.getScore(), > >> scorer.getMaxScore(), > >> pair); > >> similars += scorer.getScore(); > >> total += scorer.getMaxScore(); > >> } > >> > >> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, > >> (double)similars/total); > >> > >> ConcurrencyTools.shutdown(); > >> > >> > >> [1] http://biojava.org/wiki/CVS_to_SVN_Migration > >> [2] http://biojava.org/download/maven/ > >> [3] > >> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar > >> [4] > >> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar > >> > >> > >> Enjoy, > >> Mark > >> > >> > >> > >> On 6/30/2010 7:05 AM, Andy Yates wrote: > >> > >>> It was more of a way of decomposing the operations into a data structure > >>> where each element in the 1st dimension represents the elements to compare > >>> together. Really the Perl code is a way of describing the operations to > >>> occur in order to cover all possible permutations. > >>> > >>> Andy > >>> > >>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: > >>> > >>> Hi Andy, > >>>> > >>>> Thank you for your reply. > >>>> Actually, I was thinking about a parallelization method or a kind of > >>>> hadoop like implementation to do all pairwise comparison. The aim is that at > >>>> the end i would like to calculate the average pairwise similarity score > >>>> within a set of sequences. > >>>> > >>>> What I am doing is something like that : > >>>> > >>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 > >>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) > >>>> PairwiseScore > >>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) > >>>> End_For > >>>> End_For > >>>> > >>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) > >>>> > >>>> In fact the problem is in the ((n * (n-1)) / 2) operations. > >>>> > >>>> As for the solution presented in perl sorry but I dont see what you've > >>>> did inside ?! You created a 2D array ? how to achieve operations inside , I > >>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? > >>>> > >>>> Radwen > >>>> > >>>> > >>>> 2010/6/30 Andy Yates > >>>> Hi Radwen, > >>>> > >>>> I would have said that this is more of a problem because of the type of > >>>> algorithm you are using. It's impossible (as far as I am aware) to calculate > >>>> the score matrices in one step for multiple sequences& even if it did I > >>>> don't quite see where the speed increase would come from. > >>>> > >>>> As for the All vs. All problem don't forget that really your total number > >>>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you > >>>> are comparing so a simple 2D for loop will have you spending twice the > >>>> amount of time on this than needs to occur. When I've done this before (in > >>>> Perl so excuse the usage of it) the code looks like this: > >>>> > >>>> my @output; > >>>> my @elements = ('some','elements','something'); > >>>> while(scalar(@elements)> 1) { > >>>> my $target = pop(@elements); > >>>> foreach my $remaining_element (@elements) { > >>>> push(@output, [$target, $remaining_element]); > >>>> } > >>>> } > >>>> > >>>> So this would have emitted: > >>>> > >>>> [ > >>>> ['some','elements'], > >>>> ['some','something'], > >>>> ['elements','something'] > >>>> ] > >>>> > >>>> Try doing something similar to this using the Java Deque objects which > >>>> can act as a stack. > >>>> > >>>> Hope this helps to answer your question > >>>> > >>>> Andy > >>>> > >>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: > >>>> > >>>> Hello Biojava people, > >>>>> > >>>>> I have a question concerning Needlman Wunsh or Smith waterman > >>>>> algorithms. > >>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file > >>>>> then I > >>>>> store my sequences into an array to calculate pairwise similarity scores > >>>>> using a for loop. > >>>>> The problem is that it is very time consuming if we want to calculate > >>>>> all > >>>>> pairwise for a big number of protein sequences. I would like to know if > >>>>> there is way to do a kind of "All against All" comparisons in one single > >>>>> step ? > >>>>> Do someone have a solution for this kind of problem ? > >>>>> > >>>>> Thanks for help. > >>>>> > >>>>> Radwen > >>>>> _______________________________________________ > >>>>> biojava-dev mailing list > >>>>> biojava-dev at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > >>>>> > >>>> > > > > > > -- > > R. ANIBA > > > > Bioinformatics PhD > > Laboratoire de Bioinformatique et G?nomique Int?grative, > > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > > 1 rue Laurent Fries, > > 67404 Illkirch, France. > > http://www-igbmc.u-strasbg.fr > > http://alnitak.u-strasbg.fr/~aniba/alexsys > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > > -- > R. ANIBA > > Bioinformatics PhD > Laboratoire de Bioinformatique et G?nomique Int?grative, > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > 1 rue Laurent Fries, > 67404 Illkirch, France. > http://www-igbmc.u-strasbg.fr > http://alnitak.u-strasbg.fr/~aniba/alexsys -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From chapman at cs.wisc.edu Thu Jul 1 19:02:48 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Thu, 01 Jul 2010 18:02:48 -0500 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: <207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> Message-ID: <4C2D1E98.1020300@cs.wisc.edu> One additional note: the number of similarities is cached, so after being computed on the first call to SequencePair.getNumSimilars() or FractionalSimilarityScorer.getScore(), further calls simply read a variable. This means the first iteration over a list of sequence pairs may take a while for alignment and calculation of similarities, but later iterations will be fast so no additional array of scores is needed for storage. Mark On 7/1/2010 8:08 AM, Andy Yates wrote: > Looking at the implementation the pairwise scores for the FractionalSimilarityScorer comes from SequencePair.getNumSimilars() so the pairwise scores should always be easily available > > Andy > > On 1 Jul 2010, at 13:02, Radhouane Aniba wrote: > >> Thx Andy, >> >> Yes absolutely true what you said. But my question was much concerning when Mark is calculating pairwise scores I don't know if there is an "elegant way" to keep temporary the pairwise scores for further treatments, i'm specifically thinking about an array of scores. >> >> TBC .. >> >> Rad >> >> 2010/7/1 Scooter Willis >> Andy >> >> I agree we should probably include the commons statistics jar as a must have for anything statistical or p-score related. >> >> Scooter >> >> >> >> On 7/1/10 7:28 AM, "Andy Yates" wrote: >> >> I believe that you could use Jakarta commons-math& more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation& anything else you'd expect to be able to use to describe a range of values >> >> Andy >> >> On 1 Jul 2010, at 12:05, Radhouane Aniba wrote: >> >>> Mark ! >>> >>> Can we modify the code you gave to : >>> >>> - extract the MIN pairwise similarity score >>> - extract the MAX pairwise similarity score >>> - calculate the standard deviation of similarity scores >>> >>> ? >>> >>> Rad >>> >>> 2010/6/30 Mark Chapman >>> >>>> Hi Radwen, >>>> >>>> I have already added this functionality to the BioJava3 alignment package. >>>> The code is available on the repository [1] and current builds are on the >>>> web site [2]. The necessary files are [3] and [4] and in the example code >>>> that follows you should only have to replace "piwi-seed-fasta.txt" with your >>>> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just >>>> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . >>>> >>>> >>>> int similars = 0, total = 0; >>>> GapPenalty gaps = new SimpleGapPenalty(); >>>> SubstitutionMatrix blosum62 = >>>> new SimpleSubstitutionMatrix(); >>>> >>>> List piwi = new ArrayList(); >>>> try { >>>> piwi.addAll(FastaReaderHelper.readFastaProteinSequence( >>>> new File("piwi-seed-fasta.txt")).values()); >>>> } catch (Exception e) { >>>> e.printStackTrace(); >>>> } >>>> >>>> for (SequencePair pair : >>>> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, >>>> blosum62)) { >>>> PairwiseSequenceScorer scorer = >>>> new FractionalSimilarityScorer>>> AminoAcidCompound>(pair); >>>> System.out.printf("%n%s vs %s : %d / %d%n%s", >>>> pair.getQuery().getAccession(), >>>> pair.getTarget().getAccession(), scorer.getScore(), >>>> scorer.getMaxScore(), >>>> pair); >>>> similars += scorer.getScore(); >>>> total += scorer.getMaxScore(); >>>> } >>>> >>>> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, >>>> (double)similars/total); >>>> >>>> ConcurrencyTools.shutdown(); >>>> >>>> >>>> [1] http://biojava.org/wiki/CVS_to_SVN_Migration >>>> [2] http://biojava.org/download/maven/ >>>> [3] >>>> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar >>>> [4] >>>> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar >>>> >>>> >>>> Enjoy, >>>> Mark >>>> >>>> >>>> >>>> On 6/30/2010 7:05 AM, Andy Yates wrote: >>>> >>>>> It was more of a way of decomposing the operations into a data structure >>>>> where each element in the 1st dimension represents the elements to compare >>>>> together. Really the Perl code is a way of describing the operations to >>>>> occur in order to cover all possible permutations. >>>>> >>>>> Andy >>>>> >>>>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: >>>>> >>>>> Hi Andy, >>>>>> >>>>>> Thank you for your reply. >>>>>> Actually, I was thinking about a parallelization method or a kind of >>>>>> hadoop like implementation to do all pairwise comparison. The aim is that at >>>>>> the end i would like to calculate the average pairwise similarity score >>>>>> within a set of sequences. >>>>>> >>>>>> What I am doing is something like that : >>>>>> >>>>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 >>>>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) >>>>>> PairwiseScore >>>>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) >>>>>> End_For >>>>>> End_For >>>>>> >>>>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) >>>>>> >>>>>> In fact the problem is in the ((n * (n-1)) / 2) operations. >>>>>> >>>>>> As for the solution presented in perl sorry but I dont see what you've >>>>>> did inside ?! You created a 2D array ? how to achieve operations inside , I >>>>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? >>>>>> >>>>>> Radwen >>>>>> >>>>>> >>>>>> 2010/6/30 Andy Yates >>>>>> Hi Radwen, >>>>>> >>>>>> I would have said that this is more of a problem because of the type of >>>>>> algorithm you are using. It's impossible (as far as I am aware) to calculate >>>>>> the score matrices in one step for multiple sequences& even if it did I >>>>>> don't quite see where the speed increase would come from. >>>>>> >>>>>> As for the All vs. All problem don't forget that really your total number >>>>>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you >>>>>> are comparing so a simple 2D for loop will have you spending twice the >>>>>> amount of time on this than needs to occur. When I've done this before (in >>>>>> Perl so excuse the usage of it) the code looks like this: >>>>>> >>>>>> my @output; >>>>>> my @elements = ('some','elements','something'); >>>>>> while(scalar(@elements)> 1) { >>>>>> my $target = pop(@elements); >>>>>> foreach my $remaining_element (@elements) { >>>>>> push(@output, [$target, $remaining_element]); >>>>>> } >>>>>> } >>>>>> >>>>>> So this would have emitted: >>>>>> >>>>>> [ >>>>>> ['some','elements'], >>>>>> ['some','something'], >>>>>> ['elements','something'] >>>>>> ] >>>>>> >>>>>> Try doing something similar to this using the Java Deque objects which >>>>>> can act as a stack. >>>>>> >>>>>> Hope this helps to answer your question >>>>>> >>>>>> Andy >>>>>> >>>>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: >>>>>> >>>>>> Hello Biojava people, >>>>>>> >>>>>>> I have a question concerning Needlman Wunsh or Smith waterman >>>>>>> algorithms. >>>>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file >>>>>>> then I >>>>>>> store my sequences into an array to calculate pairwise similarity scores >>>>>>> using a for loop. >>>>>>> The problem is that it is very time consuming if we want to calculate >>>>>>> all >>>>>>> pairwise for a big number of protein sequences. I would like to know if >>>>>>> there is way to do a kind of "All against All" comparisons in one single >>>>>>> step ? >>>>>>> Do someone have a solution for this kind of problem ? >>>>>>> >>>>>>> Thanks for help. >>>>>>> >>>>>>> Radwen >>>>>>> _______________________________________________ >>>>>>> biojava-dev mailing list >>>>>>> biojava-dev at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>>>>> >>>>>> >>> >>> >>> -- >>> R. ANIBA >>> >>> Bioinformatics PhD >>> Laboratoire de Bioinformatique et G?nomique Int?grative, >>> Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), >>> 1 rue Laurent Fries, >>> 67404 Illkirch, France. >>> http://www-igbmc.u-strasbg.fr >>> http://alnitak.u-strasbg.fr/~aniba/alexsys >>> >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> >> >> >> >> -- >> R. ANIBA >> >> Bioinformatics PhD >> Laboratoire de Bioinformatique et G?nomique Int?grative, >> Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), >> 1 rue Laurent Fries, >> 67404 Illkirch, France. >> http://www-igbmc.u-strasbg.fr >> http://alnitak.u-strasbg.fr/~aniba/alexsys > From andreas at sdsc.edu Sat Jul 3 21:55:36 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sat, 3 Jul 2010 18:55:36 -0700 Subject: [Biojava-dev] BioJava at ISMB, BOSC, and 3D SIG Message-ID: Hi, Next week the BOSC and ISMB conferences will be in Boston. There will be a couple of opportunities to meet BioJava related people, even if we won't have a BioJava specific talk at BOSC this time. The week will start with the Codefest in the days before BOSC. I have not heard too much about that, so perhaps one of the people who are going there can give us an update about what the plans are? At BOSC, there will be a talk from Jianjiong Gao, who is also on of our Google summer of code students. He will be present about general and kinase specific phosphorylation sites (with Musite). I will give a talk at 3D-SIG about structure alignments (using BioJava). Any other BioJava related events that are planned? Is anybody planning to blog or twitter about the conferences ? If people are interested in a meetup in Boston, drop me a mail and we'll arrange something... Andreas From member at linkedin.com Tue Jul 13 11:59:31 2010 From: member at linkedin.com (abdul qaddus via LinkedIn) Date: Tue, 13 Jul 2010 08:59:31 -0700 (PDT) Subject: [Biojava-dev] abdul qaddus wants to stay in touch on LinkedIn Message-ID: <888510157.1044264.1279036771955.JavaMail.app@ech3-cdn05.prod> LinkedIn ------------ I'd like to add you to my professional network on LinkedIn. - abdul qaddus abdul qaddus Owner at Exbizsol Pakistan Confirm that you know abdul qaddus https://www.linkedin.com/e/8hm30y-gbkxgxgx-19/isd/1463080869/5k7VZpD8/ ------ (c) 2010, LinkedIn Corporation From andreas at sdsc.edu Wed Jul 14 19:39:58 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 14 Jul 2010 16:39:58 -0700 Subject: [Biojava-dev] Creating Junit test units In-Reply-To: References: Message-ID: Hi Sylvain, Cool, thanks for getting this started. There are junit tests in most of the other modules. Probably best to take a look at them as a template. Should be pretty straightforward from there. In terms of biojava-3 the RichSequence representation is still from the old biojava 1.7 design. Can you try to use the new sequence code base? Andreas On Wed, Jul 14, 2010 at 10:57 AM, Sylvain Foisy < sylvain.foisy at inflammgen.org> wrote: > Hi Andreas, > > I created a new module called biojava3-ws to collect all (wow, am I > ambitious...) stuff related to using Web services :data submission, > processing and results collection. I hope I did not break anything... > > I am starting to look into creating JUnit test units to complete this, > something that I never done before. Do you have some pointers toward some > tutorial material for this? It's my first coding in more than 2-3 years... > > BTW, I am still using RichSequence objects to feed into the BLAST requests > but are these objects having a future with biojava3? What should be used > instead? > > Best regards and apologies if these are stupid questions... > > Sylvain > > > ================================================================== > > Sylvain Foisy, Ph. D. > Charg? de projet / Project Manager > Bio-informatique > > Adresse postale: > > Laboratoire de genetique et medecine genomique de l'inflammation > Institut de cardiologie de Montreal > 5000 Belanger > Montreal, Qc > H1T 1C8 > > T: 514-376-3330 x.2299 | F: 514-593-2539 > M: sylvain.foisy at inflammgen.org > W: http://www.inflammgen.org > > ================================================================== > > > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Wed Jul 14 19:44:42 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 14 Jul 2010 16:44:42 -0700 Subject: [Biojava-dev] automated build messages Message-ID: Hi, the automated cruisecontrol builds of biojava-svn seem to work fine and I will soon set up an auto-forward to this mailing list, so it is easier for you to follow SVN activity... Andreas -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Thu Jul 15 12:22:08 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 15 Jul 2010 09:22:08 -0700 Subject: [Biojava-dev] Creating Junit test units In-Reply-To: References: Message-ID: > > > > I have looked into the docs for BJ3 in the wiki and it is utterly confusing > with the code in the svn: there is no FASTAReader not FASTAFileReader > classes... I see two packages with possible FASTA-pertinent material: > sequence and biojava3-core. Confusing to say the least. I agree, the wiki docu is mainly related to the 1.7 release of BioJava, we need to start adding documentation for the new code base! Andreas -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Thu Jul 15 12:52:37 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 15 Jul 2010 09:52:37 -0700 Subject: [Biojava-dev] Creating Junit test units In-Reply-To: <51EEE556-210F-4F1E-858A-66ACBE5C3B92@scripps.edu> References: <51EEE556-210F-4F1E-858A-66ACBE5C3B92@scripps.edu> Message-ID: > > > Wouldn't be a bad idea to start a biojava3 wiki with a different URL so > that search and organization is clear. This would also increase the > motivation to add content because it would be empty. I will start writing > wiki content over the weekend for the core module. Good idea. I just added a new BioJava 3 specific cookbook page and updated the wiki-front page to make this more clear: All BioJava3 specific docu should go here: http://biojava.org/wiki/BioJava:CookBook3.0 Once we make BioJava 3 the official version, we can move the 1.7 cookbook page to a different location and make the v. 3.0 cookbook the default one... Andreas From jake at researchtogether.com Mon Jul 19 09:49:04 2010 From: jake at researchtogether.com (jake at researchtogether.com) Date: Mon, 19 Jul 2010 14:49:04 +0100 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: <4C2D1E98.1020300@cs.wisc.edu> References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> Message-ID: <20100719134904.GA20200@researchtogether.com> Hi All, I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException. Any opinions? :) Cheers, Jake From holland at eaglegenomics.com Mon Jul 19 11:02:23 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 19 Jul 2010 16:02:23 +0100 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: <20100719134904.GA20200@researchtogether.com> References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> Message-ID: I often wonder what the best way of handling multiple possible internal exceptions is - particularly in cases like this when you've got HTTP and IO and many other types of exceptions which could be thrown. SequenceException maybe if there's something wrong with the sequence itself - but possibly otherwise a form of IOException may be more appropriate? Trouble is that then almost every BioJava3 method would throw it, as all of them potentially have IO exposure. I don't know. There must be experts on this in the list who can help! cheers, Richard On 19 Jul 2010, at 14:49, jake at researchtogether.com wrote: > Hi All, > > I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview > > One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException. > > Any opinions? :) > > Cheers, > Jake > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jprocter at compbio.dundee.ac.uk Mon Jul 19 10:57:30 2010 From: jprocter at compbio.dundee.ac.uk (Jim Procter) Date: Mon, 19 Jul 2010 15:57:30 +0100 Subject: [Biojava-dev] osgi-bioinformatics@googlegroups.com: new mailing list for OSGi issues in bioinformatics Message-ID: <4C4467DA.6080709@compbio.dundee.ac.uk> Hi all. Some of you will be aware of the OSGi plugin architecture (www.osgi.org), which is used by a number of java applications. You may also be aware that a number of bioinformatics projects are considering, or are currently migrating their architecture to adopt the OSGi plugin model (with or without various additional mechanisms, e.g. Spring, or equinox-p2, etc). I am involved in one such project, and I've created the osgi-bioinformatics google group because I'd very much like to be able to discuss OSGi related issues with others from the bioinformatics software development field who have some OSGi experience. It would also be great if we could thresh out some best-practice guidlines, and discuss the kinds of modules our projects provide - so others in the Bioinformatics-OSGi ecosystem might make use of them. Sorry to clutter up your in boxes with yet another mailing list invite, but hopefully, the discussion will be relevant to some of you working on biojava3 - if not now, then later on, when you all feel the biojava3 APIs are more mature. If you are OSGi (il)literate, or wish to be, then please join the list at http://groups.google.co.uk/group/osgi-bioinformatics Thanks for your attention ;) Jim Procter. -- ------------------------------------------------------------------- J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk The University of Dundee is a Scottish Registered Charity, No. SC015096. From andreas at sdsc.edu Mon Jul 19 20:49:20 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 19 Jul 2010 17:49:20 -0700 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: <20100719134904.GA20200@researchtogether.com> References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> Message-ID: Hi Jake, Thanks for this. Would it be possible to add a "Cookbook" page for how to use the NCBISequence reader as well? Some examples would be great... I understand the NCBI now requires scripts to provide email address etc.... Would be good to explain how to do this. We just started to work on more docu here: http://biojava.org/wiki/BioJava:CookBook3.0 Thanks, Andreas On Mon, Jul 19, 2010 at 6:49 AM, wrote: > Hi All, > > I've been drawing up a design for the work I have done on the NCBI > SequenceReader and I've talked through some things with Scooter which I have > put on the wiki at: > http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview > > One thing I would like to throw open for discussion is the possibility of > changing the Sequence interface so that the methods can throw a new > exception - SequenceException. > > Any opinions? :) > > Cheers, > Jake > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From HWillis at scripps.edu Mon Jul 19 21:09:32 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Mon, 19 Jul 2010 21:09:32 -0400 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> Message-ID: <23594ECE-E9D6-4C00-B72E-ACB0625FE85B@scripps.edu> Jake Do you have any updates to the code? I can go ahead and check it in. Scooter On Jul 19, 2010, at 8:49 PM, Andreas Prlic wrote: > Hi Jake, > > Thanks for this. Would it be possible to add a "Cookbook" page for how to > use the NCBISequence reader as well? Some examples would be great... I > understand the NCBI now requires scripts to provide email address etc.... > Would be good to explain how to do this. We just started to work on more > docu here: > > http://biojava.org/wiki/BioJava:CookBook3.0 > > Thanks, > Andreas > > > > On Mon, Jul 19, 2010 at 6:49 AM, wrote: > >> Hi All, >> >> I've been drawing up a design for the work I have done on the NCBI >> SequenceReader and I've talked through some things with Scooter which I have >> put on the wiki at: >> http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview >> >> One thing I would like to throw open for discussion is the possibility of >> changing the Sequence interface so that the methods can throw a new >> exception - SequenceException. >> >> Any opinions? :) >> >> Cheers, >> Jake >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev From markjschreiber at gmail.com Mon Jul 19 22:20:10 2010 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 20 Jul 2010 10:20:10 +0800 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> Message-ID: I don't think it is a great idea to hide IO exceptions but you can reduce the burden of them. You can copy the Groovy model which handles a lot of the try/catch/finally boiler plate code for you. Basically you make a helper class with methods to perform common IO operations and which will do it's very best to connect, read/write and clean up. You can also think about what might actually cause an error. If you are reading from a local disk cache where the file address is known (such as a temp file) you can very nearly guarantee that the IO operation will succeed. So much so that you could rethrow an IO Exception as an error because there is very little that can be done about it (other than improving the cache code or getting more reliable hard-drives). Reading a file from disk? The most likely problem is a incorrect file name. Other problems can probably be turned into runtime exceptions cause other problems are probably disk errors. Reading from a URL, lots of things can go wrong here so you probably need to expose all the possible exceptions. Reading from SQL? Kind of depends on the expected DB availability and latency. Also, if the query code (or JPA query) is coming from the BioJava source then an error is appropriate (the developer can't do much about the mistake). If the code is coming from the app developer then you should notify them of SQL errors. - Mark On Mon, Jul 19, 2010 at 11:02 PM, Richard Holland wrote: > > I often wonder what the best way of handling multiple possible internal exceptions is - particularly in cases like this when you've got HTTP and IO and many other types of exceptions which could be thrown. > > SequenceException maybe if there's something wrong with the sequence itself - but possibly otherwise a form of IOException may be more appropriate? Trouble is that then almost every BioJava3 method would throw it, as all of them potentially have IO exposure. > > I don't know. There must be experts on this in the list who can help! > > cheers, > Richard > > > On 19 Jul 2010, at 14:49, jake at researchtogether.com wrote: > > > Hi All, > > > > I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview > > > > One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException. > > > > Any opinions? :) > > > > Cheers, > > Jake > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev From jake at researchtogether.com Tue Jul 20 05:46:39 2010 From: jake at researchtogether.com (jake at researchtogether.com) Date: Tue, 20 Jul 2010 10:46:39 +0100 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> Message-ID: <20100720094639.GD20200@researchtogether.com> See comments in line. Thanks, Jake On Tue, Jul 20, 2010 at 10:20:10AM +0800, Mark Schreiber wrote: > I don't think it is a great idea to hide IO exceptions but you can > reduce the burden of them. I would normally agree with you, but as I shall point out later this will have a lot of knock on effects for the interface which may not be desirable. > > You can copy the Groovy model which handles a lot of the > try/catch/finally boiler plate code for you. Basically you make a > helper class with methods to perform common IO operations and which > will do it's very best to connect, read/write and clean up. > > You can also think about what might actually cause an error. If you > are reading from a local disk cache where the file address is known > (such as a temp file) you can very nearly guarantee that the IO > operation will succeed. So much so that you could rethrow an IO > Exception as an error because there is very little that can be done > about it (other than improving the cache code or getting more reliable > hard-drives). And this is the issue - the Sequence interface is used by a lot of different readers, some are reading from disk, others from database and in my particular case I am reading it from a URL. Also, it is possible that I will run into a lot of exceptions around XML parsing (the data from the URL) as well as HTTP errors (page not found, service unavailable etc.) Now, normally I would want to deal with some of the errors and only log them - e.g. a 503 I might retry a few times and if there is a problem with the XML I might try and fetch it again. However, I don't fully understand how the caller will expect these SequenceReaders to behave which I why I asked the question :) An IOException on a file is probably fatal but IOException on a network call is possibly recoverable, or at least wort re-trying. As for what can cause errors: 1. Invalid URL 2. Page(s) unavailable (4xx, 5xx) 3. Invalid/unexpected data returned (XML badly formed, FASTA invalid) 4. Change to service (if the service has changed and the parser is effectively broken) 5. Network interuptance (i.e. network timeout) > > Reading a file from disk? The most likely problem is a incorrect file > name. Other problems can probably be turned into runtime exceptions > cause other problems are probably disk errors. > > Reading from a URL, lots of things can go wrong here so you probably > need to expose all the possible exceptions. I will work on this assumption and change the interface accordingly, though I expect that the decision will be re-visited. > > Reading from SQL? Kind of depends on the expected DB availability and > latency. Also, if the query code (or JPA query) is coming from the > BioJava source then an error is appropriate (the developer can't do > much about the mistake). If the code is coming from the app developer > then you should notify them of SQL errors. > > - Mark > > On Mon, Jul 19, 2010 at 11:02 PM, Richard Holland > wrote: > > > > I often wonder what the best way of handling multiple possible internal exceptions is - particularly in cases like this when you've got HTTP and IO and many other types of exceptions which could be thrown. > > > > SequenceException maybe if there's something wrong with the sequence itself - but possibly otherwise a form of IOException may be more appropriate? Trouble is that then almost every BioJava3 method would throw it, as all of them potentially have IO exposure. > > > > I don't know. There must be experts on this in the list who can help! > > > > cheers, > > Richard > > > > > > On 19 Jul 2010, at 14:49, jake at researchtogether.com wrote: > > > > > Hi All, > > > > > > I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview > > > > > > One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException. > > > > > > Any opinions? :) > > > > > > Cheers, > > > Jake > > > _______________________________________________ > > > biojava-dev mailing list > > > biojava-dev at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > -- > > Richard Holland, BSc MBCS > > Operations and Delivery Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev From aman.a.gupta1989 at gmail.com Tue Jul 20 06:07:39 2010 From: aman.a.gupta1989 at gmail.com (aman gupta) Date: Tue, 20 Jul 2010 15:37:39 +0530 Subject: [Biojava-dev] Restriction site Message-ID: Respected Sir/madam, I would like to know how to implement the RestrictionSite interface under org.biojava.bio.molbio.RestrictionSite in actual program..??? kindly help and do the needfull... please -- ----------------------------- Aman.A.Gupta From HWillis at scripps.edu Tue Jul 20 11:25:36 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Tue, 20 Jul 2010 11:25:36 -0400 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: <20100720094639.GD20200@researchtogether.com> References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> <20100720094639.GD20200@researchtogether.com> Message-ID: <4D510D58-22F1-4F20-84D3-ED51DFDB1386@scripps.edu> Mark makes some very good points and it will be a challenge to come up with a robust(appropriate) error reporting and still maintain flexibility where writing code is easy as long as everything works. Currently, you can pass a class that implements the Sequence Interface to the constructor of a DNASequence, ProteinSequence etc. If the class that implements the sequence interface throws an exception when it is created then that is outside the api design of the abstract sequence. In the following example UniprotProxySequenceReader upon creation would call the appropriate URL and retrieve the sequence. If an error occurs then that class should throw the appropriate exception. We don't need to force a particular exception on classes that implement an interface. UniprotProxySequenceReader uniprotSequence = new UniprotProxySequenceReader("YA745_GIBZE", AminoAcidCompoundSet.getAminoAcidCompoundSet()); ProteinSequence proteinSequence = new ProteinSequence(uniprotSequence); We do have an api/exception design problem if UniprotProxySequenceReader does lazy instantiation where it doesn't retrieve the sequence data unless a call to proteinSequence.getSequence() is made. This allows us to create applications where you can load a large number of sequences without consuming memory or sequences that will never be used. If you have a web based application where the user will query a sequence based on some event then this is a nice design element. If you are writing code to exam the GC content of every gene sequence then not a big memory saver. The easy solution is to have every sequence method that has a dependency on a class with sequence interface declared throws exception. This would add additional exception handling code for the users of the api which can add to the complexity and introduce a performance penalty if the try catch is not done generally for a block of code. The reality is that for the X number of methods that have a dependency on a Sequence Interface class if one fails they will all fail. We could add an isInit() method to AbstractSequence which throws an exception or returns a boolean that is designed to force the Sequence Interface to load sequence data from external sources. The user of the API via our contract definition can do defensive programming and make sure the sequence is ready before using it. If it is not ready and a method is called that depends on the Sequence Interface then we simply return the appropriate null/not defined object. The last use case that still makes this difficult is being able to define a ChromosomeSequence(new NCBISequenceReader("NC_000019.9")) where a call to get a collection of gene sequences from the chromosome sequence to be done in a lazy fashion without retrieving the entire chromosome sequence. If I make a call to geneSequence.getProteinSequence().toString() then that would make the appropriate getSubString(2000,5000) that maps to the gene to the NCBISequenceReader which then retrieves that sub sequence from NCBI. To allow this option we can not depend on the isInit() to be correct. In this particular example we have three types of errors. The internet connection is not working, NCBI is not working or refusing your connection because you went over the three requests per second rule or you have something wrong with your accession id. If the internet is down or NCBI is refusing your connection not a great deal the application can do to recover. In the case of the accession id being an error that could be handled when you instantiate the class new NCBISequenceReader("NC_000019.9") by some sort of call to NCBI to see if it is valid and if not throw an exception. We do have options when a particular service is down or slow to respond. Uniprot implements a DNS based load distribution that I did have a problem with one weekend. It was very slow and often did not respond. Turns out if I changed my URL I could point to the http://pir.uniprot.org located in the US and everything worked great. This could be something implemented by UniprotProxySequenceReader if it gets an IO exception or determines queries are taking a long time. In summary we probably should throw exceptions for each method that depends on Sequence Interface and/or return a set of appropriate null/not-init objects. Given that we are working with imperfect data models and data relationships I think defensive programming on return values is not a bad option. It is a shame to have getSequenceLength() throw an exception or return a null Integer if an IO exception occurs. These are only problems when using a Sequence Interface that has a higher risk of failure because it is remote and would be the "exception" not the rule. For hard core developers we can resolve these issues when they occur. If the Biojava-core code makes it way into an end user application then we need to give the application developer a way to deal with error conditions. Using the NCBI chromosome example I think we can create a very powerful api to work with large amounts of sequence data but at the expense of making the api very exception happy! We have also begun the very exciting step of doing wiki docs specific to Biojava3, It is a work in progress http://biojava.org/wiki/BioJava:CookBook3.0 Thanks Scooter On Jul 20, 2010, at 5:46 AM, > wrote: See comments in line. Thanks, Jake On Tue, Jul 20, 2010 at 10:20:10AM +0800, Mark Schreiber wrote: I don't think it is a great idea to hide IO exceptions but you can reduce the burden of them. I would normally agree with you, but as I shall point out later this will have a lot of knock on effects for the interface which may not be desirable. You can copy the Groovy model which handles a lot of the try/catch/finally boiler plate code for you. Basically you make a helper class with methods to perform common IO operations and which will do it's very best to connect, read/write and clean up. You can also think about what might actually cause an error. If you are reading from a local disk cache where the file address is known (such as a temp file) you can very nearly guarantee that the IO operation will succeed. So much so that you could rethrow an IO Exception as an error because there is very little that can be done about it (other than improving the cache code or getting more reliable hard-drives). And this is the issue - the Sequence interface is used by a lot of different readers, some are reading from disk, others from database and in my particular case I am reading it from a URL. Also, it is possible that I will run into a lot of exceptions around XML parsing (the data from the URL) as well as HTTP errors (page not found, service unavailable etc.) Now, normally I would want to deal with some of the errors and only log them - e.g. a 503 I might retry a few times and if there is a problem with the XML I might try and fetch it again. However, I don't fully understand how the caller will expect these SequenceReaders to behave which I why I asked the question :) An IOException on a file is probably fatal but IOException on a network call is possibly recoverable, or at least wort re-trying. As for what can cause errors: 1. Invalid URL 2. Page(s) unavailable (4xx, 5xx) 3. Invalid/unexpected data returned (XML badly formed, FASTA invalid) 4. Change to service (if the service has changed and the parser is effectively broken) 5. Network interuptance (i.e. network timeout) Reading a file from disk? The most likely problem is a incorrect file name. Other problems can probably be turned into runtime exceptions cause other problems are probably disk errors. Reading from a URL, lots of things can go wrong here so you probably need to expose all the possible exceptions. I will work on this assumption and change the interface accordingly, though I expect that the decision will be re-visited. Reading from SQL? Kind of depends on the expected DB availability and latency. Also, if the query code (or JPA query) is coming from the BioJava source then an error is appropriate (the developer can't do much about the mistake). If the code is coming from the app developer then you should notify them of SQL errors. - Mark On Mon, Jul 19, 2010 at 11:02 PM, Richard Holland > wrote: I often wonder what the best way of handling multiple possible internal exceptions is - particularly in cases like this when you've got HTTP and IO and many other types of exceptions which could be thrown. SequenceException maybe if there's something wrong with the sequence itself - but possibly otherwise a form of IOException may be more appropriate? Trouble is that then almost every BioJava3 method would throw it, as all of them potentially have IO exposure. I don't know. There must be experts on this in the list who can help! cheers, Richard On 19 Jul 2010, at 14:49, jake at researchtogether.com wrote: Hi All, I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException. Any opinions? :) Cheers, Jake _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev From andreas at sdsc.edu Sat Jul 24 05:24:26 2010 From: andreas at sdsc.edu (andreas at sdsc.edu) Date: Sat, 24 Jul 2010 02:24:26 -0700 (PDT) Subject: [Biojava-dev] biojava-svn build.27 Build Fixed Message-ID: <980155567.1279963466731.JavaMail.andreas@emmy.rcsb.org> An HTML attachment was scrubbed... URL: From andreas at sdsc.edu Wed Jul 28 20:37:35 2010 From: andreas at sdsc.edu (andreas at sdsc.edu) Date: Wed, 28 Jul 2010 17:37:35 -0700 (PDT) Subject: [Biojava-dev] biojava-svn build.35 Build Fixed Message-ID: <512820011.1280363855648.JavaMail.andreas@emmy.rcsb.org> An HTML attachment was scrubbed... URL: From andreas at sdsc.edu Sat Jul 31 01:55:57 2010 From: andreas at sdsc.edu (andreas at sdsc.edu) Date: Fri, 30 Jul 2010 22:55:57 -0700 (PDT) Subject: [Biojava-dev] biojava-svn build.39 Build Fixed Message-ID: <960385312.1280555757389.JavaMail.andreas@emmy.rcsb.org> An HTML attachment was scrubbed... URL: From cur3n4 at yahoo.es Thu Jul 1 04:23:55 2010 From: cur3n4 at yahoo.es (Sergio Alvarez) Date: Thu, 1 Jul 2010 04:23:55 +0000 (GMT) Subject: [Biojava-dev] Supporting BioJava Message-ID: <940885.11936.qm@web25902.mail.ukl.yahoo.com> Hello, My name is Sergio Alvarez, and I am a software engineer with 11 years of experience in the Java world. Recently, I have decided to focus my career towards bioinformatics, and as a first step, I would like to be able to contribute some of my time to Bio Java. I suppose you have a list of improvements or new features, that you would like to have, so please let me know if you are interested in my support and what I could work on. Thanks a lot and Best Regards Sergio Alvarez From aradwen at gmail.com Thu Jul 1 11:05:34 2010 From: aradwen at gmail.com (Radhouane Aniba) Date: Thu, 1 Jul 2010 13:05:34 +0200 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: <4C2BA2AB.1000307@cs.wisc.edu> References:

<4C2BA2AB.1000307@cs.wisc.edu> Message-ID: Mark ! Can we modify the code you gave to : - extract the MIN pairwise similarity score - extract the MAX pairwise similarity score - calculate the standard deviation of similarity scores ? Rad 2010/6/30 Mark Chapman > Hi Radwen, > > I have already added this functionality to the BioJava3 alignment package. > The code is available on the repository [1] and current builds are on the > web site [2]. The necessary files are [3] and [4] and in the example code > that follows you should only have to replace "piwi-seed-fasta.txt" with your > file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just > change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . > > > int similars = 0, total = 0; > GapPenalty gaps = new SimpleGapPenalty(); > SubstitutionMatrix blosum62 = > new SimpleSubstitutionMatrix(); > > List piwi = new ArrayList(); > try { > piwi.addAll(FastaReaderHelper.readFastaProteinSequence( > new File("piwi-seed-fasta.txt")).values()); > } catch (Exception e) { > e.printStackTrace(); > } > > for (SequencePair pair : > Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, > blosum62)) { > PairwiseSequenceScorer scorer = > new FractionalSimilarityScorer AminoAcidCompound>(pair); > System.out.printf("%n%s vs %s : %d / %d%n%s", > pair.getQuery().getAccession(), > pair.getTarget().getAccession(), scorer.getScore(), > scorer.getMaxScore(), > pair); > similars += scorer.getScore(); > total += scorer.getMaxScore(); > } > > System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, > (double)similars/total); > > ConcurrencyTools.shutdown(); > > > [1] http://biojava.org/wiki/CVS_to_SVN_Migration > [2] http://biojava.org/download/maven/ > [3] > http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar > [4] > http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar > > > Enjoy, > Mark > > > > On 6/30/2010 7:05 AM, Andy Yates wrote: > >> It was more of a way of decomposing the operations into a data structure >> where each element in the 1st dimension represents the elements to compare >> together. Really the Perl code is a way of describing the operations to >> occur in order to cover all possible permutations. >> >> Andy >> >> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: >> >> Hi Andy, >>> >>> Thank you for your reply. >>> Actually, I was thinking about a parallelization method or a kind of >>> hadoop like implementation to do all pairwise comparison. The aim is that at >>> the end i would like to calculate the average pairwise similarity score >>> within a set of sequences. >>> >>> What I am doing is something like that : >>> >>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 >>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) >>> PairwiseScore >>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) >>> End_For >>> End_For >>> >>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) >>> >>> In fact the problem is in the ((n * (n-1)) / 2) operations. >>> >>> As for the solution presented in perl sorry but I dont see what you've >>> did inside ?! You created a 2D array ? how to achieve operations inside , I >>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? >>> >>> Radwen >>> >>> >>> 2010/6/30 Andy Yates >>> Hi Radwen, >>> >>> I would have said that this is more of a problem because of the type of >>> algorithm you are using. It's impossible (as far as I am aware) to calculate >>> the score matrices in one step for multiple sequences& even if it did I >>> don't quite see where the speed increase would come from. >>> >>> As for the All vs. All problem don't forget that really your total number >>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you >>> are comparing so a simple 2D for loop will have you spending twice the >>> amount of time on this than needs to occur. When I've done this before (in >>> Perl so excuse the usage of it) the code looks like this: >>> >>> my @output; >>> my @elements = ('some','elements','something'); >>> while(scalar(@elements)> 1) { >>> my $target = pop(@elements); >>> foreach my $remaining_element (@elements) { >>> push(@output, [$target, $remaining_element]); >>> } >>> } >>> >>> So this would have emitted: >>> >>> [ >>> ['some','elements'], >>> ['some','something'], >>> ['elements','something'] >>> ] >>> >>> Try doing something similar to this using the Java Deque objects which >>> can act as a stack. >>> >>> Hope this helps to answer your question >>> >>> Andy >>> >>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: >>> >>> Hello Biojava people, >>>> >>>> I have a question concerning Needlman Wunsh or Smith waterman >>>> algorithms. >>>> I am using Biojava 1.7 and I read sequences from proteins fasta file >>>> then I >>>> store my sequences into an array to calculate pairwise similarity scores >>>> using a for loop. >>>> The problem is that it is very time consuming if we want to calculate >>>> all >>>> pairwise for a big number of protein sequences. I would like to know if >>>> there is way to do a kind of "All against All" comparisons in one single >>>> step ? >>>> Do someone have a solution for this kind of problem ? >>>> >>>> Thanks for help. >>>> >>>> Radwen >>>> _______________________________________________ >>>> biojava-dev mailing list >>>> biojava-dev at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>> >>> -- R. ANIBA Bioinformatics PhD Laboratoire de Bioinformatique et G?nomique Int?grative, Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), 1 rue Laurent Fries, 67404 Illkirch, France. http://www-igbmc.u-strasbg.fr http://alnitak.u-strasbg.fr/~aniba/alexsys From ayates at ebi.ac.uk Thu Jul 1 11:28:01 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 1 Jul 2010 12:28:01 +0100 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References:

<4C2BA2AB.1000307@cs.wisc.edu> Message-ID: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk> I believe that you could use Jakarta commons-math & more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation & anything else you'd expect to be able to use to describe a range of values Andy On 1 Jul 2010, at 12:05, Radhouane Aniba wrote: > Mark ! > > Can we modify the code you gave to : > > - extract the MIN pairwise similarity score > - extract the MAX pairwise similarity score > - calculate the standard deviation of similarity scores > > ? > > Rad > > 2010/6/30 Mark Chapman > >> Hi Radwen, >> >> I have already added this functionality to the BioJava3 alignment package. >> The code is available on the repository [1] and current builds are on the >> web site [2]. The necessary files are [3] and [4] and in the example code >> that follows you should only have to replace "piwi-seed-fasta.txt" with your >> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just >> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . >> >> >> int similars = 0, total = 0; >> GapPenalty gaps = new SimpleGapPenalty(); >> SubstitutionMatrix blosum62 = >> new SimpleSubstitutionMatrix(); >> >> List piwi = new ArrayList(); >> try { >> piwi.addAll(FastaReaderHelper.readFastaProteinSequence( >> new File("piwi-seed-fasta.txt")).values()); >> } catch (Exception e) { >> e.printStackTrace(); >> } >> >> for (SequencePair pair : >> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, >> blosum62)) { >> PairwiseSequenceScorer scorer = >> new FractionalSimilarityScorer> AminoAcidCompound>(pair); >> System.out.printf("%n%s vs %s : %d / %d%n%s", >> pair.getQuery().getAccession(), >> pair.getTarget().getAccession(), scorer.getScore(), >> scorer.getMaxScore(), >> pair); >> similars += scorer.getScore(); >> total += scorer.getMaxScore(); >> } >> >> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, >> (double)similars/total); >> >> ConcurrencyTools.shutdown(); >> >> >> [1] http://biojava.org/wiki/CVS_to_SVN_Migration >> [2] http://biojava.org/download/maven/ >> [3] >> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar >> [4] >> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar >> >> >> Enjoy, >> Mark >> >> >> >> On 6/30/2010 7:05 AM, Andy Yates wrote: >> >>> It was more of a way of decomposing the operations into a data structure >>> where each element in the 1st dimension represents the elements to compare >>> together. Really the Perl code is a way of describing the operations to >>> occur in order to cover all possible permutations. >>> >>> Andy >>> >>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: >>> >>> Hi Andy, >>>> >>>> Thank you for your reply. >>>> Actually, I was thinking about a parallelization method or a kind of >>>> hadoop like implementation to do all pairwise comparison. The aim is that at >>>> the end i would like to calculate the average pairwise similarity score >>>> within a set of sequences. >>>> >>>> What I am doing is something like that : >>>> >>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 >>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) >>>> PairwiseScore >>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) >>>> End_For >>>> End_For >>>> >>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) >>>> >>>> In fact the problem is in the ((n * (n-1)) / 2) operations. >>>> >>>> As for the solution presented in perl sorry but I dont see what you've >>>> did inside ?! You created a 2D array ? how to achieve operations inside , I >>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? >>>> >>>> Radwen >>>> >>>> >>>> 2010/6/30 Andy Yates >>>> Hi Radwen, >>>> >>>> I would have said that this is more of a problem because of the type of >>>> algorithm you are using. It's impossible (as far as I am aware) to calculate >>>> the score matrices in one step for multiple sequences& even if it did I >>>> don't quite see where the speed increase would come from. >>>> >>>> As for the All vs. All problem don't forget that really your total number >>>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you >>>> are comparing so a simple 2D for loop will have you spending twice the >>>> amount of time on this than needs to occur. When I've done this before (in >>>> Perl so excuse the usage of it) the code looks like this: >>>> >>>> my @output; >>>> my @elements = ('some','elements','something'); >>>> while(scalar(@elements)> 1) { >>>> my $target = pop(@elements); >>>> foreach my $remaining_element (@elements) { >>>> push(@output, [$target, $remaining_element]); >>>> } >>>> } >>>> >>>> So this would have emitted: >>>> >>>> [ >>>> ['some','elements'], >>>> ['some','something'], >>>> ['elements','something'] >>>> ] >>>> >>>> Try doing something similar to this using the Java Deque objects which >>>> can act as a stack. >>>> >>>> Hope this helps to answer your question >>>> >>>> Andy >>>> >>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: >>>> >>>> Hello Biojava people, >>>>> >>>>> I have a question concerning Needlman Wunsh or Smith waterman >>>>> algorithms. >>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file >>>>> then I >>>>> store my sequences into an array to calculate pairwise similarity scores >>>>> using a for loop. >>>>> The problem is that it is very time consuming if we want to calculate >>>>> all >>>>> pairwise for a big number of protein sequences. I would like to know if >>>>> there is way to do a kind of "All against All" comparisons in one single >>>>> step ? >>>>> Do someone have a solution for this kind of problem ? >>>>> >>>>> Thanks for help. >>>>> >>>>> Radwen >>>>> _______________________________________________ >>>>> biojava-dev mailing list >>>>> biojava-dev at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>>> >>>> > > > -- > R. ANIBA > > Bioinformatics PhD > Laboratoire de Bioinformatique et G?nomique Int?grative, > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > 1 rue Laurent Fries, > 67404 Illkirch, France. > http://www-igbmc.u-strasbg.fr > http://alnitak.u-strasbg.fr/~aniba/alexsys > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From HWillis at scripps.edu Thu Jul 1 11:39:59 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Thu, 1 Jul 2010 07:39:59 -0400 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk> Message-ID: Andy I agree we should probably include the commons statistics jar as a must have for anything statistical or p-score related. Scooter On 7/1/10 7:28 AM, "Andy Yates" wrote: I believe that you could use Jakarta commons-math & more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation & anything else you'd expect to be able to use to describe a range of values Andy On 1 Jul 2010, at 12:05, Radhouane Aniba wrote: > Mark ! > > Can we modify the code you gave to : > > - extract the MIN pairwise similarity score > - extract the MAX pairwise similarity score > - calculate the standard deviation of similarity scores > > ? > > Rad > > 2010/6/30 Mark Chapman > >> Hi Radwen, >> >> I have already added this functionality to the BioJava3 alignment package. >> The code is available on the repository [1] and current builds are on the >> web site [2]. The necessary files are [3] and [4] and in the example code >> that follows you should only have to replace "piwi-seed-fasta.txt" with your >> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just >> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . >> >> >> int similars = 0, total = 0; >> GapPenalty gaps = new SimpleGapPenalty(); >> SubstitutionMatrix blosum62 = >> new SimpleSubstitutionMatrix(); >> >> List piwi = new ArrayList(); >> try { >> piwi.addAll(FastaReaderHelper.readFastaProteinSequence( >> new File("piwi-seed-fasta.txt")).values()); >> } catch (Exception e) { >> e.printStackTrace(); >> } >> >> for (SequencePair pair : >> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, >> blosum62)) { >> PairwiseSequenceScorer scorer = >> new FractionalSimilarityScorer> AminoAcidCompound>(pair); >> System.out.printf("%n%s vs %s : %d / %d%n%s", >> pair.getQuery().getAccession(), >> pair.getTarget().getAccession(), scorer.getScore(), >> scorer.getMaxScore(), >> pair); >> similars += scorer.getScore(); >> total += scorer.getMaxScore(); >> } >> >> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, >> (double)similars/total); >> >> ConcurrencyTools.shutdown(); >> >> >> [1] http://biojava.org/wiki/CVS_to_SVN_Migration >> [2] http://biojava.org/download/maven/ >> [3] >> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar >> [4] >> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar >> >> >> Enjoy, >> Mark >> >> >> >> On 6/30/2010 7:05 AM, Andy Yates wrote: >> >>> It was more of a way of decomposing the operations into a data structure >>> where each element in the 1st dimension represents the elements to compare >>> together. Really the Perl code is a way of describing the operations to >>> occur in order to cover all possible permutations. >>> >>> Andy >>> >>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: >>> >>> Hi Andy, >>>> >>>> Thank you for your reply. >>>> Actually, I was thinking about a parallelization method or a kind of >>>> hadoop like implementation to do all pairwise comparison. The aim is that at >>>> the end i would like to calculate the average pairwise similarity score >>>> within a set of sequences. >>>> >>>> What I am doing is something like that : >>>> >>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 >>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) >>>> PairwiseScore >>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) >>>> End_For >>>> End_For >>>> >>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) >>>> >>>> In fact the problem is in the ((n * (n-1)) / 2) operations. >>>> >>>> As for the solution presented in perl sorry but I dont see what you've >>>> did inside ?! You created a 2D array ? how to achieve operations inside , I >>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? >>>> >>>> Radwen >>>> >>>> >>>> 2010/6/30 Andy Yates >>>> Hi Radwen, >>>> >>>> I would have said that this is more of a problem because of the type of >>>> algorithm you are using. It's impossible (as far as I am aware) to calculate >>>> the score matrices in one step for multiple sequences& even if it did I >>>> don't quite see where the speed increase would come from. >>>> >>>> As for the All vs. All problem don't forget that really your total number >>>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you >>>> are comparing so a simple 2D for loop will have you spending twice the >>>> amount of time on this than needs to occur. When I've done this before (in >>>> Perl so excuse the usage of it) the code looks like this: >>>> >>>> my @output; >>>> my @elements = ('some','elements','something'); >>>> while(scalar(@elements)> 1) { >>>> my $target = pop(@elements); >>>> foreach my $remaining_element (@elements) { >>>> push(@output, [$target, $remaining_element]); >>>> } >>>> } >>>> >>>> So this would have emitted: >>>> >>>> [ >>>> ['some','elements'], >>>> ['some','something'], >>>> ['elements','something'] >>>> ] >>>> >>>> Try doing something similar to this using the Java Deque objects which >>>> can act as a stack. >>>> >>>> Hope this helps to answer your question >>>> >>>> Andy >>>> >>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: >>>> >>>> Hello Biojava people, >>>>> >>>>> I have a question concerning Needlman Wunsh or Smith waterman >>>>> algorithms. >>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file >>>>> then I >>>>> store my sequences into an array to calculate pairwise similarity scores >>>>> using a for loop. >>>>> The problem is that it is very time consuming if we want to calculate >>>>> all >>>>> pairwise for a big number of protein sequences. I would like to know if >>>>> there is way to do a kind of "All against All" comparisons in one single >>>>> step ? >>>>> Do someone have a solution for this kind of problem ? >>>>> >>>>> Thanks for help. >>>>> >>>>> Radwen >>>>> _______________________________________________ >>>>> biojava-dev mailing list >>>>> biojava-dev at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>>> >>>> > > > -- > R. ANIBA > > Bioinformatics PhD > Laboratoire de Bioinformatique et G?nomique Int?grative, > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > 1 rue Laurent Fries, > 67404 Illkirch, France. > http://www-igbmc.u-strasbg.fr > http://alnitak.u-strasbg.fr/~aniba/alexsys > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev From aradwen at gmail.com Thu Jul 1 12:02:32 2010 From: aradwen at gmail.com (Radhouane Aniba) Date: Thu, 1 Jul 2010 14:02:32 +0200 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk> Message-ID: Thx Andy, Yes absolutely true what you said. But my question was much concerning when Mark is calculating pairwise scores I don't know if there is an "elegant way" to keep temporary the pairwise scores for further treatments, i'm specifically thinking about an array of scores. TBC .. Rad 2010/7/1 Scooter Willis > Andy > > I agree we should probably include the commons statistics jar as a must > have for anything statistical or p-score related. > > Scooter > > > > On 7/1/10 7:28 AM, "Andy Yates" wrote: > > I believe that you could use Jakarta commons-math & more specifically > org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will > give you min, max, std deviation & anything else you'd expect to be able to > use to describe a range of values > > Andy > > On 1 Jul 2010, at 12:05, Radhouane Aniba wrote: > > > Mark ! > > > > Can we modify the code you gave to : > > > > - extract the MIN pairwise similarity score > > - extract the MAX pairwise similarity score > > - calculate the standard deviation of similarity scores > > > > ? > > > > Rad > > > > 2010/6/30 Mark Chapman > > > >> Hi Radwen, > >> > >> I have already added this functionality to the BioJava3 alignment > package. > >> The code is available on the repository [1] and current builds are on > the > >> web site [2]. The necessary files are [3] and [4] and in the example > code > >> that follows you should only have to replace "piwi-seed-fasta.txt" with > your > >> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just > >> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . > >> > >> > >> int similars = 0, total = 0; > >> GapPenalty gaps = new SimpleGapPenalty(); > >> SubstitutionMatrix blosum62 = > >> new SimpleSubstitutionMatrix(); > >> > >> List piwi = new ArrayList(); > >> try { > >> piwi.addAll(FastaReaderHelper.readFastaProteinSequence( > >> new File("piwi-seed-fasta.txt")).values()); > >> } catch (Exception e) { > >> e.printStackTrace(); > >> } > >> > >> for (SequencePair pair : > >> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, > >> blosum62)) { > >> PairwiseSequenceScorer scorer = > >> new FractionalSimilarityScorer >> AminoAcidCompound>(pair); > >> System.out.printf("%n%s vs %s : %d / %d%n%s", > >> pair.getQuery().getAccession(), > >> pair.getTarget().getAccession(), scorer.getScore(), > >> scorer.getMaxScore(), > >> pair); > >> similars += scorer.getScore(); > >> total += scorer.getMaxScore(); > >> } > >> > >> System.out.printf("%nAverage similarity = %d / %d = %f", similars, > total, > >> (double)similars/total); > >> > >> ConcurrencyTools.shutdown(); > >> > >> > >> [1] http://biojava.org/wiki/CVS_to_SVN_Migration > >> [2] http://biojava.org/download/maven/ > >> [3] > >> > http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar > >> [4] > >> > http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar > >> > >> > >> Enjoy, > >> Mark > >> > >> > >> > >> On 6/30/2010 7:05 AM, Andy Yates wrote: > >> > >>> It was more of a way of decomposing the operations into a data > structure > >>> where each element in the 1st dimension represents the elements to > compare > >>> together. Really the Perl code is a way of describing the operations to > >>> occur in order to cover all possible permutations. > >>> > >>> Andy > >>> > >>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: > >>> > >>> Hi Andy, > >>>> > >>>> Thank you for your reply. > >>>> Actually, I was thinking about a parallelization method or a kind of > >>>> hadoop like implementation to do all pairwise comparison. The aim is > that at > >>>> the end i would like to calculate the average pairwise similarity > score > >>>> within a set of sequences. > >>>> > >>>> What I am doing is something like that : > >>>> > >>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 > >>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) > >>>> PairwiseScore > >>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) > >>>> End_For > >>>> End_For > >>>> > >>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) > >>>> > >>>> In fact the problem is in the ((n * (n-1)) / 2) operations. > >>>> > >>>> As for the solution presented in perl sorry but I dont see what you've > >>>> did inside ?! You created a 2D array ? how to achieve operations > inside , I > >>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? > >>>> > >>>> Radwen > >>>> > >>>> > >>>> 2010/6/30 Andy Yates > >>>> Hi Radwen, > >>>> > >>>> I would have said that this is more of a problem because of the type > of > >>>> algorithm you are using. It's impossible (as far as I am aware) to > calculate > >>>> the score matrices in one step for multiple sequences& even if it did > I > >>>> don't quite see where the speed increase would come from. > >>>> > >>>> As for the All vs. All problem don't forget that really your total > number > >>>> of comparisons is ((n * (n-1)) / 2) where n is the number of > sequences you > >>>> are comparing so a simple 2D for loop will have you spending twice the > >>>> amount of time on this than needs to occur. When I've done this before > (in > >>>> Perl so excuse the usage of it) the code looks like this: > >>>> > >>>> my @output; > >>>> my @elements = ('some','elements','something'); > >>>> while(scalar(@elements)> 1) { > >>>> my $target = pop(@elements); > >>>> foreach my $remaining_element (@elements) { > >>>> push(@output, [$target, $remaining_element]); > >>>> } > >>>> } > >>>> > >>>> So this would have emitted: > >>>> > >>>> [ > >>>> ['some','elements'], > >>>> ['some','something'], > >>>> ['elements','something'] > >>>> ] > >>>> > >>>> Try doing something similar to this using the Java Deque objects which > >>>> can act as a stack. > >>>> > >>>> Hope this helps to answer your question > >>>> > >>>> Andy > >>>> > >>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: > >>>> > >>>> Hello Biojava people, > >>>>> > >>>>> I have a question concerning Needlman Wunsh or Smith waterman > >>>>> algorithms. > >>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file > >>>>> then I > >>>>> store my sequences into an array to calculate pairwise similarity > scores > >>>>> using a for loop. > >>>>> The problem is that it is very time consuming if we want to calculate > >>>>> all > >>>>> pairwise for a big number of protein sequences. I would like to know > if > >>>>> there is way to do a kind of "All against All" comparisons in one > single > >>>>> step ? > >>>>> Do someone have a solution for this kind of problem ? > >>>>> > >>>>> Thanks for help. > >>>>> > >>>>> Radwen > >>>>> _______________________________________________ > >>>>> biojava-dev mailing list > >>>>> biojava-dev at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > >>>>> > >>>> > > > > > > -- > > R. ANIBA > > > > Bioinformatics PhD > > Laboratoire de Bioinformatique et G?nomique Int?grative, > > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > > 1 rue Laurent Fries, > > 67404 Illkirch, France. > > http://www-igbmc.u-strasbg.fr > > http://alnitak.u-strasbg.fr/~aniba/alexsys > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- R. ANIBA Bioinformatics PhD Laboratoire de Bioinformatique et G?nomique Int?grative, Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), 1 rue Laurent Fries, 67404 Illkirch, France. http://www-igbmc.u-strasbg.fr http://alnitak.u-strasbg.fr/~aniba/alexsys From ayates at ebi.ac.uk Thu Jul 1 13:08:16 2010 From: ayates at ebi.ac.uk (Andy Yates) Date: Thu, 1 Jul 2010 14:08:16 +0100 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

Message-ID: <207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> Looking at the implementation the pairwise scores for the FractionalSimilarityScorer comes from SequencePair.getNumSimilars() so the pairwise scores should always be easily available Andy On 1 Jul 2010, at 13:02, Radhouane Aniba wrote: > Thx Andy, > > Yes absolutely true what you said. But my question was much concerning when Mark is calculating pairwise scores I don't know if there is an "elegant way" to keep temporary the pairwise scores for further treatments, i'm specifically thinking about an array of scores. > > TBC .. > > Rad > > 2010/7/1 Scooter Willis > Andy > > I agree we should probably include the commons statistics jar as a must have for anything statistical or p-score related. > > Scooter > > > > On 7/1/10 7:28 AM, "Andy Yates" wrote: > > I believe that you could use Jakarta commons-math & more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation & anything else you'd expect to be able to use to describe a range of values > > Andy > > On 1 Jul 2010, at 12:05, Radhouane Aniba wrote: > > > Mark ! > > > > Can we modify the code you gave to : > > > > - extract the MIN pairwise similarity score > > - extract the MAX pairwise similarity score > > - calculate the standard deviation of similarity scores > > > > ? > > > > Rad > > > > 2010/6/30 Mark Chapman > > > >> Hi Radwen, > >> > >> I have already added this functionality to the BioJava3 alignment package. > >> The code is available on the repository [1] and current builds are on the > >> web site [2]. The necessary files are [3] and [4] and in the example code > >> that follows you should only have to replace "piwi-seed-fasta.txt" with your > >> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just > >> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . > >> > >> > >> int similars = 0, total = 0; > >> GapPenalty gaps = new SimpleGapPenalty(); > >> SubstitutionMatrix blosum62 = > >> new SimpleSubstitutionMatrix(); > >> > >> List piwi = new ArrayList(); > >> try { > >> piwi.addAll(FastaReaderHelper.readFastaProteinSequence( > >> new File("piwi-seed-fasta.txt")).values()); > >> } catch (Exception e) { > >> e.printStackTrace(); > >> } > >> > >> for (SequencePair pair : > >> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, > >> blosum62)) { > >> PairwiseSequenceScorer scorer = > >> new FractionalSimilarityScorer >> AminoAcidCompound>(pair); > >> System.out.printf("%n%s vs %s : %d / %d%n%s", > >> pair.getQuery().getAccession(), > >> pair.getTarget().getAccession(), scorer.getScore(), > >> scorer.getMaxScore(), > >> pair); > >> similars += scorer.getScore(); > >> total += scorer.getMaxScore(); > >> } > >> > >> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, > >> (double)similars/total); > >> > >> ConcurrencyTools.shutdown(); > >> > >> > >> [1] http://biojava.org/wiki/CVS_to_SVN_Migration > >> [2] http://biojava.org/download/maven/ > >> [3] > >> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar > >> [4] > >> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar > >> > >> > >> Enjoy, > >> Mark > >> > >> > >> > >> On 6/30/2010 7:05 AM, Andy Yates wrote: > >> > >>> It was more of a way of decomposing the operations into a data structure > >>> where each element in the 1st dimension represents the elements to compare > >>> together. Really the Perl code is a way of describing the operations to > >>> occur in order to cover all possible permutations. > >>> > >>> Andy > >>> > >>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: > >>> > >>> Hi Andy, > >>>> > >>>> Thank you for your reply. > >>>> Actually, I was thinking about a parallelization method or a kind of > >>>> hadoop like implementation to do all pairwise comparison. The aim is that at > >>>> the end i would like to calculate the average pairwise similarity score > >>>> within a set of sequences. > >>>> > >>>> What I am doing is something like that : > >>>> > >>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 > >>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) > >>>> PairwiseScore > >>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) > >>>> End_For > >>>> End_For > >>>> > >>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) > >>>> > >>>> In fact the problem is in the ((n * (n-1)) / 2) operations. > >>>> > >>>> As for the solution presented in perl sorry but I dont see what you've > >>>> did inside ?! You created a 2D array ? how to achieve operations inside , I > >>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? > >>>> > >>>> Radwen > >>>> > >>>> > >>>> 2010/6/30 Andy Yates > >>>> Hi Radwen, > >>>> > >>>> I would have said that this is more of a problem because of the type of > >>>> algorithm you are using. It's impossible (as far as I am aware) to calculate > >>>> the score matrices in one step for multiple sequences& even if it did I > >>>> don't quite see where the speed increase would come from. > >>>> > >>>> As for the All vs. All problem don't forget that really your total number > >>>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you > >>>> are comparing so a simple 2D for loop will have you spending twice the > >>>> amount of time on this than needs to occur. When I've done this before (in > >>>> Perl so excuse the usage of it) the code looks like this: > >>>> > >>>> my @output; > >>>> my @elements = ('some','elements','something'); > >>>> while(scalar(@elements)> 1) { > >>>> my $target = pop(@elements); > >>>> foreach my $remaining_element (@elements) { > >>>> push(@output, [$target, $remaining_element]); > >>>> } > >>>> } > >>>> > >>>> So this would have emitted: > >>>> > >>>> [ > >>>> ['some','elements'], > >>>> ['some','something'], > >>>> ['elements','something'] > >>>> ] > >>>> > >>>> Try doing something similar to this using the Java Deque objects which > >>>> can act as a stack. > >>>> > >>>> Hope this helps to answer your question > >>>> > >>>> Andy > >>>> > >>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: > >>>> > >>>> Hello Biojava people, > >>>>> > >>>>> I have a question concerning Needlman Wunsh or Smith waterman > >>>>> algorithms. > >>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file > >>>>> then I > >>>>> store my sequences into an array to calculate pairwise similarity scores > >>>>> using a for loop. > >>>>> The problem is that it is very time consuming if we want to calculate > >>>>> all > >>>>> pairwise for a big number of protein sequences. I would like to know if > >>>>> there is way to do a kind of "All against All" comparisons in one single > >>>>> step ? > >>>>> Do someone have a solution for this kind of problem ? > >>>>> > >>>>> Thanks for help. > >>>>> > >>>>> Radwen > >>>>> _______________________________________________ > >>>>> biojava-dev mailing list > >>>>> biojava-dev at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev > >>>>> > >>>> > > > > > > -- > > R. ANIBA > > > > Bioinformatics PhD > > Laboratoire de Bioinformatique et G?nomique Int?grative, > > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > > 1 rue Laurent Fries, > > 67404 Illkirch, France. > > http://www-igbmc.u-strasbg.fr > > http://alnitak.u-strasbg.fr/~aniba/alexsys > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Andrew Yates Ensembl Genomes Engineer > EMBL-EBI Tel: +44-(0)1223-492538 > Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 > Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ > > > > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > > -- > R. ANIBA > > Bioinformatics PhD > Laboratoire de Bioinformatique et G?nomique Int?grative, > Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), > 1 rue Laurent Fries, > 67404 Illkirch, France. > http://www-igbmc.u-strasbg.fr > http://alnitak.u-strasbg.fr/~aniba/alexsys -- Andrew Yates Ensembl Genomes Engineer EMBL-EBI Tel: +44-(0)1223-492538 Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ From chapman at cs.wisc.edu Thu Jul 1 23:02:48 2010 From: chapman at cs.wisc.edu (Mark Chapman) Date: Thu, 01 Jul 2010 18:02:48 -0500 Subject: [Biojava-dev] Pairwise similarity score speed In-Reply-To: <207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> Message-ID: <4C2D1E98.1020300@cs.wisc.edu> One additional note: the number of similarities is cached, so after being computed on the first call to SequencePair.getNumSimilars() or FractionalSimilarityScorer.getScore(), further calls simply read a variable. This means the first iteration over a list of sequence pairs may take a while for alignment and calculation of similarities, but later iterations will be fast so no additional array of scores is needed for storage. Mark On 7/1/2010 8:08 AM, Andy Yates wrote: > Looking at the implementation the pairwise scores for the FractionalSimilarityScorer comes from SequencePair.getNumSimilars() so the pairwise scores should always be easily available > > Andy > > On 1 Jul 2010, at 13:02, Radhouane Aniba wrote: > >> Thx Andy, >> >> Yes absolutely true what you said. But my question was much concerning when Mark is calculating pairwise scores I don't know if there is an "elegant way" to keep temporary the pairwise scores for further treatments, i'm specifically thinking about an array of scores. >> >> TBC .. >> >> Rad >> >> 2010/7/1 Scooter Willis >> Andy >> >> I agree we should probably include the commons statistics jar as a must have for anything statistical or p-score related. >> >> Scooter >> >> >> >> On 7/1/10 7:28 AM, "Andy Yates" wrote: >> >> I believe that you could use Jakarta commons-math& more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation& anything else you'd expect to be able to use to describe a range of values >> >> Andy >> >> On 1 Jul 2010, at 12:05, Radhouane Aniba wrote: >> >>> Mark ! >>> >>> Can we modify the code you gave to : >>> >>> - extract the MIN pairwise similarity score >>> - extract the MAX pairwise similarity score >>> - calculate the standard deviation of similarity scores >>> >>> ? >>> >>> Rad >>> >>> 2010/6/30 Mark Chapman >>> >>>> Hi Radwen, >>>> >>>> I have already added this functionality to the BioJava3 alignment package. >>>> The code is available on the repository [1] and current builds are on the >>>> web site [2]. The necessary files are [3] and [4] and in the example code >>>> that follows you should only have to replace "piwi-seed-fasta.txt" with your >>>> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just >>>> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL . >>>> >>>> >>>> int similars = 0, total = 0; >>>> GapPenalty gaps = new SimpleGapPenalty(); >>>> SubstitutionMatrix blosum62 = >>>> new SimpleSubstitutionMatrix(); >>>> >>>> List piwi = new ArrayList(); >>>> try { >>>> piwi.addAll(FastaReaderHelper.readFastaProteinSequence( >>>> new File("piwi-seed-fasta.txt")).values()); >>>> } catch (Exception e) { >>>> e.printStackTrace(); >>>> } >>>> >>>> for (SequencePair pair : >>>> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps, >>>> blosum62)) { >>>> PairwiseSequenceScorer scorer = >>>> new FractionalSimilarityScorer>>> AminoAcidCompound>(pair); >>>> System.out.printf("%n%s vs %s : %d / %d%n%s", >>>> pair.getQuery().getAccession(), >>>> pair.getTarget().getAccession(), scorer.getScore(), >>>> scorer.getMaxScore(), >>>> pair); >>>> similars += scorer.getScore(); >>>> total += scorer.getMaxScore(); >>>> } >>>> >>>> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total, >>>> (double)similars/total); >>>> >>>> ConcurrencyTools.shutdown(); >>>> >>>> >>>> [1] http://biojava.org/wiki/CVS_to_SVN_Migration >>>> [2] http://biojava.org/download/maven/ >>>> [3] >>>> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar >>>> [4] >>>> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar >>>> >>>> >>>> Enjoy, >>>> Mark >>>> >>>> >>>> >>>> On 6/30/2010 7:05 AM, Andy Yates wrote: >>>> >>>>> It was more of a way of decomposing the operations into a data structure >>>>> where each element in the 1st dimension represents the elements to compare >>>>> together. Really the Perl code is a way of describing the operations to >>>>> occur in order to cover all possible permutations. >>>>> >>>>> Andy >>>>> >>>>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote: >>>>> >>>>> Hi Andy, >>>>>> >>>>>> Thank you for your reply. >>>>>> Actually, I was thinking about a parallelization method or a kind of >>>>>> hadoop like implementation to do all pairwise comparison. The aim is that at >>>>>> the end i would like to calculate the average pairwise similarity score >>>>>> within a set of sequences. >>>>>> >>>>>> What I am doing is something like that : >>>>>> >>>>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1 >>>>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES) >>>>>> PairwiseScore >>>>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J]) >>>>>> End_For >>>>>> End_For >>>>>> >>>>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES) >>>>>> >>>>>> In fact the problem is in the ((n * (n-1)) / 2) operations. >>>>>> >>>>>> As for the solution presented in perl sorry but I dont see what you've >>>>>> did inside ?! You created a 2D array ? how to achieve operations inside , I >>>>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ? >>>>>> >>>>>> Radwen >>>>>> >>>>>> >>>>>> 2010/6/30 Andy Yates >>>>>> Hi Radwen, >>>>>> >>>>>> I would have said that this is more of a problem because of the type of >>>>>> algorithm you are using. It's impossible (as far as I am aware) to calculate >>>>>> the score matrices in one step for multiple sequences& even if it did I >>>>>> don't quite see where the speed increase would come from. >>>>>> >>>>>> As for the All vs. All problem don't forget that really your total number >>>>>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you >>>>>> are comparing so a simple 2D for loop will have you spending twice the >>>>>> amount of time on this than needs to occur. When I've done this before (in >>>>>> Perl so excuse the usage of it) the code looks like this: >>>>>> >>>>>> my @output; >>>>>> my @elements = ('some','elements','something'); >>>>>> while(scalar(@elements)> 1) { >>>>>> my $target = pop(@elements); >>>>>> foreach my $remaining_element (@elements) { >>>>>> push(@output, [$target, $remaining_element]); >>>>>> } >>>>>> } >>>>>> >>>>>> So this would have emitted: >>>>>> >>>>>> [ >>>>>> ['some','elements'], >>>>>> ['some','something'], >>>>>> ['elements','something'] >>>>>> ] >>>>>> >>>>>> Try doing something similar to this using the Java Deque objects which >>>>>> can act as a stack. >>>>>> >>>>>> Hope this helps to answer your question >>>>>> >>>>>> Andy >>>>>> >>>>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote: >>>>>> >>>>>> Hello Biojava people, >>>>>>> >>>>>>> I have a question concerning Needlman Wunsh or Smith waterman >>>>>>> algorithms. >>>>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file >>>>>>> then I >>>>>>> store my sequences into an array to calculate pairwise similarity scores >>>>>>> using a for loop. >>>>>>> The problem is that it is very time consuming if we want to calculate >>>>>>> all >>>>>>> pairwise for a big number of protein sequences. I would like to know if >>>>>>> there is way to do a kind of "All against All" comparisons in one single >>>>>>> step ? >>>>>>> Do someone have a solution for this kind of problem ? >>>>>>> >>>>>>> Thanks for help. >>>>>>> >>>>>>> Radwen >>>>>>> _______________________________________________ >>>>>>> biojava-dev mailing list >>>>>>> biojava-dev at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >>>>>>> >>>>>> >>> >>> >>> -- >>> R. ANIBA >>> >>> Bioinformatics PhD >>> Laboratoire de Bioinformatique et G?nomique Int?grative, >>> Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), >>> 1 rue Laurent Fries, >>> 67404 Illkirch, France. >>> http://www-igbmc.u-strasbg.fr >>> http://alnitak.u-strasbg.fr/~aniba/alexsys >>> >>> _______________________________________________ >>> biojava-dev mailing list >>> biojava-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> >> -- >> Andrew Yates Ensembl Genomes Engineer >> EMBL-EBI Tel: +44-(0)1223-492538 >> Wellcome Trust Genome Campus Fax: +44-(0)1223-494468 >> Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/ >> >> >> >> >> >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> >> >> >> >> -- >> R. ANIBA >> >> Bioinformatics PhD >> Laboratoire de Bioinformatique et G?nomique Int?grative, >> Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC), >> 1 rue Laurent Fries, >> 67404 Illkirch, France. >> http://www-igbmc.u-strasbg.fr >> http://alnitak.u-strasbg.fr/~aniba/alexsys > From andreas at sdsc.edu Sun Jul 4 01:55:36 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Sat, 3 Jul 2010 18:55:36 -0700 Subject: [Biojava-dev] BioJava at ISMB, BOSC, and 3D SIG Message-ID: Hi, Next week the BOSC and ISMB conferences will be in Boston. There will be a couple of opportunities to meet BioJava related people, even if we won't have a BioJava specific talk at BOSC this time. The week will start with the Codefest in the days before BOSC. I have not heard too much about that, so perhaps one of the people who are going there can give us an update about what the plans are? At BOSC, there will be a talk from Jianjiong Gao, who is also on of our Google summer of code students. He will be present about general and kinase specific phosphorylation sites (with Musite). I will give a talk at 3D-SIG about structure alignments (using BioJava). Any other BioJava related events that are planned? Is anybody planning to blog or twitter about the conferences ? If people are interested in a meetup in Boston, drop me a mail and we'll arrange something... Andreas From member at linkedin.com Tue Jul 13 15:59:31 2010 From: member at linkedin.com (abdul qaddus via LinkedIn) Date: Tue, 13 Jul 2010 08:59:31 -0700 (PDT) Subject: [Biojava-dev] abdul qaddus wants to stay in touch on LinkedIn Message-ID: <888510157.1044264.1279036771955.JavaMail.app@ech3-cdn05.prod> LinkedIn ------------ I'd like to add you to my professional network on LinkedIn. - abdul qaddus abdul qaddus Owner at Exbizsol Pakistan Confirm that you know abdul qaddus https://www.linkedin.com/e/8hm30y-gbkxgxgx-19/isd/1463080869/5k7VZpD8/ ------ (c) 2010, LinkedIn Corporation From andreas at sdsc.edu Wed Jul 14 23:39:58 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 14 Jul 2010 16:39:58 -0700 Subject: [Biojava-dev] Creating Junit test units In-Reply-To: References: Message-ID: Hi Sylvain, Cool, thanks for getting this started. There are junit tests in most of the other modules. Probably best to take a look at them as a template. Should be pretty straightforward from there. In terms of biojava-3 the RichSequence representation is still from the old biojava 1.7 design. Can you try to use the new sequence code base? Andreas On Wed, Jul 14, 2010 at 10:57 AM, Sylvain Foisy < sylvain.foisy at inflammgen.org> wrote: > Hi Andreas, > > I created a new module called biojava3-ws to collect all (wow, am I > ambitious...) stuff related to using Web services :data submission, > processing and results collection. I hope I did not break anything... > > I am starting to look into creating JUnit test units to complete this, > something that I never done before. Do you have some pointers toward some > tutorial material for this? It's my first coding in more than 2-3 years... > > BTW, I am still using RichSequence objects to feed into the BLAST requests > but are these objects having a future with biojava3? What should be used > instead? > > Best regards and apologies if these are stupid questions... > > Sylvain > > > ================================================================== > > Sylvain Foisy, Ph. D. > Charg? de projet / Project Manager > Bio-informatique > > Adresse postale: > > Laboratoire de genetique et medecine genomique de l'inflammation > Institut de cardiologie de Montreal > 5000 Belanger > Montreal, Qc > H1T 1C8 > > T: 514-376-3330 x.2299 | F: 514-593-2539 > M: sylvain.foisy at inflammgen.org > W: http://www.inflammgen.org > > ================================================================== > > > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Wed Jul 14 23:44:42 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Wed, 14 Jul 2010 16:44:42 -0700 Subject: [Biojava-dev] automated build messages Message-ID: Hi, the automated cruisecontrol builds of biojava-svn seem to work fine and I will soon set up an auto-forward to this mailing list, so it is easier for you to follow SVN activity... Andreas -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Thu Jul 15 16:22:08 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 15 Jul 2010 09:22:08 -0700 Subject: [Biojava-dev] Creating Junit test units In-Reply-To: References: Message-ID: > > > > I have looked into the docs for BJ3 in the wiki and it is utterly confusing > with the code in the svn: there is no FASTAReader not FASTAFileReader > classes... I see two packages with possible FASTA-pertinent material: > sequence and biojava3-core. Confusing to say the least. I agree, the wiki docu is mainly related to the 1.7 release of BioJava, we need to start adding documentation for the new code base! Andreas -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From andreas at sdsc.edu Thu Jul 15 16:52:37 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Thu, 15 Jul 2010 09:52:37 -0700 Subject: [Biojava-dev] Creating Junit test units In-Reply-To: <51EEE556-210F-4F1E-858A-66ACBE5C3B92@scripps.edu> References: <51EEE556-210F-4F1E-858A-66ACBE5C3B92@scripps.edu> Message-ID: > > > Wouldn't be a bad idea to start a biojava3 wiki with a different URL so > that search and organization is clear. This would also increase the > motivation to add content because it would be empty. I will start writing > wiki content over the weekend for the core module. Good idea. I just added a new BioJava 3 specific cookbook page and updated the wiki-front page to make this more clear: All BioJava3 specific docu should go here: http://biojava.org/wiki/BioJava:CookBook3.0 Once we make BioJava 3 the official version, we can move the 1.7 cookbook page to a different location and make the v. 3.0 cookbook the default one... Andreas From jake at researchtogether.com Mon Jul 19 13:49:04 2010 From: jake at researchtogether.com (jake at researchtogether.com) Date: Mon, 19 Jul 2010 14:49:04 +0100 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: <4C2D1E98.1020300@cs.wisc.edu> References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> Message-ID: <20100719134904.GA20200@researchtogether.com> Hi All, I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException. Any opinions? :) Cheers, Jake From holland at eaglegenomics.com Mon Jul 19 15:02:23 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 19 Jul 2010 16:02:23 +0100 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: <20100719134904.GA20200@researchtogether.com> References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> Message-ID: I often wonder what the best way of handling multiple possible internal exceptions is - particularly in cases like this when you've got HTTP and IO and many other types of exceptions which could be thrown. SequenceException maybe if there's something wrong with the sequence itself - but possibly otherwise a form of IOException may be more appropriate? Trouble is that then almost every BioJava3 method would throw it, as all of them potentially have IO exposure. I don't know. There must be experts on this in the list who can help! cheers, Richard On 19 Jul 2010, at 14:49, jake at researchtogether.com wrote: > Hi All, > > I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview > > One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException. > > Any opinions? :) > > Cheers, > Jake > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jprocter at compbio.dundee.ac.uk Mon Jul 19 14:57:30 2010 From: jprocter at compbio.dundee.ac.uk (Jim Procter) Date: Mon, 19 Jul 2010 15:57:30 +0100 Subject: [Biojava-dev] osgi-bioinformatics@googlegroups.com: new mailing list for OSGi issues in bioinformatics Message-ID: <4C4467DA.6080709@compbio.dundee.ac.uk> Hi all. Some of you will be aware of the OSGi plugin architecture (www.osgi.org), which is used by a number of java applications. You may also be aware that a number of bioinformatics projects are considering, or are currently migrating their architecture to adopt the OSGi plugin model (with or without various additional mechanisms, e.g. Spring, or equinox-p2, etc). I am involved in one such project, and I've created the osgi-bioinformatics google group because I'd very much like to be able to discuss OSGi related issues with others from the bioinformatics software development field who have some OSGi experience. It would also be great if we could thresh out some best-practice guidlines, and discuss the kinds of modules our projects provide - so others in the Bioinformatics-OSGi ecosystem might make use of them. Sorry to clutter up your in boxes with yet another mailing list invite, but hopefully, the discussion will be relevant to some of you working on biojava3 - if not now, then later on, when you all feel the biojava3 APIs are more mature. If you are OSGi (il)literate, or wish to be, then please join the list at http://groups.google.co.uk/group/osgi-bioinformatics Thanks for your attention ;) Jim Procter. -- ------------------------------------------------------------------- J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk The University of Dundee is a Scottish Registered Charity, No. SC015096. From andreas at sdsc.edu Tue Jul 20 00:49:20 2010 From: andreas at sdsc.edu (Andreas Prlic) Date: Mon, 19 Jul 2010 17:49:20 -0700 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: <20100719134904.GA20200@researchtogether.com> References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> Message-ID: Hi Jake, Thanks for this. Would it be possible to add a "Cookbook" page for how to use the NCBISequence reader as well? Some examples would be great... I understand the NCBI now requires scripts to provide email address etc.... Would be good to explain how to do this. We just started to work on more docu here: http://biojava.org/wiki/BioJava:CookBook3.0 Thanks, Andreas On Mon, Jul 19, 2010 at 6:49 AM, wrote: > Hi All, > > I've been drawing up a design for the work I have done on the NCBI > SequenceReader and I've talked through some things with Scooter which I have > put on the wiki at: > http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview > > One thing I would like to throw open for discussion is the possibility of > changing the Sequence interface so that the methods can throw a new > exception - SequenceException. > > Any opinions? :) > > Cheers, > Jake > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev > -- ----------------------------------------------------------------------- Dr. Andreas Prlic Senior Scientist, RCSB PDB Protein Data Bank University of California, San Diego (+1) 858.246.0526 ----------------------------------------------------------------------- From HWillis at scripps.edu Tue Jul 20 01:09:32 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Mon, 19 Jul 2010 21:09:32 -0400 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> Message-ID: <23594ECE-E9D6-4C00-B72E-ACB0625FE85B@scripps.edu> Jake Do you have any updates to the code? I can go ahead and check it in. Scooter On Jul 19, 2010, at 8:49 PM, Andreas Prlic wrote: > Hi Jake, > > Thanks for this. Would it be possible to add a "Cookbook" page for how to > use the NCBISequence reader as well? Some examples would be great... I > understand the NCBI now requires scripts to provide email address etc.... > Would be good to explain how to do this. We just started to work on more > docu here: > > http://biojava.org/wiki/BioJava:CookBook3.0 > > Thanks, > Andreas > > > > On Mon, Jul 19, 2010 at 6:49 AM, wrote: > >> Hi All, >> >> I've been drawing up a design for the work I have done on the NCBI >> SequenceReader and I've talked through some things with Scooter which I have >> put on the wiki at: >> http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview >> >> One thing I would like to throw open for discussion is the possibility of >> changing the Sequence interface so that the methods can throw a new >> exception - SequenceException. >> >> Any opinions? :) >> >> Cheers, >> Jake >> _______________________________________________ >> biojava-dev mailing list >> biojava-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-dev >> > > > > -- > ----------------------------------------------------------------------- > Dr. Andreas Prlic > Senior Scientist, RCSB PDB Protein Data Bank > University of California, San Diego > (+1) 858.246.0526 > ----------------------------------------------------------------------- > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev From markjschreiber at gmail.com Tue Jul 20 02:20:10 2010 From: markjschreiber at gmail.com (Mark Schreiber) Date: Tue, 20 Jul 2010 10:20:10 +0800 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> Message-ID: I don't think it is a great idea to hide IO exceptions but you can reduce the burden of them. You can copy the Groovy model which handles a lot of the try/catch/finally boiler plate code for you. Basically you make a helper class with methods to perform common IO operations and which will do it's very best to connect, read/write and clean up. You can also think about what might actually cause an error. If you are reading from a local disk cache where the file address is known (such as a temp file) you can very nearly guarantee that the IO operation will succeed. So much so that you could rethrow an IO Exception as an error because there is very little that can be done about it (other than improving the cache code or getting more reliable hard-drives). Reading a file from disk? The most likely problem is a incorrect file name. Other problems can probably be turned into runtime exceptions cause other problems are probably disk errors. Reading from a URL, lots of things can go wrong here so you probably need to expose all the possible exceptions. Reading from SQL? Kind of depends on the expected DB availability and latency. Also, if the query code (or JPA query) is coming from the BioJava source then an error is appropriate (the developer can't do much about the mistake). If the code is coming from the app developer then you should notify them of SQL errors. - Mark On Mon, Jul 19, 2010 at 11:02 PM, Richard Holland wrote: > > I often wonder what the best way of handling multiple possible internal exceptions is - particularly in cases like this when you've got HTTP and IO and many other types of exceptions which could be thrown. > > SequenceException maybe if there's something wrong with the sequence itself - but possibly otherwise a form of IOException may be more appropriate? Trouble is that then almost every BioJava3 method would throw it, as all of them potentially have IO exposure. > > I don't know. There must be experts on this in the list who can help! > > cheers, > Richard > > > On 19 Jul 2010, at 14:49, jake at researchtogether.com wrote: > > > Hi All, > > > > I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview > > > > One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException. > > > > Any opinions? :) > > > > Cheers, > > Jake > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > _______________________________________________ > biojava-dev mailing list > biojava-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-dev From jake at researchtogether.com Tue Jul 20 09:46:39 2010 From: jake at researchtogether.com (jake at researchtogether.com) Date: Tue, 20 Jul 2010 10:46:39 +0100 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> Message-ID: <20100720094639.GD20200@researchtogether.com> See comments in line. Thanks, Jake On Tue, Jul 20, 2010 at 10:20:10AM +0800, Mark Schreiber wrote: > I don't think it is a great idea to hide IO exceptions but you can > reduce the burden of them. I would normally agree with you, but as I shall point out later this will have a lot of knock on effects for the interface which may not be desirable. > > You can copy the Groovy model which handles a lot of the > try/catch/finally boiler plate code for you. Basically you make a > helper class with methods to perform common IO operations and which > will do it's very best to connect, read/write and clean up. > > You can also think about what might actually cause an error. If you > are reading from a local disk cache where the file address is known > (such as a temp file) you can very nearly guarantee that the IO > operation will succeed. So much so that you could rethrow an IO > Exception as an error because there is very little that can be done > about it (other than improving the cache code or getting more reliable > hard-drives). And this is the issue - the Sequence interface is used by a lot of different readers, some are reading from disk, others from database and in my particular case I am reading it from a URL. Also, it is possible that I will run into a lot of exceptions around XML parsing (the data from the URL) as well as HTTP errors (page not found, service unavailable etc.) Now, normally I would want to deal with some of the errors and only log them - e.g. a 503 I might retry a few times and if there is a problem with the XML I might try and fetch it again. However, I don't fully understand how the caller will expect these SequenceReaders to behave which I why I asked the question :) An IOException on a file is probably fatal but IOException on a network call is possibly recoverable, or at least wort re-trying. As for what can cause errors: 1. Invalid URL 2. Page(s) unavailable (4xx, 5xx) 3. Invalid/unexpected data returned (XML badly formed, FASTA invalid) 4. Change to service (if the service has changed and the parser is effectively broken) 5. Network interuptance (i.e. network timeout) > > Reading a file from disk? The most likely problem is a incorrect file > name. Other problems can probably be turned into runtime exceptions > cause other problems are probably disk errors. > > Reading from a URL, lots of things can go wrong here so you probably > need to expose all the possible exceptions. I will work on this assumption and change the interface accordingly, though I expect that the decision will be re-visited. > > Reading from SQL? Kind of depends on the expected DB availability and > latency. Also, if the query code (or JPA query) is coming from the > BioJava source then an error is appropriate (the developer can't do > much about the mistake). If the code is coming from the app developer > then you should notify them of SQL errors. > > - Mark > > On Mon, Jul 19, 2010 at 11:02 PM, Richard Holland > wrote: > > > > I often wonder what the best way of handling multiple possible internal exceptions is - particularly in cases like this when you've got HTTP and IO and many other types of exceptions which could be thrown. > > > > SequenceException maybe if there's something wrong with the sequence itself - but possibly otherwise a form of IOException may be more appropriate? Trouble is that then almost every BioJava3 method would throw it, as all of them potentially have IO exposure. > > > > I don't know. There must be experts on this in the list who can help! > > > > cheers, > > Richard > > > > > > On 19 Jul 2010, at 14:49, jake at researchtogether.com wrote: > > > > > Hi All, > > > > > > I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview > > > > > > One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException. > > > > > > Any opinions? :) > > > > > > Cheers, > > > Jake > > > _______________________________________________ > > > biojava-dev mailing list > > > biojava-dev at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biojava-dev > > > > -- > > Richard Holland, BSc MBCS > > Operations and Delivery Director, Eagle Genomics Ltd > > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > > http://www.eaglegenomics.com/ > > > > > > _______________________________________________ > > biojava-dev mailing list > > biojava-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biojava-dev From aman.a.gupta1989 at gmail.com Tue Jul 20 10:07:39 2010 From: aman.a.gupta1989 at gmail.com (aman gupta) Date: Tue, 20 Jul 2010 15:37:39 +0530 Subject: [Biojava-dev] Restriction site Message-ID: Respected Sir/madam, I would like to know how to implement the RestrictionSite interface under org.biojava.bio.molbio.RestrictionSite in actual program..??? kindly help and do the needfull... please -- ----------------------------- Aman.A.Gupta From HWillis at scripps.edu Tue Jul 20 15:25:36 2010 From: HWillis at scripps.edu (Scooter Willis) Date: Tue, 20 Jul 2010 11:25:36 -0400 Subject: [Biojava-dev] Sequence interface - exceptions In-Reply-To: <20100720094639.GD20200@researchtogether.com> References: <1412E21C-AD59-407E-BD13-ACF850EBF2C7@ebi.ac.uk>

<207BE958-357B-4A66-9897-42F05A6AA189@ebi.ac.uk> <4C2D1E98.1020300@cs.wisc.edu> <20100719134904.GA20200@researchtogether.com> <20100720094639.GD20200@researchtogether.com> Message-ID: <4D510D58-22F1-4F20-84D3-ED51DFDB1386@scripps.edu> Mark makes some very good points and it will be a challenge to come up with a robust(appropriate) error reporting and still maintain flexibility where writing code is easy as long as everything works. Currently, you can pass a class that implements the Sequence Interface to the constructor of a DNASequence, ProteinSequence etc. If the class that implements the sequence interface throws an exception when it is created then that is outside the api design of the abstract sequence. In the following example UniprotProxySequenceReader upon creation would call the appropriate URL and retrieve the sequence. If an error occurs then that class should throw the appropriate exception. We don't need to force a particular exception on classes that implement an interface. UniprotProxySequenceReader uniprotSequence = new UniprotProxySequenceReader("YA745_GIBZE", AminoAcidCompoundSet.getAminoAcidCompoundSet()); ProteinSequence proteinSequence = new ProteinSequence(uniprotSequence); We do have an api/exception design problem if UniprotProxySequenceReader does lazy instantiation where it doesn't retrieve the sequence data unless a call to proteinSequence.getSequence() is made. This allows us to create applications where you can load a large number of sequences without consuming memory or sequences that will never be used. If you have a web based application where the user will query a sequence based on some event then this is a nice design element. If you are writing code to exam the GC content of every gene sequence then not a big memory saver. The easy solution is to have every sequence method that has a dependency on a class with sequence interface declared throws exception. This would add additional exception handling code for the users of the api which can add to the complexity and introduce a performance penalty if the try catch is not done generally for a block of code. The reality is that for the X number of methods that have a dependency on a Sequence Interface class if one fails they will all fail. We could add an isInit() method to AbstractSequence which throws an exception or returns a boolean that is designed to force the Sequence Interface to load sequence data from external sources. The user of the API via our contract definition can do defensive programming and make sure the sequence is ready before using it. If it is not ready and a method is called that depends on the Sequence Interface then we simply return the appropriate null/not defined object. The last use case that still makes this difficult is being able to define a ChromosomeSequence(new NCBISequenceReader("NC_000019.9")) where a call to get a collection of gene sequences from the chromosome sequence to be done in a lazy fashion without retrieving the entire chromosome sequence. If I make a call to geneSequence.getProteinSequence().toString() then that would make the appropriate getSubString(2000,5000) that maps to the gene to the NCBISequenceReader which then retrieves that sub sequence from NCBI. To allow this option we can not depend on the isInit() to be correct. In this particular example we have three types of errors. The internet connection is not working, NCBI is not working or refusing your connection because you went over the three requests per second rule or you have something wrong with your accession id. If the internet is down or NCBI is refusing your connection not a great deal the application can do to recover. In the case of the accession id being an error that could be handled when you instantiate the class new NCBISequenceReader("NC_000019.9") by some sort of call to NCBI to see if it is valid and if not throw an exception. We do have options when a particular service is down or slow to respond. Uniprot implements a DNS based load distribution that I did have a problem with one weekend. It was very slow and often did not respond. Turns out if I changed my URL I could point to the http://pir.uniprot.org located in the US and everything worked great. This could be something implemented by UniprotProxySequenceReader if it gets an IO exception or determines queries are taking a long time. In summary we probably should throw exceptions for each method that depends on Sequence Interface and/or return a set of appropriate null/not-init objects. Given that we are working with imperfect data models and data relationships I think defensive programming on return values is not a bad option. It is a shame to have getSequenceLength() throw an exception or return a null Integer if an IO exception occurs. These are only problems when using a Sequence Interface that has a higher risk of failure because it is remote and would be the "exception" not the rule. For hard core developers we can resolve these issues when they occur. If the Biojava-core code makes it way into an end user application then we need to give the application developer a way to deal with error conditions. Using the NCBI chromosome example I think we can create a very powerful api to work with large amounts of sequence data but at the expense of making the api very exception happy! We have also begun the very exciting step of doing wiki docs specific to Biojava3, It is a work in progress http://biojava.org/wiki/BioJava:CookBook3.0 Thanks Scooter On Jul 20, 2010, at 5:46 AM, > wrote: See comments in line. Thanks, Jake On Tue, Jul 20, 2010 at 10:20:10AM +0800, Mark Schreiber wrote: I don't think it is a great idea to hide IO exceptions but you can reduce the burden of them. I would normally agree with you, but as I shall point out later this will have a lot of knock on effects for the interface which may not be desirable. You can copy the Groovy model which handles a lot of the try/catch/finally boiler plate code for you. Basically you make a helper class with methods to perform common IO operations and which will do it's very best to connect, read/write and clean up. You can also think about what might actually cause an error. If you are reading from a local disk cache where the file address is known (such as a temp file) you can very nearly guarantee that the IO operation will succeed. So much so that you could rethrow an IO Exception as an error because there is very little that can be done about it (other than improving the cache code or getting more reliable hard-drives). And this is the issue - the Sequence interface is used by a lot of different readers, some are reading from disk, others from database and in my particular case I am reading it from a URL. Also, it is possible that I will run into a lot of exceptions around XML parsing (the data from the URL) as well as HTTP errors (page not found, service unavailable etc.) Now, normally I would want to deal with some of the errors and only log them - e.g. a 503 I might retry a few times and if there is a problem with the XML I might try and fetch it again. However, I don't fully understand how the caller will expect these SequenceReaders to behave which I why I asked the question :) An IOException on a file is probably fatal but IOException on a network call is possibly recoverable, or at least wort re-trying. As for what can cause errors: 1. Invalid URL 2. Page(s) unavailable (4xx, 5xx) 3. Invalid/unexpected data returned (XML badly formed, FASTA invalid) 4. Change to service (if the service has changed and the parser is effectively broken) 5. Network interuptance (i.e. network timeout) Reading a file from disk? The most likely problem is a incorrect file name. Other problems can probably be turned into runtime exceptions cause other problems are probably disk errors. Reading from a URL, lots of things can go wrong here so you probably need to expose all the possible exceptions. I will work on this assumption and change the interface accordingly, though I expect that the decision will be re-visited. Reading from SQL? Kind of depends on the expected DB availability and latency. Also, if the query code (or JPA query) is coming from the BioJava source then an error is appropriate (the developer can't do much about the mistake). If the code is coming from the app developer then you should notify them of SQL errors. - Mark On Mon, Jul 19, 2010 at 11:02 PM, Richard Holland > wrote: I often wonder what the best way of handling multiple possible internal exceptions is - particularly in cases like this when you've got HTTP and IO and many other types of exceptions which could be thrown. SequenceException maybe if there's something wrong with the sequence itself - but possibly otherwise a form of IOException may be more appropriate? Trouble is that then almost every BioJava3 method would throw it, as all of them potentially have IO exposure. I don't know. There must be experts on this in the list who can help! cheers, Richard On 19 Jul 2010, at 14:49, jake at researchtogether.com wrote: Hi All, I've been drawing up a design for the work I have done on the NCBI SequenceReader and I've talked through some things with Scooter which I have put on the wiki at: http://www.biojava.org/wiki/BioJava3_NCBISequenceReader_Design#Design_Overview One thing I would like to throw open for discussion is the possibility of changing the Sequence interface so that the methods can throw a new exception - SequenceException. Any opinions? :) Cheers, Jake _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev _______________________________________________ biojava-dev mailing list biojava-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-dev From andreas at sdsc.edu Sat Jul 24 09:24:26 2010 From: andreas at sdsc.edu (andreas at sdsc.edu) Date: Sat, 24 Jul 2010 02:24:26 -0700 (PDT) Subject: [Biojava-dev] biojava-svn build.27 Build Fixed Message-ID: <980155567.1279963466731.JavaMail.andreas@emmy.rcsb.org> An HTML attachment was scrubbed... URL: From andreas at sdsc.edu Thu Jul 29 00:37:35 2010 From: andreas at sdsc.edu (andreas at sdsc.edu) Date: Wed, 28 Jul 2010 17:37:35 -0700 (PDT) Subject: [Biojava-dev] biojava-svn build.35 Build Fixed Message-ID: <512820011.1280363855648.JavaMail.andreas@emmy.rcsb.org> An HTML attachment was scrubbed... URL: From andreas at sdsc.edu Sat Jul 31 05:55:57 2010 From: andreas at sdsc.edu (andreas at sdsc.edu) Date: Fri, 30 Jul 2010 22:55:57 -0700 (PDT) Subject: [Biojava-dev] biojava-svn build.39 Build Fixed Message-ID: <960385312.1280555757389.JavaMail.andreas@emmy.rcsb.org> An HTML attachment was scrubbed... URL: