[Biojava-dev] Pairwise similarity score speed

Andy Yates ayates at ebi.ac.uk
Thu Jul 1 13:08:16 UTC 2010


Looking at the implementation the pairwise scores for the FractionalSimilarityScorer comes from SequencePair.getNumSimilars() so the pairwise scores should always be easily available

Andy

On 1 Jul 2010, at 13:02, Radhouane Aniba wrote:

> Thx Andy,
> 
> Yes absolutely true what you said. But my question was much concerning when Mark is calculating pairwise scores I don't know if there is an "elegant way" to keep temporary the pairwise scores for further treatments, i'm specifically thinking about an array of scores.
> 
> TBC ..
> 
> Rad
> 
> 2010/7/1 Scooter Willis <HWillis at scripps.edu>
> Andy
> 
> I agree we should probably include the commons statistics jar as a must have for anything statistical or p-score related. 
> 
> Scooter
> 
> 
> 
> On 7/1/10 7:28 AM, "Andy Yates" <ayates at ebi.ac.uk> wrote:
> 
> I believe that you could use Jakarta commons-math & more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation & anything else you'd expect to be able to use to describe a range of values
> 
> Andy
> 
> On 1 Jul 2010, at 12:05, Radhouane Aniba wrote:
> 
> > Mark !
> >
> > Can we modify the code you gave to :
> >
> > - extract the MIN pairwise similarity score
> > - extract the MAX pairwise similarity score
> > - calculate the standard deviation of similarity scores
> >
> > ?
> >
> > Rad
> >
> > 2010/6/30 Mark Chapman <chapman at cs.wisc.edu>
> >
> >> Hi Radwen,
> >>
> >> I have already added this functionality to the BioJava3 alignment package.
> >> The code is available on the repository [1] and current builds are on the
> >> web site [2].  The necessary files are [3] and [4] and in the example code
> >> that follows you should only have to replace "piwi-seed-fasta.txt" with your
> >> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just
> >> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL .
> >>
> >>
> >> int similars = 0, total = 0;
> >> GapPenalty gaps = new SimpleGapPenalty();
> >> SubstitutionMatrix<AminoAcidCompound> blosum62 =
> >>   new SimpleSubstitutionMatrix<AminoAcidCompound>();
> >>
> >> List<ProteinSequence> piwi = new ArrayList<ProteinSequence>();
> >> try {
> >> piwi.addAll(FastaReaderHelper.readFastaProteinSequence(
> >>     new File("piwi-seed-fasta.txt")).values());
> >> } catch (Exception e) {
> >> e.printStackTrace();
> >> }
> >>
> >> for (SequencePair<ProteinSequence, AminoAcidCompound> pair :
> >>   Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps,
> >>       blosum62)) {
> >> PairwiseSequenceScorer<ProteinSequence, AminoAcidCompound> scorer =
> >>     new FractionalSimilarityScorer<ProteinSequence,
> >> AminoAcidCompound>(pair);
> >> System.out.printf("%n%s vs %s : %d / %d%n%s",
> >> pair.getQuery().getAccession(),
> >>     pair.getTarget().getAccession(), scorer.getScore(),
> >> scorer.getMaxScore(),
> >>     pair);
> >> similars += scorer.getScore();
> >> total += scorer.getMaxScore();
> >> }
> >>
> >> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total,
> >>   (double)similars/total);
> >>
> >> ConcurrencyTools.shutdown();
> >>
> >>
> >> [1] http://biojava.org/wiki/CVS_to_SVN_Migration
> >> [2] http://biojava.org/download/maven/
> >> [3]
> >> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar
> >> [4]
> >> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar
> >>
> >>
> >> Enjoy,
> >> Mark
> >>
> >>
> >>
> >> On 6/30/2010 7:05 AM, Andy Yates wrote:
> >>
> >>> It was more of a way of decomposing the operations into a data structure
> >>> where each element in the 1st dimension represents the elements to compare
> >>> together. Really the Perl code is a way of describing the operations to
> >>> occur in order to cover all possible permutations.
> >>>
> >>> Andy
> >>>
> >>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:
> >>>
> >>> Hi Andy,
> >>>>
> >>>> Thank you for your reply.
> >>>> Actually, I was thinking about a parallelization method or a kind of
> >>>> hadoop like implementation to do all pairwise comparison. The aim is that at
> >>>> the end i would like to calculate the average pairwise similarity score
> >>>> within a set of sequences.
> >>>>
> >>>> What I am doing is something like that :
> >>>>
> >>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
> >>>>  For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
> >>>>     PairwiseScore
> >>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
> >>>> End_For
> >>>> End_For
> >>>>
> >>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
> >>>>
> >>>> In fact the problem is in the ((n * (n-1)) / 2) operations.
> >>>>
> >>>> As for the solution presented in perl sorry but I dont see what you've
> >>>> did inside ?! You created a 2D array ? how to achieve operations inside , I
> >>>> think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?
> >>>>
> >>>> Radwen
> >>>>
> >>>>
> >>>> 2010/6/30 Andy Yates<ayates at ebi.ac.uk>
> >>>> Hi Radwen,
> >>>>
> >>>> I would have said that this is more of a problem because of the type of
> >>>> algorithm you are using. It's impossible (as far as I am aware) to calculate
> >>>> the score matrices in one step for multiple sequences&  even if it did I
> >>>> don't quite see where the speed increase would come from.
> >>>>
> >>>> As for the All vs. All problem don't forget that really your total number
> >>>> of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you
> >>>> are comparing so a simple 2D for loop will have you spending twice the
> >>>> amount of time on this than needs to occur. When I've done this before (in
> >>>> Perl so excuse the usage of it) the code looks like this:
> >>>>
> >>>> my @output;
> >>>> my @elements = ('some','elements','something');
> >>>> while(scalar(@elements)>  1) {
> >>>> my $target = pop(@elements);
> >>>> foreach my $remaining_element (@elements) {
> >>>>   push(@output, [$target, $remaining_element]);
> >>>> }
> >>>> }
> >>>>
> >>>> So this would have emitted:
> >>>>
> >>>> [
> >>>>       ['some','elements'],
> >>>>       ['some','something'],
> >>>>       ['elements','something']
> >>>> ]
> >>>>
> >>>> Try doing something similar to this using the Java Deque objects which
> >>>> can act as a stack.
> >>>>
> >>>> Hope this helps to answer your question
> >>>>
> >>>> Andy
> >>>>
> >>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
> >>>>
> >>>> Hello Biojava people,
> >>>>>
> >>>>> I have a question concerning Needlman Wunsh or Smith waterman
> >>>>> algorithms.
> >>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file
> >>>>> then I
> >>>>> store my sequences into an array to calculate pairwise similarity scores
> >>>>> using a for loop.
> >>>>> The problem is that it is very time consuming if we want to calculate
> >>>>> all
> >>>>> pairwise for a big number of protein sequences. I would like to know if
> >>>>> there is way to do a kind of "All against All" comparisons in one single
> >>>>> step ?
> >>>>> Do someone have a solution for this kind of problem ?
> >>>>>
> >>>>> Thanks for help.
> >>>>>
> >>>>> Radwen
> >>>>> _______________________________________________
> >>>>> biojava-dev mailing list
> >>>>> biojava-dev at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>>>>
> >>>>
> >
> >
> > --
> > R. ANIBA
> >
> > Bioinformatics PhD
> > Laboratoire de Bioinformatique et Génomique Intégrative,
> > Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC),
> > 1 rue Laurent Fries,
> > 67404 Illkirch, France.
> > http://www-igbmc.u-strasbg.fr
> > http://alnitak.u-strasbg.fr/~aniba/alexsys
> >
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> 
> 
> 
> -- 
> R. ANIBA
> 
> Bioinformatics PhD
> Laboratoire de Bioinformatique et Génomique Intégrative,
> Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC),
> 1 rue Laurent Fries,
> 67404 Illkirch, France.
> http://www-igbmc.u-strasbg.fr
> http://alnitak.u-strasbg.fr/~aniba/alexsys 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the biojava-dev mailing list