[Biojava-dev] Pairwise similarity score speed

Thu Jul 1 11:28:01 UTC 2010

I believe that you could use Jakarta commons-math & more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation & anything else you'd expect to be able to use to describe a range of values

Andy

On 1 Jul 2010, at 12:05, Radhouane Aniba wrote:

> Mark !
> 
> Can we modify the code you gave to :
> 
> - extract the MIN pairwise similarity score
> - extract the MAX pairwise similarity score
> - calculate the standard deviation of similarity scores
> 
> ?
> 
> Rad
> 
> 2010/6/30 Mark Chapman <chapman at cs.wisc.edu>
> 
>> Hi Radwen,
>> 
>> I have already added this functionality to the BioJava3 alignment package.
>> The code is available on the repository [1] and current builds are on the
>> web site [2].  The necessary files are [3] and [4] and in the example code
>> that follows you should only have to replace "piwi-seed-fasta.txt" with your
>> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just
>> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL .
>> 
>> 
>> int similars = 0, total = 0;
>> GapPenalty gaps = new SimpleGapPenalty();
>> SubstitutionMatrix<AminoAcidCompound> blosum62 =
>>   new SimpleSubstitutionMatrix<AminoAcidCompound>();
>> 
>> List<ProteinSequence> piwi = new ArrayList<ProteinSequence>();
>> try {
>> piwi.addAll(FastaReaderHelper.readFastaProteinSequence(
>>     new File("piwi-seed-fasta.txt")).values());
>> } catch (Exception e) {
>> e.printStackTrace();
>> }
>> 
>> for (SequencePair<ProteinSequence, AminoAcidCompound> pair :
>>   Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps,
>>       blosum62)) {
>> PairwiseSequenceScorer<ProteinSequence, AminoAcidCompound> scorer =
>>     new FractionalSimilarityScorer<ProteinSequence,
>> AminoAcidCompound>(pair);
>> System.out.printf("%n%s vs %s : %d / %d%n%s",
>> pair.getQuery().getAccession(),
>>     pair.getTarget().getAccession(), scorer.getScore(),
>> scorer.getMaxScore(),
>>     pair);
>> similars += scorer.getScore();
>> total += scorer.getMaxScore();
>> }
>> 
>> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total,
>>   (double)similars/total);
>> 
>> ConcurrencyTools.shutdown();
>> 
>> 
>> [1] http://biojava.org/wiki/CVS_to_SVN_Migration
>> [2] http://biojava.org/download/maven/
>> [3]
>> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar
>> [4]
>> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar
>> 
>> 
>> Enjoy,
>> Mark
>> 
>> 
>> 
>> On 6/30/2010 7:05 AM, Andy Yates wrote:
>> 
>>> It was more of a way of decomposing the operations into a data structure
>>> where each element in the 1st dimension represents the elements to compare
>>> together. Really the Perl code is a way of describing the operations to
>>> occur in order to cover all possible permutations.
>>> 
>>> Andy
>>> 
>>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:
>>> 
>>> Hi Andy,
>>>> 
>>>> Thank you for your reply.
>>>> Actually, I was thinking about a parallelization method or a kind of
>>>> hadoop like implementation to do all pairwise comparison. The aim is that at
>>>> the end i would like to calculate the average pairwise similarity score
>>>> within a set of sequences.
>>>> 
>>>> What I am doing is something like that :
>>>> 
>>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
>>>>  For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
>>>>     PairwiseScore
>>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
>>>> End_For
>>>> End_For
>>>> 
>>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
>>>> 
>>>> In fact the problem is in the ((n * (n-1)) / 2) operations.
>>>> 
>>>> As for the solution presented in perl sorry but I dont see what you've
>>>> did inside ?! You created a 2D array ? how to achieve operations inside , I
>>>> think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?
>>>> 
>>>> Radwen
>>>> 
>>>> 
>>>> 2010/6/30 Andy Yates<ayates at ebi.ac.uk>
>>>> Hi Radwen,
>>>> 
>>>> I would have said that this is more of a problem because of the type of
>>>> algorithm you are using. It's impossible (as far as I am aware) to calculate
>>>> the score matrices in one step for multiple sequences&  even if it did I
>>>> don't quite see where the speed increase would come from.
>>>> 
>>>> As for the All vs. All problem don't forget that really your total number
>>>> of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you
>>>> are comparing so a simple 2D for loop will have you spending twice the
>>>> amount of time on this than needs to occur. When I've done this before (in
>>>> Perl so excuse the usage of it) the code looks like this:
>>>> 
>>>> my @output;
>>>> my @elements = ('some','elements','something');
>>>> while(scalar(@elements)>  1) {
>>>> my $target = pop(@elements);
>>>> foreach my $remaining_element (@elements) {
>>>>   push(@output, [$target, $remaining_element]);
>>>> }
>>>> }
>>>> 
>>>> So this would have emitted:
>>>> 
>>>> [
>>>>       ['some','elements'],
>>>>       ['some','something'],
>>>>       ['elements','something']
>>>> ]
>>>> 
>>>> Try doing something similar to this using the Java Deque objects which
>>>> can act as a stack.
>>>> 
>>>> Hope this helps to answer your question
>>>> 
>>>> Andy
>>>> 
>>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
>>>> 
>>>> Hello Biojava people,
>>>>> 
>>>>> I have a question concerning Needlman Wunsh or Smith waterman
>>>>> algorithms.
>>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file
>>>>> then I
>>>>> store my sequences into an array to calculate pairwise similarity scores
>>>>> using a for loop.
>>>>> The problem is that it is very time consuming if we want to calculate
>>>>> all
>>>>> pairwise for a big number of protein sequences. I would like to know if
>>>>> there is way to do a kind of "All against All" comparisons in one single
>>>>> step ?
>>>>> Do someone have a solution for this kind of problem ?
>>>>> 
>>>>> Thanks for help.
>>>>> 
>>>>> Radwen
>>>>> _______________________________________________
>>>>> biojava-dev mailing list
>>>>> biojava-dev at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>> 
>>>> 
> 
> 
> -- 
> R. ANIBA
> 
> Bioinformatics PhD
> Laboratoire de Bioinformatique et Génomique Intégrative,
> Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC),
> 1 rue Laurent Fries,
> 67404 Illkirch, France.
> http://www-igbmc.u-strasbg.fr
> http://alnitak.u-strasbg.fr/~aniba/alexsys
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/