[Biojava-dev] Pairwise similarity score speed

Wed Jun 30 20:01:47 UTC 2010

Hi Radwen,

I have already added this functionality to the BioJava3 alignment package.  The 
code is available on the repository [1] and current builds are on the web site 
[2].  The necessary files are [3] and [4] and in the example code that follows 
you should only have to replace "piwi-seed-fasta.txt" with your file name. 
Also, to switch from Needleman-Wunsch to Smith-Waterman, just change 
PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL .

int similars = 0, total = 0;
GapPenalty gaps = new SimpleGapPenalty();
SubstitutionMatrix<AminoAcidCompound> blosum62 =
     new SimpleSubstitutionMatrix<AminoAcidCompound>();

List<ProteinSequence> piwi = new ArrayList<ProteinSequence>();
try {
   piwi.addAll(FastaReaderHelper.readFastaProteinSequence(
       new File("piwi-seed-fasta.txt")).values());
} catch (Exception e) {
   e.printStackTrace();
}

for (SequencePair<ProteinSequence, AminoAcidCompound> pair :
     Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps,
         blosum62)) {
   PairwiseSequenceScorer<ProteinSequence, AminoAcidCompound> scorer =
       new FractionalSimilarityScorer<ProteinSequence, AminoAcidCompound>(pair);
   System.out.printf("%n%s vs %s : %d / %d%n%s", pair.getQuery().getAccession(),
       pair.getTarget().getAccession(), scorer.getScore(), scorer.getMaxScore(),
       pair);
   similars += scorer.getScore();
   total += scorer.getMaxScore();
}

System.out.printf("%nAverage similarity = %d / %d = %f", similars, total,
     (double)similars/total);

ConcurrencyTools.shutdown();

[1] http://biojava.org/wiki/CVS_to_SVN_Migration
[2] http://biojava.org/download/maven/
[3] 
http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar
[4] 
http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar

Enjoy,
Mark

On 6/30/2010 7:05 AM, Andy Yates wrote:
> It was more of a way of decomposing the operations into a data structure where each element in the 1st dimension represents the elements to compare together. Really the Perl code is a way of describing the operations to occur in order to cover all possible permutations.
>
> Andy
>
> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:
>
>> Hi Andy,
>>
>> Thank you for your reply.
>> Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences.
>>
>> What I am doing is something like that :
>>
>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
>>    For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
>>       PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
>>   End_For
>> End_For
>>
>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
>>
>> In fact the problem is in the ((n * (n-1)) / 2) operations.
>>
>> As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?
>>
>> Radwen
>>
>>
>> 2010/6/30 Andy Yates<ayates at ebi.ac.uk>
>> Hi Radwen,
>>
>> I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences&  even if it did I don't quite see where the speed increase would come from.
>>
>> As for the All vs. All problem don't forget that really your total number of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this:
>>
>> my @output;
>> my @elements = ('some','elements','something');
>> while(scalar(@elements)>  1) {
>>   my $target = pop(@elements);
>>   foreach my $remaining_element (@elements) {
>>     push(@output, [$target, $remaining_element]);
>>   }
>> }
>>
>> So this would have emitted:
>>
>> [
>>         ['some','elements'],
>>         ['some','something'],
>>         ['elements','something']
>> ]
>>
>> Try doing something similar to this using the Java Deque objects which can act as a stack.
>>
>> Hope this helps to answer your question
>>
>> Andy
>>
>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
>>
>>> Hello Biojava people,
>>>
>>> I have a question concerning Needlman Wunsh or Smith waterman algorithms.
>>> I am using Biojava 1.7 and I read sequences from proteins fasta file then I
>>> store my sequences into an array to calculate pairwise similarity scores
>>> using a for loop.
>>> The problem is that it is very time consuming if we want to calculate all
>>> pairwise for a big number of protein sequences. I would like to know if
>>> there is way to do a kind of "All against All" comparisons in one single
>>> step ?
>>> Do someone have a solution for this kind of problem ?
>>>
>>> Thanks for help.
>>>
>>> Radwen
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev