[Biojava-dev] Pairwise similarity score speed
Andy Yates
ayates at ebi.ac.uk
Thu Jul 1 11:28:01 UTC 2010
I believe that you could use Jakarta commons-math & more specifically org.apache.commons.math.stat.descriptive.DescriptiveStatistics which will give you min, max, std deviation & anything else you'd expect to be able to use to describe a range of values
Andy
On 1 Jul 2010, at 12:05, Radhouane Aniba wrote:
> Mark !
>
> Can we modify the code you gave to :
>
> - extract the MIN pairwise similarity score
> - extract the MAX pairwise similarity score
> - calculate the standard deviation of similarity scores
>
> ?
>
> Rad
>
> 2010/6/30 Mark Chapman <chapman at cs.wisc.edu>
>
>> Hi Radwen,
>>
>> I have already added this functionality to the BioJava3 alignment package.
>> The code is available on the repository [1] and current builds are on the
>> web site [2]. The necessary files are [3] and [4] and in the example code
>> that follows you should only have to replace "piwi-seed-fasta.txt" with your
>> file name. Also, to switch from Needleman-Wunsch to Smith-Waterman, just
>> change PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL .
>>
>>
>> int similars = 0, total = 0;
>> GapPenalty gaps = new SimpleGapPenalty();
>> SubstitutionMatrix<AminoAcidCompound> blosum62 =
>> new SimpleSubstitutionMatrix<AminoAcidCompound>();
>>
>> List<ProteinSequence> piwi = new ArrayList<ProteinSequence>();
>> try {
>> piwi.addAll(FastaReaderHelper.readFastaProteinSequence(
>> new File("piwi-seed-fasta.txt")).values());
>> } catch (Exception e) {
>> e.printStackTrace();
>> }
>>
>> for (SequencePair<ProteinSequence, AminoAcidCompound> pair :
>> Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps,
>> blosum62)) {
>> PairwiseSequenceScorer<ProteinSequence, AminoAcidCompound> scorer =
>> new FractionalSimilarityScorer<ProteinSequence,
>> AminoAcidCompound>(pair);
>> System.out.printf("%n%s vs %s : %d / %d%n%s",
>> pair.getQuery().getAccession(),
>> pair.getTarget().getAccession(), scorer.getScore(),
>> scorer.getMaxScore(),
>> pair);
>> similars += scorer.getScore();
>> total += scorer.getMaxScore();
>> }
>>
>> System.out.printf("%nAverage similarity = %d / %d = %f", similars, total,
>> (double)similars/total);
>>
>> ConcurrencyTools.shutdown();
>>
>>
>> [1] http://biojava.org/wiki/CVS_to_SVN_Migration
>> [2] http://biojava.org/download/maven/
>> [3]
>> http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar
>> [4]
>> http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar
>>
>>
>> Enjoy,
>> Mark
>>
>>
>>
>> On 6/30/2010 7:05 AM, Andy Yates wrote:
>>
>>> It was more of a way of decomposing the operations into a data structure
>>> where each element in the 1st dimension represents the elements to compare
>>> together. Really the Perl code is a way of describing the operations to
>>> occur in order to cover all possible permutations.
>>>
>>> Andy
>>>
>>> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:
>>>
>>> Hi Andy,
>>>>
>>>> Thank you for your reply.
>>>> Actually, I was thinking about a parallelization method or a kind of
>>>> hadoop like implementation to do all pairwise comparison. The aim is that at
>>>> the end i would like to calculate the average pairwise similarity score
>>>> within a set of sequences.
>>>>
>>>> What I am doing is something like that :
>>>>
>>>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
>>>> For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
>>>> PairwiseScore
>>>> +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
>>>> End_For
>>>> End_For
>>>>
>>>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
>>>>
>>>> In fact the problem is in the ((n * (n-1)) / 2) operations.
>>>>
>>>> As for the solution presented in perl sorry but I dont see what you've
>>>> did inside ?! You created a 2D array ? how to achieve operations inside , I
>>>> think this do not resolve the ((n * (n-1)) / 2) problem ? Isn't it ?
>>>>
>>>> Radwen
>>>>
>>>>
>>>> 2010/6/30 Andy Yates<ayates at ebi.ac.uk>
>>>> Hi Radwen,
>>>>
>>>> I would have said that this is more of a problem because of the type of
>>>> algorithm you are using. It's impossible (as far as I am aware) to calculate
>>>> the score matrices in one step for multiple sequences& even if it did I
>>>> don't quite see where the speed increase would come from.
>>>>
>>>> As for the All vs. All problem don't forget that really your total number
>>>> of comparisons is ((n * (n-1)) / 2) where n is the number of sequences you
>>>> are comparing so a simple 2D for loop will have you spending twice the
>>>> amount of time on this than needs to occur. When I've done this before (in
>>>> Perl so excuse the usage of it) the code looks like this:
>>>>
>>>> my @output;
>>>> my @elements = ('some','elements','something');
>>>> while(scalar(@elements)> 1) {
>>>> my $target = pop(@elements);
>>>> foreach my $remaining_element (@elements) {
>>>> push(@output, [$target, $remaining_element]);
>>>> }
>>>> }
>>>>
>>>> So this would have emitted:
>>>>
>>>> [
>>>> ['some','elements'],
>>>> ['some','something'],
>>>> ['elements','something']
>>>> ]
>>>>
>>>> Try doing something similar to this using the Java Deque objects which
>>>> can act as a stack.
>>>>
>>>> Hope this helps to answer your question
>>>>
>>>> Andy
>>>>
>>>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
>>>>
>>>> Hello Biojava people,
>>>>>
>>>>> I have a question concerning Needlman Wunsh or Smith waterman
>>>>> algorithms.
>>>>> I am using Biojava 1.7 and I read sequences from proteins fasta file
>>>>> then I
>>>>> store my sequences into an array to calculate pairwise similarity scores
>>>>> using a for loop.
>>>>> The problem is that it is very time consuming if we want to calculate
>>>>> all
>>>>> pairwise for a big number of protein sequences. I would like to know if
>>>>> there is way to do a kind of "All against All" comparisons in one single
>>>>> step ?
>>>>> Do someone have a solution for this kind of problem ?
>>>>>
>>>>> Thanks for help.
>>>>>
>>>>> Radwen
>>>>> _______________________________________________
>>>>> biojava-dev mailing list
>>>>> biojava-dev at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>
>>>>
>
>
> --
> R. ANIBA
>
> Bioinformatics PhD
> Laboratoire de Bioinformatique et Génomique Intégrative,
> Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC),
> 1 rue Laurent Fries,
> 67404 Illkirch, France.
> http://www-igbmc.u-strasbg.fr
> http://alnitak.u-strasbg.fr/~aniba/alexsys
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
--
Andrew Yates Ensembl Genomes Engineer
EMBL-EBI Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK http://www.ensemblgenomes.org/
More information about the biojava-dev
mailing list