From andreas at sdsc.edu  Tue Jun 29 19:00:25 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 29 Jun 2010 16:00:25 -0700
Subject: [Biojava-dev] BioJava 3 current status
Message-ID: <AANLkTikSqrRA2QTozTJFxW7-q5y9aLm0rNHYBhkY4blm@mail.gmail.com>

Hi,

The mailing lists have been a bit quiet lately, but there was a lot of
communication off-list and many SVN commit. As such here a quick
update on the current status of BioJava 3 and the Google summer of
code projects:

The two GSoC projects are well under way and on track. If you are
interested in following what is going on, check out the project pages
for

Posttranslational Modification: http://biojava.org/wiki/GSoC:PTM

Multiple Sequence Alignment: http://biojava.org/wiki/GSoC:MSA


About BioJava 3: This has made great progress over the last weeks and
a lot of new functionality has been committed to SVN. To make this
release ready there are now two new tools:

* There is now a BioJava Maven repository, which is hosting SNAPSHOT
builds from the current SVN.

http://www.biojava.org/download/maven/


These builds are made available by the

* Automated build system (using CruiseControl) which is running at:

http://emmy.rcsb.org:8080/cruisecontrol/ and
http://emmy.rcsb.org:8080/dashboard/

Andreas

From aradwen at gmail.com  Wed Jun 30 07:18:01 2010
From: aradwen at gmail.com (Radhouane Aniba)
Date: Wed, 30 Jun 2010 13:18:01 +0200
Subject: [Biojava-dev] Pairwise similarity score speed
Message-ID: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com>

Hello Biojava people,

I have a question concerning Needlman Wunsh or Smith waterman algorithms.
I am using Biojava 1.7 and I read sequences from proteins fasta file then I
store my sequences into an array to calculate pairwise similarity scores
using a for loop.
The problem is that it is very time consuming if we want to calculate all
pairwise for a big number of protein sequences. I would like to know if
there is way to do a kind of "All against All" comparisons in one single
step ?
Do someone have a solution for this kind of problem ?

Thanks for help.

Radwen

From ayates at ebi.ac.uk  Wed Jun 30 07:48:55 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 30 Jun 2010 12:48:55 +0100
Subject: [Biojava-dev] Pairwise similarity score speed
In-Reply-To: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com>
References: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com>
Message-ID: <A5D0F697-B933-493E-8F73-D5EE0A4E50BD@ebi.ac.uk>

Hi Radwen,

I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences & even if it did I don't quite see where the speed increase would come from.

As for the All vs. All problem don't forget that really your total number of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this:

my @output;
my @elements = ('some','elements','something');
while(scalar(@elements) > 1) {
  my $target = pop(@elements);
  foreach my $remaining_element (@elements) {
    push(@output, [$target, $remaining_element]);
  }
}

So this would have emitted:

[
	['some','elements'],
	['some','something'],
	['elements','something']
]

Try doing something similar to this using the Java Deque objects which can act as a stack.

Hope this helps to answer your question

Andy

On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:

> Hello Biojava people,
> 
> I have a question concerning Needlman Wunsh or Smith waterman algorithms.
> I am using Biojava 1.7 and I read sequences from proteins fasta file then I
> store my sequences into an array to calculate pairwise similarity scores
> using a for loop.
> The problem is that it is very time consuming if we want to calculate all
> pairwise for a big number of protein sequences. I would like to know if
> there is way to do a kind of "All against All" comparisons in one single
> step ?
> Do someone have a solution for this kind of problem ?
> 
> Thanks for help.
> 
> Radwen
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From aradwen at gmail.com  Wed Jun 30 08:01:02 2010
From: aradwen at gmail.com (Radhouane Aniba)
Date: Wed, 30 Jun 2010 14:01:02 +0200
Subject: [Biojava-dev] Pairwise similarity score speed
In-Reply-To: <A5D0F697-B933-493E-8F73-D5EE0A4E50BD@ebi.ac.uk>
References: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com> 
	<A5D0F697-B933-493E-8F73-D5EE0A4E50BD@ebi.ac.uk>
Message-ID: <AANLkTil3-JMDD4tWx1m1r6HaqisR8USr-DYfFe_zXyxN@mail.gmail.com>

Hi Andy,

Thank you for your reply.
Actually, I was thinking about a parallelization method or a kind of hadoop
like implementation to do all pairwise comparison. The aim is that at the
end i would like to calculate the average pairwise similarity score within a
set of sequences.

What I am doing is something like that :

For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
  For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
     PairwiseScore
+=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
 End_For
End_For

Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)

In fact the problem is in the ((n * (n-1)) / 2) operations.

As for the solution presented in perl sorry but I dont see what you've did
inside ?! You created a 2D array ? how to achieve operations inside , I
think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?

Radwen


2010/6/30 Andy Yates <ayates at ebi.ac.uk>

> Hi Radwen,
>
> I would have said that this is more of a problem because of the type of
> algorithm you are using. It's impossible (as far as I am aware) to calculate
> the score matrices in one step for multiple sequences & even if it did I
> don't quite see where the speed increase would come from.
>
> As for the All vs. All problem don't forget that really your total number
> of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you
> are comparing so a simple 2D for loop will have you spending twice the
> amount of time on this than needs to occur. When I've done this before (in
> Perl so excuse the usage of it) the code looks like this:
>
> my @output;
> my @elements = ('some','elements','something');
> while(scalar(@elements) > 1) {
>  my $target = pop(@elements);
>  foreach my $remaining_element (@elements) {
>    push(@output, [$target, $remaining_element]);
>  }
> }
>
> So this would have emitted:
>
> [
>        ['some','elements'],
>        ['some','something'],
>        ['elements','something']
> ]
>
> Try doing something similar to this using the Java Deque objects which can
> act as a stack.
>
> Hope this helps to answer your question
>
> Andy
>
> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
>
> > Hello Biojava people,
> >
> > I have a question concerning Needlman Wunsh or Smith waterman algorithms.
> > I am using Biojava 1.7 and I read sequences from proteins fasta file then
> I
> > store my sequences into an array to calculate pairwise similarity scores
> > using a for loop.
> > The problem is that it is very time consuming if we want to calculate all
> > pairwise for a big number of protein sequences. I would like to know if
> > there is way to do a kind of "All against All" comparisons in one single
> > step ?
> > Do someone have a solution for this kind of problem ?
> >
> > Thanks for help.
> >
> > Radwen
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>


-- 
R. ANIBA

Bioinformatics PhD
Laboratoire de Bioinformatique et G?nomique Int?grative,
Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC),
1 rue Laurent Fries,
67404 Illkirch, France.
http://www-igbmc.u-strasbg.fr
http://alnitak.u-strasbg.fr/~aniba/alexsys


From ayates at ebi.ac.uk  Wed Jun 30 08:05:41 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 30 Jun 2010 13:05:41 +0100
Subject: [Biojava-dev] Pairwise similarity score speed
In-Reply-To: <AANLkTil3-JMDD4tWx1m1r6HaqisR8USr-DYfFe_zXyxN@mail.gmail.com>
References: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com>
	<A5D0F697-B933-493E-8F73-D5EE0A4E50BD@ebi.ac.uk>
	<AANLkTil3-JMDD4tWx1m1r6HaqisR8USr-DYfFe_zXyxN@mail.gmail.com>
Message-ID: <E91D377C-7388-4B5B-89D0-A316AE8E1339@ebi.ac.uk>

It was more of a way of decomposing the operations into a data structure where each element in the 1st dimension represents the elements to compare together. Really the Perl code is a way of describing the operations to occur in order to cover all possible permutations.

Andy

On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:

> Hi Andy, 
> 
> Thank you for your reply.
> Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences.
> 
> What I am doing is something like that :
> 
> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
>   For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
>      PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
>  End_For
> End_For
> 
> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
> 
> In fact the problem is in the ((n * (n-1)) / 2) operations.
> 
> As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?
> 
> Radwen
> 
> 
> 2010/6/30 Andy Yates <ayates at ebi.ac.uk>
> Hi Radwen,
> 
> I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences & even if it did I don't quite see where the speed increase would come from.
> 
> As for the All vs. All problem don't forget that really your total number of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this:
> 
> my @output;
> my @elements = ('some','elements','something');
> while(scalar(@elements) > 1) {
>  my $target = pop(@elements);
>  foreach my $remaining_element (@elements) {
>    push(@output, [$target, $remaining_element]);
>  }
> }
> 
> So this would have emitted:
> 
> [
>        ['some','elements'],
>        ['some','something'],
>        ['elements','something']
> ]
> 
> Try doing something similar to this using the Java Deque objects which can act as a stack.
> 
> Hope this helps to answer your question
> 
> Andy
> 
> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
> 
> > Hello Biojava people,
> >
> > I have a question concerning Needlman Wunsh or Smith waterman algorithms.
> > I am using Biojava 1.7 and I read sequences from proteins fasta file then I
> > store my sequences into an array to calculate pairwise similarity scores
> > using a for loop.
> > The problem is that it is very time consuming if we want to calculate all
> > pairwise for a big number of protein sequences. I would like to know if
> > there is way to do a kind of "All against All" comparisons in one single
> > step ?
> > Do someone have a solution for this kind of problem ?
> >
> > Thanks for help.
> >
> > Radwen
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> 
> 
> -- 
> R. ANIBA
> 
> Bioinformatics PhD
> Laboratoire de Bioinformatique et G?nomique Int?grative,
> Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC),
> 1 rue Laurent Fries,
> 67404 Illkirch, France.
> http://www-igbmc.u-strasbg.fr
> http://alnitak.u-strasbg.fr/~aniba/alexsys 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From chapman at cs.wisc.edu  Wed Jun 30 16:01:47 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Wed, 30 Jun 2010 15:01:47 -0500
Subject: [Biojava-dev] Pairwise similarity score speed
In-Reply-To: <E91D377C-7388-4B5B-89D0-A316AE8E1339@ebi.ac.uk>
References: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com>	<A5D0F697-B933-493E-8F73-D5EE0A4E50BD@ebi.ac.uk>	<AANLkTil3-JMDD4tWx1m1r6HaqisR8USr-DYfFe_zXyxN@mail.gmail.com>
	<E91D377C-7388-4B5B-89D0-A316AE8E1339@ebi.ac.uk>
Message-ID: <4C2BA2AB.1000307@cs.wisc.edu>

Hi Radwen,

I have already added this functionality to the BioJava3 alignment package.  The 
code is available on the repository [1] and current builds are on the web site 
[2].  The necessary files are [3] and [4] and in the example code that follows 
you should only have to replace "piwi-seed-fasta.txt" with your file name. 
Also, to switch from Needleman-Wunsch to Smith-Waterman, just change 
PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL .


int similars = 0, total = 0;
GapPenalty gaps = new SimpleGapPenalty();
SubstitutionMatrix<AminoAcidCompound> blosum62 =
     new SimpleSubstitutionMatrix<AminoAcidCompound>();

List<ProteinSequence> piwi = new ArrayList<ProteinSequence>();
try {
   piwi.addAll(FastaReaderHelper.readFastaProteinSequence(
       new File("piwi-seed-fasta.txt")).values());
} catch (Exception e) {
   e.printStackTrace();
}

for (SequencePair<ProteinSequence, AminoAcidCompound> pair :
     Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps,
         blosum62)) {
   PairwiseSequenceScorer<ProteinSequence, AminoAcidCompound> scorer =
       new FractionalSimilarityScorer<ProteinSequence, AminoAcidCompound>(pair);
   System.out.printf("%n%s vs %s : %d / %d%n%s", pair.getQuery().getAccession(),
       pair.getTarget().getAccession(), scorer.getScore(), scorer.getMaxScore(),
       pair);
   similars += scorer.getScore();
   total += scorer.getMaxScore();
}

System.out.printf("%nAverage similarity = %d / %d = %f", similars, total,
     (double)similars/total);

ConcurrencyTools.shutdown();


[1] http://biojava.org/wiki/CVS_to_SVN_Migration
[2] http://biojava.org/download/maven/
[3] 
http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar
[4] 
http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar


Enjoy,
Mark


On 6/30/2010 7:05 AM, Andy Yates wrote:
> It was more of a way of decomposing the operations into a data structure where each element in the 1st dimension represents the elements to compare together. Really the Perl code is a way of describing the operations to occur in order to cover all possible permutations.
>
> Andy
>
> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:
>
>> Hi Andy,
>>
>> Thank you for your reply.
>> Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences.
>>
>> What I am doing is something like that :
>>
>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
>>    For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
>>       PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
>>   End_For
>> End_For
>>
>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
>>
>> In fact the problem is in the ((n * (n-1)) / 2) operations.
>>
>> As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?
>>
>> Radwen
>>
>>
>> 2010/6/30 Andy Yates<ayates at ebi.ac.uk>
>> Hi Radwen,
>>
>> I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences&  even if it did I don't quite see where the speed increase would come from.
>>
>> As for the All vs. All problem don't forget that really your total number of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this:
>>
>> my @output;
>> my @elements = ('some','elements','something');
>> while(scalar(@elements)>  1) {
>>   my $target = pop(@elements);
>>   foreach my $remaining_element (@elements) {
>>     push(@output, [$target, $remaining_element]);
>>   }
>> }
>>
>> So this would have emitted:
>>
>> [
>>         ['some','elements'],
>>         ['some','something'],
>>         ['elements','something']
>> ]
>>
>> Try doing something similar to this using the Java Deque objects which can act as a stack.
>>
>> Hope this helps to answer your question
>>
>> Andy
>>
>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
>>
>>> Hello Biojava people,
>>>
>>> I have a question concerning Needlman Wunsh or Smith waterman algorithms.
>>> I am using Biojava 1.7 and I read sequences from proteins fasta file then I
>>> store my sequences into an array to calculate pairwise similarity scores
>>> using a for loop.
>>> The problem is that it is very time consuming if we want to calculate all
>>> pairwise for a big number of protein sequences. I would like to know if
>>> there is way to do a kind of "All against All" comparisons in one single
>>> step ?
>>> Do someone have a solution for this kind of problem ?
>>>
>>> Thanks for help.
>>>
>>> Radwen
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev

From andreas at sdsc.edu  Tue Jun 29 23:00:25 2010
From: andreas at sdsc.edu (Andreas Prlic)
Date: Tue, 29 Jun 2010 16:00:25 -0700
Subject: [Biojava-dev] BioJava 3 current status
Message-ID: <AANLkTikSqrRA2QTozTJFxW7-q5y9aLm0rNHYBhkY4blm@mail.gmail.com>

Hi,

The mailing lists have been a bit quiet lately, but there was a lot of
communication off-list and many SVN commit. As such here a quick
update on the current status of BioJava 3 and the Google summer of
code projects:

The two GSoC projects are well under way and on track. If you are
interested in following what is going on, check out the project pages
for

Posttranslational Modification: http://biojava.org/wiki/GSoC:PTM

Multiple Sequence Alignment: http://biojava.org/wiki/GSoC:MSA


About BioJava 3: This has made great progress over the last weeks and
a lot of new functionality has been committed to SVN. To make this
release ready there are now two new tools:

* There is now a BioJava Maven repository, which is hosting SNAPSHOT
builds from the current SVN.

http://www.biojava.org/download/maven/


These builds are made available by the

* Automated build system (using CruiseControl) which is running at:

http://emmy.rcsb.org:8080/cruisecontrol/ and
http://emmy.rcsb.org:8080/dashboard/

Andreas


From aradwen at gmail.com  Wed Jun 30 11:18:01 2010
From: aradwen at gmail.com (Radhouane Aniba)
Date: Wed, 30 Jun 2010 13:18:01 +0200
Subject: [Biojava-dev] Pairwise similarity score speed
Message-ID: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com>

Hello Biojava people,

I have a question concerning Needlman Wunsh or Smith waterman algorithms.
I am using Biojava 1.7 and I read sequences from proteins fasta file then I
store my sequences into an array to calculate pairwise similarity scores
using a for loop.
The problem is that it is very time consuming if we want to calculate all
pairwise for a big number of protein sequences. I would like to know if
there is way to do a kind of "All against All" comparisons in one single
step ?
Do someone have a solution for this kind of problem ?

Thanks for help.

Radwen


From ayates at ebi.ac.uk  Wed Jun 30 11:48:55 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 30 Jun 2010 12:48:55 +0100
Subject: [Biojava-dev] Pairwise similarity score speed
In-Reply-To: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com>
References: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com>
Message-ID: <A5D0F697-B933-493E-8F73-D5EE0A4E50BD@ebi.ac.uk>

Hi Radwen,

I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences & even if it did I don't quite see where the speed increase would come from.

As for the All vs. All problem don't forget that really your total number of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this:

my @output;
my @elements = ('some','elements','something');
while(scalar(@elements) > 1) {
  my $target = pop(@elements);
  foreach my $remaining_element (@elements) {
    push(@output, [$target, $remaining_element]);
  }
}

So this would have emitted:

[
	['some','elements'],
	['some','something'],
	['elements','something']
]

Try doing something similar to this using the Java Deque objects which can act as a stack.

Hope this helps to answer your question

Andy

On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:

> Hello Biojava people,
> 
> I have a question concerning Needlman Wunsh or Smith waterman algorithms.
> I am using Biojava 1.7 and I read sequences from proteins fasta file then I
> store my sequences into an array to calculate pairwise similarity scores
> using a for loop.
> The problem is that it is very time consuming if we want to calculate all
> pairwise for a big number of protein sequences. I would like to know if
> there is way to do a kind of "All against All" comparisons in one single
> step ?
> Do someone have a solution for this kind of problem ?
> 
> Thanks for help.
> 
> Radwen
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From aradwen at gmail.com  Wed Jun 30 12:01:02 2010
From: aradwen at gmail.com (Radhouane Aniba)
Date: Wed, 30 Jun 2010 14:01:02 +0200
Subject: [Biojava-dev] Pairwise similarity score speed
In-Reply-To: <A5D0F697-B933-493E-8F73-D5EE0A4E50BD@ebi.ac.uk>
References: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com> 
	<A5D0F697-B933-493E-8F73-D5EE0A4E50BD@ebi.ac.uk>
Message-ID: <AANLkTil3-JMDD4tWx1m1r6HaqisR8USr-DYfFe_zXyxN@mail.gmail.com>

Hi Andy,

Thank you for your reply.
Actually, I was thinking about a parallelization method or a kind of hadoop
like implementation to do all pairwise comparison. The aim is that at the
end i would like to calculate the average pairwise similarity score within a
set of sequences.

What I am doing is something like that :

For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
  For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
     PairwiseScore
+=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
 End_For
End_For

Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)

In fact the problem is in the ((n * (n-1)) / 2) operations.

As for the solution presented in perl sorry but I dont see what you've did
inside ?! You created a 2D array ? how to achieve operations inside , I
think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?

Radwen


2010/6/30 Andy Yates <ayates at ebi.ac.uk>

> Hi Radwen,
>
> I would have said that this is more of a problem because of the type of
> algorithm you are using. It's impossible (as far as I am aware) to calculate
> the score matrices in one step for multiple sequences & even if it did I
> don't quite see where the speed increase would come from.
>
> As for the All vs. All problem don't forget that really your total number
> of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you
> are comparing so a simple 2D for loop will have you spending twice the
> amount of time on this than needs to occur. When I've done this before (in
> Perl so excuse the usage of it) the code looks like this:
>
> my @output;
> my @elements = ('some','elements','something');
> while(scalar(@elements) > 1) {
>  my $target = pop(@elements);
>  foreach my $remaining_element (@elements) {
>    push(@output, [$target, $remaining_element]);
>  }
> }
>
> So this would have emitted:
>
> [
>        ['some','elements'],
>        ['some','something'],
>        ['elements','something']
> ]
>
> Try doing something similar to this using the Java Deque objects which can
> act as a stack.
>
> Hope this helps to answer your question
>
> Andy
>
> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
>
> > Hello Biojava people,
> >
> > I have a question concerning Needlman Wunsh or Smith waterman algorithms.
> > I am using Biojava 1.7 and I read sequences from proteins fasta file then
> I
> > store my sequences into an array to calculate pairwise similarity scores
> > using a for loop.
> > The problem is that it is very time consuming if we want to calculate all
> > pairwise for a big number of protein sequences. I would like to know if
> > there is way to do a kind of "All against All" comparisons in one single
> > step ?
> > Do someone have a solution for this kind of problem ?
> >
> > Thanks for help.
> >
> > Radwen
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>


-- 
R. ANIBA

Bioinformatics PhD
Laboratoire de Bioinformatique et G?nomique Int?grative,
Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC),
1 rue Laurent Fries,
67404 Illkirch, France.
http://www-igbmc.u-strasbg.fr
http://alnitak.u-strasbg.fr/~aniba/alexsys


From ayates at ebi.ac.uk  Wed Jun 30 12:05:41 2010
From: ayates at ebi.ac.uk (Andy Yates)
Date: Wed, 30 Jun 2010 13:05:41 +0100
Subject: [Biojava-dev] Pairwise similarity score speed
In-Reply-To: <AANLkTil3-JMDD4tWx1m1r6HaqisR8USr-DYfFe_zXyxN@mail.gmail.com>
References: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com>
	<A5D0F697-B933-493E-8F73-D5EE0A4E50BD@ebi.ac.uk>
	<AANLkTil3-JMDD4tWx1m1r6HaqisR8USr-DYfFe_zXyxN@mail.gmail.com>
Message-ID: <E91D377C-7388-4B5B-89D0-A316AE8E1339@ebi.ac.uk>

It was more of a way of decomposing the operations into a data structure where each element in the 1st dimension represents the elements to compare together. Really the Perl code is a way of describing the operations to occur in order to cover all possible permutations.

Andy

On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:

> Hi Andy, 
> 
> Thank you for your reply.
> Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences.
> 
> What I am doing is something like that :
> 
> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
>   For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
>      PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
>  End_For
> End_For
> 
> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
> 
> In fact the problem is in the ((n * (n-1)) / 2) operations.
> 
> As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?
> 
> Radwen
> 
> 
> 2010/6/30 Andy Yates <ayates at ebi.ac.uk>
> Hi Radwen,
> 
> I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences & even if it did I don't quite see where the speed increase would come from.
> 
> As for the All vs. All problem don't forget that really your total number of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this:
> 
> my @output;
> my @elements = ('some','elements','something');
> while(scalar(@elements) > 1) {
>  my $target = pop(@elements);
>  foreach my $remaining_element (@elements) {
>    push(@output, [$target, $remaining_element]);
>  }
> }
> 
> So this would have emitted:
> 
> [
>        ['some','elements'],
>        ['some','something'],
>        ['elements','something']
> ]
> 
> Try doing something similar to this using the Java Deque objects which can act as a stack.
> 
> Hope this helps to answer your question
> 
> Andy
> 
> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
> 
> > Hello Biojava people,
> >
> > I have a question concerning Needlman Wunsh or Smith waterman algorithms.
> > I am using Biojava 1.7 and I read sequences from proteins fasta file then I
> > store my sequences into an array to calculate pairwise similarity scores
> > using a for loop.
> > The problem is that it is very time consuming if we want to calculate all
> > pairwise for a big number of protein sequences. I would like to know if
> > there is way to do a kind of "All against All" comparisons in one single
> > step ?
> > Do someone have a solution for this kind of problem ?
> >
> > Thanks for help.
> >
> > Radwen
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
> 
> 
> 
> 
> 
> 
> 
> -- 
> R. ANIBA
> 
> Bioinformatics PhD
> Laboratoire de Bioinformatique et G?nomique Int?grative,
> Institut de G?n?tique et de Biologie Mol?culaire et Cellulaire (IGBMC),
> 1 rue Laurent Fries,
> 67404 Illkirch, France.
> http://www-igbmc.u-strasbg.fr
> http://alnitak.u-strasbg.fr/~aniba/alexsys 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/


From chapman at cs.wisc.edu  Wed Jun 30 20:01:47 2010
From: chapman at cs.wisc.edu (Mark Chapman)
Date: Wed, 30 Jun 2010 15:01:47 -0500
Subject: [Biojava-dev] Pairwise similarity score speed
In-Reply-To: <E91D377C-7388-4B5B-89D0-A316AE8E1339@ebi.ac.uk>
References: <AANLkTimFPjaU203rIO6LFUwRN6w6WaIiDVF4UHIL4Y7X@mail.gmail.com>	<A5D0F697-B933-493E-8F73-D5EE0A4E50BD@ebi.ac.uk>	<AANLkTil3-JMDD4tWx1m1r6HaqisR8USr-DYfFe_zXyxN@mail.gmail.com>
	<E91D377C-7388-4B5B-89D0-A316AE8E1339@ebi.ac.uk>
Message-ID: <4C2BA2AB.1000307@cs.wisc.edu>

Hi Radwen,

I have already added this functionality to the BioJava3 alignment package.  The 
code is available on the repository [1] and current builds are on the web site 
[2].  The necessary files are [3] and [4] and in the example code that follows 
you should only have to replace "piwi-seed-fasta.txt" with your file name. 
Also, to switch from Needleman-Wunsch to Smith-Waterman, just change 
PairwiseAligner.GLOBAL to PairwiseAligner.LOCAL .


int similars = 0, total = 0;
GapPenalty gaps = new SimpleGapPenalty();
SubstitutionMatrix<AminoAcidCompound> blosum62 =
     new SimpleSubstitutionMatrix<AminoAcidCompound>();

List<ProteinSequence> piwi = new ArrayList<ProteinSequence>();
try {
   piwi.addAll(FastaReaderHelper.readFastaProteinSequence(
       new File("piwi-seed-fasta.txt")).values());
} catch (Exception e) {
   e.printStackTrace();
}

for (SequencePair<ProteinSequence, AminoAcidCompound> pair :
     Alignments.getAllPairsAlignments(piwi, PairwiseAligner.GLOBAL, gaps,
         blosum62)) {
   PairwiseSequenceScorer<ProteinSequence, AminoAcidCompound> scorer =
       new FractionalSimilarityScorer<ProteinSequence, AminoAcidCompound>(pair);
   System.out.printf("%n%s vs %s : %d / %d%n%s", pair.getQuery().getAccession(),
       pair.getTarget().getAccession(), scorer.getScore(), scorer.getMaxScore(),
       pair);
   similars += scorer.getScore();
   total += scorer.getMaxScore();
}

System.out.printf("%nAverage similarity = %d / %d = %f", similars, total,
     (double)similars/total);

ConcurrencyTools.shutdown();


[1] http://biojava.org/wiki/CVS_to_SVN_Migration
[2] http://biojava.org/download/maven/
[3] 
http://biojava.org/download/maven/org/biojava/biojava3-core/3.0-SNAPSHOT/biojava3-core-3.0-SNAPSHOT.jar
[4] 
http://biojava.org/download/maven/org/biojava/biojava3-alignment/3.0-SNAPSHOT/biojava3-alignment-3.0-SNAPSHOT.jar


Enjoy,
Mark


On 6/30/2010 7:05 AM, Andy Yates wrote:
> It was more of a way of decomposing the operations into a data structure where each element in the 1st dimension represents the elements to compare together. Really the Perl code is a way of describing the operations to occur in order to cover all possible permutations.
>
> Andy
>
> On 30 Jun 2010, at 13:01, Radhouane Aniba wrote:
>
>> Hi Andy,
>>
>> Thank you for your reply.
>> Actually, I was thinking about a parallelization method or a kind of hadoop like implementation to do all pairwise comparison. The aim is that at the end i would like to calculate the average pairwise similarity score within a set of sequences.
>>
>> What I am doing is something like that :
>>
>> For I = 0 to I = Length(ARRAY_OF_SEQUENCES)-1
>>    For J=I+1 to J=Length(ARRAY_OF_SEQUENCES)
>>       PairwiseScore +=CALCULATE_PAIRWISE(ARRAY_OF_SEQUENCES[I],ARRAY_OF_SEQUENCES[J])
>>   End_For
>> End_For
>>
>> Average_Score = PairwiseScore/length(ARRAY_OF_SEQUENCES)
>>
>> In fact the problem is in the ((n * (n-1)) / 2) operations.
>>
>> As for the solution presented in perl sorry but I dont see what you've did inside ?! You created a 2D array ? how to achieve operations inside , I think this do not resolve the ((n * (n-1)) / 2)  problem ? Isn't it ?
>>
>> Radwen
>>
>>
>> 2010/6/30 Andy Yates<ayates at ebi.ac.uk>
>> Hi Radwen,
>>
>> I would have said that this is more of a problem because of the type of algorithm you are using. It's impossible (as far as I am aware) to calculate the score matrices in one step for multiple sequences&  even if it did I don't quite see where the speed increase would come from.
>>
>> As for the All vs. All problem don't forget that really your total number of comparisons is  ((n * (n-1)) / 2) where n is the number of sequences you are comparing so a simple 2D for loop will have you spending twice the amount of time on this than needs to occur. When I've done this before (in Perl so excuse the usage of it) the code looks like this:
>>
>> my @output;
>> my @elements = ('some','elements','something');
>> while(scalar(@elements)>  1) {
>>   my $target = pop(@elements);
>>   foreach my $remaining_element (@elements) {
>>     push(@output, [$target, $remaining_element]);
>>   }
>> }
>>
>> So this would have emitted:
>>
>> [
>>         ['some','elements'],
>>         ['some','something'],
>>         ['elements','something']
>> ]
>>
>> Try doing something similar to this using the Java Deque objects which can act as a stack.
>>
>> Hope this helps to answer your question
>>
>> Andy
>>
>> On 30 Jun 2010, at 12:18, Radhouane Aniba wrote:
>>
>>> Hello Biojava people,
>>>
>>> I have a question concerning Needlman Wunsh or Smith waterman algorithms.
>>> I am using Biojava 1.7 and I read sequences from proteins fasta file then I
>>> store my sequences into an array to calculate pairwise similarity scores
>>> using a for loop.
>>> The problem is that it is very time consuming if we want to calculate all
>>> pairwise for a big number of protein sequences. I would like to know if
>>> there is way to do a kind of "All against All" comparisons in one single
>>> step ?
>>> Do someone have a solution for this kind of problem ?
>>>
>>> Thanks for help.
>>>
>>> Radwen
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev