[Biojava-dev] Pairwise Alignment methods

Felipe Albrecht felipe.albrecht at gmail.com
Fri Jan 25 06:39:45 UTC 2008


Okay, I agree with what you said.

I was looking the SequenceAlignment source and I realize a strange thing.
At formatOutput method, the editDistance is multiplied by -1,
if you use a NeedlemanWunsch pairwiseAlignment method, the editDistance
is returned without any multiplication.
That is, the score/editDistance of formatOutput is different from there that
is given by NeedlemanWunsch pairwiseAlignment.

What is the correct?

Thank you again

Felipe Albrecht

On Jan 25, 2008 3:40 AM, Mark Schreiber <markjschreiber at gmail.com> wrote:

> On Jan 25, 2008 12:25 PM, Felipe Albrecht <felipe.albrecht at gmail.com>
> wrote:
> > Hello again :-)
> >
> >
> > On Jan 25, 2008 1:43 AM, Mark Schreiber <markjschreiber at gmail.com>
> wrote:
> > >
> > > On Jan 25, 2008 10:06 AM, Felipe Albrecht <felipe.albrecht at gmail.com>
> > wrote:
> > > > Hi,
> > > >
> > > > is not possible to add into the SequenceAlignment interface
> something
> > like:
> > > > "double doAlignmentAndGetTheScore(SymbolList symbolList1, SymbolList
> > > > symbolList2)"?
> > > > Okay, the name is horrible, but you know what it means.
> > > >
>
> > If I'm correct, the SequenceAlignment is an abstract class, so, we can
> > define there with an empty implementation, and SmithWaterman and others
> > classes implements it. Anyone that implemented SequenceAlignment will
> not
> > see anything different.
>
> OK in that case adding the method would be OK, even desirable.
> Probably this would be the best way to merge in your code.
>
> > Okay, now I understood, biojava is not a library for bioinformatics
> > applications, but for interconnect bioinformatics applications. So,
> biojava
>
> Actually it is a library for bioinformatics that you use to build
> bioinformatics applications.  It is possibly not as loosely coupled as
> you might like for your purpose. It is definitely not as loosely
> coupled as the Unix collection of executables or an SOA system.  Due
> to heavy use of interfaces and abstract classes there is some
> possibility for custom code.  For example you can recode the
> SmithWaterman object to be optimal for your needs and then create an
> application where you use your class in place of the normal biojava
> SmithWaterman.
>
> > in the actual way is not appropriate for the application that I am
> > developing. I will develop some "optimized" classes and functions for my
> use
> > and when it will be ready I will announce in this mailing list and ask
> if
> > want to merge in biojava. If biojava team needs somebody to improve some
> > biojava functions, specially sequences and sequences IO, can ask me.
>
> Code improvements and optimizations are always welcome especially if
> current interfaces can be preserved (that way the end user gets the
> improvement without having to change their code).  I always advise
> potential optimizers to use a profiler because it is sometimes hard to
> predict how the JVM will behave, for example JIT compiling may mean
> parts of code that are theoretically CPU intensive may not be the CPU
> bottleneck when the JVM compiles them.
>
> - Mark
>
> >
> > Thank you
> >
> > Felipe Albrecht
> >
> >
> >
> >
> >
> > >
> > >
> > >
> > >
> > >
> > > - Mark
> > >
> > > > On Jan 24, 2008 11:26 PM, Mark Schreiber <markjschreiber at gmail.com>
> > wrote:
> > > > > Hi Felipe -
> > > > >
> > > > > I agree your method is more efficient but I think it violates the
> > > > > SequenceAlignment interface which would cause compatibility
> problems.
> > > > > I also wonder what should happen if a user calls the
> getAlignment()
> > > > > method if you have only calculated a score.
> > > > >
> > > > > instanceof is potentially expensive but it is nothing compared to
> > > > > actually performing the SmithWaterman.
> > > > >
> > > > > Biojava is somewhat memory heavy but this is largely because it is
> > > > > object oriented. Certainly something in C would be lighter and
> faster
> > > > > but the whole point in using Java is the relative benefits of
> object
> > > > > oriented design.  While ultra optimized algorithms where once a
> major
> > > > > feature of bioinformatics this is becoming less necessary as
> standard
> > > > > desktops are now equivalent to the super computers of 5 years ago.
> > > > >
> > > > > I actually find the SW and NW to be reasonably fast. This is
> because
> > > > > all the heavy lifting is done in loops that the JVM presumably
> > > > > compiles and executes natively.
> > > > >
> > > > > - Mark
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Jan 25, 2008 3:40 AM, Felipe Albrecht <
> felipe.albrecht at gmail.com >
> > > > wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I saw the commit and I think that this solution is not the
> better.
> > > > > > I think it because you are creating internally two Sequence and
> > probably
> > > > the
> > > > > > programmer will not use others alignment information,  he will
> use
> > only
> > > > the
> > > > > > score.
> > > > > >
> > > > > > Because it, I think that if you have 2 SymbolList, just do the
> > alignment
> > > > and
> > > > > > return the score, as I did.Otherwise, If the programmer want the
> > "visual
> > > > > > alignment", he should create externally the SimpleSequences, it
> is,
> > not
> > > > the
> > > > > > method must do it.
> > > > > >
> > > > > > IMHO, one [serious] problem in biojava is the memory
> consumption, it
> > > > have
> > > > > > not "lightweight" classes or methods that do the things quickly.
> > Because
> > > > it,
> > > > > > may be is a good choice to have a method that simply gives the
> > alignment
> > > > > > score, and not do the others things, like backtracking. Another
> > think,
> > > > the
> > > > > > cost of the "instanceof" is high.
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > Felipe Albrecht
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Jan 24, 2008 11:35 AM, Mark Schreiber <
> markjschreiber at gmail.com
> > >
> > > > wrote:
> > > > > > > Hi -
> > > > > > >
> > > > > > > I have just commited changes that let you use SymbolLists in
> all
> > parts
> > > > > > > of the NW and SW SequenceAlignment objects.
> > > > > > >
> > > > > > > As you suggested I made the matrix a method local variable. I
> also
> > > > > > > removed calls to the garbage collector.
> > > > > > >
> > > > > > > This can be checked out from SVN.
> > > > > > >
> > > > > > > - Mark
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Jan 24, 2008 9:05 PM, Felipe Albrecht <
> > felipe.albrecht at gmail.com >
> > > > > > wrote:
> > > > > > > > If you prefer, I can send a diff and should I do the same
> thing
> > in
> > > > > > > > SequenceAlignment and NeedlemanWunsch classes?
> > > > > > > >
> > > > > > > > Thank  you,
> > > > > > > >
> > > > > > > > Felipe Albrecht
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Jan 24, 2008 5:50 AM, Mark Schreiber <
> > markjschreiber at gmail.com >
> > > > > > wrote:
> > > > > > > > > Hi Felipe -
> > > > > > > > >
> > > > > > > > > Thanks for the input on this. As a general rule the GC
> should
> > > > never be
> > > > > > > > > called from code. Generally this degrades performance of
> the
> > JVM.
> > > > > > > > > Unless there is a very good reason I will remove this.
> > Probably
> > > > you
> > > > > > > > > are right a method parameter may work better.
> > > > > > > > >
> > > > > > > > > - Mark
> > > > > > > > >
> > > > > > > > > On Jan 24, 2008 1:47 PM, Felipe Albrecht
> > > > <felipe.albrecht at gmail.com >
> > > > > > > > wrote:
> > > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I think that it can be solved by a simple way:
> > > > > > > > > > Implement (or just copy and cut) a pairwiseAlignment
> > utilizing
> > > > > > SymboList
> > > > > > > > as
> > > > > > > > > > parameters and do no creating a alignment, just the
> > calculating
> > > > it
> > > > > > and
> > > > > > > > > > returning the value.
> > > > > > > > > >
> > > > > > > > > > Another thing that is a bit stange for me, is the
> > utilization of
> > > > > > garbage
> > > > > > > > > > collector direcly, that is: The field "scoreMatrix" is a
> > class
> > > > > > field,
> > > > > > > > why at
> > > > > > > > > > the end of pairwiseAlignment it is set to null and the
> > garbage
> > > > > > collector
> > > > > > > > > > run? It is not better (and simpler) to use scoreMatrix
> as
> > method
> > > > > > > > variable?
> > > > > > > > > >
> > > > > > > > > > I'm annexing the class code with my changes that is
> doing
> > well
> > > > the
> > > > > > (4^8)
> > > > > > > > *
> > > > > > > > > > (4^8) SymbolList pairwise alignments that I am needing
> :-)
> > > > > > > > > >
> > > > > > > > > > Thank you,
> > > > > > > > > >
> > > > > > > > > > Felipe Albrecht
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >  On Jan 23, 2008 6:50 AM, Mark Schreiber <
> > > > markjschreiber at gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > Hi Felipe -
> > > > > > > > > > >
> > > > > > > > > > > I agree this is a barrier to ease of use. Even if
> > Sequences
> > > > are
> > > > > > > > > > > required internally for some obscure reason there is
> no
> > reason
> > > > why
> > > > > > > > > > > dummy Sequences cannot be made inside the aligner.
>  These
> > > > > > sequences
> > > > > > > > > > > could be given names like 'query' and 'subject' or
> even
> > 'seq1'
> > > > and
> > > > > > > > > > > 'seq2'.
> > > > > > > > > > >
> > > > > > > > > > > I will take a look at adding some methods.
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > >
> > > > > > > > > > > - Mark
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Jan 23, 2008 2:58 PM, Felipe Albrecht
> > > > > > < felipe.albrecht at gmail.com >
> > > > > > > > > > wrote:
> > > > > > > > > > > > Hello all,
> > > > > > > > > > > >
> > > > > > > > > > > > I have a simple question about pairwise alignment
> > classes
> > > > > > > > (SmithWaterman
> > > > > > > > > > and
> > > > > > > > > > > > NeedlemanWunsch):
> > > > > > > > > > > > Why it is necessary two Sequence for alignment and
> not
> > two
> > > > > > > > SymbolList?
> > > > > > > > > > > >
> > > > > > > > > > > > Example, I have a SymbolList collection to align
> between
> > > > then,
> > > > > > > > > > > > by this way I need to create some "dummies"
>  Sequence
> > for to
> > > > do
> > > > > > the
> > > > > > > > > > > > alignment.
> > > > > > > > > > > >
> > > > > > > > > > > > Reading the source, I saw that the unique field that
> is
> > > > > > exclusive to
> > > > > > > > > > > > Sequence is the name, for the alignment output,
> > > > > > > > > > > > but if I need only the alignment result, it is
> useless.
> > > > > > > > > > > >
> > > > > > > > > > > > It is not possible to override the pairwiseAlignment
> to
> > > > accept
> > > > > > > > > > SymbolList or
> > > > > > > > > > > > may be a new method that the parameters are 2
> SymbolList
> > and
> > > > > > returns
> > > > > > > > the
> > > > > > > > > > > > alignment score?
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you
> > > > > > > > > > > >
> > > > > > > > > > > > Felipe Albrecht
> > > > > > > > > > > > _______________________________________________
> > > > > > > > > > > > biojava-dev mailing list
> > > > > > > > > > > > biojava-dev at lists.open-bio.org
> > > > > > > > > > > >
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
>



More information about the biojava-dev mailing list