[Biopython-dev] Follow up to bug incoming/75

Jeffrey Chang jchang at smi.stanford.edu
Sat Jul 20 19:26:41 EDT 2002


Hi Daishi,

I'm afraid that the algorithm I implemented for Bio.Align.fastpairwise
just doesn't handle local alignments very well.  There are a lot of
boundary cases, and you have found one of them.

To address this, I have implemented a new dynamic programming
algorithm based on the Needleman-Wunsch algorithm that was implemented
in Bio.Align.pairwise.  I saved this as the module
Bio.Align.pairwise2.

Bio.Align.pairwise2 caches some gap scores, so that for affine gap
penalties, it has the same asymptotic running time as
Bio.Align.fastpairwise.  Plus, it maintains more information in its
data structure so it makes the traceback much easier to understand.  I
believe the code will be easier to maintain.

I did some profiling, and it seems that the for Bio.Align.pairwise2 is
slower than Bio.Align.fastpairwise, but the traceback is faster, which
evens things out.  Since this module replaces the functionality of
Bio.Align.pairwise and Bio.Align.fastpairwise, I am going to deprecate
them in future releases of Biopython.

Please take a look at Bio.Align.pairwise2 and let me know if you find
any problems!

Thanks,
Jeff




On Tue, Jul 16, 2002 at 03:23:46PM -0700, daishi at egcrc.net wrote:
> Jeffrey Chang wrote:
> > Hi Daishi,
> > 
> > Thanks for the bug report and patch.  I will look into this soon.
> > 
> > Jeff
> 
> Please ignore the previous patch; it contains a bug.
> I believe I now understand the purpose of L332-333;
> I have reincorporated the functionality in a slightly
> different way which works with my other modifications.
> The new patch follows:
> 
> *** /usr/local/src/biopython/Bio/Align/fastpairwise.py	Tue Jul  9 15:08:53 2002
> --- fastpairwise.py	Tue Jul 16 15:15:50 2002
> ***************
> *** 302,308 ****
>            else:
>                # Since the length of the alignment is unknown, set an
>                # offset from the end of the alignment.
> !             begin, end = 0, -len(seqA)
>                if not end:
>                    end = None
>            x = seqA, seqB, score, begin, end, row, col, []
> --- 302,308 ----
>            else:
>                # Since the length of the alignment is unknown, set an
>                # offset from the end of the alignment.
> !             begin, end = None, -len(seqA)
>                if not end:
>                    end = None
>            x = seqA, seqB, score, begin, end, row, col, []
> ***************
> *** 326,341 ****
>                    elif col < 0:
>                        begin = row + 1
>                for cseqA, cseqB, cbegin, taillen in path_cache[(row, col)]:
> !                 nbegin = begin
>                    nseqA = cseqA[:-taillen] + seqA
>                    nseqB = cseqB[:-taillen] + seqB
> !                 if nbegin < cbegin:
> !                     nbegin = cbegin
>                    x = nseqA, nseqB, score, nbegin, end, -1, -1, path
>                    in_process.append(x)
>            elif row < 0 and col < 0:
>                # This one is done.  Put it into the tracebacks list and
>                # continue.
>                tracebacks.append((seqA, seqB, score, begin, end))
>                path.pop()   # Don't cache the last step.
>                # Update the cache.
> --- 326,346 ----
>                    elif col < 0:
>                        begin = row + 1
>                for cseqA, cseqB, cbegin, taillen in path_cache[(row, col)]:
> !                 if begin:
> !                     nbegin = begin
> !                 else:
> !                     nbegin = cbegin
>                    nseqA = cseqA[:-taillen] + seqA
>                    nseqB = cseqB[:-taillen] + seqB
> ! #                if nbegin < cbegin:
> ! #                    nbegin = cbegin
>                    x = nseqA, nseqB, score, nbegin, end, -1, -1, path
>                    in_process.append(x)
>            elif row < 0 and col < 0:
>                # This one is done.  Put it into the tracebacks list and
>                # continue.
> +             if not begin:
> +                 begin = 0
>                tracebacks.append((seqA, seqB, score, begin, end))
>                path.pop()   # Don't cache the last step.
>                # Update the cache.
> ***************
> *** 347,360 ****
>            elif row < 0:
>                nseqA = gap_char + seqA
>                nseqB = sequenceB[col:col+1] + seqB
> !             if not global_alignment:
>                    begin = col + 1
>                x = nseqA, nseqB, score, begin, end, row, col-1, path
>                in_process.append(x)
>            elif col < 0:
>                nseqA = sequenceA[row:row+1] + seqA
>                nseqB = gap_char + seqB
> !             if not global_alignment:
>                    begin = row + 1
>                x = nseqA, nseqB, score, begin, end, row-1, col, path
>                in_process.append(x)
> --- 352,365 ----
>            elif row < 0:
>                nseqA = gap_char + seqA
>                nseqB = sequenceB[col:col+1] + seqB
> !             if not (global_alignment or begin):
>                    begin = col + 1
>                x = nseqA, nseqB, score, begin, end, row, col-1, path
>                in_process.append(x)
>            elif col < 0:
>                nseqA = sequenceA[row:row+1] + seqA
>                nseqB = gap_char + seqB
> !             if not (global_alignment or begin):
>                    begin = row + 1
>                x = nseqA, nseqB, score, begin, end, row-1, col, path
>                in_process.append(x)



More information about the Biopython-dev mailing list