[Biopython] Looking for a way to apply pairwise2 but really fast

Fri Jul 12 13:10:32 UTC 2013

On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Biopythonians,
>
> The pairwise2 function provides a very convenient way of aligning two
> sequences. For example:
>
> from Bio import pairwise2
> aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1)
>
> where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences.
>
>
> Now, I find that routinely I need to compare qseq1 to a set of many
> subject sequences like, for example, [sseq1, sseq2, ..., sseq300].
> When I do that, I notice that pairwise2 is extremely slow.
>
>
> It gets worse: most of the time I need to pairwise align a million
> query sequences to the set of 300 subjects. It is just impossible to
> use pairwise2 as a solution.
>
> Can somebody offer a strategy to make pairwise comparisons a doable
> task within Biopython?

Try using multiple threads and/or a cluster, e.g. look at subprocessing
or simply do 300 parallel jobs, one for each subject.

Use a specialised tool, perhaps with heuristic matching, e.g, BLAST
or EMBOSS needle or needleall
http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html

> Note: I tried BLASTing from within Python but although it works, for
> large number of sequences, it is only a matter of time before a BLAST
> output bug shows up and it stalls your analysis pipeline. Not cool.

Bugs in BLAST, or limitations of our parser? Which output format
are you using?

Peter