[Biopython] Looking for a way to apply pairwise2 but really fast

Peter Cock p.j.a.cock at googlemail.com
Fri Jul 12 13:10:32 UTC 2013


On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Biopythonians,
>
> The pairwise2 function provides a very convenient way of aligning two
> sequences. For example:
>
> from Bio import pairwise2
> aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1)
>
> where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences.
>
>
> Now, I find that routinely I need to compare qseq1 to a set of many
> subject sequences like, for example, [sseq1, sseq2, ..., sseq300].
> When I do that, I notice that pairwise2 is extremely slow.
>
>
> It gets worse: most of the time I need to pairwise align a million
> query sequences to the set of 300 subjects. It is just impossible to
> use pairwise2 as a solution.
>
> Can somebody offer a strategy to make pairwise comparisons a doable
> task within Biopython?

Try using multiple threads and/or a cluster, e.g. look at subprocessing
or simply do 300 parallel jobs, one for each subject.

Use a specialised tool, perhaps with heuristic matching, e.g, BLAST
or EMBOSS needle or needleall
http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html

> Note: I tried BLASTing from within Python but although it works, for
> large number of sequences, it is only a matter of time before a BLAST
> output bug shows up and it stalls your analysis pipeline. Not cool.

Bugs in BLAST, or limitations of our parser? Which output format
are you using?

Peter



More information about the Biopython mailing list