[Biopython] Looking for a way to apply pairwise2 but really fast

Michiel de Hoon mjldehoon at yahoo.com
Sat Jul 13 01:31:50 UTC 2013


I also noticed that Bio.pairwise2 is extremely slow. I am preparing an alternative to Bio.pairwise2, but it is not ready yet for inclusion into Biopython. See my branch here: https://github.com/mdehoon/biopython/blob/aligner/Bio/Align/algorithms.py.

Are you primarily interested in the score of the best alignment, or do you need the best alignment itself?

Best,
-Michiel.



________________________________
 From: Peter Cock <p.j.a.cock at googlemail.com>
To: Ivan Gregoretti <ivangreg at gmail.com> 
Cc: Biopython Mailing List <biopython at lists.open-bio.org> 
Sent: Friday, July 12, 2013 10:10 PM
Subject: Re: [Biopython] Looking for a way to apply pairwise2 but really fast
 

On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Biopythonians,
>
> The pairwise2 function provides a very convenient way of aligning two
> sequences. For example:
>
> from Bio import pairwise2
> aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1)
>
> where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences.
>
>
> Now, I find that routinely I need to compare qseq1 to a set of many
> subject sequences like, for example, [sseq1, sseq2, ..., sseq300].
> When I do that, I notice that pairwise2 is extremely slow.
>
>
> It gets worse: most of the time I need to pairwise align a million
> query sequences to the set of 300 subjects. It is just impossible to
> use pairwise2 as a solution.
>
> Can somebody offer a strategy to make pairwise comparisons a doable
> task within Biopython?

Try using multiple threads and/or a cluster, e.g. look at subprocessing
or simply do 300 parallel jobs, one for each subject.

Use a specialised tool, perhaps with heuristic matching, e.g, BLAST
or EMBOSS needle or needleall
http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html

> Note: I tried BLASTing from within Python but although it works, for
> large number of sequences, it is only a matter of time before a BLAST
> output bug shows up and it stalls your analysis pipeline. Not cool.

Bugs in BLAST, or limitations of our parser? Which output format
are you using?

Peter
_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list