[Biopython] pairwise sequence alignment programs in biopython

Wed Jul 11 11:18:44 UTC 2018

Dear John,

I' don't know what you really need, but if you only need the score (and 
no alignment) or if you are satisfied with *only one* alignment, in 
pairwise2 you can use the two parameters "score_only=True" or 
"one_alignment_only=True", respectively. This will have an impact on 
speed and memory consumption.

In the new PairwiseAligner you can use the function "score" (instead of 
align) if you are only interested in the score (and this should also be 
faster and less memory consuming). Since the new PairwiseAligner does 
only produce the alignments when you call them, there is no need to 
restrict the number of alignments.

And, as written by Michiel, if the gap penalties and match/mismatch 
scores are the same, then the results of pairwise2 and PairwiseAligner 
should be identical.

Still wondering about the 2500 limit for pairwise2 on your machine... It 
would be nice if you can give me an exact example: sequences (you can 
refer to your sequences in your GitHub branch) and the code line for 
pairwise2. I have some low memory computers at home and would like to 
investigate this.

Best,
Markus

Am 11.07.2018 um 10:47 schrieb John Berrisford:
> Dear Marcus and Peter
>
> I'm writing a program that will be run on lots of different machines - the spec (os, ram etc...) of which I will have no control over.
> My test machine is an 8GB 64bit windows 10 laptop.
>
> My tests are a work in progress in github
> https://github.com/berrisfordjohn/adding_stats_to_mmcif/blob/master/tests/test_seq_align.py
>
> all I'm doing is aligning is taking a long a sequence and against varying lengths of itself against the same thing. i.e. take a 5500 residue sequence and then align the first 2000 residue against the first 2000 residues.
> In my tests on my machine 2000 residues is ok with pairwise2, but 2500 residues fails.  As this appears be machine specific your results may vary.
>
> However, I am pleased to report that pairwisealigner is working with large sequences and I'm glad to hear that it is similar in alignment results to pairwise2. Next check is ensuring that the alignments do as I expect.
>
> Thanks
>
> John
>
> -----Original Message-----
> From: Peter Cock <p.j.a.cock at googlemail.com>
> Sent: 11 July 2018 08:52
> To: John Berrisford <jmb at ebi.ac.uk>
> Cc: Biopython Mailing List <biopython at mailman.open-bio.org>
> Subject: Re: [Biopython] pairwise sequence alignment programs in biopython
>
> To clarify on length of sequences, I had forgotten the details, see:
>
> https://github.com/biopython/biopython/pull/1655#issuecomment-390180240
>
> If you just want the alignment lengths, the new Align.PairwiseAligner wins, if you want the alignments themselves, then pairwise2 wins.
>
> On the other hand, with random sequences of 5000bp, Michiel reported his new Align.PairwiseAligner was faster.
>
> How much memory (RAM) do you have, and are you using a 32bit operating system? It is likely memory limits which is stopping you align over about 2000 sequences.
>
> Peter
>
> On Wed, Jul 11, 2018 at 12:12 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> Hi John,
>>
>> The Align.PairwiseAligner code is new in Biopython 1.72, and better
>> support for longer sequences was one of the improvements.
>>
>> You would probably find it useful to read over the pull request:
>> https://github.com/biopython/biopython/pull/1655
>>
>>
>> Peter
>>
>> On Tue, Jul 10, 2018 at 7:51 PM, John Berrisford <jmb at ebi.ac.uk> wrote:
>>> Hi
>>>
>>>
>>>
>>> I’m looking at performing pairwise alignments of polymer sequences in
>>> biopython.
>>>
>>> These will be protein or nucleotide sequences. They may include
>>> non-standard residues which will be denoted as X.
>>>
>>> The sequences will be of varying length from around 20 residues up to
>>> several thousand residues – put simply the range of sequences in the PDB.
>>>
>>>
>>>
>>> I’m looking for the best tool to use to do this in biopython
>>>
>>>
>>>
>>> So far I have performed tests with pairwise2 and Align.PairwiseAligner.
>>>
>>>  From my tests it seems that pairwise2 has a limit of ~2000 residues – i.e.
>>> if I give it a sequence of 2500 residues to compare against itself it
>>> crashes. PairwiseAligner seems to be able to handle much longer
>>> sequences without issue.
>>>
>>>
>>>
>>> I need to be able to set gap penalties – which is possible in both of
>>> these programs.
>>>
>>>
>>>
>>> So my question are:
>>>
>>> Are these the only options in biopython? – I would prefer a python
>>> implementation rather than something that requires external compilation i.e.
>>> Emboss Needle
>>>
>>> Are these the best options?
>>>
>>> Are they both maintained / stable?
>>>
>>> Are they comparable in their results?
>>>
>>> Is the limitation in sequence length in pairwise2 a known issue? A
>>> quick google search suggests most people use pairwise2, which is
>>> strange given its sequence length limitation.
>>>
>>>
>>>
>>> Thank you
>>>
>>>
>>>
>>> John
>>>
>>>
>>>
>>> --
>>>
>>> John Berrisford
>>>
>>> PDBe
>>>
>>> European Bioinformatics Institute (EMBL-EBI)
>>>
>>> European Molecular Biology Laboratory
>>>
>>> Wellcome Genome Campus
>>>
>>> Hinxton
>>>
>>> Cambridge CB10 1SD UK
>>>
>>> Tel: +44 1223 492529
>>>
>>>
>>>
>>> https://www.pdbe.org
>>>
>>> https://www.facebook.com/proteindatabank
>>>
>>> https://twitter.com/PDBeurope
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>>> http://mailman.open-bio.org/mailman/listinfo/biopython
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython

-- 
_________________________________
Dr. Markus Piotrowski
Privatdozent/Akademischer Rat
Lehrstuhl für Molekulargenetik und Physiologie der Pflanzen
ND 3/49
Universitätsstr. 150
44801 Bochum

Tel. xx49-(0)234-3224290
Fax. xx49-(0)234-3214187

http://www.ruhr-uni-bochum.de/pflaphy/Seiten_dt/PG_Piotrowski_d.html
http://homepage.ruhr-uni-bochum.de/Markus.Piotrowski/Index.html