[Biopython] Biopython function for operating multiple homologous sequences in a single file

Thu Jan 23 09:44:35 UTC 2014

On Tue, Jan 21, 2014 at 11:42 AM, Edson Ishengoma
<ishengomae at nm-aist.ac.tz> wrote:
> Hi all,
>
> I have a single large file containing many (thousands) coding sequence
> pairs according to their homologs as so:
>
>> >ENSBTAT00000048342_species1
>> sequences
>> >ENSBTAT00000048342_species2
>> sequences
>> >ENSBTAT00000009085_species1
>> sequences
>> >ENSBTAT00000009085_species2
>> sequences
>> >ENSBTAT00000009212_species1
>> sequences
>> >ENSBTAT00000009212_species2
>> sequences
>> ......
>> ......
>> ......
>>
>
> Now I want to produce a clustalw alignment for each cds pair.

Why do you want to do that?

A pairwise alignment tool might be better... like EMBOSS needle or
water depending on if you want global (full sequence) or local
(partial sequence) alignment. In particular, look at needleall which
is for many-against-many pairwise alignments:
http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html

> Is there a
> way to use the biopython commandline function for clustalw to treat each
> gene pair separately for all pairs, run alignment and produce an ouput
> (alignments + trees file)?

If you really want to run lots of pairwise alignment with clustalw, you
would need a big loop over all the pairs, and call clustalw again and
again (once for each pair). I would think something like needleall
would be better.

Also, you shouldn't use the guide tree from clustalw for any serious
analysis, and anyway if you are doing pairwise alignments the trees
will always be a trivial with two sequences.

Regards,

Peter