[Biopython-dev] Multiple alignment - Clustalw etc...

Tue Mar 31 07:25:27 EDT 2009

Hi Peter,

2009/3/30 Peter <biopython at maubp.freeserve.co.uk>

> On Mon, Mar 30, 2009 at 12:42 PM, Cymon Cox <cy at cymon.org> wrote:
> >
> > Hi Folks,
> >
> > I've been trying to formalize a bunch of randomly scattered bits of code
> to
> > support the use of the alignment programme Muscle
> > (http://www.drive5.com/muscle/). I prefer to use this software in
> preference
> > to Clustalw - subjectively, it seems to give the most accurate
> alignments.
> > (Whether Biopython would want to support a second alignment programme
> > /external dependency is another matter...)
>
> A wrapper for MUSCLE wouldn't hurt - although there is scope for some
> rearrangement of our command line tool wrappers rather than adding more
> and more top level modules.  Maybe under Bio.Align, and move the Clustalw
> wrapper there too.

Agreed - it would seem more appropriate to have the alignment interfaces in
Bio.Align.

> > Anyway, while doing so, I realised just how awkward the current interface
> to
> > Clustalw is, which doesn't fit the SeqIO/AlignIO paradigm well.
>
> What I typically do fits pretty nicely with the SeqIO/AlignIO paradigm:
> (1) use SeqIO to prepare the FASTA input file.
> (2) run the command line tool (e.g. MUSCLE).
> (3) use AlignIO (or SeqIO) to read the alignment output file.

Well, yes - we can always not use the biopython interface.

> Actually I think that Bio.Clustalw interface is now a bit out of place,
> as it hides some of this from you. (Note that Bio.Clustalw predates
> Bio.AlignIO, and that by working with handles Bio.AlignIO is fairly
> tool neutral).
>
> > Currently, if we have a bunch of SeqRecords, say after downloading from
> > GenBank or being pulled from a BioSQL db, we have to write them to disk
> > and call clustalw on the file:
> >
> >>>> from Bio import Clustalw
> >>>> from Bio.Clustalw import MultipleAlignCL
> >>>> cline = MultipleAlignCL("f002", command="clustalw")
> >>>> align = Clustalw.do_alignment(cline)
>
> Well yes. Typically for any alignment tool you'd have to write the
> unaligned records in FASTA format.  Some tools may let handle
> this via standard input, so you may be able to use a pipe instead
> of a file - but the issues are similar.
>
> > It seems to me more appropriate to be able to call clustalw directly on a
> > bunch of SeqRecords:
> >
> > eg (suggested implementation)
> >>>> records = list(SeqIO.parse(open("f002", "r"), "fasta"))
> >>>> from Bio.Align import MultipleAlignment
> >>>> align = MultipleAlignment(records, executable="clustalw")
>
> i.e. Have a Biopython wrapper use a temp file to record the
> given records to in a format appropriate for the command line
> tool selected, and capturing the output?  In the case of
> ClustalW or MUSCLE this means making a temp FASTA input
> file.  For ClustalW we'd then have to open the output file, read
> it, and then delete it.

Yes, that's what I'm suggesting.

Here's my reasoning: it seems to me the input and output formats of the data
required by a particular alignment tool are incidental and should be hidden
from the user. At present the Clustalw interface forces you to write a fasta
formatted file of your records to disk, and then has Clustalw write an
aligned matrix to disk in a format specified by the user. If the latter is
Clustal format, then the record is parsed and an alignment object is
returned, else None is returned. In either case, an output file(s) remains
on disk.

So, say we have a bunch of sequences in pir format and we'd like them
aligned and saved in stockholm format:

from Bio import SeqIO
from Bio import AlignIO
from Bio import Clustalw
from Bio.Clustalw import MultipleAlignCL
records = SeqIO.parse(open("Tests/NBRF/DMA_nuc.pir", "r"), "pir")
AlignIO.write([records], open("temp.fasta", "w"), "fasta")
cline = MultipleAlignCL("temp.fasta", command="clustalw")
align = Clustalw.do_alignment(cline)
AlignIO.write([align], open("temp.sth", "w"), "stockholm")

we end up with 4 output files on disk: temp.aln,  temp.dnd,  temp.fasta,
temp.sth - 3 of which are incidental.

(BTW, using the above procedure on the files "B_nuc.pir" and "Cw_prot.pir"
in Tests/NBRF hangs on RH and Ubuntu linux: it seems to be waiting for the
subprocess to return, which it never does: pid, sts = os.waitpid(self.pid,
0))

As I say, I'd like to see this:
>>> from Bio.Align import MultipleAlignment
>>> records = list(SeqIO.parse(open("Tests/NBRF/DMA_nuc.pir", "r"), "pir"))
>>> align = MultipleAlignment(records, executable="clustalw")
>>> AlignIO.write([align], open("temp.sth", "w"), "stockholm")

ie resulting in one file temp.sth, which we've explicitly written to disk.

>  For other tools we may be able to just
> capture its output on stdout and not have to clean up a temp
> output file.
>
> All the possible command line tools have their own arguments,
> range of file formats, behaviour with respect to default filenames
> etc.  Trying to capture all this in a single wrapper seems rather
> ambitious.  For example, how would you handle gap penalties?
> Keep in mind that different tools may use the same name for
> a gap extension penalty but interpret the values differently.

Sorry, I wasn't very clear about what I intended:

MultipleAlignment(records, executable="clustalw", <keyword args>)
returns Clustalw.do_alignment(records, <keyword args>)
and
MultipleAlignment(records, executable="muscle", <keyword args>)
returns Muscle.do_alignments(records, <keyword args>)

I'm not suggesting unifying all programme options into a single interface,
just wrap the individual alignment tool modules in a common call,
MulitpleAlignment(), align_records(), or whatever...

As for the keyword options, at present the Clustalw interface supports the
manual setting of some attributes to the MultipleAlignCL instance, but there
is no type or value checking. I think as many options as possible should be
supported through keyword arguments - tedious, but doable.

Also, while I can see this might be nice for short alignments
> (which are quick to run), its rather implicit or magic.

Not sure what you mean here? Why would the size of alignment matter? And as
for it being magic, its seems to me it does, and only does, what it says on
the label - aligns the data.

>  I personally
> prefer to have to deal with the files explicitly myself - but then I
> have been dealing with large alignments which I want to keep
> on disk.

I tend to build many (small - <100 taxa) single gene alignments - in one
use-case, 280 of them...

> Secondly, the biopython interface does not support calling
> > Clustalw to perform profile alignments,
> >
> > (suggested implementation)
> > # The scaffold alignment:
> >>>> align = AlignIO.read(open("blah.nex", "r"), "nexus")
> > # The sequences we want to add to it:
> >>>> records = list(SeqIO.parse(open("f002", "r"), "fasta"))
> >>>> from Bio.Align import ProfileAlignment
> >>>> align = ProfileAlignment(align, records, executable="clustalw")
> >
> > Calls to MultipleAlignment and ProfileAlignment would take a
> > **options parameter to collect any additional command line options.
>

I'm very keen to see profile alignments supported - be it either in Clustalw
or Muscle, or both.

>
> > Thirdly, should an alignment object have a
> > Alignment.refine_alignment(executable="clustalw")
> > method?
> >
> > Any thoughts?
>
> I may have misunderstood you, but the ideas you've sketched out
> seem very very broad/ambitious - and actually take us further away
>
from the SeqIO/AlignIO interface by hiding all the filenames and
> handles from the user.  I think these should be kept explicit.

OK, well having had my say, I'm quite happy to write the Muscle module in
the style of the current Clustalw interface, or whatever style is most
appropriate for exposing the filename handles. But I'm not sure what that
would be - perhaps you could elaborate on this a bit...

Cheers, C.
-- 
____________________________________________________________________

Cymon J. Cox

Centro de Ciencias do Mar
Faculdade de Ciencias do Mar e Ambiente (FCMA)
Universidade do Algarve
Campus de Gambelas
8005-139 Faro
Portugal

Phone: +0351 289800909 ext 7909
Fax: +0351 289800051
Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com
HomePage : http://biology.duke.edu/bryology/cymon.html
-8.63/-6.77