[Biopython-dev] Multiple alignment - Clustalw etc...
Peter
biopython at maubp.freeserve.co.uk
Mon Mar 30 14:37:18 UTC 2009
On Mon, Mar 30, 2009 at 12:42 PM, Cymon Cox <cy at cymon.org> wrote:
>
> Hi Folks,
>
> I've been trying to formalize a bunch of randomly scattered bits of code to
> support the use of the alignment programme Muscle
> (http://www.drive5.com/muscle/). I prefer to use this software in preference
> to Clustalw - subjectively, it seems to give the most accurate alignments.
> (Whether Biopython would want to support a second alignment programme
> /external dependency is another matter...)
A wrapper for MUSCLE wouldn't hurt - although there is scope for some
rearrangement of our command line tool wrappers rather than adding more
and more top level modules. Maybe under Bio.Align, and move the Clustalw
wrapper there too.
> Anyway, while doing so, I realised just how awkward the current interface to
> Clustalw is, which doesn't fit the SeqIO/AlignIO paradigm well.
What I typically do fits pretty nicely with the SeqIO/AlignIO paradigm:
(1) use SeqIO to prepare the FASTA input file.
(2) run the command line tool (e.g. MUSCLE).
(3) use AlignIO (or SeqIO) to read the alignment output file.
Actually I think that Bio.Clustalw interface is now a bit out of place,
as it hides some of this from you. (Note that Bio.Clustalw predates
Bio.AlignIO, and that by working with handles Bio.AlignIO is fairly
tool neutral).
> Currently, if we have a bunch of SeqRecords, say after downloading from
> GenBank or being pulled from a BioSQL db, we have to write them to disk
> and call clustalw on the file:
>
>>>> from Bio import Clustalw
>>>> from Bio.Clustalw import MultipleAlignCL
>>>> cline = MultipleAlignCL("f002", command="clustalw")
>>>> align = Clustalw.do_alignment(cline)
Well yes. Typically for any alignment tool you'd have to write the
unaligned records in FASTA format. Some tools may let handle
this via standard input, so you may be able to use a pipe instead
of a file - but the issues are similar.
> It seems to me more appropriate to be able to call clustalw directly on a
> bunch of SeqRecords:
>
> eg (suggested implementation)
>>>> records = list(SeqIO.parse(open("f002", "r"), "fasta"))
>>>> from Bio.Align import MultipleAlignment
>>>> align = MultipleAlignment(records, executable="clustalw")
i.e. Have a Biopython wrapper use a temp file to record the
given records to in a format appropriate for the command line
tool selected, and capturing the output? In the case of
ClustalW or MUSCLE this means making a temp FASTA input
file. For ClustalW we'd then have to open the output file, read
it, and then delete it. For other tools we may be able to just
capture its output on stdout and not have to clean up a temp
output file.
All the possible command line tools have their own arguments,
range of file formats, behaviour with respect to default filenames
etc. Trying to capture all this in a single wrapper seems rather
ambitious. For example, how would you handle gap penalties?
Keep in mind that different tools may use the same name for
a gap extension penalty but interpret the values differently.
Also, while I can see this might be nice for short alignments
(which are quick to run), its rather implicit or magic. I personally
prefer to have to deal with the files explicitly myself - but then I
have been dealing with large alignments which I want to keep
on disk.
> Secondly, the biopython interface does not support calling
> Clustalw to perform profile alignments,
>
> (suggested implementation)
> # The scaffold alignment:
>>>> align = AlignIO.read(open("blah.nex", "r"), "nexus")
> # The sequences we want to add to it:
>>>> records = list(SeqIO.parse(open("f002", "r"), "fasta"))
>>>> from Bio.Align import ProfileAlignment
>>>> align = ProfileAlignment(align, records, executable="clustalw")
>
> Calls to MultipleAlignment and ProfileAlignment would take a
> **options parameter to collect any additional command line options.
>
> Thirdly, should an alignment object have a
> Alignment.refine_alignment(executable="clustalw")
> method?
>
> Any thoughts?
I may have misunderstood you, but the ideas you've sketched out
seem very very broad/ambitious - and actually take us further away
from the SeqIO/AlignIO interface by hiding all the filenames and
handles from the user. I think these should be kept explicit.
Peter
More information about the Biopython-dev
mailing list