[Biopython-dev] Multiple alignment - Clustalw etc...
Peter
biopython at maubp.freeserve.co.uk
Tue Mar 31 11:24:32 EDT 2009
On Tue, Mar 31, 2009 at 3:49 PM, Cymon Cox <cy at cymon.org> wrote:
>>>
>>> If the latter is Clustal format, then the record is parsed and an
>>> alignment object is returned, else None is returned. In either
>>> case, an output file(s) remains on disk.
>>
>> It should be a fairly simple enhancement to look at the arguments
>> to see if another output format we can parse was selected, e.g.
>> PHYLIP?) and also parse that. Do you think that would be a
>> sensible addition to Bio.Clustalw.do_alignment?
>
> No - I dont think there should be any output file (of any format) at all, an
> alignment object should always be returned and the user explicitly write to
> format they want using AlignIO. (But I think this becomes clearer below...)
Well there must be an output file, since ClustalW won't write its output
alignment to stdout. Of course, you would have a wrapper which
deletes the output file after it has been parsed into an Alignment object.
However, we shouldn't change the existing Bio.Clustalw.do_alignment
function to do this (or to delete the .dnd guide tree), since people may
be using the call for these "side effects".
>> Its never been
>> an issue for me as if you are using the Bio.Clustalw.do_alignment
>> interface you probably don't care about the output file format.
>
> Quite. (Unless you are trying to write to a format not supported by
> biopython e.g. GCG, GDE, of course.)
What I was saying was Bio.Clustalw.do_alignment knows the requested
output format, and if it is ClustalW it automatically parses the output file
and returns the alignment. Since this code was written, Bio.AlignIO was
added and could potentially be used to parse PHYLIP (etc) output from
the Clustalw tool. And one day maybe GCG etc too.
i.e. Right now Bio.Clustalw.do_alignment will return an alignment if it is in
ClustalW format, or None if it isn't. I'm suggesting Bio.Clustalw.do_alignment
could return an alignment when Bio.AlignIO can parse the requested file
format, or None if it can't.
This would only be a small enhancement, and may not be worth bothering
with if we are thinking about deprecating Bio.Clustalw with a replacement
under Bio.Align.
>> Size of alignment influences the compute time, and therefore is an issue
>> for anyone doing things at the python prompt. Moreover, if the alignments
>> are big and slow, you generally want to make sure the output file is kept
>> on disk, as you'll probably want to read it more than once.
>
> Agreed, but should the call to align the data (ie to clustalw) be writing
> the output to disk or should the user be making an explicit call using
> AlignIO?
The command line tool ClustalW will itself write the output to disk. I don't
recall off hand, but other tools like Muscle may give the option of writing
to a file or to stdout. In either case, the tool writes to a handle, and the
user may want to *read* this handle using Bio.AlignIO.
If I want the tool's output to go straight to a file, I'd get the tool to do it.
The only reason I can see to be *writing* the alignment with Bio.AlignIO
would be for file conversion (or after manipulating the alignment), and that
would done by the user's python code.
If you are talking about the data preparation (i.e. the input file rather than
the output file), then I think it is up to the user's code to prepare a suitable
input FASTA file (e.g. from SeqRecord objects with Bio.SeqIO) before
calling the command line tool.
>>> And as for it being magic, its seems to me it does, and only does, what
>>> it says on the label - aligns the data.
>>
>> The magic is the behind the scenes creation/deletion of the input/output
>> files, and the conversion between file formats.
>
> Fair enough - then magic it be... :)
:)
>> > OK, well having had my say, I'm quite happy to write the Muscle module in
>> > the style of the current Clustalw interface, or whatever style is most
>> > appropriate for exposing the filename handles. But I'm not sure what that
>> > would be - perhaps you could elaborate on this a bit...
>>
>> I've elaborated, ...
>
> Thanks for your thoughts on this, it helps clarify some things...
Oh good. If you don't agree with any of that, do say so by the way.
>> So, I would suggest we think about adding new wrappers under Bio.Align
>> (e.g. Bio.Align.Clustalw, Bio.Align.Muscle, Bio.Align.TCoffee - or
>> perhaps all together in Bio.Align.Applications or something) based on
>> the Bio.Application module as used in Bio.EMBOSS. We could then
>> deprecate Bio.Clustalw, which should also help tidy up the top level
>> name space. Initially at least, I wouldn't include any clever wrapper
>> code at all.
>
> OK, I'll aim for this with the Muscle code...
That sounds good. Now can I tempt you into trying out github at the same
time, so we can see your proposed code evolve in public?
Could I add at this point that I don't think the wrapper should set any default
arguments - leave that up to the command line tool itself. Otherwise you can
get the situation where the Biopython defaults get out of sync with the tool's
own default values (an issue with our online qblast wrapper and the NCBI
change their default settings over time).
As an aside, I have used Muscle with Biopython thanks to its option for
strict Clustal ouput, which can be parsed by Bio.AlignIO fine. For this I
just generated my own command line on the fly, but I was only using a
couple of the command line arguments.
Peter
More information about the Biopython-dev
mailing list