[Biopython-dev] Multiple alignment - Clustalw etc...

Cymon Cox cy at cymon.org
Tue Mar 31 10:49:20 EDT 2009


Hi Peter,

2009/3/31 Peter <biopython at maubp.freeserve.co.uk>

> On Tue, Mar 31, 2009 at 12:25 PM, Cymon Cox <cy at cymon.org> wrote:#
>

> > At present the Clustalw interface forces you to write a fasta
> > formatted file of your records to disk, and then has Clustalw
> > write an aligned matrix to disk in a format specified by the user.
>
> The Clustalw tool only takes FASTA formatted input, so if you have
> a bunch of sequences in memory you are forced to convert them
> into FASTA format to use them as input.  The question is where
> does this conversion take place - explicitly by the user, or implicitly
> by a wrapper.


Agreed - that's the question...


> > If the latter is Clustal format, then the record is parsed and an
> alignment
>  > object is returned, else None is returned. In either case, an output
> file(s)
> > remains on disk.
>
> It should be a fairly simple enhancement to look at the arguments
> to see if another output format we can parse was selected, e.g.
> PHYLIP?) and also parse that.  Do you think that would be a
> sensible addition to Bio.Clustalw.do_alignment?


No - I dont think there should be any output file (of any format) at all, an
alignment object should always be returned and the user explicitly write to
format they want using AlignIO. (But I think this becomes clearer below...)


>  Its never been
> an issue for me as if you are using the Bio.Clustalw.do_alignment
> interface you probably don't care about the output file format.


Quite. (Unless you are trying to write to a format not supported by
biopython e.g. GCG, GDE, of course.)


> > So, say we have a bunch of sequences in pir format and we'd like them
> > aligned and saved in stockholm format:
> >
> > from Bio import SeqIO
> > from Bio import AlignIO
> > from Bio import Clustalw
> > from Bio.Clustalw import MultipleAlignCL
> > records = SeqIO.parse(open("Tests/NBRF/DMA_nuc.pir", "r"), "pir")
> > AlignIO.write([records], open("temp.fasta", "w"), "fasta")
>
> The above line is wrong


Doh! Grrr...

Yeah, perhaps it should have raised an error - I'll follow this up elsewhere
- but even with the corrected line and explicitly opening and closing the
file handles, I still can get clustalw to align this file... (later...)

> we end up with 4 output files on disk: temp.aln,  temp.dnd,  temp.fasta,
> > temp.sth - 3 of which are incidental.
>
> Yes - but as the ClustalW doesn't read in PIR files, and doesn't output
> Stockholm files on its own, so this has to happen.  It's just a question
> of who does it (the user, or the wrapper code).


Yep...


> > As I say, I'd like to see this:
>  >>>> from Bio.Align import MultipleAlignment
> >>>> records = list(SeqIO.parse(open("Tests/NBRF/DMA_nuc.pir", "r"),
> "pir"))
> >>>> align = MultipleAlignment(records, executable="clustalw")
> >>>> AlignIO.write([align], open("temp.sth", "w"), "stockholm")
> >
> > ie resulting in one file temp.sth, which we've explicitly written to
> disk.
>
> So you'd like the wrapper to take care of creating and deleting the
> temp input FASTA file, and also deleting the temp output ClustalW
> file after parsing it.  This can probably be done quite cleanly using
> python's NamedTemporaryFile object.
>

Yep.


> >> Also, while I can see this might be nice for short alignments
>  >> (which are quick to run), its rather implicit or magic.
> >
> > Not sure what you mean here? Why would the size of alignment matter?
>
> Size of alignment influences the compute time, and therefore is an issue
> for
> anyone doing things at the python prompt.  Moreover, if the alignments are
> big and slow, you generally want to make sure the output file is kept on
> disk,
> as you'll probably want to read it more than once.


Agreed, but should the call to align the data (ie to clustalw) be writing
the output to disk or should the user be making an explicit call using
AlignIO?


> > And as for it being magic, its seems to me it does, and only does, what
> > it says on the label - aligns the data.
>
> The magic is the behind the scenes creation/deletion of the input/output
> files, and the conversion between file formats.


Fair enough - then magic it be... :)


> > OK, well having had my say, I'm quite happy to write the Muscle module in
> > the style of the current Clustalw interface, or whatever style is most
> > appropriate for exposing the filename handles. But I'm not sure what that
> > would be - perhaps you could elaborate on this a bit...
>
> I've elaborated, perhaps too much? ;)
>
> Basically you seem to be thinking about a high level abstraction for
> multiple alignment tools (dependent on the Bio.SeqIO and Bio.AlignIO
> module), while I am more focused on the low level abstraction for
> wrapping any command line tool.  This isn't to say we can't have both,
> but to me it makes sense to start with the low level stuff first.
>
> We (unfortunately) have several styles of command line tool wrappers
> in Biopython already - this is a wart that has been on my mental to do
> list for some time.  I think we should focus on dealing with command
> line strings, and keep this separate from how the tools are invoked
> (e.g. subprocess or os.system), preparation of input files, and how
> any output is parsed.  As long as this core is in place, more advanced
> wrappers are possible on top of this basic infrastructure (Tiago may
> have some comments here from his Bio.PopGen work).
>
> Essentially all our command line wrappers start by building a command
> line string.  In some cases this command line string is exposed to the
> user (e.g. Bio.EMBOSS), and they can choose how they want to invoke
> it.  For example, they can explicitly opt to use the Python subprocess
> module and pipes if they want to - or use a standard invocation from
> Bio.Applications (we may want to add a couple of variations to this
> module).
>
> Other wrappers (e.g. Bio.Blast.NCBIStandalone) instead call the tool
> for you. In the case of Bio.Blast.NCBIStandalone, if you don't want
> the handles because you've told Blast to save its output to a file,
> our wrapper still returns the standard output and standard error
> handles - it is forced on you (see Bug 2654).   Also, there is no easy
> way to see what the actual command line string was, which can make
> debugging hard, and also prevents certain things (e.g. submitting the
> command line as a task to a cluster of workstations).  At least
> Bio.Clustalw offers a command line string object (MultipleAlignCL),
> its just the do_alignment helper function I'm not so keen on.
>
> The Bio.Clustalw.do_alignment wrapper is rather unusual in that it
> automatically parses the output - while most of our wrappers don't.
> Decoupling the parsing is more modular - it makes it easy for the user
> to use any parser for the output from a command line tool (either
> using stdout, or by reading an output file).  I like this, and it fits
> with the handle based approach in most of our parsers.


Thanks for your thoughts on this, it helps clarify some things...


> So, I would suggest we think about adding new wrappers under Bio.Align
> (e.g. Bio.Align.Clustalw, Bio.Align.Muscle, Bio.Align.TCoffee - or
> perhaps all together in Bio.Align.Applications or something) based on
> the Bio.Application module as used in Bio.EMBOSS.  We could then
> deprecate Bio.Clustalw, which should also help tidy up the top level
> name space.  Initially at least, I wouldn't include any clever wrapper
> code at all.


OK, I'll aim for this with the Muscle code...

Cheers, C.
-- 
____________________________________________________________________

Cymon J. Cox

Centro de Ciencias do Mar
Faculdade de Ciencias do Mar e Ambiente (FCMA)
Universidade do Algarve
Campus de Gambelas
8005-139 Faro
Portugal

Phone: +0351 289800909 ext 7909
Fax: +0351 289800051
Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com
HomePage : http://biology.duke.edu/bryology/cymon.html
-8.63/-6.77


More information about the Biopython-dev mailing list