[Biopython-dev] Multiple alignment - Clustalw etc...

Brad Chapman chapmanb at 50mail.com
Mon Mar 30 21:00:07 UTC 2009


Cymon;
I wrote a bunch of the Clustalw stuff a long while ago, and it
sounds like Peter has a good handle on integrating it with AlignIO
so I will leave that to him.

On the choosing aligners side of things, have you tried MAFFT?

http://align.bmr.kyushu-u.ac.jp/mafft/software/

It's updated regularly and seems to have good buzz in the community.
I haven't had to do lots of multiple alignments recently, but it's
worked well for the few I've done.

Having support for multiple aligners is good stuff; I second Peter's
suggestion of having these live under Bio.Align.

Brad

> On Mon, Mar 30, 2009 at 12:42 PM, Cymon Cox <cy at cymon.org> wrote:
> >
> > Hi Folks,
> >
> > I've been trying to formalize a bunch of randomly scattered bits of code to
> > support the use of the alignment programme Muscle
> > (http://www.drive5.com/muscle/). I prefer to use this software in preference
> > to Clustalw - subjectively, it seems to give the most accurate alignments.
> > (Whether Biopython would want to support a second alignment programme
> > /external dependency is another matter...)
> 
> A wrapper for MUSCLE wouldn't hurt - although there is scope for some
> rearrangement of our command line tool wrappers rather than adding more
> and more top level modules.  Maybe under Bio.Align, and move the Clustalw
> wrapper there too.
> 
> > Anyway, while doing so, I realised just how awkward the current interface to
> > Clustalw is, which doesn't fit the SeqIO/AlignIO paradigm well.
> 
> What I typically do fits pretty nicely with the SeqIO/AlignIO paradigm:
> (1) use SeqIO to prepare the FASTA input file.
> (2) run the command line tool (e.g. MUSCLE).
> (3) use AlignIO (or SeqIO) to read the alignment output file.
> 
> Actually I think that Bio.Clustalw interface is now a bit out of place,
> as it hides some of this from you.  (Note that Bio.Clustalw predates
> Bio.AlignIO, and that by working with handles Bio.AlignIO is fairly
> tool neutral).
> 
> > Currently, if we have a bunch of SeqRecords, say after downloading from
> > GenBank or being pulled from a BioSQL db, we have to write them to disk
> > and call clustalw on the file:
> >
> >>>> from Bio import Clustalw
> >>>> from Bio.Clustalw import MultipleAlignCL
> >>>> cline = MultipleAlignCL("f002", command="clustalw")
> >>>> align = Clustalw.do_alignment(cline)
> 
> Well yes. Typically for any alignment tool you'd have to write the
> unaligned records in FASTA format.  Some tools may let handle
> this via standard input, so you may be able to use a pipe instead
> of a file - but the issues are similar.
> 
> > It seems to me more appropriate to be able to call clustalw directly on a
> > bunch of SeqRecords:
> >
> > eg (suggested implementation)
> >>>> records = list(SeqIO.parse(open("f002", "r"), "fasta"))
> >>>> from Bio.Align import MultipleAlignment
> >>>> align = MultipleAlignment(records, executable="clustalw")
> 
> i.e. Have a Biopython wrapper use a temp file to record the
> given records to in a format appropriate for the command line
> tool selected, and capturing the output?  In the case of
> ClustalW or MUSCLE this means making a temp FASTA input
> file.  For ClustalW we'd then have to open the output file, read
> it, and then delete it.  For other tools we may be able to just
> capture its output on stdout and not have to clean up a temp
> output file.
> 
> All the possible command line tools have their own arguments,
> range of file formats, behaviour with respect to default filenames
> etc.  Trying to capture all this in a single wrapper seems rather
> ambitious.  For example, how would you handle gap penalties?
> Keep in mind that different tools may use the same name for
> a gap extension penalty but interpret the values differently.
> 
> Also, while I can see this might be nice for short alignments
> (which are quick to run), its rather implicit or magic.  I personally
> prefer to have to deal with the files explicitly myself - but then I
> have been dealing with large alignments which I want to keep
> on disk.
> 
> > Secondly, the biopython interface does not support calling
> > Clustalw to perform profile alignments,
> >
> > (suggested implementation)
> > # The scaffold alignment:
> >>>> align = AlignIO.read(open("blah.nex", "r"), "nexus")
> > # The sequences we want to add to it:
> >>>> records = list(SeqIO.parse(open("f002", "r"), "fasta"))
> >>>> from Bio.Align import ProfileAlignment
> >>>> align = ProfileAlignment(align, records, executable="clustalw")
> >
> > Calls to MultipleAlignment and ProfileAlignment would take a
> > **options parameter to collect any additional command line options.
> >
> > Thirdly, should an alignment object have a
> > Alignment.refine_alignment(executable="clustalw")
> > method?
> >
> > Any thoughts?
> 
> I may have misunderstood you, but the ideas you've sketched out
> seem very very broad/ambitious - and actually take us further away
> from the SeqIO/AlignIO interface by hiding all the filenames and
> handles from the user.  I think these should be kept explicit.
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev



More information about the Biopython-dev mailing list