[Biopython-dev] Clustal alignment format header line

Peter biopython at maubp.freeserve.co.uk
Tue May 12 15:28:35 UTC 2009


On Tue, May 12, 2009 at 12:07 PM, Cymon Cox <cy at cymon.org> wrote:
> Both Muscle (-clw) and Probcons (-clustalw)  output a programme specific
> header line for the clustal format alignment:
>
> "MUSCLE (3.7) multiple sequence alignment
>
>
> AK1H_ECOLI/1-378      CPDSINAALICRGEKMSIAIMAGVLEAR etc"
>
> "PROBCONS version 1.12 multiple sequence alignment
>
> AK1H_ECOLI/1-378    CPDSINAALICRGEKMSIAIMA
>
> "
>
> Bio.AlignIO will not read these alignments
> Bio/AlignIO/ClustalIO.py:94
>  if line[:7] != 'CLUSTAL':
>       raise ValueError("Did not find CLUSTAL header")
>
> Muscle does have a -clwstrict flag but ProbCons doesnt.
>
> Would it be a good idea to relax the header parsing?
>
> C.

Maybe.  Up until now the only example of this I had personally come
across was MUSCLE, but they helpfully provide the -clwstrict argument
so the issue wasn't important.

There are also of course the official variants like:

CLUSTAL W (1.81) multiple sequence alignment
CLUSTAL 2.0.9 multiple sequence alignment

How would you code this?  A flexible option would be to take anything
where the first line ends with "multiple sequence alignment", but this
risks letting a lot of non-clustal files though which will then
(hopefully) fail, but probably with a much more cryptic error message.
A white list of safe variants like "MUSCLE" and "PROBCONS" would be
safest.

Also I have a vague memory of some tool using something like "CLUSTAL
... from ToolX" but I don't recall the details.

Peter




More information about the Biopython-dev mailing list