[Biopython-dev] Clustal alignment format header line
Peter
biopython at maubp.freeserve.co.uk
Tue May 12 15:28:35 UTC 2009
On Tue, May 12, 2009 at 12:07 PM, Cymon Cox <cy at cymon.org> wrote:
> Both Muscle (-clw) and Probcons (-clustalw) output a programme specific
> header line for the clustal format alignment:
>
> "MUSCLE (3.7) multiple sequence alignment
>
>
> AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMAGVLEAR etc"
>
> "PROBCONS version 1.12 multiple sequence alignment
>
> AK1H_ECOLI/1-378 CPDSINAALICRGEKMSIAIMA
>
> "
>
> Bio.AlignIO will not read these alignments
> Bio/AlignIO/ClustalIO.py:94
> if line[:7] != 'CLUSTAL':
> raise ValueError("Did not find CLUSTAL header")
>
> Muscle does have a -clwstrict flag but ProbCons doesnt.
>
> Would it be a good idea to relax the header parsing?
>
> C.
Maybe. Up until now the only example of this I had personally come
across was MUSCLE, but they helpfully provide the -clwstrict argument
so the issue wasn't important.
There are also of course the official variants like:
CLUSTAL W (1.81) multiple sequence alignment
CLUSTAL 2.0.9 multiple sequence alignment
How would you code this? A flexible option would be to take anything
where the first line ends with "multiple sequence alignment", but this
risks letting a lot of non-clustal files though which will then
(hopefully) fail, but probably with a much more cryptic error message.
A white list of safe variants like "MUSCLE" and "PROBCONS" would be
safest.
Also I have a vague memory of some tool using something like "CLUSTAL
... from ToolX" but I don't recall the details.
Peter
More information about the Biopython-dev
mailing list