[BioPython] Bio.SeqIO and Clustal aka Clustalw files

Peter biopython at maubp.freeserve.co.uk
Sun Feb 4 20:13:41 UTC 2007


Michiel De Hoon wrote:
>> Clustalw 1.83 will reject any file where the first 30 characters of the 
>> identifier are not unique (regardless of the file format).
>>
>> However, there is nothing in the clustal file format which prevents 
>> this.  For example, BioEdit 5.0.7 will happily read and write clustal 
>> format alignments with repeated entries.
>>
>> Should Bio.SeqIO also be tolerant like this?
> 
> Yes, I think so. Some users may want to write a file in the Clustal format to
> use it with some program other Clustal.

I was hoping you would agree.  Done.

 > Also, assuming that clustal gives a clear error message when the file
 > contains longer identifiers, that should be sufficient to enable the
 > user to fix the problem.

The clustal programs do seem to be able to read clustal files with 
identifiers longer than 30 characters (I tried a hand made file with 
identifiers 55 characters long).  This is good.

Regardless of the input file format, if your input sequences have long 
identifiers they are silently truncated to 30 characters.

In addition, any colons in the identifier are silently converted into 
underscores on loading.

Both the command line ClustalW 1.83 and the GUI tool ClustalX 1.83 then 
give a very explicit error message if there are non unique identifiers:

ERROR: Multiple sequences found with same name, XXXX (first 30 chars are 
significant)

(where XXXX is the repeated identifier).

Peter




More information about the Biopython mailing list