[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Thu Aug 17 20:09:20 UTC 2006

Marc Colosimo wrote:
>> Nice quick work on that. For Clustal, I think it should NOT be an  
>> Iterator, but there should be SequenceDict or SequenceList for it.  
>> There are other alignment filetypes out there that could use a  
>> SequenceIterator (those that are not interlaced).  From looking over  
>> your code, it seem like it would be easy to add a check in  
>> File2SequenceDict/List to check for Clustal types and do something  
>> "special"

Peter (BioPython Dev) wrote:
> Yes, I was thinking wondering about that too.
> 
> For interlaced file formats (such as clustalw, NEXUS multiple alignment 
> format) we have to load the whole file into memory anyway - so using a 
> SequenceIterator was a bit odd.
> 
> What I was trying to do was use a SequenceIterator as the lowest common 
> denominator - the ClustalIterator shows that this can be done for 
> interlaced files, and seems to work.

There are two and a half examples done this way now...

> I was planning to add a Stockholm parser (where the sequences themselves 
> are non-interleaved).  The PFAM database alignments use this, and are 
> the largest alignments I am aware of.
> 
> ...
> 
> It looks like a subclassed version could be written to handle the PFAM 
> annotations nicely.

http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c3

Changes to the clustal parser, and addition of a parser for Stockholm
alignments, and a subclassed version to handle the PFAM style
annotations strings.

I have included basic handling of the sequence specific meta-data [I
need to have a look at real PFAM data to sort of the database cross
references still], but currently ignore the whole file level information
(#=GF lines) and the per column information (#=GC lines).

Maybe reading sequences out of multiple alignment files should be done
as a special case of loading multiple alignments?  Is this what you
meant by "something special" Marc?

Peter