[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Thu Aug 17 13:25:07 UTC 2006

Marc Colosimo wrote:
> Peter,
> 
> Nice quick work on that. For Clustal, I think it should NOT be an  
> Iterator, but there should be SequenceDict or SequenceList for it.  
> There are other alignment filetypes out there that could use a  
> SequenceIterator (those that are not interlaced).  From looking over  
> your code, it seem like it would be easy to add a check in  
> File2SequenceDict/List to check for Clustal types and do something  
> "special"

Yes, I was thinking wondering about that too.

For interlaced file formats (such as clustalw, NEXUS multiple alignment 
format) we have to load the whole file into memory anyway - so using a 
SequenceIterator was a bit odd.

What I was trying to do was use a SequenceIterator as the lowest common 
denominator - the ClustalIterator shows that this can be done for 
interlaced files, and seems to work.

Its trivial to "upgrade" the ClustalIterator to a SequenceDict or 
SequenceList if that's what is needed.

The way I wrote the ClustalIterator it actually reads the whole file and 
stores a list of IDs and a dictionary mapping the ID to the sequence 
string.  It creates SeqRecord objects only on request.  This should use 
less memory than a full list of every SeqRecord (but I have not measured 
this).

Note that I would also want to add an easy way to turn any 
SequenceIterator, SequenceList or SequenceDict into a multiple alignment 
object.

Out of interest, what are the largest alignments you deal with?

I was planning to add a Stockholm parser (where the sequences themselves 
are non-interleaved).  The PFAM database alignments use this, and are 
the largest alignments I am aware of.

However, the format supports per sequence annotation information and 
this information can be rather spread out.  Looking at a real example 
from PFAM, there were blocks of such data both before and after the 
sequences.  The format suggest that such annotation might also be found 
next to each sequence.

i.e. An annotation free Stockholm iterator would be easy, but including 
the meta data would in general require loading the whole file.

http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html

It looks like a subclassed version could be written to handle the PFAM 
annotations nicely.

Peter