[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Peter (BioPython Dev)
biopython-dev at maubp.freeserve.co.uk
Thu Aug 17 20:09:20 UTC 2006
Marc Colosimo wrote:
>> Nice quick work on that. For Clustal, I think it should NOT be an
>> Iterator, but there should be SequenceDict or SequenceList for it.
>> There are other alignment filetypes out there that could use a
>> SequenceIterator (those that are not interlaced). From looking over
>> your code, it seem like it would be easy to add a check in
>> File2SequenceDict/List to check for Clustal types and do something
>> "special"
Peter (BioPython Dev) wrote:
> Yes, I was thinking wondering about that too.
>
> For interlaced file formats (such as clustalw, NEXUS multiple alignment
> format) we have to load the whole file into memory anyway - so using a
> SequenceIterator was a bit odd.
>
> What I was trying to do was use a SequenceIterator as the lowest common
> denominator - the ClustalIterator shows that this can be done for
> interlaced files, and seems to work.
There are two and a half examples done this way now...
> I was planning to add a Stockholm parser (where the sequences themselves
> are non-interleaved). The PFAM database alignments use this, and are
> the largest alignments I am aware of.
>
> ...
>
> It looks like a subclassed version could be written to handle the PFAM
> annotations nicely.
http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c3
Changes to the clustal parser, and addition of a parser for Stockholm
alignments, and a subclassed version to handle the PFAM style
annotations strings.
I have included basic handling of the sequence specific meta-data [I
need to have a look at real PFAM data to sort of the database cross
references still], but currently ignore the whole file level information
(#=GF lines) and the per column information (#=GC lines).
Maybe reading sequences out of multiple alignment files should be done
as a special case of loading multiple alignments? Is this what you
meant by "something special" Marc?
Peter
More information about the Biopython-dev
mailing list