[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Tue Aug 22 16:46:39 UTC 2006

Albert Krewinkel wrote:
> I'd like to seriously start working on an EMBL parser, but ...

As the de-facto GenBank module owner, I'm also interested getting EMBL 
and GenBank working nicely together.  The big question BEFORE you/we 
start any serious coding on EMBL support is how it fits into BioPython.

Do we (a) add a new module like the existing Bio.Fasta and Bio.GenBank, 
or (b) use a new framework like the one I've put forward here:

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

 > ... there are  some things I'm concerned about: It surely would be a
 > good thing to  build the SequenceIO and Parser stuff upon some base
 > classes and agree on using certain tools which are (or will be) used
 > in the hole project.

What I was proposing was that all the new sequence file format parsers
should be implemented as subclasses of my SequenceIterator class - 
either directly (e.g. FastaIterator) or indirectly (e.g. the 
PfamStockholmIterator) and they should return SeqRecord objects.

I am open to discussion about how interlaced file formats should be
handled, but I think I have shown how the SequenceIterator based scheme 
could work using the Clustalw and Stockholm formats as examples.

> Since I never received any education/training on software 
> development, I would appreciate if someone can tell me how the code's
> structure should look like -- the current Scanner/Consumer code
> isn't any help.

I agree that the current Scanner/Consumer code won't be much help.

The fact that the current Bio.GenBank parser uses the Scanner/Consumer 
model reflects the fact that I rewrote (in Python) what had been done 
using Martel/Mindy.  This is one excuse for the state of that code of 
mine ;)

I don't think the flexibility of the Scanner/Consumer model is needed
just to turn Embl/GenBank data into SeqRecord objects (and only into 
SeqRecord objects).

> How about using reStructuredText in docstrings?  IMO it leaves the 
> .__doc__ string very readable but improves epydoc generated 
> descriptions.

I'm not familiar with how any existing API documentation is extracted
from the source code...

Peter