[Biopython-dev] EMBL flatfile parsing

Peter biopython-dev at maubp.freeserve.co.uk
Tue Feb 6 15:16:42 UTC 2007


Albert Krewinkel wrote:
>> I am trying to parse a EMBL-formated file with biopython, but I
>> couldn't find any working parser for this. When I try to use the
>> Martel-based parser as described in one of the mailinglist-threads, I
>> get the following error...

Peter wrote:
> OK, we have the following files in BioPython:
> 
> Bio/formatdefs/embl.py (wrapper)
> Bio/expressions/embl/__init__.py (dummy file)
> Bio/expressions/embl/embl65.py (contains Martel definition)
 >
 > ...
 >
> It does look like an out of date [Martel] file format definition in
 > BioPython (assuming that example code from Jeff Chang is fine).

I haven't touched the Martel file format definition, but I have been 
looking at EMBL parsing for Bio.SeqIO

Based on my experience with the poor performance of the old Martel 
GenBank on large files, I would expect the same issue to apply to the 
Martel EMBL parser (even if it was updated).

So, I have been looking at re-writing my Python based GenBank parser (in 
Bio.GenBank) instead:

Notes and attachment showing the idea here:
http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c14

I am thinking of sticking with the current scanner/consumer model in 
Bio/GenBank/__init__.py but simply replacing the (GenBank only) _Scanner 
class with a "GenBank scanner" and an "EMBL scanner" (based on a common 
base class which will handle the feature table).

These new scanners would both feed into the existing consumers.  In 
particular, the "Feature Consumer" which builds a SeqRecord with 
SeqFeature objects.  I have this more or less working.

Does this sound like a sensible way to include EMBL support?

While it would be possible to use the new EMBL parser in much the same 
way as the current GenBank parser, I would recommend most users simply 
invoke them via Bio.SeqIO for normal work.

I could put most of the new code in Bio/GenBank and create a new 
module/directory called Bio/EMBL, or just stick everything in 
Bio/GenBank - I'm not that fussed either way given I want to push 
Bio.SeqIO as the main interface.

(Once that is settled I can rearrange the new code to slot in as 
appropriate.)

Michiel - how does this plan sound?  And should I try and get these 
changes done and tested in time for the next release - or wait until 
afterwards?

Peter



More information about the Biopython-dev mailing list