[Biopython-dev] EMBL flatfile parsing
Peter
biopython-dev at maubp.freeserve.co.uk
Tue Feb 6 15:16:42 UTC 2007
Albert Krewinkel wrote:
>> I am trying to parse a EMBL-formated file with biopython, but I
>> couldn't find any working parser for this. When I try to use the
>> Martel-based parser as described in one of the mailinglist-threads, I
>> get the following error...
Peter wrote:
> OK, we have the following files in BioPython:
>
> Bio/formatdefs/embl.py (wrapper)
> Bio/expressions/embl/__init__.py (dummy file)
> Bio/expressions/embl/embl65.py (contains Martel definition)
>
> ...
>
> It does look like an out of date [Martel] file format definition in
> BioPython (assuming that example code from Jeff Chang is fine).
I haven't touched the Martel file format definition, but I have been
looking at EMBL parsing for Bio.SeqIO
Based on my experience with the poor performance of the old Martel
GenBank on large files, I would expect the same issue to apply to the
Martel EMBL parser (even if it was updated).
So, I have been looking at re-writing my Python based GenBank parser (in
Bio.GenBank) instead:
Notes and attachment showing the idea here:
http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c14
I am thinking of sticking with the current scanner/consumer model in
Bio/GenBank/__init__.py but simply replacing the (GenBank only) _Scanner
class with a "GenBank scanner" and an "EMBL scanner" (based on a common
base class which will handle the feature table).
These new scanners would both feed into the existing consumers. In
particular, the "Feature Consumer" which builds a SeqRecord with
SeqFeature objects. I have this more or less working.
Does this sound like a sensible way to include EMBL support?
While it would be possible to use the new EMBL parser in much the same
way as the current GenBank parser, I would recommend most users simply
invoke them via Bio.SeqIO for normal work.
I could put most of the new code in Bio/GenBank and create a new
module/directory called Bio/EMBL, or just stick everything in
Bio/GenBank - I'm not that fussed either way given I want to push
Bio.SeqIO as the main interface.
(Once that is settled I can rearrange the new code to slot in as
appropriate.)
Michiel - how does this plan sound? And should I try and get these
changes done and tested in time for the next release - or wait until
afterwards?
Peter
More information about the Biopython-dev
mailing list