[Biopython-dev] EMBL flatfile parsing

Wed Apr 19 11:24:04 UTC 2006

Albert Krewinkel wrote:
> Hello,
> 
> I am trying to parse a EMBL-formated file with biopython, but I
> couldn't find any working parser for this. When I try to use the
> Martel-based parser as described in one of the mailinglist-threads, I
> get the following error...

OK, we have the following files in BioPython:

Bio/formatdefs/embl.py (wrapper)
Bio/expressions/embl/__init__.py (dummy file)
Bio/expressions/embl/embl65.py (contains Martel definition)

According to the comments, this should read EMBL files in the format 
from EMBL Nucleotide Sequence Database Release 65, December 2000.

They are now on release 86, and there have been changes to the file format:

http://www.ebi.ac.uk/embl/Documentation/changesdetails.html

For example, the ID lines have changed, and the SV (sequence version) 
line removed.

> 
> Python 2.4.1 (#1, Oct 22 2005, 16:20:11)
> [GCC 4.0.0 20041026 (Apple Computer, Inc. build 4061)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> 
>>>>filename = '/Users/krewinkel/tmp/embltest.embl'
>>>>from Bio.formatdefs.embl import embl65
>>>>from xml.sax import saxutils
>>>>parser = embl65.make_parser()
>>>>parser.setContentHandler(saxutils.XMLGenerator())
>>>>parser.parse(open(filename))

That looks like its based on Jeff Chang's email dated 23 July 2003, one 
of the only mentions of EMBL that I could spot in the archives.

http://lists.open-bio.org/pipermail/biopython-dev/2003-July/001351.html

> <?xml version="1.0" encoding="iso-8859-1"?>
> <dataset format="embl/65">Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 482, in parse
>     self.parseFile(source.getCharacterStream() or source.getByteStream())
>   File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 468, in parseFile
>     self._err_handler.error(result)
>   File "/opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/xml/sax/handler.py", line 34, in error
>     raise exception
> Martel.Parser.ParserPositionException: error parsing at or beyond character 0

Same here, using your example file.

The fact that it seems to be failing right at beginning suggests it is 
the change to the ID line that is causing the problem (line one in the 
example file).

> The file itself appears to be okay, since it can be read by 'seqret'
> and bioperl. This seems to be a parser problem -- or am I doing
> something wrong?

It does look like an out of date file format definition in BioPython 
(assuming that example code from Jeff Chang is fine).

Peter