[Biopython-dev] EMBL flatfile parsing
Peter (BioPython-dev)
biopython-dev at maubp.freeserve.co.uk
Wed Apr 19 07:24:04 EDT 2006
Albert Krewinkel wrote:
> Hello,
>
> I am trying to parse a EMBL-formated file with biopython, but I
> couldn't find any working parser for this. When I try to use the
> Martel-based parser as described in one of the mailinglist-threads, I
> get the following error...
OK, we have the following files in BioPython:
Bio/formatdefs/embl.py (wrapper)
Bio/expressions/embl/__init__.py (dummy file)
Bio/expressions/embl/embl65.py (contains Martel definition)
According to the comments, this should read EMBL files in the format
from EMBL Nucleotide Sequence Database Release 65, December 2000.
They are now on release 86, and there have been changes to the file format:
http://www.ebi.ac.uk/embl/Documentation/changesdetails.html
For example, the ID lines have changed, and the SV (sequence version)
line removed.
>
> Python 2.4.1 (#1, Oct 22 2005, 16:20:11)
> [GCC 4.0.0 20041026 (Apple Computer, Inc. build 4061)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>
>>>>filename = '/Users/krewinkel/tmp/embltest.embl'
>>>>from Bio.formatdefs.embl import embl65
>>>>from xml.sax import saxutils
>>>>parser = embl65.make_parser()
>>>>parser.setContentHandler(saxutils.XMLGenerator())
>>>>parser.parse(open(filename))
That looks like its based on Jeff Chang's email dated 23 July 2003, one
of the only mentions of EMBL that I could spot in the archives.
http://lists.open-bio.org/pipermail/biopython-dev/2003-July/001351.html
> <?xml version="1.0" encoding="iso-8859-1"?>
> <dataset format="embl/65">Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 482, in parse
> self.parseFile(source.getCharacterStream() or source.getByteStream())
> File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 468, in parseFile
> self._err_handler.error(result)
> File "/opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/xml/sax/handler.py", line 34, in error
> raise exception
> Martel.Parser.ParserPositionException: error parsing at or beyond character 0
Same here, using your example file.
The fact that it seems to be failing right at beginning suggests it is
the change to the ID line that is causing the problem (line one in the
example file).
> The file itself appears to be okay, since it can be read by 'seqret'
> and bioperl. This seems to be a parser problem -- or am I doing
> something wrong?
It does look like an out of date file format definition in BioPython
(assuming that example code from Jeff Chang is fine).
Peter
More information about the Biopython-dev
mailing list