[Biopython-dev] RE: Parsing Protein GenBank Records

Tue Oct 30 14:17:29 EST 2001

Hello,

Thanks for your help. The updated parser now works well for most REFSEQ
proteins. I came across several REFSEQ protein records where the parser
still fails on UNIX machine. The following is the error message:

Traceback (most recent call last):
entry = parser.parse(gb_handle)
File "/usr/.../Bio/GenBank/__init__.py", line 281, in parse
self._scanner.feed(handle, self_consumer)
File "/usr/.../Bio/GenBank/__init__.py", line 1143, in feed
self._parser.parseFile(handle)
File "/usr/.../Martel/Parser.py", line 226, in parseFile
self.parseString(fileobj.read())
File "/usr/.../Martel/Parser.py", line 254, in parseString
self._err_handler.fatalError(result)  File
"/usr/.../python2.1/xml/sax/handler.py", line 38, in fatalError
raise exceptionParserPositionException: error parsing at or beyond character
2889

Any help will be greatly appreciated.

Thank You,
Jeong

-----Original Message-----
From: Brad Chapman [mailto:chapmanb at arches.uga.edu]
Sent: Tuesday, September 18, 2001 9:26 PM
To: Jeong Joung
Cc: biopython-dev at biopython.org
Subject: Re: Parsing Protein GenBank Records

Hi Joung;
(ccing this to biopython-dev since this is relevant to everyone)

> I'm having trouble parsing GenBank records obtained from the protein
> database. The parser works fine for nucleotide GenBank records , but not
for
> protein records. I would appreciate it very much if you can guide me in
> right direction for parsing such records.
>
> Here is the code and the error that I get back.
>
> >>> parser = GenBank.RecordParser()
> >>> ncbi = GenBank.NCBIDictionary(database='Protein')
> >>> rec = ncbi['6754304']

The parser does work for proteins in general, but does fail badly on
this particular REFSEQ sequence. In the past, REFSEQ stuff has been
only "sort of" GenBank format, and this record is no exception. It
has a lot of formatting problems (has no identifier for the sequence
type in the LOCUS line, has extra DBSOURCE tag, has non-standard
feature table types and keys (Protein, Region, region_name)).
Anyways, it is a big non-standard formatting mess.

I've fixed the GenBank parser to be able to handle this, and checked
the changes into CVS. Diffs to the relevant files (Record.py,
__init__.py and genbank_format.py in Bio.GenBank) are also attached
to this file in case you don't have CVS access.

Thanks for the bug report. Hope this works for you!

Brad
--
PGP public key available from http://pgp.mit.edu/