[Biopython-dev] RE: Refseq Data
Peter Wilkinson
pewilkinson at informaxinc.com
Tue Nov 6 17:29:36 EST 2001
Hi Brad,
I tried your update (most recent changes you said is on, it is not working
with the nucleotide records either right now. I get the following error:
File "D:\Program Files\Python21\Bio\GenBank\__init__.py", line 1205, in
feed
self._parser.parseFile(handle)
File "D:\Program Files\Python21\Martel\Parser.py", line 226, in parseFile
self.parseString(fileobj.read())
File "D:\Program Files\Python21\Martel\Parser.py", line 254, in
parseString
self._err_handler.fatalError(result)
File "D:\Program Files\Python21\lib\xml\sax\handler.py", line 38, in
fatalError
raise exception
Martel.Parser.ParserPositionException: error parsing at or beyond character
379
I am puzzled about something though. I just downloaded the 'latest' files
from the web from the Bio/GenBank directory. However the CVS viewer from the
web shows that the files are 3 weeks old. I will try to go in with the
command line and see what I can find ...
Did you commit the changes to the CVS tree, or is the webCVS viewer doing
something funky?
Peter
>
> Traceback (most recent call last):
> entry = parser.parse(gb_handle)
> File "/usr/.../Bio/GenBank/__init__.py", line 281, in parse
> self._scanner.feed(handle, self_consumer)
> File "/usr/.../Bio/GenBank/__init__.py", line 1143, in feed
> self._parser.parseFile(handle)
> File "/usr/.../Martel/Parser.py", line 226, in parseFile
> self.parseString(fileobj.read())
> File "/usr/.../Martel/Parser.py", line 254, in parseString
> self._err_handler.fatalError(result) File
> "/usr/.../python2.1/xml/sax/handler.py", line 38, in fatalError
> raise exceptionParserPositionException: error parsing at or
> beyond character
> 2889
>
> Any help will be greatly appreciated.
>
> Thank You,
> Jeong
>
> -----Original Message-----
> From: Brad Chapman [mailto:chapmanb at arches.uga.edu]
> Sent: Tuesday, September 18, 2001 9:26 PM
> To: Jeong Joung
> Cc: biopython-dev at biopython.org
> Subject: Re: Parsing Protein GenBank Records
>
>
> Hi Joung;
> (ccing this to biopython-dev since this is relevant to everyone)
>
> > I'm having trouble parsing GenBank records obtained from the protein
> > database. The parser works fine for nucleotide GenBank
> records , but not
> for
> > protein records. I would appreciate it very much if you can
> guide me in
> > right direction for parsing such records.
> >
> > Here is the code and the error that I get back.
> >
> > >>> parser = GenBank.RecordParser()
> > >>> ncbi = GenBank.NCBIDictionary(database='Protein')
> > >>> rec = ncbi['6754304']
>
> The parser does work for proteins in general, but does fail badly on
> this particular REFSEQ sequence. In the past, REFSEQ stuff has been
> only "sort of" GenBank format, and this record is no exception. It
> has a lot of formatting problems (has no identifier for the sequence
> type in the LOCUS line, has extra DBSOURCE tag, has non-standard
> feature table types and keys (Protein, Region, region_name)).
> Anyways, it is a big non-standard formatting mess.
>
> I've fixed the GenBank parser to be able to handle this, and checked
> the changes into CVS. Diffs to the relevant files (Record.py,
> __init__.py and genbank_format.py in Bio.GenBank) are also attached
> to this file in case you don't have CVS access.
>
> Thanks for the bug report. Hope this works for you!
>
> Brad
> --
> PGP public key available from http://pgp.mit.edu/
>
>
> --__--__--
>
> Message: 2
> From: "Jeong Joung" <j.joung at AptusGenomics.com>
> To: "Brad Chapman" <chapmanb at arches.uga.edu>
> Cc: <biopython-dev at biopython.org>
> Date: Tue, 30 Oct 2001 15:48:19 -0500
> Subject: [Biopython-dev] Parsing Protein GenBank Records
>
> Hi,
>
> I just found out that this problem occurs on some REFSEQ
> nucleotide records
> as well.
>
> Thank You,
> Jeong
>
> -----Original Message-----
> From: Jeong Joung [mailto:j.joung at AptusGenomics.com]
> Sent: Tuesday, October 30, 2001 2:17 PM
> To: Brad Chapman
> Cc: biopython-dev at biopython.org
> Subject: RE: Parsing Protein GenBank Records
>
>
> Hello,
>
> Thanks for your help. The updated parser now works well for
> most REFSEQ
> proteins. I came across several REFSEQ protein records where
> the parser
> still fails on UNIX machine. The following is the error message:
>
> Traceback (most recent call last):
> entry = parser.parse(gb_handle)
> File "/usr/.../Bio/GenBank/__init__.py", line 281, in parse
> self._scanner.feed(handle, self_consumer)
> File "/usr/.../Bio/GenBank/__init__.py", line 1143, in feed
> self._parser.parseFile(handle)
> File "/usr/.../Martel/Parser.py", line 226, in parseFile
> self.parseString(fileobj.read())
> File "/usr/.../Martel/Parser.py", line 254, in parseString
> self._err_handler.fatalError(result) File
> "/usr/.../python2.1/xml/sax/handler.py", line 38, in fatalError
> raise exceptionParserPositionException: error parsing at or
> beyond character
> 2889
>
> Any help will be greatly appreciated.
>
> Thank You,
> Jeong
>
> -----Original Message-----
> From: Brad Chapman [mailto:chapmanb at arches.uga.edu]
> Sent: Tuesday, September 18, 2001 9:26 PM
> To: Jeong Joung
> Cc: biopython-dev at biopython.org
> Subject: Re: Parsing Protein GenBank Records
>
>
> Hi Joung;
> (ccing this to biopython-dev since this is relevant to everyone)
>
> > I'm having trouble parsing GenBank records obtained from the protein
> > database. The parser works fine for nucleotide GenBank
> records , but not
> for
> > protein records. I would appreciate it very much if you can
> guide me in
> > right direction for parsing such records.
> >
> > Here is the code and the error that I get back.
> >
> > >>> parser = GenBank.RecordParser()
> > >>> ncbi = GenBank.NCBIDictionary(database='Protein')
> > >>> rec = ncbi['6754304']
>
> The parser does work for proteins in general, but does fail badly on
> this particular REFSEQ sequence. In the past, REFSEQ stuff has been
> only "sort of" GenBank format, and this record is no exception. It
> has a lot of formatting problems (has no identifier for the sequence
> type in the LOCUS line, has extra DBSOURCE tag, has non-standard
> feature table types and keys (Protein, Region, region_name)).
> Anyways, it is a big non-standard formatting mess.
>
> I've fixed the GenBank parser to be able to handle this, and checked
> the changes into CVS. Diffs to the relevant files (Record.py,
> __init__.py and genbank_format.py in Bio.GenBank) are also attached
> to this file in case you don't have CVS access.
>
> Thanks for the bug report. Hope this works for you!
>
> Brad
> --
> PGP public key available from http://pgp.mit.edu/
>
>
>
> --__--__--
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
>
>
> End of Biopython-dev Digest
More information about the Biopython-dev
mailing list