[Biopython-dev] GenBank parser fails (on large files?)
Brad Chapman
chapmanb at arches.uga.edu
Thu Sep 27 16:05:54 EDT 2001
Hi Michel, Andrew;
Michel:
> >ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_
> H37Rv/AL123456.gbk
>
> >This fails with:
>
> >Martel.Parser.ParserPositionException: error parsing at or
> > beyond character 42
Andrew:
> I've found the problem. Here's the format definition
[...]
> In this record, the locus line is
>
> LOCUS AL123456 4411529 bp circular BCT 07-JUL-1998
> ^^^^^^^^^^ all spaces
>
> so there is no residue type. The 'blank_space' in 'locus_line'
> eats up all those spaces, leaving the parser at the word 'circular'.
Thanks for looking at this Andrew -- I've also been checking it out
concurrently and came to the same conclusion. Wow, I never would
have expected to have circular without the residue type :-).
I've fixed this and also a second problem with this file, the
version line has no GI:
VERSION AL123456
I've added these examples to the GenBankFormat test so that we
should be able to catch them in the future.
For Michel, the fixes are in CVS and the patches to
GenBank/__init__.py and GenBank/genbank_format.py are attached. With
these I can parse your file without problems. I've also added a
couple of things which will (hopefully) speed up dealing with large
sequences some. Thanks for the bug report on this; Let us know
if you come across anything else that fails.
> I've not tested this, since I think the format definition needs
> to be revisited first because I've now more experience in writing
> these things, and second because the LOCUS line definition is
> changing in the next couple months, according to
>
> ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt
Yeah, I had read about this previously and I _think_ the format will
handle them (after some modifications I made a while back). In
test_GenBankFormat.py there are a couple of example locus lines with
this new format that it'll parse okay. We'll see if it will hold up
when the full-scale change comes on, though.
But, you are still more than welcome to attack the locus line
parsing anytime you feel up to it -- you are definately the master
o' Martel :-).
Brad
--
PGP public key available from http://pgp.mit.edu/
-------------- next part --------------
Index: genbank_format.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v
retrieving revision 1.8
diff -c -r1.8 genbank_format.py
*** genbank_format.py 2001/09/19 01:15:52 1.8
--- genbank_format.py 2001/09/27 20:02:47
***************
*** 85,92 ****
residue_type = Martel.Group("residue_type",
Martel.Opt(Martel.Alt(*residue_prefixes)) +
Martel.Opt(Martel.Alt(*residue_types)) +
! Martel.Opt(blank_space +
Martel.Str("circular")))
date = Martel.Group("date",
Martel.Re("[-\w]+"))
--- 85,93 ----
residue_type = Martel.Group("residue_type",
Martel.Opt(Martel.Alt(*residue_prefixes)) +
Martel.Opt(Martel.Alt(*residue_types)) +
! Martel.Opt(Martel.Opt(blank_space) +
Martel.Str("circular")))
+
date = Martel.Group("date",
Martel.Re("[-\w]+"))
***************
*** 163,171 ****
Martel.Str("VERSION") +
blank_space +
version +
! blank_space +
! Martel.Str("GI:") +
! gi +
Martel.AnyEol())
# DBSOURCE REFSEQ: accession NM_010510.1
--- 164,172 ----
Martel.Str("VERSION") +
blank_space +
version +
! Martel.Opt(blank_space +
! Martel.Str("GI:") +
! gi) +
Martel.AnyEol())
# DBSOURCE REFSEQ: accession NM_010510.1
More information about the Biopython-dev
mailing list