[Biopython-dev] GenBank parser fails (on large files?)

Brad Chapman chapmanb at arches.uga.edu
Thu Sep 27 16:05:54 EDT 2001


Hi Michel, Andrew;

Michel:
> >ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_
> H37Rv/AL123456.gbk
> 
> >This fails with:
> 
> >Martel.Parser.ParserPositionException: error parsing at or
> > beyond character 42

Andrew:
> I've found the problem.  Here's the format definition
[...]
> In this record, the locus line is
> 
> LOCUS       AL123456  4411529 bp          circular  BCT       07-JUL-1998
>                                 ^^^^^^^^^^ all spaces
> 
> so there is no residue type.   The 'blank_space' in 'locus_line'
> eats up all those spaces, leaving the parser at the word 'circular'.

Thanks for looking at this Andrew -- I've also been checking it out
concurrently and came to the same conclusion. Wow, I never would
have expected to have circular without the residue type :-).

I've fixed this and also a second problem with this file, the
version line has no GI:

VERSION     AL123456

I've added these examples to the GenBankFormat test so that we
should be able to catch them in the future.

For Michel, the fixes are in CVS and the patches to
GenBank/__init__.py and GenBank/genbank_format.py are attached. With
these I can parse your file without problems. I've also added a
couple of things which will (hopefully) speed up dealing with large
sequences some. Thanks for the bug report on this; Let us know 
if you come across anything else that fails.

> I've not tested this, since I think the format definition needs
> to be revisited first because I've now more experience in writing
> these things, and second because the LOCUS line definition is
> changing in the next couple months, according to
> 
>   ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt

Yeah, I had read about this previously and I _think_ the format will
handle them (after some modifications I made a while back). In
test_GenBankFormat.py there are a couple of example locus lines with
this new format that it'll parse okay. We'll see if it will hold up
when the full-scale change comes on, though.

But, you are still more than welcome to attack the locus line
parsing anytime you feel up to it -- you are definately the master
o' Martel :-).

Brad
-- 
PGP public key available from http://pgp.mit.edu/
-------------- next part --------------
Index: genbank_format.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v
retrieving revision 1.8
diff -c -r1.8 genbank_format.py
*** genbank_format.py	2001/09/19 01:15:52	1.8
--- genbank_format.py	2001/09/27 20:02:47
***************
*** 85,92 ****
  residue_type = Martel.Group("residue_type",
                              Martel.Opt(Martel.Alt(*residue_prefixes)) +
                              Martel.Opt(Martel.Alt(*residue_types)) +
!                             Martel.Opt(blank_space +
                                         Martel.Str("circular")))
  date = Martel.Group("date",
                      Martel.Re("[-\w]+"))
  
--- 85,93 ----
  residue_type = Martel.Group("residue_type",
                              Martel.Opt(Martel.Alt(*residue_prefixes)) +
                              Martel.Opt(Martel.Alt(*residue_types)) +
!                             Martel.Opt(Martel.Opt(blank_space) + 
                                         Martel.Str("circular")))
+ 
  date = Martel.Group("date",
                      Martel.Re("[-\w]+"))
  
***************
*** 163,171 ****
                              Martel.Str("VERSION") +
                              blank_space +
                              version +
!                             blank_space +
!                             Martel.Str("GI:") +
!                             gi +
                              Martel.AnyEol())
  
  # DBSOURCE    REFSEQ: accession NM_010510.1
--- 164,172 ----
                              Martel.Str("VERSION") +
                              blank_space +
                              version +
!                             Martel.Opt(blank_space +
!                                        Martel.Str("GI:") +
!                                        gi) +
                              Martel.AnyEol())
  
  # DBSOURCE    REFSEQ: accession NM_010510.1


More information about the Biopython-dev mailing list