[Biopython-dev] GenBank parser fails (on large files?)

Andrew Dalke adalke at mindspring.com
Thu Sep 27 15:35:50 EDT 2001


>Full_Name: Michel Kerszberg
>ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_
H37Rv/AL123456.gbk

>This fails with:

>Martel.Parser.ParserPositionException: error parsing at or
> beyond character 42
>
>This is in the first line of the record, which seems
>correctly formatted. No amount of massaging of the
>file seems to help.
>
>I have seen this problem reported with other large
>GenBank records.

I've found the problem.  Here's the format definition

locus_line = Martel.Group("locus_line",
                          ...
                          blank_space +
                          Martel.Opt(residue_type +

residue_type = Martel.Group("residue_type",
                            Martel.Opt(Martel.Alt(*residue_prefixes)) +
                            Martel.Opt(Martel.Alt(*residue_types)) +
                            Martel.Opt(blank_space +
                                       Martel.Str("circular")))

In this record, the locus line is

LOCUS       AL123456  4411529 bp          circular  BCT       07-JUL-1998
                                ^^^^^^^^^^ all spaces

so there is no residue type.   The 'blank_space' in 'locus_line'
eats up all those spaces, leaving the parser at the word 'circular'.
That doesn't match the residue_prefixes or the residue_types.  There's
no " " so it doesn't match the 'blank_space', so the residue_type
fails.

Here's a likely solution - move 'blank_space' to occur after the

residue_type = Martel.Group("residue_type",
    Martel.Alt(
         Martel.Opt(Martel.Alt(*residue_prefixes)) + \
           Martel.Alt(*residue_types) + \
           Martel.Opt(blank_space + Martel.Str("circular")),
         Martel.Opt(Martel.Str("circular")))

I've not tested this, since I think the format definition needs
to be revisited first because I've now more experience in writing
these things, and second because the LOCUS line definition is
changing in the next couple months, according to

  ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt

                    Andrew








More information about the Biopython-dev mailing list