[Biopython-dev] GenBank parser fails (on large files?)
Andrew Dalke
adalke at mindspring.com
Thu Sep 27 15:35:50 EDT 2001
>Full_Name: Michel Kerszberg
>ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_
H37Rv/AL123456.gbk
>This fails with:
>Martel.Parser.ParserPositionException: error parsing at or
> beyond character 42
>
>This is in the first line of the record, which seems
>correctly formatted. No amount of massaging of the
>file seems to help.
>
>I have seen this problem reported with other large
>GenBank records.
I've found the problem. Here's the format definition
locus_line = Martel.Group("locus_line",
...
blank_space +
Martel.Opt(residue_type +
residue_type = Martel.Group("residue_type",
Martel.Opt(Martel.Alt(*residue_prefixes)) +
Martel.Opt(Martel.Alt(*residue_types)) +
Martel.Opt(blank_space +
Martel.Str("circular")))
In this record, the locus line is
LOCUS AL123456 4411529 bp circular BCT 07-JUL-1998
^^^^^^^^^^ all spaces
so there is no residue type. The 'blank_space' in 'locus_line'
eats up all those spaces, leaving the parser at the word 'circular'.
That doesn't match the residue_prefixes or the residue_types. There's
no " " so it doesn't match the 'blank_space', so the residue_type
fails.
Here's a likely solution - move 'blank_space' to occur after the
residue_type = Martel.Group("residue_type",
Martel.Alt(
Martel.Opt(Martel.Alt(*residue_prefixes)) + \
Martel.Alt(*residue_types) + \
Martel.Opt(blank_space + Martel.Str("circular")),
Martel.Opt(Martel.Str("circular")))
I've not tested this, since I think the format definition needs
to be revisited first because I've now more experience in writing
these things, and second because the LOCUS line definition is
changing in the next couple months, according to
ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt
Andrew
More information about the Biopython-dev
mailing list