[Biopython-dev] More relaxed parsing of wonky GenBank files

Tue Jan 8 08:27:20 EST 2013

On Tuesday, January 8, 2013, Kai Blin wrote:

> Hi folks,
>
> I've recently pushed into production use a new version of my software
> that uses BioPython parsers instead of our own hand-written parsers.
>
> One big thing we noticed is that BioPython is waaay more picky as to
> what a proper GenBank file is supposed to look like. Sadly, many of
> our users seem to be creating their GenBank files with programs that
> only have a rough understanding what the file format is supposed to
> look like. Most of the invalid input can safely be ignored, and I
> would propose to extend the GenBank parser to cope with the most
> common errors I'm seeing in day to day use.
>
> I'm happy to provide the patches, but before starting this work I'd
> like to make sure that they would be acceptable in principle. So, any
> reason to rather blow up in our user's face than to try and cope with
> invalid input?
>
> Cheers,
> Kai
>

We already try to be tolerant, and issue warnings where it seems
safe to take a broken file (e.g. Unrecognised first line, mismatch
between length given in first line and actual sequence), but in
these cases not all the mis-formed data will or can be parsed.
Sometimes a file is broken to the point it is unwise to attempt
to parse it any further and an exception is the best course
of action.

Clearly you're found a whole load more dodgy files. If you
can work out which buggy tools are producing them, please
do try and report the issues to the tool authors. I know that
BioEdit is one source, but maintainence of that popular
free Windows tool stopped many years ago.

If you can prepare some (small) example files illustrating the
rule-breaking files (for testing), and with patches too if you like,
I will certainly review them for inclusion.

Note if the user wants an exception, they can use the warnings
module to catch and upgrade our parser warnings. As Michael
pointed out, other bits of Biopython have an explicit validation
or strict mode like the Entrez and PDB parsers. In the case of
the PDB parser this just toggles between issuing warnings and
raising exceptions. I'm not sure if the GenBank (and any other
SeqIO parsers) need a validate/permissive option given this
can already be achieved with the warnings module. After all,
broken GenBank files should be in the minority.

(My understanding of the Entrez setting is also about dealing
with missing DTD files and cases where the NCBI has a
bug and their XML and DTD disagree.)

Peter