[Biopython-dev] More relaxed parsing of wonky GenBank files

Kai Blin kai.blin at biotech.uni-tuebingen.de
Tue Jan 8 08:55:42 EST 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2013-01-08 14:27, Peter Cock wrote:

> We already try to be tolerant, and issue warnings where it seems 
> safe to take a broken file (e.g. Unrecognised first line, mismatch 
> between length given in first line and actual sequence), but in 
> these cases not all the mis-formed data will or can be parsed. 
> Sometimes a file is broken to the point it is unwise to attempt to
> parse it any further and an exception is the best course of
> action.

Yeah, I started looking into the code and realized that it already
tries to handle a lot of special cases.

> Clearly you're found a whole load more dodgy files. If you can work
> out which buggy tools are producing them, please do try and report
> the issues to the tool authors. I know that BioEdit is one source,
> but maintainence of that popular free Windows tool stopped many
> years ago.

Unfortunately I often have no way to contact the uploaders of the
broken sequence files, unless they chose to provide an email address.

> If you can prepare some (small) example files illustrating the 
> rule-breaking files (for testing), and with patches too if you
> like, I will certainly review them for inclusion.

The two most common things I saw in the last week are single record
files without the '//' end-of-record marker, and files where the
sequence lines are indented by one space more than expected (my
favourite).

I've added two sample files for these issues, I'm currently working on
patches that make them pass the tests.

Thanks for the comments. I'll push to my github fork once I've got
something.

Cheers,
Kai

- -- 
Dipl.-Inform. Kai Blin         kai.blin at biotech.uni-tuebingen.de
Institute for Microbiology and Infection Medicine
Division of Microbiology/Biotechnology
Eberhard-Karls-Universität Tübingen
Auf der Morgenstelle 28                 Phone : ++49 7071 29-78841
D-72076 Tübingen                        Fax :   ++49 7071 29-5979
Germany
Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQEcBAEBAgAGBQJQ7CVeAAoJEKM5lwBiwTTPGCYIANAkOxKtNPkclw66aCBWCaAH
Uz6zyCk8DTomGOy1fnBoPKI3R+tn73+8XNe6RknFDb6NL/uMD1bR4mTHi1yuHT24
7XSJp+j1JeIamMSs6hLAf4s/HIE2YoEriOe8I6lUAa2I//rxsKf2PcS7y/4Ax6XP
K/PUPODVanTCKFrpOIh2DS92lXvMJqI+cpZQ7k1ioaL+6iM9uqi9iRiV9H69Dci5
9bubA98+XvG1cnBISoQTHXpU1p1uiKU1CLxyWdl+9GTq4dCxTkeKDQvxoOd8JH/P
ksJPXyYY5u41KrDFpIMNJZpvr0PawLHcUGePKXDEvAt7wvmfDxN92xcVYsUP9w4=
=9u/w
-----END PGP SIGNATURE-----


More information about the Biopython-dev mailing list