[Biopython-dev] [Biopython] skipping a bad record read in SeqIO

Sun Jun 7 20:10:33 UTC 2009

On 6/7/09, Iddo Friedberg <idoerg at gmail.com> wrote:
> Thanks Peter.
>
>  OK, it's a genbank file, but the point is not hacking around that problem
>  (which I did), it's more of a biopython policy question.

Could you report a bug with this particular GenBank file (or at least, the
entry). I think Biopython should try and cope with all valid GenBank files.

It has been a long time since I personally found a GenBank file
Biopython couldn't parse - the only cases I can remember recently from
the mailing list have been invalid files from 3rd party scripts or tools.

Sometimes for out of spec files issuing a warning but continuing may
be OK (we do already this on some LOCUS line variants, e.g. some
GenBank files output from EMBOSS), but for anything unexpected I
think the only safe option is to raise an exception.

>  Biopython cannot handle every record format variant (==error) out there,
>  and we should probably have a method for skipping over illegible records.
>  The records skipped should be noted, of course, e.g. by writing to stderr.
>  If the record cannot be read, then the preceding record ID and / or the
>  record serial number should be written.
>
>  Does that sound like something we should be doing?

No, not really.

I'm not 100% sure this is what you meant, but I would oppose any
suggestion that the default behaviour should be to completely skip bad
records (with only a warning or output to stderr to signal this).

In some cases (e.g. GenBank and SwissProt files) the start and end of
records are well defined, so for a corrupt record we may be able to
recover by issuing a warning and skipping ahead to the next record
boundary. In other file formats this could be impossible (or at least,
risky). So as a general policy for Bio.SeqIO, I don't think we can
offer any way to skip bad records.

Perhaps I am biased as most GenBank files I personally use are single
records (i.e. genomes).

Peter

P.S. I would use the warnings module rather than writing to stderr, as
this would allow the user to filter warnings, upgrade them to
exceptions etc.