[Biopython-dev] [Biopython] skipping a bad record read in SeqIO
Peter
biopython at maubp.freeserve.co.uk
Sun Jun 7 20:10:33 UTC 2009
On 6/7/09, Iddo Friedberg <idoerg at gmail.com> wrote:
> Thanks Peter.
>
> OK, it's a genbank file, but the point is not hacking around that problem
> (which I did), it's more of a biopython policy question.
Could you report a bug with this particular GenBank file (or at least, the
entry). I think Biopython should try and cope with all valid GenBank files.
It has been a long time since I personally found a GenBank file
Biopython couldn't parse - the only cases I can remember recently from
the mailing list have been invalid files from 3rd party scripts or tools.
Sometimes for out of spec files issuing a warning but continuing may
be OK (we do already this on some LOCUS line variants, e.g. some
GenBank files output from EMBOSS), but for anything unexpected I
think the only safe option is to raise an exception.
> Biopython cannot handle every record format variant (==error) out there,
> and we should probably have a method for skipping over illegible records.
> The records skipped should be noted, of course, e.g. by writing to stderr.
> If the record cannot be read, then the preceding record ID and / or the
> record serial number should be written.
>
> Does that sound like something we should be doing?
No, not really.
I'm not 100% sure this is what you meant, but I would oppose any
suggestion that the default behaviour should be to completely skip bad
records (with only a warning or output to stderr to signal this).
In some cases (e.g. GenBank and SwissProt files) the start and end of
records are well defined, so for a corrupt record we may be able to
recover by issuing a warning and skipping ahead to the next record
boundary. In other file formats this could be impossible (or at least,
risky). So as a general policy for Bio.SeqIO, I don't think we can
offer any way to skip bad records.
Perhaps I am biased as most GenBank files I personally use are single
records (i.e. genomes).
Peter
P.S. I would use the warnings module rather than writing to stderr, as
this would allow the user to filter warnings, upgrade them to
exceptions etc.
More information about the Biopython-dev
mailing list