[Biopython-dev] skipping a bad record read in SeqIO

Sun Jun 7 21:14:10 UTC 2009

On Sun, Jun 7, 2009 at 1:10 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On 6/7/09, Iddo Friedberg <idoerg at gmail.com> wrote:
> > Thanks Peter.
> >
> >  OK, it's a genbank file, but the point is not hacking around that
> problem
> >  (which I did), it's more of a biopython policy question.
>
> Could you report a bug with this particular GenBank file (or at least, the
> entry). I think Biopython should try and cope with all valid GenBank files.
>
> It has been a long time since I personally found a GenBank file
> Biopython couldn't parse - the only cases I can remember recently from
> the mailing list have been invalid files from 3rd party scripts or tools.
>
> Sometimes for out of spec files issuing a warning but continuing may
> be OK (we do already this on some LOCUS line variants, e.g. some
> GenBank files output from EMBOSS), but for anything unexpected I
> think the only safe option is to raise an exception.
>
> >  Biopython cannot handle every record format variant (==error) out there,
> >  and we should probably have a method for skipping over illegible
> records.
> >  The records skipped should be noted, of course, e.g. by writing to
> stderr.
> >  If the record cannot be read, then the preceding record ID and / or the
> >  record serial number should be written.
> >
> >  Does that sound like something we should be doing?
>
> No, not really.
>
> I'm not 100% sure this is what you meant, but I would oppose any
> suggestion that the default behaviour should be to completely skip bad
> records (with only a warning or output to stderr to signal this).
>
> In some cases (e.g. GenBank and SwissProt files) the start and end of
> records are well defined, so for a corrupt record we may be able to
> recover by issuing a warning and skipping ahead to the next record
> boundary. In other file formats this could be impossible (or at least,
> risky). So as a general policy for Bio.SeqIO, I don't think we can
> offer any way to skip bad records.
>
> Perhaps I am biased as most GenBank files I personally use are single
> records (i.e. genomes).

No, I am not suggesting that it should be the default behavior, but that an
argument (skip_bad_records=True or somesuch) could be passed to the parser
to make this possible for users who would like to do that. I work with
millions of sequences at a time, and if 5,000 or 50,000 are badly formatted
(or problematic due to a parser bug), I would rather make a note of it and
move on, coming back later to fix the problem. The alternative would be --
well, and ugly hack, which will cause loss of time and research momentum.

Also, I am not suggesting an exact implementation (yet). Warnings do sound
better than stderr.

There are a few million genbank (the format) files out there that did not
originate with NCBI genbank (the database). Mostly in metagenomics. Some are
meta-file that contain no sequence but only LOCUS fields.

It used to be that any format was strictly adhered to, simply because files
in that format would always originate from the same source, and FASTA was
the universal format used for exchange, since it is very hard to mess up a
fasta format. That is not the case any more. For that reason I think we
should consider how to handle unparse-able records.

>
>
> Peter
>
> P.S. I would use the warnings module rather than writing to stderr, as
> this would allow the user to filter warnings, upgrade them to
> exceptions etc.
>

-- 
Iddo Friedberg, Ph.D.
Atkinson Hall, mail code 0446
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0446, USA
T: +1 (858) 534-0570
http://iddo-friedberg.org

-- 
Iddo Friedberg, Ph.D.
Atkinson Hall, mail code 0446
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0446, USA
T: +1 (858) 534-0570
http://iddo-friedberg.org