[Biopython-dev] Bio.Entrez catching more errors

Sun Mar 22 06:44:42 EDT 2009

On Sat, Mar 21, 2009 at 4:47 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> I think it is good if we catch more errors in Bio.Entrez, but I think
> the error catching should be done by the parser, not when
> retrieving.

We could do that - maybe some common functions for checking
the first line to see if it looks like HTML or XML would help.  It means
lots of changes to lots of parsers, but would help outside the use
case of Bio.Entrez - so this perhaps worth doing anyway.

What about the fairly common situation (at, its something I've done
fairly often) where Bio.Entrez.efetch() is used to fetch records which
are saved directly to file without verification - e.g. to be parsed by
another program?  Unless the error is caught in Bio.Entrez.efetch()
it may be out of our control.

> As you show, NCBI Entrez returns error messages in various
> different formats: plain text, HTML, incorrect XML, broken XML.
> Since there are many ways to access NCBI Entrez, there may
> be other styles of error messages that we don't know about.
> Then there is the added complication of accessing NCBI Entrez
> to get information in formats other than XML, e.g. GenBank files.
> And all this may be changed over time by NCBI.
>
> Since the error message is ill-defined, code trying to identify
> error messages won't be robust.

All very true.  But the main point in my original email was on
something slightly different...

> On the other hand, the format of files expected by a given
> parser is well-defined: Either the file agrees with the format
> expected by the parser, or it doesn't; if it doesn't, then that's
> an error.

Its not that simple - we are often dealing with loosely defined
file formats, and you may be able to reasonably interpret one
file in several different formats (giving difference/incorrect data).

Some parsers are very tolerant at the moment, for example
GenBank files can have a legitimate free format comment
before the records, so the parser skips anything until it
recognizes a GenBank locus id line.

> We may not be able to extract the exact error message
> returned by NCBI, but a parser for format XYZ can tell
> you that the file is not in format XYZ.

Some parsers may be able to do this, but not all.

> Maybe the XML parser can say it doesn't look like an
> XML file, but that's about it.

This is an easy case because XML is so strictly defined.
Spotting a non-XML file is pretty trivial.

> Once NCBI Entrez starts to return errors in a uniform
> format, we can modify our parsers to find out the
> exact error message. Until that happens, trying to do
> so on our side will not be robust.

I agree that pulling out error messages (the second half
of my original email in the thread) is error prone.  You
might argue that catching any errors is still worthwhile,
as long as there are no false positives.

The first half of the email (the main point) was based
on a special case: HTML and XML are pretty easy to
identify.  If you ask for HTML and don't get it, it is an
error (and vice versa).  If you ask for XML and don't
get it, it is an error (and vice versa).  The fact that
the NCBI currently often return an HTML or XML error
page when a plain text format was requested is then
easily detected as an error (simply from the file type).
This will still work even if the NCBI do change their
error formats or wording - it should be pretty robust.

Peter