[Biopython-dev] Bio.Entrez catching more errors

Wed Mar 25 08:15:21 EDT 2009

On Wed, Mar 25, 2009 at 11:47 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
>> What about the fairly common situation (at, its something
>> I've done fairly often) where Bio.Entrez.efetch() is used
>> to fetch records which are saved directly to file without
>> verification - e.g. to be parsed by another program?
>> Unless the error is caught in Bio.Entrez.efetch()
>> it may be out of our control.
>
> That is easy: just run the output returned by NCBI through
> the appropriate parser. If the parser is happy, proceed to
> save the NCBI output in a file.

Possible, but you'd need to cache the handle's data in order
to be able to save it after parsing.  The UndoHandle doesn't
do this.

You could save the data to a file, and then check the parser
can read it back - however, this would be complicated if you
are downloading data in batches to go into a single file.

>> The first half of the email (the main point) was based
>> on a special case: HTML and XML are pretty easy to
>> identify.  If you ask for HTML and don't get it, it is
>> an error (and vice versa).  If you ask for XML and don't
>> get it, it is an error (and vice versa).  The fact that
>> the NCBI currently often return an HTML or XML error
>> page when a plain text format was requested is then
>> easily detected as an error (simply from the file type).
>> This will still work even if the NCBI do change their
>> error formats or wording - it should be pretty robust.
>
> Have a look at serialset.xml in the Bio.Entrez test cases ... this
> is the output obtained from NCBI using efetch from the journals
> database with retmode='xml'. The file looks like XML, but it
> doesn't start with "<!xml". However, Bio.Entrez.read parses it
> correctly, so while it's not pretty to me this would not count as
> an error.

I do concede my sample code for detecting XML or HTML could
be improved, and this provides a good test case for a difficult
XML file.  Maybe when we expect XML (or HTML), all we should
check is the file starts with "<"?  e.g.

   elif "retmode" in params and params["retmode"].lower()=="html" \
   and not data.lower().startswith("<") :
       raise TypeError("Requested HTML, but didn't get it: %s..." % data)
   elif "retmode" in params and params["retmode"].lower()=="xml" \
   and not data.lower().startswith("<") :
       raise TypeError("Requested XML, but didn't get it: %s..." % data)
   elif "retmode" in params and params["retmode"] \
   and params["retmode"].lower()!="xml" \
   and data.lower().startswith("<?xml") :
       raise TypeError("Didn't request XML, but got it: %s..." % data)
   elif "retmode" in params and params["retmode"] \
   and params["retmode"].lower()!="html" \
   and (data.lower().startswith("<html") or \
        data.lower().startswith("<!doctype html")):
       #Expected for some error pages (e.g. the Bad Gateway caught above)
       raise TypeError("Didn't request HTML, but got it: %s..." % data)

The above code isn't expected to catch all possible errors - just the
most common ones.  One this thing version won't catch is a mix up
between XML and HTML (e.g. requested XML, given HTML error page)
but the two do overlap somewhat anyway.

Peter