[Biopython-dev] Bio.Entrez catching more errors

Sat Mar 21 04:47:07 UTC 2009

I think it is good if we catch more errors in Bio.Entrez, but I think the error catching should be done by the parser, not when retrieving.

As you show, NCBI Entrez returns error messages in various different formats: plain text, HTML, incorrect XML, broken XML. Since there are many ways to access NCBI Entrez, there may be other styles of error messages that we don't know about. Then there is the added complication of accessing NCBI Entrez to get information in formats other than XML, e.g. GenBank files. And all this may be changed over time by NCBI.

Since the error message is ill-defined, code trying to identify error messages won't be robust. On the other hand, the format of files expected by a given parser is well-defined: Either the file agrees with the format expected by the parser, or it doesn't; if it doesn't, then that's an error. We may not be able to extract the exact error message returned by NCBI, but a parser for format XYZ can tell you that the file is not in format XYZ. Maybe the XML parser can say it doesn't look like an XML file, but that's about it.

Once NCBI Entrez starts to return errors in a uniform format, we can modify our parsers to find out the exact error message. Until that happens, trying to do so on our side will not be robust.

--Michiel

--- On Tue, 3/10/09, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [Biopython-dev] Bio.Entrez catching more errors
> To: "BioPython-Dev Mailing List" <biopython-dev at lists.open-bio.org>
> Date: Tuesday, March 10, 2009, 7:40 PM
> Hi All,
> 
> It occured to me that the Bio.Entrez._open function can
> look at the
> retmode argument (if present) and spot if there is a
> mismatch between
> the requested format (e.g. XML, HTML, text or asn.1) and
> the actual
> data the NCBI returned.  Something along the following
> lines could be
> added to the end of the _open function in
> Bio/Entrez/__init__.py to
> acheive this:
> 
>     elif "retmode" in params and
> params["retmode"].lower()=="html" \
>     and not data.lower().startswith("<html")
> \
>     and not data.lower().startswith("<!doctype
> html") :
>         raise TypeError("Requested HTML, but
> didn't get it: %s..." % data)
>     elif "retmode" in params and
> params["retmode"].lower()=="xml" \
>     and not data.lower().startswith("<?xml") :
>         raise TypeError("Requested XML, but didn't
> get it: %s..." % data)
>     elif "retmode" in params and
> params["retmode"] \
>     and
> params["retmode"].lower()!="xml" \
>     and data.lower().startswith("<?xml") :
>         raise TypeError("Didn't request XML, but
> got it: %s..." % data)
>     elif "retmode" in params and
> params["retmode"] \
>     and
> params["retmode"].lower()!="html" \
>     and (data.lower().startswith("<html") or
> \
>          data.lower().startswith("<!doctype
> html")):
>         #Expected for some error pages (e.g. the Bad
> Gateway caught above)
>         raise TypeError("Didn't request HTML, but
> got it: %s..." % data)
> 
> I'm sure my XML/HTML detection could be made more
> robust here - I hope
> the principle is clear.  My motivation is that I have
> noticed the NCBI
> can return HTML error pages, and while we do catch some of
> these
> explicitly (e.g. Bad Gateway, or Service Unavailable), I
> think any
> HTML page when the user asked from XML, text or asn.1
> should be
> treated as error.  Similarly, not getting XML when you ask
> for it etc.
> 
> Note that by raising the exception including the message
> text it
> should be much easier to diagnose these failures.  As a
> tiny
> refinement to the above code, we should only add the
> "..." if there is
> more text to follow - this isn't always the case.
> 
> e.g. The following give an HTML error page (while some
> databases like
> "protein" are better behaved in this respect):
> >>> print Entrez.efetch(db="homologene",
> id="nonexistant", retmode="text").read()
> >>> print Entrez.efetch(db="homologene",
> id="nonexistant",
> retmode="asn.1").read()
> 
> Similarly, these give an XML like fragment (which is not a
> valid XML
> file in itself - arguably an NCBI bug; some databases like
> "protein"
> are better behaved in this respect):
> >>> print Entrez.efetch(db="pubmed",
> id="nonexistant", retmode="xml").read()
> >>> print Entrez.efetch(db="homologene",
> id="nonexistant", retmode="xml").read()
> >>> print Entrez.efetch(db="cdd",
> id="nonexistant", retmode="xml").read()
> >>> print Entrez.efetch(db="taxonomy",
> id="nonexistant", retmode="xml").read()
> 
> My suggested change to Bio.Entrez would also catch the
> following
> examples (using an invalid database) where the NCBI ignore
> the retmode
> and return an HTML help page:
> >>> print
> Entrez.efetch(db="nonexistant",
> id="123456", retmode="xml").read()
> >>> print
> Entrez.efetch(db="nonexistant",
> id="123456", retmode="text").read()
> 
> In a less clear cut example, this would flag the following
> as an error
> as the NCBI seem to return ASN.1 text instead of HTML
> here::
> >>> print Entrez.efetch(db="nucleotide",
> retmode="html", id="123456").read()
> 
> Overall, I think this change should catch lots of errors
> which
> otherwise may not be detected until later (e.g. while
> trying to parse
> the file).
> 
> --------------------------------------------------------------------------------------------------
> 
> On another point, should we catch these responses as
> errors:?
> 
> >>> efetch(db="snp",
> id="123456").read()
> '<html><head><title>PmFetch
> response</title></head><body>\n<pre>\n1:
> id: 123456 Error occurred: cannot get document
> summary\n</pre></body></html>'
> >>> efetch(db="snp",
> id="123456", retmode="html").read()
> '<html><head><title>PmFetch
> response</title></head><body>\n<pre>\n1:
> id: 123456 Error occurred: cannot get document
> summary\n</pre></body></html>'
> >>> efetch(db="snp",
> id="123456", retmode="xml").read()
> '<?xml
> version="1.0"?>\n<ExchangeSet
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\nxmlns="http://www.ncbi.nlm.nih.gov/SNP/docsum"\nxsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/docsum\nhttp://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">1:
> id: 123456 Error occurred: cannot get document
> summary\n\n</ExchangeSet>'
> >>> efetch(db="snp",
> id="123456", retmode="text").read()
> '1: id: 123456 Error occurred: cannot get document
> summary\n'
> 
> and,
> >>> print efetch(db="homologene",
> retmode="html", id="fake").read()
> <html>
> <body>
> <br/><h2>Error occurred: Empty id list -
> nothing todo</h2>...
> 
> Looking for the string "Error occurred: " looks
> fairly safe here, and
> should cover a range of entries.  Of course, you can
> imagine false
> positives too, e.g. a valid PUBMED plain text record for a
> tutorial
> article with a title like "Yikes! An Error Occurred: A
> beginner's
> Guide To Defensive Programming." could match.
> 
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev