[Biopython-dev] Bio.Entrez catching more errors

Tue Mar 10 19:40:29 EDT 2009

Hi All,

It occured to me that the Bio.Entrez._open function can look at the
retmode argument (if present) and spot if there is a mismatch between
the requested format (e.g. XML, HTML, text or asn.1) and the actual
data the NCBI returned.  Something along the following lines could be
added to the end of the _open function in Bio/Entrez/__init__.py to
acheive this:

    elif "retmode" in params and params["retmode"].lower()=="html" \
    and not data.lower().startswith("<html") \
    and not data.lower().startswith("<!doctype html") :
        raise TypeError("Requested HTML, but didn't get it: %s..." % data)
    elif "retmode" in params and params["retmode"].lower()=="xml" \
    and not data.lower().startswith("<?xml") :
        raise TypeError("Requested XML, but didn't get it: %s..." % data)
    elif "retmode" in params and params["retmode"] \
    and params["retmode"].lower()!="xml" \
    and data.lower().startswith("<?xml") :
        raise TypeError("Didn't request XML, but got it: %s..." % data)
    elif "retmode" in params and params["retmode"] \
    and params["retmode"].lower()!="html" \
    and (data.lower().startswith("<html") or \
         data.lower().startswith("<!doctype html")):
        #Expected for some error pages (e.g. the Bad Gateway caught above)
        raise TypeError("Didn't request HTML, but got it: %s..." % data)

I'm sure my XML/HTML detection could be made more robust here - I hope
the principle is clear.  My motivation is that I have noticed the NCBI
can return HTML error pages, and while we do catch some of these
explicitly (e.g. Bad Gateway, or Service Unavailable), I think any
HTML page when the user asked from XML, text or asn.1 should be
treated as error.  Similarly, not getting XML when you ask for it etc.

Note that by raising the exception including the message text it
should be much easier to diagnose these failures.  As a tiny
refinement to the above code, we should only add the "..." if there is
more text to follow - this isn't always the case.

e.g. The following give an HTML error page (while some databases like
"protein" are better behaved in this respect):
>>> print Entrez.efetch(db="homologene", id="nonexistant", retmode="text").read()
>>> print Entrez.efetch(db="homologene", id="nonexistant", retmode="asn.1").read()

Similarly, these give an XML like fragment (which is not a valid XML
file in itself - arguably an NCBI bug; some databases like "protein"
are better behaved in this respect):
>>> print Entrez.efetch(db="pubmed", id="nonexistant", retmode="xml").read()
>>> print Entrez.efetch(db="homologene", id="nonexistant", retmode="xml").read()
>>> print Entrez.efetch(db="cdd", id="nonexistant", retmode="xml").read()
>>> print Entrez.efetch(db="taxonomy", id="nonexistant", retmode="xml").read()

My suggested change to Bio.Entrez would also catch the following
examples (using an invalid database) where the NCBI ignore the retmode
and return an HTML help page:
>>> print Entrez.efetch(db="nonexistant", id="123456", retmode="xml").read()
>>> print Entrez.efetch(db="nonexistant", id="123456", retmode="text").read()

In a less clear cut example, this would flag the following as an error
as the NCBI seem to return ASN.1 text instead of HTML here::
>>> print Entrez.efetch(db="nucleotide", retmode="html", id="123456").read()

Overall, I think this change should catch lots of errors which
otherwise may not be detected until later (e.g. while trying to parse
the file).

--------------------------------------------------------------------------------------------------

On another point, should we catch these responses as errors:?

>>> efetch(db="snp", id="123456").read()
'<html><head><title>PmFetch response</title></head><body>\n<pre>\n1:
id: 123456 Error occurred: cannot get document
summary\n</pre></body></html>'
>>> efetch(db="snp", id="123456", retmode="html").read()
'<html><head><title>PmFetch response</title></head><body>\n<pre>\n1:
id: 123456 Error occurred: cannot get document
summary\n</pre></body></html>'
>>> efetch(db="snp", id="123456", retmode="xml").read()
'<?xml version="1.0"?>\n<ExchangeSet
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\nxmlns="http://www.ncbi.nlm.nih.gov/SNP/docsum"\nxsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/docsum\nhttp://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">1:
id: 123456 Error occurred: cannot get document
summary\n\n</ExchangeSet>'
>>> efetch(db="snp", id="123456", retmode="text").read()
'1: id: 123456 Error occurred: cannot get document summary\n'

and,
>>> print efetch(db="homologene", retmode="html", id="fake").read()
<html>
<body>
<br/><h2>Error occurred: Empty id list - nothing todo</h2>...

Looking for the string "Error occurred: " looks fairly safe here, and
should cover a range of entries.  Of course, you can imagine false
positives too, e.g. a valid PUBMED plain text record for a tutorial
article with a title like "Yikes! An Error Occurred: A beginner's
Guide To Defensive Programming." could match.

Peter