[Biopython-dev] Fw: Re: Bio.Entrez catching more errors

Sun Mar 15 08:53:28 EDT 2009

--- On Sun, 3/15/09, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Whereas I think it's a good idea if Bio.Entrez catches
> more errors, I think the parser is a more suitable place to
> check for errors. See Bio.ExPASy.ScanProsite for an example
> of catching errors with an XML parser; this avoids using a
> File.UndoHandle.
> 
> --Michiel
> 
> --- On Tue, 3/10/09, Peter
> <biopython at maubp.freeserve.co.uk> wrote:
> 
> > From: Peter <biopython at maubp.freeserve.co.uk>
> > Subject: [Biopython-dev] Bio.Entrez catching more
> errors
> > To: "BioPython-Dev Mailing List"
> <biopython-dev at lists.open-bio.org>
> > Date: Tuesday, March 10, 2009, 7:40 PM
> > Hi All,
> > 
> > It occured to me that the Bio.Entrez._open function
> can
> > look at the
> > retmode argument (if present) and spot if there is a
> > mismatch between
> > the requested format (e.g. XML, HTML, text or asn.1)
> and
> > the actual
> > data the NCBI returned.  Something along the following
> > lines could be
> > added to the end of the _open function in
> > Bio/Entrez/__init__.py to
> > acheive this:
> > 
> >     elif "retmode" in params and
> > params["retmode"].lower()=="html"
> \
> >     and not
> data.lower().startswith("<html")
> > \
> >     and not data.lower().startswith("<!doctype
> > html") :
> >         raise TypeError("Requested HTML, but
> > didn't get it: %s..." % data)
> >     elif "retmode" in params and
> > params["retmode"].lower()=="xml"
> \
> >     and not
> data.lower().startswith("<?xml") :
> >         raise TypeError("Requested XML, but
> didn't
> > get it: %s..." % data)
> >     elif "retmode" in params and
> > params["retmode"] \
> >     and
> > params["retmode"].lower()!="xml"
> \
> >     and data.lower().startswith("<?xml")
> :
> >         raise TypeError("Didn't request XML,
> but
> > got it: %s..." % data)
> >     elif "retmode" in params and
> > params["retmode"] \
> >     and
> > params["retmode"].lower()!="html"
> \
> >     and (data.lower().startswith("<html")
> or
> > \
> >          data.lower().startswith("<!doctype
> > html")):
> >         #Expected for some error pages (e.g. the Bad
> > Gateway caught above)
> >         raise TypeError("Didn't request HTML,
> but
> > got it: %s..." % data)
> > 
> > I'm sure my XML/HTML detection could be made more
> > robust here - I hope
> > the principle is clear.  My motivation is that I have
> > noticed the NCBI
> > can return HTML error pages, and while we do catch
> some of
> > these
> > explicitly (e.g. Bad Gateway, or Service Unavailable),
> I
> > think any
> > HTML page when the user asked from XML, text or asn.1
> > should be
> > treated as error.  Similarly, not getting XML when you
> ask
> > for it etc.
> > 
> > Note that by raising the exception including the
> message
> > text it
> > should be much easier to diagnose these failures.  As
> a
> > tiny
> > refinement to the above code, we should only add the
> > "..." if there is
> > more text to follow - this isn't always the case.
> > 
> > e.g. The following give an HTML error page (while some
> > databases like
> > "protein" are better behaved in this
> respect):
> > >>> print
> Entrez.efetch(db="homologene",
> > id="nonexistant",
> retmode="text").read()
> > >>> print
> Entrez.efetch(db="homologene",
> > id="nonexistant",
> > retmode="asn.1").read()
> > 
> > Similarly, these give an XML like fragment (which is
> not a
> > valid XML
> > file in itself - arguably an NCBI bug; some databases
> like
> > "protein"
> > are better behaved in this respect):
> > >>> print
> Entrez.efetch(db="pubmed",
> > id="nonexistant",
> retmode="xml").read()
> > >>> print
> Entrez.efetch(db="homologene",
> > id="nonexistant",
> retmode="xml").read()
> > >>> print Entrez.efetch(db="cdd",
> > id="nonexistant",
> retmode="xml").read()
> > >>> print
> Entrez.efetch(db="taxonomy",
> > id="nonexistant",
> retmode="xml").read()
> > 
> > My suggested change to Bio.Entrez would also catch the
> > following
> > examples (using an invalid database) where the NCBI
> ignore
> > the retmode
> > and return an HTML help page:
> > >>> print
> > Entrez.efetch(db="nonexistant",
> > id="123456", retmode="xml").read()
> > >>> print
> > Entrez.efetch(db="nonexistant",
> > id="123456",
> retmode="text").read()
> > 
> > In a less clear cut example, this would flag the
> following
> > as an error
> > as the NCBI seem to return ASN.1 text instead of HTML
> > here::
> > >>> print
> Entrez.efetch(db="nucleotide",
> > retmode="html",
> id="123456").read()
> > 
> > Overall, I think this change should catch lots of
> errors
> > which
> > otherwise may not be detected until later (e.g. while
> > trying to parse
> > the file).
> > 
> >
> --------------------------------------------------------------------------------------------------
> > 
> > On another point, should we catch these responses as
> > errors:?
> > 
> > >>> efetch(db="snp",
> > id="123456").read()
> > '<html><head><title>PmFetch
> >
> response</title></head><body>\n<pre>\n1:
> > id: 123456 Error occurred: cannot get document
> >
> summary\n</pre></body></html>'
> > >>> efetch(db="snp",
> > id="123456",
> retmode="html").read()
> > '<html><head><title>PmFetch
> >
> response</title></head><body>\n<pre>\n1:
> > id: 123456 Error occurred: cannot get document
> >
> summary\n</pre></body></html>'
> > >>> efetch(db="snp",
> > id="123456", retmode="xml").read()
> > '<?xml
> > version="1.0"?>\n<ExchangeSet
> >
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\nxmlns="http://www.ncbi.nlm.nih.gov/SNP/docsum"\nxsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/docsum\nhttp://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">1:
> > id: 123456 Error occurred: cannot get document
> > summary\n\n</ExchangeSet>'
> > >>> efetch(db="snp",
> > id="123456",
> retmode="text").read()
> > '1: id: 123456 Error occurred: cannot get document
> > summary\n'
> > 
> > and,
> > >>> print efetch(db="homologene",
> > retmode="html", id="fake").read()
> > <html>
> > <body>
> > <br/><h2>Error occurred: Empty id list -
> > nothing todo</h2>...
> > 
> > Looking for the string "Error occurred: "
> looks
> > fairly safe here, and
> > should cover a range of entries.  Of course, you can
> > imagine false
> > positives too, e.g. a valid PUBMED plain text record
> for a
> > tutorial
> > article with a title like "Yikes! An Error
> Occurred: A
> > beginner's
> > Guide To Defensive Programming." could match.
> > 
> > Peter
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at lists.open-bio.org
> >
> http://lists.open-bio.org/mailman/listinfo/biopython-dev