[Biopython-dev] Python 3 and encoding for online resources

Wed Aug 4 05:19:45 EDT 2010

Can you give an example script where you get an HTML error page? In the cases I've tried, the metadata revealed that an error had occurred, even if urllib2.urlopen didn't raise an HTTP error but returned a handle to XML containing the error message.

--Michiel.

--- On Tue, 8/3/10, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython-dev] Python 3 and encoding for online resources
> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
> Date: Tuesday, August 3, 2010, 12:16 PM
> On Tue, Aug 3, 2010 at 4:44 PM,
> Michiel de Hoon <mjldehoon at yahoo.com>
> wrote:
> > Have you tried looking at handle.info(), where handle
> is the handle
> > returned by urllib.urlopen()? Another candidate is
> handle.getcode().
> 
> In the case of using the history support with a bad webenv,
> we get an
> HTML error page with HTTP status code 200 (OK) which
> explains
> why urllib doesn't raise an exception (sample example in
> previous email).
> 
> In the case of using the history support with an invalid
> integer query key,
> we get an HTML error page with HTTP status code 200 (OK),
> e.g.
> 
> <html>
> <body>
> <br/><h2>Error occurred: Unable to obtain query
> #123456789</h2>
> ...
> </body>
> </html>
> 
> In the case of using the history support with a non-integer
> query key,
> we also get an HTML error page with HTTP status code 200
> (OK), e.g.
> 
> <html>
> <body>
> <br/><h2>Error occurred: NCBI C++ Exception:
>     Error:       
> CORELIB(CStringException::eConvert)
> "/pubmed_gen/rbuild/version/20100419.1/entrez/c++/src/corelib/ncbistr.cpp",
> line 666: ncbi::NStr::StringToInt8() --- Cannot convert
> string 'wrong'
> to Int8 (m_Pos = 0)
> </h2>
> ...
> </body>
> </html>
> 
> It puzzles me that they are still using HTTP status code
> 200 (OK) here.
> 
> > Otherwise, we could try to contact NCBI to see if
> their error messages
> > can be returned in a standard format, or at least in a
> format consistent
> > with the request.
> 
> This is definitely worth trying. Additionally we should
> also ask them about
> making more use of HTTP error codes like 400 when serving
> an error page.
> 
> Would you like to email the NCBI Entrez team about this
> (and CC me
> please)?
> 
> > Otherwise, we can also consider not to parse the HTML
> error message;
> > the SeqIO/Entrez parsers will notice a format problem
> and raise an
> > exception anyway.
> 
> As things stand with the NCBI returning 200 (OK) HTML error
> messages
> I'm not comfortable with this. It will break the use case
> of a batch
> download script which writes the data direct to disk
> without parsing it
> (or giving it to another tool as input). I believe the
> earlier we can catch
> any NCBI error messages the better, even if it does require
> some messy
> peeping at the data via an buffered handle.
> 
> Thanks,
> 
> Peter
>