[Biopython-dev] Python 3 and encoding for online resources

Tue Aug 3 12:16:44 EDT 2010

On Tue, Aug 3, 2010 at 4:44 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Have you tried looking at handle.info(), where handle is the handle
> returned by urllib.urlopen()? Another candidate is handle.getcode().

In the case of using the history support with a bad webenv, we get an
HTML error page with HTTP status code 200 (OK) which explains
why urllib doesn't raise an exception (sample example in previous email).

In the case of using the history support with an invalid integer query key,
we get an HTML error page with HTTP status code 200 (OK), e.g.

<html>
<body>
<br/><h2>Error occurred: Unable to obtain query #123456789</h2>
...
</body>
</html>

In the case of using the history support with a non-integer query key,
we also get an HTML error page with HTTP status code 200 (OK), e.g.

<html>
<body>
<br/><h2>Error occurred: NCBI C++ Exception:
    Error:        CORELIB(CStringException::eConvert)
"/pubmed_gen/rbuild/version/20100419.1/entrez/c++/src/corelib/ncbistr.cpp",
line 666: ncbi::NStr::StringToInt8() --- Cannot convert string 'wrong'
to Int8 (m_Pos = 0)
</h2>
...
</body>
</html>

It puzzles me that they are still using HTTP status code 200 (OK) here.

> Otherwise, we could try to contact NCBI to see if their error messages
> can be returned in a standard format, or at least in a format consistent
> with the request.

This is definitely worth trying. Additionally we should also ask them about
making more use of HTTP error codes like 400 when serving an error page.

Would you like to email the NCBI Entrez team about this (and CC me
please)?

> Otherwise, we can also consider not to parse the HTML error message;
> the SeqIO/Entrez parsers will notice a format problem and raise an
> exception anyway.

As things stand with the NCBI returning 200 (OK) HTML error messages
I'm not comfortable with this. It will break the use case of a batch
download script which writes the data direct to disk without parsing it
(or giving it to another tool as input). I believe the earlier we can catch
any NCBI error messages the better, even if it does require some messy
peeping at the data via an buffered handle.

Thanks,

Peter