[Biopython-dev] Python 3 and encoding for online resources

Michiel de Hoon mjldehoon at yahoo.com
Mon Aug 2 13:50:47 UTC 2010


> Or if you just want to grab some code for a quick play, 
>I have a branch where I've been doing this on a
> semi-regular basis:
> 
> http://github.com/peterjc/biopython/tree/auto2to3

Thanks! I used this branch to test the Bio.Entrez and Bio.SwissProt parsers. The Bio.Entrez Parser works as is; the Bio.SwissProt parser is really easy to fix (just convert each line into a plain string inside the _read function in Bio.SwissProt.__init__). Perhaps we can do something similar for the other test_SeqIO_online.py failures (the ones appearing in Bio/SeqIO/FastaIO.py)?

> > So I'd suggest to not use File.UndoHandle (at all),
> ...
> I disagree. The NCBI return multiple different file
> formats, so there are multiple different parsers that may get
> an error page.
>
> Given the NCBI return HTML error pages regardless of what
> format the request was (XML, plain text, etc), I think we
> have to look for errors before giving the data to the
> parser.

Part of the problem solves itself when we change to Python 3. In Python 3, urllib.request.urlopen raises a urllib.error.HTTPError in cases where urllib.urlopen in Python 2 raises no exception:


mdehoon:~/Software/biopython2to3/peterjc-biopython-06c2ea6 $ python
Python 2.7 (r27:82500, Jul 19 2010, 00:08:00) 
[GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> urllib.urlopen("http://www.biopython.org/somethingimadeup")
<addinfourl at 19048968 whose fp = <socket._fileobject object at 0x7bf8f0>>
>>> 


mdehoon:~/Software/biopython2to3/peterjc-biopython-06c2ea6 $ python3
Python 3.1.2 (r312:79360M, Mar 24 2010, 01:33:18) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> urllib.request.urlopen("http://www.biopython.org/somethingimadeup")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/urllib/request.py", line 121, in urlopen
    return _opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/urllib/request.py", line 355, in open
    response = meth(req, response)
  File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/urllib/request.py", line 467, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/urllib/request.py", line 393, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/urllib/request.py", line 327, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/urllib/request.py", line 475, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
>>> 


which means that we can catch at least some errors without having to actually read from the handle. A 414 Request-URI Too Large is also being caught, In this sense, urllib in Python 3 behaves as urllib2 in Python 2. I don't know though how to go about checking whether all HTTP errors we check for in Bio.Entrez are being caught (anybody know a magical way to trigger a particular HTTP error?). Nevertheless, this avoids having to go through a File.UndoHandle, and is safer than checking the HTML / text response from NCBI (at least the "download dataset is empty" response from NCBI has already changed).

So I would suggest to switch from urllib to urllib2 in Bio.Entrez and catch any HTTP errors (urllib2 is translated appropriately by 2to3), and to handle any bytes/utf8/ascii conversion inside the parser (as in Bio.SwissProt).

--Michiel.




--- On Sun, 8/1/10, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython-dev] Python 3 and encoding for online resources
> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
> Cc: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
> Date: Sunday, August 1, 2010, 1:54 PM
> On Sun, Aug 1, 2010 at 4:14 PM,
> Michiel de Hoon <mjldehoon at yahoo.com>
> wrote:
> > According to this post:
> >
> > http://stackoverflow.com/questions/1179305/expat-parsing-in-python-3
> >
> > we need only one parser which always parses a byte
> stream.
> > Bio.Entrez uses File.UndoHandle but just to look for
> potential
> > errors in the first few lines when opening the Entrez
> url, which
> > in my opinion we shouldn't be doing anyway since it's
> the
> > parser's job to decide whether the input is
> well-formed.
> > So I'd suggest to not use File.UndoHandle (at all),
> ...
> 
> I disagree. The NCBI return multiple different file
> formats, so
> there are multiple different parsers that may get an error
> page.
> Given the NCBI return HTML error pages regardless of what
> format the request was (XML, plain text, etc), I think we
> have to look for errors before giving the data to the
> parser.
> But that can be done using byte strings just as easily as
> with
> unicode strings.
> 
> > make sure our parser works with Python 3 byte streams,
> and
> > ask users to open any downloaded Entrez XML files in
> binary
> > mode.
> 
> That sounds workable.
> 
> > Is there a Biopython version (in trunk or otherwise)
> that is ready
> > for Python 3? If so, I can have a look at the parser
> to see if it
> > handles byte streams correctly.
> 
> The trunk itself -- after running 2to3 on it (as described
> in the
> README file). Or if you just want to grab some code for a
> quick
> play, I have a branch where I've been doing this on a
> semi-regular
> basis:
> 
> http://github.com/peterjc/biopython/tree/auto2to3
> 
> Note that we are keeping the trunk as Python 2 code, which
> can make like interesting (Another option would be a
> Python
> 3 branch, but we'd then need to manually keep things in
> sync).
> To make life a little easier, we are probably going to need
> some
> python 3 compatibility functions (like bytes as unicode,
> unicode
> as bytes - see the NumPy project for other possible
> examples),
> which we are currently doing on a module by module basis.
> Here I'm thinking specifically of some of the things
> required in
> Bio/SeqIO/SffIO.py, but there are other python 3 hacks we
> may
> want to standardise.
> 
> For the C code (which we haven't looked at yet, setup,py
> is
> ignoring the extensions on Python 3 for now) we should be
> able to use the normal #ifdef approach. Again, we can
> learn
> a lot from looking at NumPy here.
> 
> Peter
> 



      



More information about the Biopython-dev mailing list