[Biopython-dev] Python 3 and encoding for online resources
biopython at maubp.freeserve.co.uk
Tue Jul 27 09:23:27 EDT 2010
One of the remaining (pure python) problems with Biopython
under Python 3 relates to parsing online resources like the
NCBI Entrez API or even Bio.ExPASy.get_sprot_raw().
See for example test_SeqIO_online.py for a failure.
In Python 2, urlopen from urlib or urllib2 would give a
string handle. In python 3, you get a bytes handle (not
a unicode handle and choosing the encoding is tricky):
In the case of resources like the NCBI and ExPASy we
should be able to assume an encoding (maybe UTF-8
or Latin) for all the plain text output, while from XML/HTML
there are ways for the data to specify this itself.
I think we may need to transform the urllib bytes handle into
a unicode string handle for parsing. One option would be to
extend the Bio.File.UndoHandle class (or invent a subclass)
which applies the decoding. This seems simple since
Bio.Entrez and Bio.ExPASy already use this class.
Another option (which I suggested on the Bio.SeqIO.index()
thread ) would be to extend our parsers to cope with both
byte and unicode handles. That could be a lot of work though...
More information about the Biopython-dev