[Biopython-dev] [Bug 2448] New: Bio.EUtils can't handle accented author names
Andrew Dalke
dalke at dalkescientific.com
Sun Feb 10 16:50:13 EST 2008
On Feb 10, 2008, at 9:29 PM, bugzilla-daemon at portal.open-bio.org wrote:
> Summary: Bio.EUtils can't handle accented author names
...
> self.stack[-1].append(Text(text))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in
> position 4:
> ordinal not in range(128)
The EUtils code is old. It uses a DTD to XML parser that I found,
what, 6 years ago? This problem is because the code uses
class IndentedText(str):
def __init__(self, data=""):
self.data = unescape(unicode(data))
self._level = 0
self._parent = None
That derivation from str is suspicious. I don't think it's needed,
but I haven't reviewed the code well enough.
Getting rid of the 'str' *might* fix it. Otherwise what's going on
is the __new__ is seeing the byte string using non-ASCII values and
it doesn't know what to do. So another solution might be to change
that base class to "unicode" and do the right decode calls.
Note that the current parser doesn't handle &# notation.
Some years back I started work on a EUtils2. It used the then-quite-
new ElementTree library. Here's what I had
http://www.dalkescientific.com/writings/diary/archive/2005/09/30/
using_eutils.html
If anyone wants the code,
http://dalkescientific.com/EUtils-2.0a1.tar.gz
I don't plan on doing anything more with it until I have a pressing
need. Like someone wanting to pay me for it :)
This old mail might also be useful for someone working on non-ASCII
queries that are sent to NCBI.
> The following is the MEDLINE character table for the XML.
>
> http://www.nlm.nih.gov/databases/dtd/medline_character_database.utf8
>
> Diana Airozo
> NCBI Contractor
> dalke at dalkescientific.com wrote (Tue, Sep 7 2004 15:20:14):
>
>
>> Hi Diana,
>>
>> Thank you for your reply. For a clarification on the
>> non-ASCII query question
>>
>>
>>>> Also, how do I do non-ASCII queries? For example, suppose I want
>>>> to search for papers from "Göteborg Universitet" or "La Universidad
>>>> de España".
>>>>
>>
>>
>>
>>> You would search using Goteborg.
>>>
>>
>> I want to automate this so that a user query for Göteborg
>> gets converted into "Goteborg." I would prefer to use the
>> same algorithm for doing this that your indexer uses. I
>> looked online for unicode -> ASCII conversion table that
>> strips the accents and other diacriticals and expands
>> characters like ß into ss and æ into ae. I found
>> several, but I would prefer to use the same table your
>> indexer has so that queries are more likely to work.
>>
>> (Well, actually I would like your search code to perform
>> the same input normalization that your indexer does, but
>> I'll use this as a workaround.)
>>
>> Is the conversion table you use available?
>>
Andrew
dalke at dalkescientific.com
More information about the Biopython-dev
mailing list