[Biopython-dev] [Bug 2448] New: Bio.EUtils can't handle accented author names

Sun Feb 10 16:50:13 EST 2008

On Feb 10, 2008, at 9:29 PM, bugzilla-daemon at portal.open-bio.org wrote:
>            Summary: Bio.EUtils can't handle accented author names
   ...
>     self.stack[-1].append(Text(text))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in  
> position 4:
> ordinal not in range(128)

The EUtils code is old.  It uses a DTD to XML parser that I found,  
what, 6 years ago?  This problem is because the code uses

class IndentedText(str):
         def __init__(self, data=""):
                 self.data = unescape(unicode(data))
                 self._level = 0
                 self._parent = None

That derivation from str is suspicious.  I don't think it's needed,  
but I haven't reviewed the code well enough.

Getting rid of the 'str' *might* fix it.  Otherwise what's going on  
is the __new__ is seeing the byte string using non-ASCII values and  
it doesn't know what to do.  So another solution might be to change  
that base class to "unicode" and do the right decode calls.

Note that the current parser doesn't handle &# notation.

Some years back I started work on a EUtils2.  It used the then-quite- 
new ElementTree library. Here's what I had
   http://www.dalkescientific.com/writings/diary/archive/2005/09/30/ 
using_eutils.html

If anyone wants the code,
   http://dalkescientific.com/EUtils-2.0a1.tar.gz

I don't plan on doing anything more with it until I have a pressing  
need.  Like someone wanting to pay me for it :)

This old mail might also be useful for someone working on non-ASCII  
queries that are sent to NCBI.

> The following is the MEDLINE character table for the XML.
>
> http://www.nlm.nih.gov/databases/dtd/medline_character_database.utf8
>
> Diana Airozo
> NCBI Contractor
> dalke at dalkescientific.com wrote (Tue, Sep 7 2004 15:20:14):
>
>
>> Hi Diana,
>>
>>    Thank you for your reply.  For a clarification on the
>> non-ASCII query question
>>
>>
>>>> Also, how do I do non-ASCII queries?  For example, suppose I want
>>>> to search for papers from "Göteborg Universitet" or "La Universidad
>>>> de España".
>>>>
>>
>>
>>
>>> You would search using Goteborg.
>>>
>>
>> I want to automate this so that a user query for Göteborg
>> gets converted into "Goteborg."  I would prefer to use the
>> same algorithm for doing this that your indexer uses.  I
>> looked online for unicode -> ASCII conversion table that
>> strips the accents and other diacriticals and expands
>> characters like ß into ss and æ into ae.  I found
>> several, but I would prefer to use the same table your
>> indexer has so that queries are more likely to work.
>>
>> (Well, actually I would like your search code to perform
>> the same input normalization that your indexer does, but
>> I'll use this as a workaround.)
>>
>> Is the conversion table you use available?
>>

				Andrew
				dalke at dalkescientific.com