[Biopython-dev] [Bug 2448] Bio.EUtils can't handle accented author names

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Mon Feb 25 16:00:19 EST 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2448





------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2008-02-25 16:00 EST -------
This was Andrew Dalke's reply on the Biopython-dev mailing list, 10 Feb 2008,
which I'm adding to Bugzilla for future reference:

On Feb 10, 2008, at 9:29 PM, bugzilla-daemon at portal.open-bio.org wrote:
>            Summary: Bio.EUtils can't handle accented author names
  ...
>     self.stack[-1].append(Text(text))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in
> position 4:
> ordinal not in range(128)

The EUtils code is old.  It uses a DTD to XML parser that I found,
what, 6 years ago?  This problem is because the code uses

class IndentedText(str):
        def __init__(self, data=""):
                self.data = unescape(unicode(data))
                self._level = 0
                self._parent = None

That derivation from str is suspicious.  I don't think it's needed,
but I haven't reviewed the code well enough.

Getting rid of the 'str' *might* fix it.  Otherwise what's going on
is the __new__ is seeing the byte string using non-ASCII values and
it doesn't know what to do.  So another solution might be to change
that base class to "unicode" and do the right decode calls.

Note that the current parser doesn't handle &# notation.

Some years back I started work on a EUtils2.  It used the then-quite-
new ElementTree library. Here's what I had
  http://www.dalkescientific.com/writings/diary/archive/2005/09/30/
using_eutils.html

If anyone wants the code,
  http://dalkescientific.com/EUtils-2.0a1.tar.gz

I don't plan on doing anything more with it until I have a pressing
need.  Like someone wanting to pay me for it :)

This old mail might also be useful for someone working on non-ASCII
queries that are sent to NCBI.


> The following is the MEDLINE character table for the XML.
>
> http://www.nlm.nih.gov/databases/dtd/medline_character_database.utf8
>
> Diana Airozo
> NCBI Contractor
> dalke at dalkescientific.com wrote (Tue, Sep 7 2004 15:20:14):
>
>
>> Hi Diana,
>>
>>    Thank you for your reply.  For a clarification on the
>> non-ASCII query question
>>
>>
>>>> Also, how do I do non-ASCII queries?  For example, suppose I want
>>>> to search for papers from "Göteborg Universitet" or "La Universidad
>>>> de España".
>>>>
>>
>>
>>
>>> You would search using Goteborg.
>>>
>>
>> I want to automate this so that a user query for Göteborg
>> gets converted into "Goteborg."  I would prefer to use the
>> same algorithm for doing this that your indexer uses.  I
>> looked online for unicode -> ASCII conversion table that
>> strips the accents and other diacriticals and expands
>> characters like ß into ss and æ into ae.  I found
>> several, but I would prefer to use the same table your
>> indexer has so that queries are more likely to work.
>>
>> (Well, actually I would like your search code to perform
>> the same input normalization that your indexer does, but
>> I'll use this as a workaround.)
>>
>> Is the conversion table you use available?
>>


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list