[Biopython-dev] [Bug 2678] Bio.Entrez module does not always retrieve or find DTD files

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Fri Mar 20 13:57:00 UTC 2009


http://bugzilla.open-bio.org/show_bug.cgi?id=2678





------- Comment #11 from biopython-bugzilla at maubp.freeserve.co.uk  2009-03-20 09:57 EST -------
(In reply to comment #10)
> 
> In hindsight, I wonder if trying to download missing DTD files is really a
> good idea. Suppose a user does a large number of Entrez queries, and saves
> the results as XML files. Then, he tries to parse each of those XML files.
> If a DTD file is missing, then Bio.Entrez will try to download the same DTD
> file for each XML file it is trying to parse. This is not only wasteful, but
> also bypasses Entrez's rule of no more than three accesses per second.

Very true.  We should be able to enforce the access limit here without too much
trouble.  More generally, it would make sense for the DTD file to be saved -
ideally to the python site-packages but as we may not have write access, at
least to a cache.

> In addition, this is fragile. The XML files typically contain a full url to
> the needed DTD.   But many of Entrez's DTD files contain references to other
> DTD files, and those references can be relative. When Bio.Entrez gets such a
> relative path to where the DTD file is located, it is difficult to figure out
> the absolute path to the DTD. Now we are looking for it in
> http://www.ncbi.nlm.nih.gov/dtd/, but this does not seem to contain all
> required DTDs.

When I looked into the DTD URLs, I didn't see the NCBI using an relative
links, but they may have changed things since.  Additionally the NCBI have a
(different but overlapping) set of DTD files at:
http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/

Can we get some python XML/DTD library to resolve these links for us?

> It may therefore make sense not to download the DTD file, but to raise an
> Exception with a helpful error message, specifying which DTD file is missing,
> where it can possibly be found, and where the DTD file can be installed. It
> requires some more effort from the user, but it is more robust, won't break
> Entrez' rules, and is more efficient.

Biopython 1.49 generally failed to download missing DTD files.  Right now the
current code in CVS does much better at coping with missing DTD files, but in a
very wasteful way.  In either version, it does at least issue warnings,
indicating something is not right.

As a user, I would prefer Bio.Entrez to download missing DTD files on demand
AND SAVE THEM.  As a developer I can see this is rather complicated, and you
are right Michiel - a simple error message with instructions is much more
straight forward.

Note that the error might also suggest upgrading to the latest Biopython, or
reporting the issue to us - but it would then be a very long error message!

If you want to switch to a helpful error message for missing DTD files, I'm OK
with that.  We could also ship the current code for Biopython 1.50.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list