[BioPython] Fwd: NCBI Abuse Activity with BioPython
Andrew Dalke
dalke at dalkescientific.com
Thu Jun 26 01:15:50 UTC 2008
Hi Chris,
I'm no longer part of the Biopython dev team, but I read at least
the subject line on the mailing list.
I wrote the Biopython EUtils package around December 2002 and
according to the CVS logs it was added to Biopython in June 2003, so
more then 5 years ago. Looking at the commit logs there haven't been
any change to the relevant code since 2004, and that was a minor patch.
I thought I put a rate limiter into the code, but looking at it
now I see I didn't. The documentation clearly states that users must
follow NCBI's recommendations, but who actually reads documentation?
>> * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
>> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web
>> address.
That change was announced on May 21, 2003, and most likely no one on
the Biopython dev group tracks the EUtils mailing list. It was also
after I wrote the code, but to be fair I was subscribed to the
utilities list at the time and should have caught the change.
I think the correct fix is to this code in ThinClient.py:
def __init__(self,
opener = None,
tool = TOOL,
email = EMAIL,
baseurl = "http://www.ncbi.nlm.nih.gov/entrez/
eutils/"):
Change the baseurl to "http://eutils.ncbi.nlm.nih.gov/entrez/
eutils/". I have not tested this.
>> * Make no more than one request every 3 seconds.
There's a couple of points here. The quickest and most direct way to
force/fix the code is to change the "def _get()" in ThinClient.py .
The current code is
def _get(self, program, query):
"""Internal function: send the query string to the program
as GET"""
# NOTE: epost uses a different interface
q = self._fixup_query(query)
url = self.baseurl + program + "?" + q
if DUMP_URL:
print "Opening with GET:", url
if DUMP_RESULT:
print " ================== Results ============= "
s = self.opener.open(url).read()
print s
print " ================== Finished ============ "
return cStringIO.StringIO(s)
return self.opener.open(url)
Here's one possible fix: add the following two lines to module scope:
import time
_prev_time = 0
and insert four lines in the _get function.
def _get(self, program, query):
"""Internal function: send the query string to the program
as GET"""
# NOTE: epost uses a different interface
global _prev_time
q = self._fixup_query(query)
url = self.baseurl + program + "?" + q
if DUMP_URL:
print "Opening with GET:", url
# Follow NCBI's 3 second restriction
if time.time() - _prev_time < 3:
time.sleep(time.time()-_prev_time)
_prev_time = time.time()
if DUMP_RESULT:
print " ================== Results ============= "
s = self.opener.open(url).read()
print s
print " ================== Finished ============ "
return cStringIO.StringIO(s)
return self.opener.open(url)
(I recall that I had something like that, and it made my unit tests -
which I did during the off hours - interminable.)
When I wrote this module I think I assumed that whoever would use the
library would use the code correctly. Using it correctly means a few
things:
- obey the restrictions set by NCBI
- change the 'tool' and 'email' settings, so NCBI complains the
right person.
(The default is to say 'EUtils_Python_client' and 'biopython-
dev at biopython.org')
This isn't happening. The patch above force-fixes the first. Should
Biopython do a better job of the second? It's not easy to figure out
the correct email. I couldn't then and can't now think of a better
solution. Perhaps use the result of getpass.getuser()? But that
doesn't get the rest of the domain for a proper email. Though NCBI
should be able to guess the site from the IP address.
The reason I made this assumption is that I meant EUtils to be used
by contentious developers. I've since learned that that's seldom the
case, and because it was imported into Biopython it's been exposed to
a wider audience.
>> Also, there is the problem of huge searches in order to build local
>> databases. With you package it seems that if one were so inclined you
>> would send a search for all human sequences (over 10,000,000
>> sequences)
>> and you program would then retrieve these one ID at a time.
>> Regardless
>> of the fact that this is an extreme example, we would much prefer if
>> your program could webenv from the Esearch and use the search
>> history
>> and webenv to retrieve sets of sequences at 200 - 200 at a time.
It does exactly that. There's an entire interface for handling
search history - and it took some non-trivial work and questions to
NCBI to get things working right. Rather, there are two layers. One
is for the low-level protocol ("ThinClient") that EUtils offers, and
another wraps around the history mechanism ("HistoryClient").
>>> from Bio import EUtils
>>> from Bio.EUtils import HistoryClient
>>> client = HistoryClient.HistoryClient()
>>> result = client.search("polio AND picornavirus")
>>> len(result)
3437
>>> f = result.efetch()
>>> print f.read(1000)
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st
January 2008//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/
pubmed_080101.dtd">
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="In-Process">
<PMID>18540199</PMID>
<DateCreated>
<Year>2008</Year>
<Month>06</Month>
<Day>10</Day>
</DateCreated>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0041-3771</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>50</Volume>
<Issue>2</Issue>
<PubDate>
<Year>2008</Year>
</PubDate>
</JournalIssue>
<Title>Tsitologiia</Title>
<ISOAbbreviation>Tsitologiia</ISOAbbreviation>
</Journal>
<ArticleTitle>[The enter of viruses family
Picornaviridae in
and there's a way to populate the history with a list of records,
then fetch those records in a block:
>>> result = client.from_dbids(EUtils.DBIds("pubmed",
["100","200","300","400","500"]))
>>> f = result.efetch("text", "brief")
>>> print f.read()
1: Jolly RD et al. Bovine mannosidosis--a model ...[PMID: 100]
2: El Halawani ME et al. The relative importance of mo...[PMID: 200]
3: Amdur MA. Alcohol-related problems in a...[PMID: 300]
4: Regitz G et al. Trypsin-sensitive photosynthe...[PMID: 400]
5: Nourse ES. The regional workshops on pri...[PMID: 500]
If I had to guess, likely more people find the ThinClient code easier
to understand, because the NCBI interface has a simple way to get the
result for a single record, without using the history interface. The
NCBI interface doesn't guide people to the right way to use it
effectively.
I started working on an update to EUtils which improved the API to
include a few helper functions, like "EUtils.search()" instead of
having to create a HistoryClient. That might help guide people to
using it better. I wrote up something about it a few years ago:
http://www.dalkescientific.com/writings/diary/archive/2005/09/30/
using_eutils.html
But a problem in completing that is that I never got any sort of
funding or user feedback on how people were using the software, and
as I moved over to chemistry it became lower and lower on my list.
That's still the problem with me working on this again.
I don't know about this next point, but there might also be a lack of
documentation on how to use the Biopython interface effectively? The
NCBI documentation isn't mean for non-programmers (it's more of a
bytes-on-the-wire document) so perhaps people are pattern matching on
what looks right and going with what works, vs. what works well.
Then because there was no 3 second limit, they had no incentive to
find a better/faster solution.
Andrew
dalke at dalkescientific.com
More information about the Biopython
mailing list