[BioPython] Fwd: NCBI Abuse Activity with BioPython

Thu Jun 26 01:15:50 UTC 2008

Hi Chris,

   I'm no longer part of the Biopython dev team, but I read at least  
the subject line on the mailing list.

   I wrote the Biopython EUtils package around December 2002 and  
according to the CVS logs it was added to Biopython in June 2003, so  
more then 5 years ago.  Looking at the commit logs there haven't been  
any change to the relevant code since 2004, and that was a minor patch.

   I thought I put a rate limiter into the code, but looking at it  
now I see I didn't.  The documentation clearly states that users must  
follow NCBI's recommendations, but who actually reads documentation?

>> *  Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
>> <http://eutils.ncbi.nlm.nih.gov/> , not the standard NCBI Web  
>> address.

That change was announced on May 21, 2003, and most likely no one on  
the Biopython dev group tracks the EUtils mailing list.  It was also  
after I wrote the code, but to be fair I was subscribed to the  
utilities list at the time and should have caught the change.

I think the correct fix is to this code in ThinClient.py:

     def __init__(self,
                  opener = None,
                  tool = TOOL,
                  email = EMAIL,
                  baseurl = "http://www.ncbi.nlm.nih.gov/entrez/ 
eutils/"):

Change the baseurl to "http://eutils.ncbi.nlm.nih.gov/entrez/ 
eutils/".  I have not tested this.

>> *  Make no more than one request every 3 seconds.

There's a couple of points here.  The quickest and most direct way to  
force/fix the code is to change the "def _get()" in ThinClient.py .   
The current code is

     def _get(self, program, query):
         """Internal function: send the query string to the program  
as GET"""
         # NOTE: epost uses a different interface

         q = self._fixup_query(query)
         url = self.baseurl + program + "?" + q
         if DUMP_URL:
             print "Opening with GET:", url
         if DUMP_RESULT:
             print " ================== Results ============= "
             s = self.opener.open(url).read()
             print s
             print " ================== Finished ============ "
             return cStringIO.StringIO(s)
         return self.opener.open(url)

Here's one possible fix: add the following two lines to module scope:

import time
_prev_time = 0

and insert four lines in the _get function.

     def _get(self, program, query):
         """Internal function: send the query string to the program  
as GET"""
         # NOTE: epost uses a different interface
         global _prev_time
         q = self._fixup_query(query)
         url = self.baseurl + program + "?" + q
         if DUMP_URL:
             print "Opening with GET:", url

	# Follow NCBI's 3 second restriction
         if time.time() - _prev_time < 3:
             time.sleep(time.time()-_prev_time)
         _prev_time = time.time()

         if DUMP_RESULT:
             print " ================== Results ============= "
             s = self.opener.open(url).read()
             print s
             print " ================== Finished ============ "
             return cStringIO.StringIO(s)
         return self.opener.open(url)

(I recall that I had something like that, and it made my unit tests -  
which I did during the off hours - interminable.)

When I wrote this module I think I assumed that whoever would use the  
library would use the code correctly.  Using it correctly means a few  
things:
   - obey the restrictions set by NCBI
   - change the 'tool' and 'email' settings, so NCBI complains the  
right person.
      (The default is to say 'EUtils_Python_client' and 'biopython- 
dev at biopython.org')

This isn't happening.  The patch above force-fixes the first.  Should  
Biopython do a better job of the second?  It's not easy to figure out  
the correct email.  I couldn't then and can't now think of a better  
solution.  Perhaps use the result of getpass.getuser()?  But that  
doesn't get the rest of the domain for a proper email.  Though NCBI  
should be able to guess the site from the IP address.

The reason I made this assumption is that I meant EUtils to be used  
by contentious developers.  I've since learned that that's seldom the  
case, and because it was imported into Biopython it's been exposed to  
a wider audience.

>> Also, there is the problem of huge searches in order to build local
>> databases. With you package it seems that if one were so inclined you
>> would send a search for all human sequences (over 10,000,000  
>> sequences)
>> and you program would then retrieve these one ID at a time.  
>> Regardless
>> of the fact that this is an extreme example, we would much prefer if
>> your program could webenv from the Esearch  and  use the search  
>> history
>> and webenv to retrieve sets of sequences at 200 - 200 at a time.

It does exactly that.  There's an entire interface for handling  
search history - and it took some non-trivial work and questions to  
NCBI to get things working right.  Rather, there are two layers.  One  
is for the low-level protocol ("ThinClient") that EUtils offers, and  
another wraps around the history mechanism ("HistoryClient").

 >>> from Bio import EUtils
 >>> from Bio.EUtils import HistoryClient
 >>> client = HistoryClient.HistoryClient()
 >>> result = client.search("polio AND picornavirus")
 >>> len(result)
3437
 >>> f = result.efetch()
 >>> print f.read(1000)
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st  
January 2008//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ 
pubmed_080101.dtd">
<PubmedArticleSet>
<PubmedArticle>
     <MedlineCitation Owner="NLM" Status="In-Process">
         <PMID>18540199</PMID>
         <DateCreated>
             <Year>2008</Year>
             <Month>06</Month>
             <Day>10</Day>
         </DateCreated>
         <Article PubModel="Print">
             <Journal>
                 <ISSN IssnType="Print">0041-3771</ISSN>
                 <JournalIssue CitedMedium="Print">
                     <Volume>50</Volume>
                     <Issue>2</Issue>
                     <PubDate>
                         <Year>2008</Year>
                     </PubDate>
                 </JournalIssue>
                 <Title>Tsitologiia</Title>
                 <ISOAbbreviation>Tsitologiia</ISOAbbreviation>
             </Journal>
             <ArticleTitle>[The enter of viruses family  
Picornaviridae in

and there's a way to populate the history with a list of records,  
then fetch those records in a block:

 >>> result = client.from_dbids(EUtils.DBIds("pubmed",  
["100","200","300","400","500"]))
 >>> f = result.efetch("text", "brief")
 >>> print f.read()

1: Jolly RD et al. Bovine mannosidosis--a model ...[PMID: 100]

2: El Halawani ME et al. The relative importance of mo...[PMID: 200]

3: Amdur MA. Alcohol-related problems in a...[PMID: 300]

4: Regitz G et al. Trypsin-sensitive photosynthe...[PMID: 400]

5: Nourse ES. The regional workshops on pri...[PMID: 500]

If I had to guess, likely more people find the ThinClient code easier  
to understand, because the NCBI interface has a simple way to get the  
result for a single record, without using the history interface.  The  
NCBI interface doesn't guide people to the right way to use it  
effectively.

I started working on an update to EUtils which improved the API to  
include a few helper functions, like "EUtils.search()" instead of  
having to create a HistoryClient.  That might help guide people to  
using it better.  I wrote up something about it a few years ago:
   http://www.dalkescientific.com/writings/diary/archive/2005/09/30/ 
using_eutils.html

But a problem in completing that is that I never got any sort of  
funding or user feedback on how people were using the software, and  
as I moved over to chemistry it became lower and lower on my list.   
That's still the problem with me working on this again.

I don't know about this next point, but there might also be a lack of  
documentation on how to use the Biopython interface effectively?  The  
NCBI documentation isn't mean for non-programmers (it's more of a  
bytes-on-the-wire document) so perhaps people are pattern matching on  
what looks right and going with what works, vs. what works well.   
Then because there was no 3 second limit, they had no incentive to  
find a better/faster solution.

				Andrew
				dalke at dalkescientific.com