[Biopython-dev] NCBI Abuse activity with Biopython
Michiel de Hoon
mjldehoon at yahoo.com
Wed Jun 25 20:01:07 EDT 2008
Dear all,
Recently NCBI blocked access for a Biopython user who was making 50,000 requests to NCBI at a rate of 18 requests per second during peak hours. This user was using the search_for function in Bio.GenBank, which internally uses Bio.EUtils. Apparently, Bio.EUtils does not follow the 3 seconds sleep rule betwen requests. NCBI also asked us to send requests for the Entrez E-Utilities to the EUtils web address, and not to the regular NCBI web address. I don't know if Bio.EUtils does that.
Bio.Entrez does use the 3 seconds sleep rule, and the eight E-Utilities functions all make use of the EUtils web address, though it is possible to pass a different web address as one of the arguments. The "query" function, which is not part of the E-Utilities, does use the standard NCBI web address.
To avoid such problems in the future, I'd like to propose the following:
1) Deprecate Bio.EUtils. Its functionality is covered by Bio.Entrez, which (from release 1.46) will have a parser.
Bio.EUtils is currently used by the following modules:
Bio/config/DBRegistry.py
Bio/dbdefs/fasta.py
Bio/dbdefs/genbank.py
Bio/dbdefs/medline.py
Bio/GenBank/__init__.py
We were already planning to remove Bio.config and Bio.dbdefs, so we'd only have to modify Bio.GenBank.
2) Remove the 'query' function from Bio.Entrez. Anyway accessing NCBI's web site from Python to get HTML back doesn't make a lot of sense.
3) Remove the argument for a user-specified web address to make sure that always the E-Utilities address is used.
--Michiel.
More information about the Biopython-dev
mailing list