[Biopython-dev] NCBI Abuse activity with Biopython

Thu Jun 26 13:51:55 UTC 2008

[Sorry, hit the send button too soon]

> The Bio.GenBank.search_for() still seems somewhat
> useful, but without a default limit on the number
> of returned IDs, this could easily be abused.
> Again, we could deprecate this and direct people
> to Bio.Entrez.esearch() instead.
As always, I am in favor of deprecating functions whose purpose is dubious.
As an example, this is a Genbank search done via Bio.GenBank and via Bio.Entrez:

# Using Bio.GenBank
>>> from Bio import GenBank
>>> gi_list = GenBank.search_for("Opuntia AND rpl16")
>>> gi_list
['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284']

# Same thing, using Bio.Entrez
>>> from Bio import Entrez
>>> handle = Entrez.esearch(db='nucleotide', term="Opuntia AND rpl16")
>>> record = Entrez.read(handle)
>>> record["IdList"]
['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284']

I believe that GenBank.search_for automatically takes care of the retmax parameter (the maximum number of ids to return), but I agree that this can be abused easily.

> Brad also appears to have changed the functionality of 
> Bio.GenBank.download_many() from a call back mechanism 
> to returning a handle.  We could still return a handle, but it would
> require fetching all the records (perhaps in batches), and
> concatenating them.  I think it would make more sense to deprecate
> the Bio.GenBank.download_many() function, and direct people to
> Bio.Entrez.efetch() instead.

Agree.

Btw, NCBIDictionary definitely needs to go.
>From the documentation, continuing the example above:
>>> ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank")
>>> gb_record = ncbi_dict[gi_list[0]]
Hence, we're running efetch once for each key separately; this is exactly what NCBI advised against.

--Michiel.