[Biopython-dev] NCBI Abuse activity with Biopython
Michiel de Hoon
mjldehoon at yahoo.com
Thu Jun 26 09:51:55 EDT 2008
[Sorry, hit the send button too soon]
> The Bio.GenBank.search_for() still seems somewhat
> useful, but without a default limit on the number
> of returned IDs, this could easily be abused.
> Again, we could deprecate this and direct people
> to Bio.Entrez.esearch() instead.
As always, I am in favor of deprecating functions whose purpose is dubious.
As an example, this is a Genbank search done via Bio.GenBank and via Bio.Entrez:
# Using Bio.GenBank
>>> from Bio import GenBank
>>> gi_list = GenBank.search_for("Opuntia AND rpl16")
>>> gi_list
['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284']
# Same thing, using Bio.Entrez
>>> from Bio import Entrez
>>> handle = Entrez.esearch(db='nucleotide', term="Opuntia AND rpl16")
>>> record = Entrez.read(handle)
>>> record["IdList"]
['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284']
I believe that GenBank.search_for automatically takes care of the retmax parameter (the maximum number of ids to return), but I agree that this can be abused easily.
> Brad also appears to have changed the functionality of
> Bio.GenBank.download_many() from a call back mechanism
> to returning a handle. We could still return a handle, but it would
> require fetching all the records (perhaps in batches), and
> concatenating them. I think it would make more sense to deprecate
> the Bio.GenBank.download_many() function, and direct people to
> Bio.Entrez.efetch() instead.
Agree.
Btw, NCBIDictionary definitely needs to go.
>From the documentation, continuing the example above:
>>> ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank")
>>> gb_record = ncbi_dict[gi_list[0]]
Hence, we're running efetch once for each key separately; this is exactly what NCBI advised against.
--Michiel.
More information about the Biopython-dev
mailing list