[Biopython-dev] Online access, Bio.PubMed & Bio.GenBank vs Bio.Entrez

Fri Aug 15 16:28:21 UTC 2008

Hello,

This is a slightly long email covering what to do with the online code
in Bio.PubMed and Bio.GenBank, and how to make Bio.Entrez easier to
use.  All these modules are essentially wrapping access to the NCBI
Entrez database via the Entrez Programming Utilities (EUtils).
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html

The Bio.PubMed module is now essentially a wrapper for Bio.Entrez,
offering a simple and useful subset of functionality.

e.g.
>>> from Bio import PubMed
>>> pubmed_id_list = PubMed.search_for("orchids")
>>> print pubmed_id_list
['18701671', '18687799', '18627489', '18627452', '18586527', ...., '17751120']
>>> print len(pubmed_id_list)
226

(I've included a Bio.Entrez version at the end of this email)

While this works fine, there is (currently) no way to provide your
email address to the NCBI as they encourage (in case they need to
contact you).   We could add this as another optional argument I
suppose.

Also, if you then want to download some or all of these records (say
as MedLine format files to parse with Bio.Medline), doing this with
Bio.PubMed.download_many() or the Dictionary class does not take
advantage of the NCBI's history system (as they encourage).  There are
similar concerns with the Bio.GenBank.search_for(), download_many()
and NCBIDictionary classes.

There is simply no way with the current decoupling of the search_for()
and downloading functions to employ the EUtils session history, so
while they are nice and fairly easy to program with they do actively
discourage users from following the NCBI's preferred usage for large
downloads.

You can do a linked search/retrieve using Bio.Entrez as documented in
our tutorial for an esearch/efetch example using nucleotide sequences.
 This is currently done as the last example in the chapter, so I'm
considering making this topic a little more high profile (and moving
it before the examples).

In addition to encouraging the use of Bio.Entrez by documenting it
prominently in the tutorial, we could go further and deprecate the
"user friendly" Bio.PubMed and Bio.GenBank wrapper functions.  What do
people think of this?  Deprecating the Dictionary classes in
particular could be a good idea as they use the old fashioned parser
objects.

I also think it would help to make Bio.Entrez a little easier to use.
One suggestion I made back in June was to include alternative versions
of the EUtils functions which also parse the XML using
Bio.Entrez.read():
http://portal.open-bio.org/pipermail/biopython-dev/2008-June/003859.html
I rather liked Andrew Dalke's naming idea, that
Entrez.read(Entrez.esearch(...)) becomes Entrez.search(...) etc:
http://portal.open-bio.org/pipermail/biopython-dev/2008-June/003861.html

Returning to my earlier example, right now you can write:
>>> from Bio import Entrez
>>> entrez_id_list = Entrez.read(Entrez.esearch(db="pubmed", term="orchids", retmax="300", \

email="A.N.Other at Example.com"))["IdList"]
>>> len(entrez_id_list)
226

I think it would be a usability improvement to do:
>>> from Bio import Entrez
>>> entrez_id_list = Entrez.search(db="pubmed", term="orchids", retmax="300", \

email="A.N.Other at Example.com")["IdList"]

This is still more complicated than the Bio.PubMed example above, but
not by as much.

In psuedo code, the implementation would be something like this:

def search(...) :
   """Calls esearch requesting XML output and parses it."""
   return parse(esearch(..., retmode="XML"))

Alternatively, Michiel had suggested having the Bio.Entrez.e*
functions automatically parse the output depending on their arguments,
but I'm not keen on this.

Peter