[Biopython-dev] Strange behaviour in efetching Pubmed citations

Peter Cock p.j.a.cock at googlemail.com
Mon Nov 26 15:36:13 UTC 2012


On Mon, Nov 26, 2012 at 3:23 PM, Maurice Ling <mauriceling at gmail.com> wrote:
> Hi
>
> I found something strange in my download script to pull a list of pubmed
> citations. This was working in the past (back in 2008 period)...
>
> The script is
>
> ID_start = 19000000
> ID_stop = 19000010
> downtime = 1.2
>
> from Bio import Entrez
> from Bio import Medline
> import string
> import time
> import cPickle
>
> Entrez.email = 'maurice.ling at sdstate.edu'
>
> while (ID_start < ID_stop):
>     try:
>         handle = Entrez.efetch(db="pubmed", id=[str(ID_start)],
> rettype="medline",
>                            retmode="text")
>         records = list(Medline.parse(handle))[0]
>         print records
>         cPickle.dump(records, open(str(ID_start) + '.txt', 'w'), -1)
>         ID_start = ID_start + 1
>         time.sleep(downtime)
>         print 'ID count: ', str(ID_start)
>     except:
>         print 'ID count: error ', str(ID_start)
>         ID_start = ID_start + 1

Are you sure you didn't run something slightly different? The
simplest possibility would be a line accidentally setting
ID_start to equal 1, rather than increasing it.

Also, using a for loop would be much cleaner (with the identifiers
as either integers or as strings). For instance,

for identifier in range(19000000, 19000010):
   #Do stuff

Note you have a discrepancy with ID_stop vs ID_end

This seems to work for me:

ID_start = 19000000
ID_stop = 19000010
downtime = 1.2
from Bio import Entrez
from Bio import Medline
import string
import time
import cPickle
Entrez.email = 'maurice.ling at sdstate.edu'
for identifier in range(ID_start, ID_stop):
    identifier = str(identifier)
    try:
        handle = Entrez.efetch(db="pubmed", id=identifier,
                               rettype="medline", retmode="text")
        records = list(Medline.parse(handle))[0]
        print records
        cPickle.dump(records, open('%s.txt' % identifier, 'w'), -1)
    except Excpetion, error:
        print "Error for %s - %s" % (identifier, error)

However, rather than parsing the Medline records and saving
the pickled object, I would save the plain text Medline data itself.
That way you can use the files outside of Python (e.g. working at
the Unix command line with grep).

Peter



More information about the Biopython-dev mailing list