[Biopython] Entrez.read return value is typed as a string??

Peter biopython at maubp.freeserve.co.uk
Tue Oct 27 15:42:18 UTC 2009


On Tue, Oct 27, 2009 at 3:12 PM, Ben O'Loghlin <bassbabyface at yahoo.com> wrote:
> Hi all,
>
> I'm new to BioPython, having spent < 4 hours playing with it, and I'm mighty
> impressed with what it can do for me once I get it working. Unfortunately
> I've spent about 3.5 of those hours inanely grappling with Entrez.read, so I
> turn to more experienced BioPythoneers for assistance.

Oh dear - were you working though the Entrez chapter in the Tutorial?
If not, what where you looking at?

> I'm trying to use Entrez to extract and manipulate records from PubMed, and
> I'm stumped. I was expecting the return value of Entrez.read to be a
> structured object, and instead it seems to return a string which would
> require further parsing to do anything useful with.

That doesn't sound right. The Bio.Entrez.read() should take a handle,
in XML format, and return a nested collection of python objects.

> I'm not sure if this is the expected output and I have misunderstood, or if
> PubMed is just returning results in unexpected formats which break the
> parser in Entrez.read, or if Bio just doesn't work after midnight (2:06 am
> Australian EST).
>
> Is anyone able/willing to assist? The goal here is to have some way of
> extracting individual fields from the returned records, e.g. print out the
> Abstract for PMID 17206916.

First of all, handles give access to data via the read() and other methods,
like readline()

>>> from Bio import Entrez
>>> handle = Entrez.efetch(db="pubmed", id="17206916")
>>> print handle.readline()
<html><head><title>PmFetch response</title></head><body>

So you see by default, the NCBI is returning HTML. We can ask for XML:

>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>> print handle.readline()
<?xml version="1.0"?>

You could parse this with Bio.Entrez.read() if you wanted to:

>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>> record = Entrez.read(handle)
>>> print record
[{u'MedlineCitation': ... ]

Or, rather than XML designed for a computer to parse, you could ask for
the plain text MEDLINE format,

>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="text", rettype="medline")
>>> print handle.read()
PMID- 17206916
OWN - NLM
STAT- MEDLINE
DA  - 20070108
DCOM- 20070130
...

Does that help?

Peter



More information about the Biopython mailing list