[Biopython] Parsing records off PubMed vs. PubMedCentral.

Thu Nov 19 11:03:04 UTC 2009

On Wed, Nov 18, 2009 at 11:19 PM, Jose C. Lacal <Jose.Lacal at openphi.com> wrote:
> Greetings:
>
> I'm just starting to use BioPython and this may be a dumb question.
>
> I've been following the excellent tutorial at
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc88
>
> My question refers to section 8.11.1
>
>
> a.) I am able to query, retrieve and parse files from db="pubmed" as per
> the code below. This works.
>
>
> from Bio import Entrez, Medline
> Entrez.email = "Jose.Lacal at OpenPHI.com"
>
> handle = handle = Entrez.esearch(db="pubmed",
> term="hypertension[all]&George+Mason+University[affl]",
> rettype="medline", retmode="text")
>
> record = Entrez.read(handle)
> print record["IdList"]
>
> idlist = record["IdList"]
> handle =
> Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="text")
>
> records = Medline.parse(handle)
> for record in records:
>        print record["AU"]
>

OK, good :)

> b.) But when I change db="pubmed" to db="pmc" I get an error message:
> KeyError: 'AU'
>
> It looks like "pmc" does not have the same keys as "pubmed" And I've
> been unable to find the equivalent format to parse files downloaded from
> "pmc"
>
> Pointers and suggestions most appreciated. regards.

Correct - PubMed and PubMedCentral are different databases and use
different identifiers. You can use Entrez ELink to map between them.
e.g. The Biopython application note has PMID 19304878, but its
PMCID is 2682512.

>>> from Bio import Entrez
>>> print Entrez.efetch(db="pubmed",id="19304878",rettype="medline",retmode="text").read()
PMID- 19304878
OWN - NLM
STAT- MEDLINE
DA  - 20090515
DCOM- 20090709
LR  - 20091104
IS  - 1367-4811 (Electronic)
VI  - 25
IP  - 11
DP  - 2009 Jun 1
TI  - Biopython: freely available Python tools for computational molecular
      biology and bioinformatics.
PG  - 1422-3
...

Now, according to the documentation for EFetch, PMC should support
rettype="medline" (just like PubMed):
http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html

>>> print Entrez.efetch(db="pmc",id="2682512", retmode="medline", rettype="text").read()
<html>
<body>
<br/><h2>Error occurred: Report 'text' not found in 'pmc'
presentation</h2><br/><ul title="some params from request:">
...
</html>

Odd. I also tried the XML from EFetch for PMC, but it fails to
validate. I wonder if this in an NCBI glitch? I have emailed them
about this.

In the meantime, I would suggest you just use PubMed not PMC - it
covers more journals but in less depth.

Peter