[Biopython] need help! how to retrieve full text from Pubmed central ?

Brad Chapman chapmanb at 50mail.com
Mon Jan 4 12:51:54 UTC 2010


Ning;

>      From http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html,
> I can learn:
>       PubMed Central contains a number of articles classified as "open
> access" for which you may download the full text as XML. For the
> remaining articles in PMC you may download only the abstracts as XML.
> 
> but when try to
> handle=Entrez.efetch(db='pmc',id=idlist,rettype='full',retmode='xml')
> record=Entrez.read(handle)
> 
> got following errors:
>     Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py",
> line 258, in read
>     record = handler.read(handle)
>   File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/Parser.py",
> line 114, in read
>     raise CorruptedXMLError
> Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data.
> Please make sure that the input data are in XML format, and that the
> data are not corrupted.
> 
> the python version is 1.53 and my system is ubuntu 9.10.

Following your example, doing:

from Bio import Entrez
Entrez.email = 'yours at blah.com'
handle = Entrez.efetch(db='pmc', id=2747014, rettype='full', retmode='xml')
handle.read()

gives back the full XML text, as you wanted. Your next step, calling
Entrez.read, asks Biopython to parse this into a record object.
There isn't support in Biopython for this currently, and
realistically that probably isn't what you want. If you are pulling
down full text like this you are best served parsing the XML directly
using something like ElementTree:

http://docs.python.org/library/xml.etree.elementtree.html

and pulling out the items you are interested in.

Hope this helps,
Brad



More information about the Biopython mailing list