[Biopython] need help! how to retrieve full text from Pubmed central ?

Tue Jan 5 11:33:26 UTC 2010

On Sat, Dec 26, 2009 at 2:37 PM, ning luwen <bioinformaticsing at gmail.com> wrote:
> Dear everyone,
>    I need to download full text from Pubmed central. After see the
> Entrez manual, maybe Entrez(not the web interface) doesn't give a way
> to download .pdf full text file, is this true?
>

According to the EFetch help, for PMC you can only retrieve XML
(although this does seem to give the full text):
http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html

I had a look at the ELink documentation, and don't see any way
to use it to get a PDF link (e.g. to the publisher's site). You could
use the DOI, but that doesn't allow control over HTML vs PDF.

I think you should email the Entrez support team for advice
(and if you find out more, please let us know).

>From playing with the PMC website, I eventually found a
URL which will work to get a PDF file, both in my browser
and via the command line tool wget:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/pdf

However, it seems the default Python urllib useragent is
blocked for some reason. A quick search online shows
one way to over-ride the user-agent in Python, and if we
pretend to be the Firefox browser this now works:

url = "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/pdf"
filename = "PMC2682512.pdf"
from urllib import FancyURLopener
class FakeMozilla(FancyURLopener):
    version = "Mozilla/5.0 (Windows; U; Windows NT 5.2; rv:1.9.2)
Gecko/20100101 Firefox/3.6"
FakeMozilla().retrieve(url, filename)

So, while that does seem to work, it is *NOT* endorsed by the
NCBI. If you just want to download a few files, it may do the
trick, but I do think you should email the Entrez support team
for advice on how this *should* be done.

Regards,

Peter