[Biopython] save efetch results in different files

Peter biopython at maubp.freeserve.co.uk
Thu Apr 29 09:08:00 UTC 2010


On Wed, Apr 28, 2010 at 5:56 PM, Silvio Tschapke wrote:
>
> On Wed, Apr 28, 2010 at 11:57 AM, Peter wrote:
>>
>> On Wed, Apr 28, 2010 at 10:24 AM, Silvio Tschapke wrote:
>> > Hi all,
>> >
>> > I'd like to download hundreds of pubmed entries in one turn, but save
>> > every entry in a single file for further processing with e.g. NLTK.
>> > Is this possible? Or what is the common way to do this? Or do I have to
>> > call efetch for every single pmid? I dont know how.
>>
>> Personally I would probably save each pubmed result to a separate file
>> named using the pmid - a Unix filesystem should cope fine with a few
>> thousand files in a single directory. This is simple and lets you add more
>> entries at a later date, and you have simple access to any record.
>
> This is what I thought..to save each pubmed result to a separate file named
> using the pmid, as you can see in the code snippet.
> But it isn't working so far. Could you help me with the efetch_handle? I
> have called efetch one time with all pmids. So the efetch_handle contains
> all results. But now I need to pull out every single result from this handle
> to save it in a separate file with its pmid. And I don't know how to do it.
> Or isn't there another way..do I have to call efetch for every pmid and than
> save it into a file inside the loop?
> Because Biopython recommends to not do many queries per second I
> thought it would be better to only call efetch one time for all pmids.

The simplest answer is to make one efetch call per PMID, giving a single
record at a time which you can save to individual files. You can still do
this with the esearch+efetch history support. This does mean making
many small queries to the NCBI, rather than batching them together -
but the NCBI do not have any explicit guidelines on batch sizes.

Note - you would be making over 100 queries, so make sure you don't
run this during USA office hours!

The more complex approach (which the NCBI might prefer) is to
download batches of records together (e.g. 50 PMID results at once).
If you wanted to save these to separate files, you would have to divide
the text up yourself. I think you just need to look for lines starting
"PMID-" so this shouldn't be too hard.

Peter



More information about the Biopython mailing list