[BioRuby] Bio::PubMed efetch xml support and other options

Tue Nov 20 15:38:32 UTC 2007

Hi Kaustubh,

I've just committed the change that Bio::PubMed.efetch and esearch to wait for 3 seconds between consequent queries.

I also renamed efetch2 and esearch2 (newer version, which accepts E-Util options as a hash) to efetch and esearch (old version). New version of efetch method breaks backward compatibility which could accept a list of ids as variable length arguments.

>>>>  1. efetch("123") --> OK
>>>>  2. efetch("123", "456") --> NG
>>>>  3. efetch(["123", "456"]) --> OK

Here, the pubmed IDs can be (array of) string or numeric.

By the way, currently efetch method returns the following error.

% ruby lib/bio/io/pubmed.rb
   :
--- Retrieve PubMed entry by E-Utils ---
Wed Nov 21 00:23:20 +0900 2007
1: id: 16381885 Error occurred: PubMed article server is not avaliable
Wed Nov 21 00:23:23 +0900 2007
1: id: 16381885 Error occurred: PubMed article server is not avaliable

Is this a temporal problem?
I believe efetch2 was working when I have implemented.

Regards,
Toshiaki Katayama

On 2007/11/20, at 23:27, Kaustubh Patil wrote:
> Hi Toshiaki,
>
> Thanks for your email. Please find my answers embedded below;
>
> Thanks,
> kaustubh
>
> Toshiaki Katayama wrote:
>
>> Hi Kaustubh,
>>
>> On 2007/11/15, at 18:53, Kaustubh Patil wrote:
>>
>>  
>>> Hi Toshiaki,
>>>
>>> Thank you very much for the improvements. There are some other desirable improvements;
>>>
>>> 1. PubMed has some timing restrictions on two consequitive queries. So it will be very nice if it can be implemented inside a function, like, esearch/efetch.
>>>    
>>
>> How about to have following method and call it within efetch and esearch methods before the Bio::Command.post_form?
>>
>> --------------------------------------------------
>>  # Make no more than one request every 3 seconds.                              @@ncbi_interval = 3
>>  @@last_accessed = nil
>>
>>  def wait_access
>>    if @@last_accessed
>>      duration = Time.now - @@last_accessed
>>      if duration > @@ncbi_interval
>>        sleep @@ncbi_interval - duration
>>      end
>>    else
>>      @@last_accessed = Time.now
>>    end
>>  end
>> --------------------------------------------------
>>  
> This could be a very good and quick implementation. In fact I use something similar for my usgae now.
>
>> By the way, NCBI also have another restriction:
>>
>> http://www.ncbi.nlm.nih.gov/entrez/utils/utils_index.html
>>  
>>> Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time    weekdays for any series of more than 100 requests.
>>>
>>> Do you think this should also be taken care automatically?
>>>
>>>    
> I am aware of those restrictions. I will be very nice if this can be taken care automatically. There is a very good Library for accessing/using Medline through R, called MedlineR (btw currentl its not downloadable as their erver is down). MedlineR handles this automatically.
>
> There is another improvement I am thinking about. It is not possible to fetch a large number of documents in one go. I suppose this is mainly because on the practical restrictions on URL length, e.g. IE supports max 2,048 characters (although, I am not aware if PubMed imposes any limits). It will be useful (under some conditions) to  cut the fetches into a number of parts and then return the combined result. What do you think?
>
>>> 2. Mapping terms to MeSH (I couldn't find this!).
>>>    
>>
>>  
>> I'm not sure how to accomplish this.
>>  
> I will do bit more research on this and then get back to you.
>
>>
>>  
>>> I will post other comments as I recollect them. I have another question (though it is not very appropriate place for it);
>>>
>>> Is there any Ruby library which can do some basic text mining tasks, like, tokenization, sentence boundary discrimination etc. ?
>>>    
>>
>> I think yes, but I'm not doing text mining for now, sorry ;-)
>>  
> Yet I haven't find a Ruby library for that. I will keep on searching.
>
> Cheers,
> Kaustubh
>
>> Thanks,
>> Toshiaki
>>
>>
>>  
>>> Cheers,
>>> Kaustubh
>>>
>>> Toshiaki Katayama wrote:
>>>
>>>    
>>>> Hi Patil,
>>>>
>>>> On 2007/11/12, at 20:09, Kaustubh Patil wrote:
>>>>      
>>>>> XML is very nice for searching etc. PubMed documents can be fetched in various formats, including xml. I have changed the efetch method in Bio::PubMed class in order to implement this. Here is the modified method;
>>>>>          
>>>> Enhancement to accept retmode=xml sounds good idea, so I just committed efetch2 and esearch2 methods which can be better replacements for the efetch and esearch methods.
>>>>
>>>> Both methods are able to accept any E-Utils options as a hash.
>>>>
>>>> I will remove the suffix "2" from these method if the following incompatibility can be accepted.
>>>>
>>>> * changing efetch(*ids) to efetch(ids, hash = {}) breaks compatibility
>>>> currently all of
>>>>  1. efetch("123")
>>>>  2. efetch("123", "456")
>>>>  3. efetch(["123", "456"])
>>>> are accepted but 2. will be unavailable.
>>>>
>>>> Other notes:
>>>>
>>>> * default value for the retmode option remains "text" for the backward compatibility
>>>> * both methods are rewritten to use Bio::Command.post_form to make the code clear
>>>> * Bio::FlatFile is updated to accept recent MEDLINE entry format (UI -> PMID)
>>>>
>>>>
>>>> puts Bio::PubMed.esearch2("(genome AND analysis) OR bioinformatics)")
>>>> puts Bio::PubMed.esearch2("(genome AND analysis) OR bioinformatics)", {"retmax" => "500"})
>>>> puts Bio::PubMed.esearch2("(genome AND analysis) OR bioinformatics)", {"rettype" => "count"})
>>>>
>>>> puts Bio::PubMed.efetch2("10592173")
>>>> puts Bio::PubMed.efetch2(["10592173", "14693808"], {"retmode" => "xml"})
>>>>
>>>>
>>>> Thanks,
>>>> Toshiaki Katayama
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioruby
>>>> <kpatil.vcf>
>>>>      
>>
>>  
>> <kpatil.vcf>